linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] printk: replace ringbuffer
@ 2020-01-28 16:19 John Ogness
  2020-01-28 16:19 ` [PATCH 1/2] printk: add lockless buffer John Ogness
                   ` (3 more replies)
  0 siblings, 4 replies; 58+ messages in thread
From: John Ogness @ 2020-01-28 16:19 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Peter Zijlstra, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

Hello,

After several RFC series [0][1][2][3][4], here is the first set of
patches to rework the printk subsystem. This first set of patches
only replace the existing ringbuffer implementation. No locking is
removed. No semantics/behavior of printk are changed.

The VMCOREINFO is updated, which will require changes to the
external crash [5] tool. I will be preparing a patch to add support
for the new VMCOREINFO.

This series is in line with the agreements [6] made at the meeting
during LPC2019 in Lisbon, with 1 exception: support for dictionaries
will _not_ be discontinued [7]. Dictionaries are stored in a separate
buffer so that they cannot interfere with the human-readable buffer.

John Ogness

[0] https://lkml.kernel.org/r/20190212143003.48446-1-john.ogness@linutronix.de
[1] https://lkml.kernel.org/r/20190607162349.18199-1-john.ogness@linutronix.de
[2] https://lkml.kernel.org/r/20190727013333.11260-1-john.ogness@linutronix.de
[3] https://lkml.kernel.org/r/20190807222634.1723-1-john.ogness@linutronix.de
[4] https://lkml.kernel.org/r/20191128015235.12940-1-john.ogness@linutronix.de
[5] https://github.com/crash-utility/crash
[6] https://lkml.kernel.org/r/87k1acz5rx.fsf@linutronix.de
[7] https://lkml.kernel.org/r/20191007120134.ciywr3wale4gxa6v@pathway.suse.cz

John Ogness (2):
  printk: add lockless buffer
  printk: use the lockless ringbuffer

 include/linux/kmsg_dump.h         |    2 -
 kernel/printk/Makefile            |    1 +
 kernel/printk/printk.c            |  836 +++++++++---------
 kernel/printk/printk_ringbuffer.c | 1370 +++++++++++++++++++++++++++++
 kernel/printk/printk_ringbuffer.h |  328 +++++++
 5 files changed, 2114 insertions(+), 423 deletions(-)
 create mode 100644 kernel/printk/printk_ringbuffer.c
 create mode 100644 kernel/printk/printk_ringbuffer.h

-- 
2.20.1


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 1/2] printk: add lockless buffer
  2020-01-28 16:19 [PATCH 0/2] printk: replace ringbuffer John Ogness
@ 2020-01-28 16:19 ` John Ogness
  2020-01-29  3:53   ` Steven Rostedt
                     ` (2 more replies)
  2020-01-28 16:19 ` [PATCH 2/2] printk: use the lockless ringbuffer John Ogness
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 58+ messages in thread
From: John Ogness @ 2020-01-28 16:19 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Peter Zijlstra, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

Introduce a multi-reader multi-writer lockless ringbuffer for storing
the kernel log messages. Readers and writers may use their API from
any context (including scheduler and NMI). This ringbuffer will make
it possible to decouple printk() callers from any context, locking,
or console constraints. It also makes it possible for readers to have
full access to the ringbuffer contents at any time and context (for
example from any panic situation).

The printk_ringbuffer is made up of 3 internal ringbuffers::

desc_ring:      A ring of descriptors. A descriptor contains all record
                meta data (sequence number, timestamp, loglevel, etc.)
                as well as internal state information about the record
                and logical positions specifying where in the other
                ringbuffers the text and dictionary strings are
                located.

text_data_ring: A ring of data blocks. A data block consists of an
                unsigned long integer (ID) that maps to a desc_ring
                index followed by the text string of the record.

dict_data_ring: A ring of data blocks. A data block consists of an
                unsigned long integer (ID) that maps to a desc_ring
                index followed by the dictionary string of the record.

Descriptor state information is the key element to allow readers and
writers to locklessly synchronize access to the data.

Co-developed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
---
 kernel/printk/printk_ringbuffer.c | 1370 +++++++++++++++++++++++++++++
 kernel/printk/printk_ringbuffer.h |  328 +++++++
 2 files changed, 1698 insertions(+)
 create mode 100644 kernel/printk/printk_ringbuffer.c
 create mode 100644 kernel/printk/printk_ringbuffer.h

diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
new file mode 100644
index 000000000000..796257f226ee
--- /dev/null
+++ b/kernel/printk/printk_ringbuffer.c
@@ -0,0 +1,1370 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/kernel.h>
+#include <linux/irqflags.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/bug.h>
+#include "printk_ringbuffer.h"
+
+/**
+ * DOC: printk_ringbuffer overview
+ *
+ * Data Structure
+ * --------------
+ * The printk_ringbuffer is made up of 3 internal ringbuffers::
+ *
+ *   * desc_ring:      A ring of descriptors. A descriptor contains all record
+ *                     meta data (sequence number, timestamp, loglevel, etc.)
+ *                     as well as internal state information about the record
+ *                     and logical positions specifying where in the other
+ *                     ringbuffers the text and dictionary strings are
+ *                     located.
+ *
+ *   * text_data_ring: A ring of data blocks. A data block consists of an
+ *                     unsigned long integer (ID) that maps to a desc_ring
+ *                     index followed by the text string of the record.
+ *
+ *   * dict_data_ring: A ring of data blocks. A data block consists of an
+ *                     unsigned long integer (ID) that maps to a desc_ring
+ *                     index followed by the dictionary string of the record.
+ *
+ * Implementation
+ * --------------
+ *
+ * ABA Issues
+ * ~~~~~~~~~~
+ * To help avoid ABA issues, descriptors are referenced by IDs (index values
+ * with tagged states) and data blocks are referenced by logical positions
+ * (index values with tagged states). However, on 32-bit systems the number
+ * of tagged states is relatively small such that an ABA incident is (at
+ * least theoretically) possible. For example, if 4 million maximally sized
+ * printk messages were to occur in NMI context on a 32-bit system, the
+ * interrupted task would not be able to recognize that the 32-bit integer
+ * wrapped and thus represents a different data block than the one the
+ * interrupted task expects.
+ *
+ * To help combat this possibility, additional state checking is performed
+ * (such as using cmpxchg() even though set() would suffice). These extra
+ * checks will hopefully catch any ABA issue that a 32-bit system might
+ * experience.
+ *
+ * Memory Barriers
+ * ~~~~~~~~~~~~~~~
+ * Several memory barriers are used. To simplify proving correctness and
+ * generating litmus tests, lines of code using memory barriers (loads,
+ * stores and the associated memory barriers) are labeled:
+ *
+ *	LMM(function:letter)
+ *
+ * Comments reference using only the function:letter part.
+ *
+ * Descriptor Ring
+ * ~~~~~~~~~~~~~~~
+ * The descriptor ring is an array of descriptors. A descriptor contains all
+ * the meta data of a printk record as well as blk_lpos structs pointing to
+ * associated text and dictionary data blocks (see "Data Rings" below). Each
+ * descriptor is assigned an ID that maps directly to index values of the
+ * descriptor array and has a state. The ID and the state are bitwise combined
+ * into a single descriptor field named @state_var, allowing ID and state to
+ * be synchronously and atomically updated.
+ *
+ * Descriptors have three states:
+ *
+ *   * reserved:  A writer is modifying the record.
+ *
+ *   * committed: The record and all its data are complete and available
+ *                for reading.
+ *
+ *   * reusable:  The record exists, but its text and/or dictionary data
+ *                may no longer be available.
+ *
+ * Querying the @state_var of a record requires providing the ID of the
+ * descriptor to query. This can yield a possible fourth (pseudo) state:
+ *
+ *   * miss:      The descriptor being queried has an unexpected ID.
+ *
+ * The descriptor ring has a @tail_id that contains the ID of the oldest
+ * descriptor and @head_id that contains the ID of the newest descriptor.
+ *
+ * When a new descriptor should be created (and the ring is full), the tail
+ * descriptor is invalidated by first transitioning to the reusable state and
+ * then invalidating all tail data blocks up to and including the data blocks
+ * associated with the tail descriptor (for text and dictionary rings). Then
+ * @tail_id is advanced, followed by advancing @head_id. And finally the
+ * @state_var of the new descriptor is initialized to the new ID and reserved
+ * state.
+ *
+ * The @tail_id can only be advanced if the the new @tail_id would be in the
+ * committed or reusable queried state. This makes it possible that a valid
+ * sequence number of the tail is always available.
+ *
+ * Data Rings
+ * ~~~~~~~~~~
+ * The two data rings (text and dictionary) function identically. They exist
+ * separately so that their buffer sizes can be individually set and they do
+ * not affect one another.
+ *
+ * Data rings are byte arrays composed of data blocks, referenced by blk_lpos
+ * structs that point to the logical position of the beginning of a data block
+ * and the beginning of the next adjacent data block. Logical positions are
+ * mapped directly to index values of the byte array ringbuffer.
+ *
+ * Each data block consists of an ID followed by the raw data. The ID is the
+ * identifier of a descriptor that is associated with the data block. A data
+ * block is considered valid if all conditions are met:
+ *
+ *   1) The descriptor associated with the data block is in the committed
+ *      or reusable queried state.
+ *
+ *   2) The descriptor associated with the data block points back to the
+ *      same data block.
+ *
+ *   3) The data block is within the head/tail logical position range.
+ *
+ * If the raw data of a data block would extend beyond the end of the byte
+ * array, only the ID of the data block is stored at the logical position
+ * and the full data block (ID and raw data) is stored at the beginning of
+ * the byte array. The referencing blk_lpos will point to the ID before the
+ * wrap and the next will point to the logical position adjacent the full
+ * data block.
+ *
+ * Data rings have @tail_lpos that points to the beginning of the oldest data
+ * block and @head_lpos that points to the logical position of the next (not
+ * yet existing) data block.
+ *
+ * When a new data block should be created (and the ring is full), tail data
+ * blocks will first be invalidated by putting their associated descriptors
+ * into the reusable state and then pushing the @tail_lpos forward beyond
+ * them. Then the @head_lpos is pushed forward and is associated with a new
+ * descriptor. If a data block is not valid, the @tail_lpos cannot be
+ * advanced beyond it.
+ *
+ * Usage
+ * -----
+ * Here are some simple examples demonstrating writers and readers. For the
+ * examples a global ringbuffer (test_rb) is available (which is not the
+ * actual ringbuffer used by printk)::
+ *
+ *	DECLARE_PRINTKRB(test_rb, 15, 5, 3);
+ *
+ * This ringbuffer allows up to 32768 records (2 ^ 15) and has a size of
+ * 1 MiB (2 ^ 20) for text data and 256 KiB (2 ^ 18) for dictionary data.
+ *
+ * Sample writer code::
+ *
+ *	struct prb_reserved_entry e;
+ *	struct printk_record r;
+ *
+ *	// specify how much to allocate
+ *	r.text_buf_size = strlen(textstr) + 1;
+ *	r.dict_buf_size = strlen(dictstr) + 1;
+ *
+ *	if (prb_reserve(&e, &test_rb, &r)) {
+ *		snprintf(r.text_buf, r.text_buf_size, "%s", textstr);
+ *
+ *		// dictionary allocation may have failed
+ *		if (r.dict_buf)
+ *			snprintf(r.dict_buf, r.dict_buf_size, "%s", dictstr);
+ *
+ *		r.info->ts_nsec = local_clock();
+ *
+ *		prb_commit(&e);
+ *	}
+ *
+ * Sample reader code::
+ *
+ *	struct printk_info info;
+ *	char text_buf[32];
+ *	char dict_buf[32];
+ *	struct printk_record r = {
+ *		.info		= &info,
+ *		.text_buf	= &text_buf[0],
+ *		.dict_buf	= &dict_buf[0],
+ *		.text_buf_size	= sizeof(text_buf),
+ *		.dict_buf_size	= sizeof(dict_buf),
+ *	};
+ *	u64 seq;
+ *
+ *	prb_for_each_record(0, &test_rb, &seq, &r) {
+ *		if (info.seq != seq)
+ *			pr_warn("lost %llu records\n", info.seq - seq);
+ *
+ *		if (info.text_len > r.text_buf_size) {
+ *			pr_warn("record %llu text truncated\n", info.seq);
+ *			text_buf[sizeof(text_buf) - 1] = 0;
+ *		}
+ *
+ *		if (info.dict_len > r.dict_buf_size) {
+ *			pr_warn("record %llu dict truncated\n", info.seq);
+ *			dict_buf[sizeof(dict_buf) - 1] = 0;
+ *		}
+ *
+ *		pr_info("%llu: %llu: %s;%s\n", info.seq, info.ts_nsec,
+ *			&text_buf[0], info.dict_len ? &dict_buf[0] : "");
+ *	}
+ */
+
+#define DATA_SIZE(data_ring)		_DATA_SIZE((data_ring)->size_bits)
+#define DATA_SIZE_MASK(data_ring)	(DATA_SIZE(data_ring) - 1)
+
+#define DESCS_COUNT(desc_ring)		_DESCS_COUNT((desc_ring)->count_bits)
+#define DESCS_COUNT_MASK(desc_ring)	(DESCS_COUNT(desc_ring) - 1)
+
+/* Determine the data array index from a logical position. */
+#define DATA_INDEX(data_ring, lpos)	((lpos) & DATA_SIZE_MASK(data_ring))
+
+/* Determine the desc array index from an ID or sequence number. */
+#define DESC_INDEX(desc_ring, n)	((n) & DESCS_COUNT_MASK(desc_ring))
+
+/* Determine how many times the data array has wrapped. */
+#define DATA_WRAPS(data_ring, lpos)	((lpos) >> (data_ring)->size_bits)
+
+/* Get the logical position at index 0 of the current wrap. */
+#define DATA_THIS_WRAP_START_LPOS(data_ring, lpos) \
+	((lpos) & ~DATA_SIZE_MASK(data_ring))
+
+/* Get the ID for the same index of the previous wrap as the given ID. */
+#define DESC_ID_PREV_WRAP(desc_ring, id) \
+	DESC_ID((id) - DESCS_COUNT(desc_ring))
+
+/* A data block: maps to the raw data within the data ring. */
+struct prb_data_block {
+	unsigned long	id;
+	char		data[0];
+};
+
+static struct prb_desc *to_desc(struct prb_desc_ring *desc_ring, u64 n)
+{
+	return &desc_ring->descs[DESC_INDEX(desc_ring, n)];
+}
+
+static struct prb_data_block *to_block(struct prb_data_ring *data_ring,
+				       unsigned long begin_lpos)
+{
+	char *data = &data_ring->data[DATA_INDEX(data_ring, begin_lpos)];
+
+	return (struct prb_data_block *)data;
+}
+
+/* Increase the data size to account for data block meta data. */
+static unsigned long to_blk_size(unsigned long size)
+{
+	struct prb_data_block *db = NULL;
+
+	size += sizeof(*db);
+	size = ALIGN(size, sizeof(db->id));
+	return size;
+}
+
+/*
+ * Sanity checker for reserve size. The ringbuffer code assumes that a data
+ * block does not exceed the maximum possible size that could fit within the
+ * ringbuffer. This function provides that basic size check so that the
+ * assumption is safe.
+ *
+ * Writers are also not allowed to write 0-sized (data-less) records. Such
+ * records are used only internally by the ringbuffer.
+ */
+static bool data_check_size(struct prb_data_ring *data_ring, unsigned int size)
+{
+	struct prb_data_block *db = NULL;
+
+	/*
+	 * Writers are not allowed to write data-less records. Such records
+	 * are used only internally by the ringbuffer to denote records where
+	 * their data failed to allocate or have been lost.
+	 */
+	if (size == 0)
+		return false;
+
+	/*
+	 * Ensure the alignment padded size could possibly fit in the data
+	 * array. The largest possible data block must still leave room for
+	 * at least the ID of the next block.
+	 */
+	size = to_blk_size(size);
+	if (size > DATA_SIZE(data_ring) - sizeof(db->id))
+		return false;
+
+	return true;
+}
+
+/* The possible responses of a descriptor state-query. */
+enum desc_state {
+	desc_miss,	/* ID mismatch */
+	desc_reserved,	/* reserved, but still in use by writer */
+	desc_committed, /* committed, writer is done */
+	desc_reusable,	/* free, not used by any writer */
+};
+
+/* Query the state of a descriptor. */
+static enum desc_state get_desc_state(unsigned long id,
+				      unsigned long state_val)
+{
+	if (id != DESC_ID(state_val))
+		return desc_miss;
+
+	if (state_val & DESC_REUSE_MASK)
+		return desc_reusable;
+
+	if (state_val & DESC_COMMITTED_MASK)
+		return desc_committed;
+
+	return desc_reserved;
+}
+
+/* Get a copy of a specified descriptor and its state. */
+static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
+				 unsigned long id, struct prb_desc *desc_out)
+{
+	struct prb_desc *desc = to_desc(desc_ring, id);
+	atomic_long_t *state_var = &desc->state_var;
+	enum desc_state d_state;
+	unsigned long state_val;
+
+	/*
+	 * Check the state before copying the data. Only descriptors in the
+	 * committed or reusable state are copied because a descriptor in any
+	 * other state is in use and must be considered garbage by the reader.
+	 */
+	state_val = atomic_long_read(state_var); /* LMM(desc_read:A) */
+	d_state = get_desc_state(id, state_val);
+	if (d_state != desc_committed && d_state != desc_reusable)
+		return d_state;
+
+	/*
+	 * Guarantee the state is loaded before loading/copying the
+	 * descriptor. This pairs with prb_commit:B.
+	 */
+	smp_rmb(); /* LMM(desc_read:B) */
+
+	/*
+	 * Copy the descriptor.
+	 *
+	 * Memory barrier involvement:
+	 *
+	 * 1. No possibility of reading old/obsolete descriptor data.
+	 *    If desc_read:A reads from prb_commit:C, then desc_read:C reads
+	 *    from prb_commit:A.
+	 *
+	 *    Relies on:
+	 *
+	 *    WMB from prb_commit:A to prb_commit:C
+	 *       matching
+	 *    RMB from desc_read:A to desc_read:C
+	 *
+	 * 2. No possibility of reading old/obsolete descriptor state.
+	 *    If desc_read:C reads from desc_reserve:D, then desc_read:E
+	 *    reads from desc_reserve:B.
+	 *
+	 *    Relies on:
+	 *
+	 *    WMB from desc_reserve:B to desc_reserve:D
+	 *       matching
+	 *    RMB from desc_read:C to desc_read:E
+	 */
+	*desc_out = READ_ONCE(*desc); /* LMM(desc_read:C) */
+
+	/*
+	 * Guarantee the descriptor is loaded before re-checking the
+	 * state. This pairs with desc_reserve:C.
+	 */
+	smp_rmb(); /* LMM(desc_read:D) */
+
+	/*
+	 * Re-check the state after copying the data. If the state is no
+	 * longer committed or reusable, the caller must consider the copied
+	 * descriptor as garbage.
+	 */
+	state_val = atomic_long_read(state_var); /* LMM(desc_read:E) */
+	return get_desc_state(id, state_val);
+}
+
+/*
+ * Take a given descriptor out of the committed state by attempting
+ * the transition from committed to reusable. Either this task or some
+ * other task will have been successful.
+ */
+static void desc_make_reusable(struct prb_desc_ring *desc_ring,
+			       unsigned long id)
+{
+	struct prb_desc *desc = to_desc(desc_ring, id);
+	atomic_long_t *state_var = &desc->state_var;
+	unsigned long val_committed = id | DESC_COMMITTED_MASK;
+	unsigned long val_reusable = val_committed | DESC_REUSE_MASK;
+
+	atomic_long_cmpxchg_relaxed(state_var, val_committed, val_reusable);
+}
+
+/*
+ * For a given data ring (text or dict) and its current tail lpos:
+ * for each data block up until @lpos, make the associated descriptor
+ * reusable.
+ *
+ * If there is any problem making the associated descriptor reusable,
+ * either the descriptor has not yet been committed or another writer
+ * task has already pushed the tail lpos past the problematic data
+ * block. Regardless, on error the caller can re-load the tail lpos
+ * to determine the situation.
+ */
+static bool data_make_reusable(struct printk_ringbuffer *rb,
+			       struct prb_data_ring *data_ring,
+			       unsigned long tail_lpos, unsigned long lpos,
+			       unsigned long *lpos_out)
+{
+	struct prb_desc_ring *desc_ring = &rb->desc_ring;
+	struct prb_data_blk_lpos *blk_lpos;
+	struct prb_data_block *blk;
+	enum desc_state d_state;
+	struct prb_desc desc;
+	unsigned long id;
+
+	/*
+	 * Using the provided @data_ring, point @blk_lpos to the correct
+	 * blk_lpos within the local copy of the descriptor.
+	 */
+	if (data_ring == &rb->text_data_ring)
+		blk_lpos = &desc.text_blk_lpos;
+	else
+		blk_lpos = &desc.dict_blk_lpos;
+
+	/* Loop until @tail_lpos has advanced to or beyond @lpos. */
+	while ((lpos - tail_lpos) - 1 < DATA_SIZE(data_ring)) {
+		blk = to_block(data_ring, tail_lpos);
+		id = READ_ONCE(blk->id);
+
+		d_state = desc_read(desc_ring, id,
+				    &desc); /* LMM(data_make_reusable:A) */
+
+		switch (d_state) {
+		case desc_miss:
+			return false;
+		case desc_reserved:
+			return false;
+		case desc_committed:
+			/*
+			 * This data block is invalid if the descriptor
+			 * does not point back to it.
+			 */
+			if (blk_lpos->begin != tail_lpos)
+				return false;
+			desc_make_reusable(desc_ring, id);
+			break;
+		case desc_reusable:
+			/*
+			 * This data block is invalid if the descriptor
+			 * does not point back to it.
+			 */
+			if (blk_lpos->begin != tail_lpos)
+				return false;
+			break;
+		}
+
+		/* Advance @tail_lpos to the next data block. */
+		tail_lpos = blk_lpos->next;
+	}
+
+	*lpos_out = tail_lpos;
+
+	return true;
+}
+
+/*
+ * Advance the data ring tail to at least @lpos. This function puts all
+ * descriptors into the reusable state if the tail will be pushed beyond
+ * their associated data block.
+ */
+static bool data_push_tail(struct printk_ringbuffer *rb,
+			   struct prb_data_ring *data_ring,
+			   unsigned long lpos)
+{
+	unsigned long tail_lpos;
+	unsigned long next_lpos;
+
+	/* If @lpos is not valid, there is nothing to do. */
+	if (lpos == INVALID_LPOS)
+		return true;
+
+	tail_lpos = atomic_long_read(&data_ring->tail_lpos);
+
+	do {
+		/* If @lpos is no longer valid, there is nothing to do. */
+		if (lpos - tail_lpos >= DATA_SIZE(data_ring))
+			break;
+
+		/*
+		 * Make all descriptors reusable that are associated with
+		 * data blocks before @lpos.
+		 */
+		if (!data_make_reusable(rb, data_ring, tail_lpos, lpos,
+					&next_lpos)) {
+			/*
+			 * data_make_reusable() performed state loads. Make
+			 * sure they are loaded before reloading the tail lpos
+			 * in order to see a new tail in the case that the
+			 * descriptor has been recycled. This pairs with
+			 * desc_reserve:A.
+			 */
+			smp_rmb(); /* LMM(data_push_tail:A) */
+
+			/*
+			 * Reload the tail lpos.
+			 *
+			 * Memory barrier involvement:
+			 *
+			 * No possibility of missing a recycled descriptor.
+			 * If data_make_reusable:A reads from desc_reserve:B,
+			 * then data_push_tail:B reads from desc_push_tail:A.
+			 *
+			 * Relies on:
+			 *
+			 * MB from desc_push_tail:A to desc_reserve:B
+			 *    matching
+			 * RMB from data_make_reusable:A to data_push_tail:B
+			 */
+			next_lpos = atomic_long_read(&data_ring->tail_lpos
+						); /* LMM(data_push_tail:B) */
+			if (next_lpos == tail_lpos)
+				return false;
+
+			/* Another task pushed the tail. Try again. */
+			tail_lpos = next_lpos;
+		}
+	} while (!atomic_long_try_cmpxchg_relaxed(&data_ring->tail_lpos,
+			&tail_lpos, next_lpos)); /* can be relaxed? */
+
+	return true;
+}
+
+/*
+ * Advance the desc ring tail. This function advances the tail by one
+ * descriptor, thus invalidating the oldest descriptor. Before advancing
+ * the tail, the tail descriptor is made reusable and all data blocks up to
+ * and including the descriptor's data block are invalidated (i.e. the data
+ * ring tail is pushed past the data block of the descriptor being made
+ * reusable).
+ */
+static bool desc_push_tail(struct printk_ringbuffer *rb,
+			   unsigned long tail_id)
+{
+	struct prb_desc_ring *desc_ring = &rb->desc_ring;
+	enum desc_state d_state;
+	struct prb_desc desc;
+
+	d_state = desc_read(desc_ring, tail_id, &desc);
+
+	switch (d_state) {
+	case desc_miss:
+		/*
+		 * If the ID is exactly 1 wrap behind the expected, it is
+		 * in the process of being reserved by another writer and
+		 * must be considered reserved.
+		 */
+		if (DESC_ID(atomic_long_read(&desc.state_var)) ==
+		    DESC_ID_PREV_WRAP(desc_ring, tail_id)) {
+			return false;
+		}
+		return true;
+	case desc_reserved:
+		return false;
+	case desc_committed:
+		desc_make_reusable(desc_ring, tail_id);
+		break;
+	case desc_reusable:
+		break;
+	}
+
+	/*
+	 * Data blocks must be invalidated before their associated
+	 * descriptor can be made available for recycling. Invalidating
+	 * them later is not possible because there is no way to trust
+	 * data blocks once their associated descriptor is gone.
+	 */
+
+	if (!data_push_tail(rb, &rb->text_data_ring, desc.text_blk_lpos.next))
+		return false;
+	if (!data_push_tail(rb, &rb->dict_data_ring, desc.dict_blk_lpos.next))
+		return false;
+
+	/* The data ring tail(s) were pushed: LMM(desc_push_tail:A) */
+
+	/*
+	 * Check the next descriptor after @tail_id before pushing the tail to
+	 * it because the tail must always be in a committed or reusable
+	 * state. The implementation of prb_first_seq() relies on this.
+	 *
+	 * A successful read implies that the next descriptor is less than or
+	 * equal to @head_id so there is no risk of pushing the tail past the
+	 * head.
+	 */
+	d_state = desc_read(desc_ring, DESC_ID(tail_id + 1),
+			    &desc); /* LMM(desc_push_tail:B) */
+	if (d_state == desc_committed || d_state == desc_reusable) {
+		atomic_long_cmpxchg_relaxed(&desc_ring->tail_id, tail_id,
+			DESC_ID(tail_id + 1)); /* LMM(desc_push_tail:C) */
+	} else {
+		/*
+		 * Guarantee the last state load from desc_read() is before
+		 * reloading @tail_id in order to see a new tail in the case
+		 * that the descriptor has been recycled. This pairs with
+		 * desc_reserve:A.
+		 */
+		smp_rmb(); /* LMM(desc_push_tail:D) */
+
+		/*
+		 * Re-check the tail ID. The descriptor following @tail_id is
+		 * not in an allowed tail state. But if the tail has since
+		 * been moved by another task, then it does not matter.
+		 *
+		 * Memory barrier involvement:
+		 *
+		 * No possibility of missing a pushed tail.
+		 * If desc_push_tail:B reads from desc_reserve:B, then
+		 * desc_push_tail:E reads from desc_push_tail:C.
+		 *
+		 * Relies on:
+		 *
+		 * MB from desc_push_tail:C to desc_reserve:B
+		 *    matching
+		 * RMB from desc_push_tail:B to desc_push_tail:E
+		 */
+		if (atomic_long_read(&desc_ring->tail_id) ==
+					tail_id) { /* LMM(desc_push_tail:E) */
+			return false;
+		}
+	}
+
+	return true;
+}
+
+/* Reserve a new descriptor, invalidating the oldest if necessary. */
+static bool desc_reserve(struct printk_ringbuffer *rb, unsigned long *id_out)
+{
+	struct prb_desc_ring *desc_ring = &rb->desc_ring;
+	unsigned long prev_state_val;
+	unsigned long id_prev_wrap;
+	struct prb_desc *desc;
+	unsigned long head_id;
+	unsigned long id;
+
+	head_id = atomic_long_read(&desc_ring->head_id);
+
+	do {
+		desc = to_desc(desc_ring, head_id);
+
+		id = DESC_ID(head_id + 1);
+		id_prev_wrap = DESC_ID_PREV_WRAP(desc_ring, id);
+
+		if (id_prev_wrap == atomic_long_read(&desc_ring->tail_id)) {
+			/*
+			 * Make space for the new descriptor by
+			 * advancing the tail.
+			 */
+			if (!desc_push_tail(rb, id_prev_wrap))
+				return false;
+		}
+	} while (!atomic_long_try_cmpxchg_relaxed(&desc_ring->head_id,
+						  &head_id, id));
+
+	/*
+	 * Guarantee any data ring tail changes are stored before recycling
+	 * the descriptor. A full memory barrier is needed since another
+	 * task may have pushed the data ring tails. This pairs with
+	 * data_push_tail:A.
+	 *
+	 * Guarantee a new tail ID is stored before recycling the descriptor.
+	 * A full memory barrier is needed since another task may have pushed
+	 * the tail ID. This pairs with desc_push_tail:D and prb_first_seq:C.
+	 */
+	smp_mb(); /* LMM(desc_reserve:A) */
+
+	desc = to_desc(desc_ring, id);
+
+	/* If the descriptor has been recycled, verify the old state val. */
+	prev_state_val = atomic_long_read(&desc->state_var);
+	if (prev_state_val && prev_state_val != (id_prev_wrap |
+						 DESC_COMMITTED_MASK |
+						 DESC_REUSE_MASK)) {
+		WARN_ON_ONCE(1);
+		return false;
+	}
+
+	/* Assign the descriptor a new ID and set its state to reserved. */
+	if (!atomic_long_try_cmpxchg_relaxed(&desc->state_var,
+			&prev_state_val, id | 0)) { /* LMM(desc_reserve:B) */
+		WARN_ON_ONCE(1);
+		return false;
+	}
+
+	/*
+	 * Guarantee the new descriptor ID and state is stored before making
+	 * any other changes. This pairs with desc_read:D.
+	 */
+	smp_wmb(); /* LMM(desc_reserve:C) */
+
+	/* Now data in @desc can be modified: LMM(desc_reserve:D) */
+
+	*id_out = id;
+	return true;
+}
+
+/* Determine the end of a data block. */
+static unsigned long get_next_lpos(struct prb_data_ring *data_ring,
+				   unsigned long lpos, unsigned int size)
+{
+	unsigned long begin_lpos;
+	unsigned long next_lpos;
+
+	begin_lpos = lpos;
+	next_lpos = lpos + size;
+
+	if (DATA_WRAPS(data_ring, begin_lpos) ==
+	    DATA_WRAPS(data_ring, next_lpos)) {
+		/* The data block does not wrap. */
+		return next_lpos;
+	}
+
+	/* Wrapping data blocks store their data at the beginning. */
+	return (DATA_THIS_WRAP_START_LPOS(data_ring, next_lpos) + size);
+}
+
+/*
+ * Allocate a new data block, invalidating the oldest data block(s)
+ * if necessary. This function also associates the data block with
+ * a specified descriptor.
+ */
+static char *data_alloc(struct printk_ringbuffer *rb,
+			struct prb_data_ring *data_ring, unsigned long size,
+			struct prb_data_blk_lpos *blk_lpos, unsigned long id)
+{
+	struct prb_data_block *blk;
+	unsigned long begin_lpos;
+	unsigned long next_lpos;
+
+	if (!data_ring->data || size == 0) {
+		/* Specify a data-less block. */
+		blk_lpos->begin = INVALID_LPOS;
+		blk_lpos->next = INVALID_LPOS;
+		return NULL;
+	}
+
+	size = to_blk_size(size);
+
+	begin_lpos = atomic_long_read(&data_ring->head_lpos);
+
+	do {
+		next_lpos = get_next_lpos(data_ring, begin_lpos, size);
+
+		if (!data_push_tail(rb, data_ring,
+				    next_lpos - DATA_SIZE(data_ring))) {
+			/* Failed to allocate, specify a data-less block. */
+			blk_lpos->begin = INVALID_LPOS;
+			blk_lpos->next = INVALID_LPOS;
+			return NULL;
+		}
+	} while (!atomic_long_try_cmpxchg_relaxed(&data_ring->head_lpos,
+						  &begin_lpos, next_lpos));
+
+	blk = to_block(data_ring, begin_lpos);
+	blk->id = id;
+
+	if (DATA_WRAPS(data_ring, begin_lpos) !=
+	    DATA_WRAPS(data_ring, next_lpos)) {
+		/* Wrapping data blocks store their data at the beginning. */
+		blk = to_block(data_ring, 0);
+		blk->id = id;
+	}
+
+	blk_lpos->begin = begin_lpos;
+	blk_lpos->next = next_lpos;
+
+	return &blk->data[0];
+}
+
+static unsigned int space_used(struct prb_data_ring *data_ring,
+			       struct prb_data_blk_lpos *blk_lpos)
+{
+	if (DATA_WRAPS(data_ring, blk_lpos->begin) ==
+	    DATA_WRAPS(data_ring, blk_lpos->next)) {
+		return (DATA_INDEX(data_ring, blk_lpos->next) -
+			DATA_INDEX(data_ring, blk_lpos->begin));
+	}
+
+	return (DATA_INDEX(data_ring, blk_lpos->next) +
+		DATA_SIZE(data_ring) -
+		DATA_INDEX(data_ring, blk_lpos->begin));
+}
+
+/**
+ * prb_reserve() - Reserve space in the ringbuffer.
+ *
+ * @e:  The entry structure to setup.
+ * @rb: The ringbuffer to reserve data in.
+ * @r:  The record structure to allocate buffers for.
+ *
+ * This is the public function available to writers to reserve data.
+ *
+ * The writer specifies the text and dict sizes to reserve by setting the
+ * @text_buf_size and @dict_buf_size fields of @r, respectively. Dictionaries
+ * are optional, so @dict_buf_size is allowed to be 0.
+ *
+ * Context: Any context. Disables local interrupts on success.
+ * Return: true if at least text data could be allocated, otherwise false.
+ *
+ * On success, the fields @info, @text_buf, @dict_buf of @r will be set by
+ * this function and should be filled in by the writer before committing. Also
+ * on success, prb_record_text_space() can be used on @e to query the actual
+ * space used for the text data block.
+ *
+ * If the function fails to reserve dictionary space (but all else succeeded),
+ * it will still report success. In that case @dict_buf is set to NULL and
+ * @dict_buf_size is set to 0. Writers must check this before writing to
+ * dictionary space.
+ */
+bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
+		 struct printk_record *r)
+{
+	struct prb_desc_ring *desc_ring = &rb->desc_ring;
+	struct prb_desc *d;
+	unsigned long id;
+
+	if (!data_check_size(&rb->text_data_ring, r->text_buf_size))
+		goto fail;
+
+	/* Records without dictionaries are allowed. */
+	if (r->dict_buf_size) {
+		if (!data_check_size(&rb->dict_data_ring, r->dict_buf_size))
+			goto fail;
+	}
+
+	/* Disable interrupts during the reserve/commit window. */
+	local_irq_save(e->irqflags);
+
+	if (!desc_reserve(rb, &id)) {
+		/* Descriptor reservation failures are tracked. */
+		atomic_long_inc(&rb->fail);
+		local_irq_restore(e->irqflags);
+		goto fail;
+	}
+
+	d = to_desc(desc_ring, id);
+
+	/*
+	 * Set the @e fields here so that prb_commit() can be used if
+	 * text data allocation fails.
+	 */
+	e->rb = rb;
+	e->id = id;
+
+	/*
+	 * Initialize the sequence number if it has never been set.
+	 * Otherwise just increment it by a full wrap.
+	 *
+	 * @seq is considered "never been set" if it has a value of 0,
+	 * _except_ for descs[0], which was set by the ringbuffer initializer
+	 * and therefore is always considered as set.
+	 *
+	 * See the "Bootstrap" comment block in printk_ringbuffer.h for
+	 * details about how the initializer bootstraps the descriptors.
+	 */
+	if (d->info.seq == 0 && DESC_INDEX(desc_ring, id) != 0)
+		d->info.seq = DESC_INDEX(desc_ring, id);
+	else
+		d->info.seq += DESCS_COUNT(desc_ring);
+
+	r->text_buf = data_alloc(rb, &rb->text_data_ring, r->text_buf_size,
+				 &d->text_blk_lpos, id);
+	/* If text data allocation fails, a data-less record is committed. */
+	if (r->text_buf_size && !r->text_buf) {
+		d->info.text_len = 0;
+		d->info.dict_len = 0;
+		prb_commit(e);
+		goto fail;
+	}
+
+	r->dict_buf = data_alloc(rb, &rb->dict_data_ring, r->dict_buf_size,
+				 &d->dict_blk_lpos, id);
+	/*
+	 * If dict data allocation fails, the caller can still commit
+	 * text. But dictionary information will not be available.
+	 */
+	if (r->dict_buf_size && !r->dict_buf)
+		r->dict_buf_size = 0;
+
+	r->info = &d->info;
+	r->text_line_count = NULL;
+
+	/* Set default values for the sizes. */
+	d->info.text_len = r->text_buf_size;
+	d->info.dict_len = r->dict_buf_size;
+
+	/* Record full text space used by record. */
+	e->text_space = space_used(&rb->text_data_ring, &d->text_blk_lpos);
+
+	return true;
+fail:
+	/* Make it clear to the caller that the reserve failed. */
+	memset(r, 0, sizeof(*r));
+	return false;
+}
+EXPORT_SYMBOL(prb_reserve);
+
+/**
+ * prb_commit() - Commit (previously reserved) data to the ringbuffer.
+ *
+ * @e: The entry containing the reserved data information.
+ *
+ * This is the public function available to writers to commit data.
+ *
+ * Context: Any context. Enables local interrupts.
+ */
+void prb_commit(struct prb_reserved_entry *e)
+{
+	struct prb_desc_ring *desc_ring = &e->rb->desc_ring;
+	struct prb_desc *d = to_desc(desc_ring, e->id);
+	unsigned long prev_state_val = e->id | 0;
+
+	/* Now the writer has finished all writing: LMM(prb_commit:A) */
+
+	/*
+	 * Guarantee that all record data is stored before the descriptor
+	 * state is stored as committed. This pairs with desc_read:B.
+	 */
+	smp_wmb(); /* LMM(prb_commit:B) */
+
+	/* Set the descriptor as committed. */
+	if (!atomic_long_try_cmpxchg_relaxed(&d->state_var, &prev_state_val,
+		e->id | DESC_COMMITTED_MASK)) { /* LMM(prb_commit:C) */
+		WARN_ON_ONCE(1);
+	}
+
+	/* Restore interrupts, the reserve/commit window is finished. */
+	local_irq_restore(e->irqflags);
+}
+EXPORT_SYMBOL(prb_commit);
+
+/*
+ * Given @blk_lpos, return a pointer to the raw data from the data block
+ * and calculate the size of the data part. A NULL pointer is returned
+ * if @blk_lpos specifies values that could never be legal.
+ *
+ * This function (used by readers) performs strict validation on the lpos
+ * values to possibly detect bugs in the writer code. A WARN_ON_ONCE() is
+ * triggered if an internal error is detected.
+ */
+static char *get_data(struct prb_data_ring *data_ring,
+		      struct prb_data_blk_lpos *blk_lpos,
+		      unsigned long *data_size)
+{
+	struct prb_data_block *db;
+
+	/* Data-less data block description. */
+	if (blk_lpos->begin == INVALID_LPOS &&
+	    blk_lpos->next == INVALID_LPOS) {
+		return NULL;
+
+	/* Regular data block: @begin less than @next and in same wrap. */
+	} else if (DATA_WRAPS(data_ring, blk_lpos->begin) ==
+		   DATA_WRAPS(data_ring, blk_lpos->next) &&
+		   blk_lpos->begin < blk_lpos->next) {
+		db = to_block(data_ring, blk_lpos->begin);
+		*data_size = blk_lpos->next - blk_lpos->begin;
+
+	/* Wrapping data block: @begin is one wrap behind @next. */
+	} else if (DATA_WRAPS(data_ring,
+			      blk_lpos->begin + DATA_SIZE(data_ring)) ==
+		   DATA_WRAPS(data_ring, blk_lpos->next)) {
+		db = to_block(data_ring, 0);
+		*data_size = DATA_INDEX(data_ring, blk_lpos->next);
+
+	/* Illegal block description. */
+	} else {
+		WARN_ON_ONCE(1);
+		return NULL;
+	}
+
+	/* A valid data block will always be aligned to the ID size. */
+	if (WARN_ON_ONCE(blk_lpos->begin !=
+			 ALIGN(blk_lpos->begin, sizeof(db->id))) ||
+	    WARN_ON_ONCE(blk_lpos->next !=
+			 ALIGN(blk_lpos->next, sizeof(db->id)))) {
+		return NULL;
+	}
+
+	/* A valid data block will always have at least an ID. */
+	if (WARN_ON_ONCE(*data_size < sizeof(db->id)))
+		return NULL;
+
+	/* Subtract descriptor ID space from size to reflect data size. */
+	*data_size -= sizeof(db->id);
+
+	return &db->data[0];
+}
+
+/*
+ * Given @blk_lpos, copy an expected @len of data into the provided buffer.
+ * If @line_count is provided, count the number of lines in the data.
+ *
+ * This function (used by readers) performs strict validation on the data
+ * size to possibly detect bugs in the writer code. A WARN_ON_ONCE() is
+ * triggered if an internal error is detected.
+ */
+static bool copy_data(struct prb_data_ring *data_ring,
+		      struct prb_data_blk_lpos *blk_lpos, u16 len, char *buf,
+		      unsigned int buf_size, unsigned int *line_count)
+{
+	unsigned long data_size;
+	char *data;
+
+	/* Caller might not want any data. */
+	if ((!buf || !buf_size) && !line_count)
+		return true;
+
+	data = get_data(data_ring, blk_lpos, &data_size);
+	if (!data)
+		return false;
+
+	/* Actual cannot be less than expected. */
+	if (WARN_ON_ONCE(data_size < (unsigned long)len)) {
+		pr_warn_once(
+		    "wrong data size (%lu, expecting %hu) for data: %.*s\n",
+		    data_size, len, (int)data_size, data);
+		return false;
+	}
+
+	/* Caller interested in the line count? */
+	if (line_count) {
+		unsigned long next_size = data_size;
+		char *next = data;
+
+		*line_count = 0;
+
+		while (next_size) {
+			(*line_count)++;
+			next = memchr(next, '\n', next_size);
+			if (!next)
+				break;
+			next++;
+			next_size = data_size - (next - data);
+		}
+	}
+
+	/* Caller interested in the data content? */
+	if (!buf || !buf_size)
+		return true;
+
+	data_size = min_t(u16, buf_size, len);
+
+	if (!WARN_ON_ONCE(!data_size))
+		memcpy(&buf[0], data, data_size);
+	return true;
+}
+
+/*
+ * Read the record @id and verify that it is committed and has the sequence
+ * number @seq. On success, 0 is returned.
+ *
+ * Error return values:
+ * -EINVAL: A committed record @seq does not exist.
+ * -ENOENT: The record @seq exists, but its data is not available. This is a
+ *          valid record, so readers should continue with the next seq.
+ */
+static int desc_read_committed(struct prb_desc_ring *desc_ring,
+			       unsigned long id, u64 seq,
+			       struct prb_desc *desc)
+{
+	enum desc_state d_state;
+
+	d_state = desc_read(desc_ring, id, desc);
+	if (desc->info.seq != seq)
+		return -EINVAL;
+	else if (d_state == desc_reusable)
+		return -ENOENT;
+	else if (d_state != desc_committed)
+		return -EINVAL;
+
+	return 0;
+}
+
+/*
+ * Copy the ringbuffer data from the record with @seq to the provided
+ * @r buffer. On success, 0 is returned.
+ *
+ * See desc_read_committed() for error return values.
+ */
+static int prb_read(struct printk_ringbuffer *rb, u64 seq,
+		    struct printk_record *r)
+{
+	struct prb_desc_ring *desc_ring = &rb->desc_ring;
+	struct prb_desc *rdesc = to_desc(desc_ring, seq);
+	atomic_long_t *state_var = &rdesc->state_var;
+	struct prb_desc desc;
+	unsigned long id;
+	int err;
+
+	/* Get a reliable local copy of the descriptor and check validity. */
+	id = DESC_ID(atomic_long_read(state_var));
+	err = desc_read_committed(desc_ring, id, seq, &desc);
+
+	/*
+	 * It is possible that no record was specified. In that case the
+	 * caller is only interested in the availability of the record.
+	 */
+	if (err || !r)
+		return err;
+
+	/* If requested, copy meta data. */
+	if (r->info)
+		memcpy(r->info, &desc.info, sizeof(*(r->info)));
+
+	/* Copy text data. If it fails, this is a data-less descriptor. */
+	if (!copy_data(&rb->text_data_ring, &desc.text_blk_lpos,
+		       desc.info.text_len, r->text_buf, r->text_buf_size,
+		       r->text_line_count)) {
+		return -ENOENT;
+	}
+
+	/*
+	 * Copy dict data. Although this should not fail, dict data is not
+	 * important. So if it fails, modify the copied meta data to report
+	 * that there is no dict data, thus silently dropping the dict data.
+	 */
+	if (!copy_data(&rb->dict_data_ring, &desc.dict_blk_lpos,
+		       desc.info.dict_len, r->dict_buf, r->dict_buf_size,
+		       NULL)) {
+		if (r->info)
+			r->info->dict_len = 0;
+	}
+
+	/* Re-check real descriptor validity. */
+	return desc_read_committed(desc_ring, id, seq, &desc);
+}
+
+/**
+ * prb_first_seq() - Get the sequence number of the tail descriptor.
+ *
+ * @rb:  The ringbuffer to get the sequence number from.
+ *
+ * This is the public function available to readers to see what the
+ * first/oldest sequence number is. This provides readers a starting
+ * point to begin iterating the ringbuffer. Note that the returned sequence
+ * number might not belong to a valid record.
+ *
+ * Context: Any context.
+ * Return: The sequence number of the first/oldest record or, if the
+ *         ringbuffer is empty, 0 is returned.
+ */
+u64 prb_first_seq(struct printk_ringbuffer *rb)
+{
+	struct prb_desc_ring *desc_ring = &rb->desc_ring;
+	enum desc_state d_state;
+	struct prb_desc desc;
+	unsigned long id;
+
+	for (;;) {
+		id = atomic_long_read(
+			&rb->desc_ring.tail_id); /* LMM(prb_first_seq:A) */
+
+		d_state = desc_read(desc_ring, id,
+				    &desc); /* LMM(prb_first_seq:B) */
+
+		/*
+		 * This loop will not be infinite because the tail is
+		 * _always_ in the committed or reusable state.
+		 */
+		if (d_state == desc_committed || d_state == desc_reusable)
+			break;
+
+		/*
+		 * Guarantee the last state load from desc_read() is before
+		 * reloading @tail_id in order to see a new tail in the case
+		 * that the descriptor has been recycled. This pairs with
+		 * desc_reserve:A.
+		 */
+		smp_rmb(); /* LMM(prb_first_seq:C) */
+
+		/*
+		 * Reload the tail ID.
+		 *
+		 * Memory barrier involvement:
+		 *
+		 * No possibility of missing a pushed tail.
+		 * If prb_first_seq:B reads from desc_reserve:B, then
+		 * prb_first_seq:A reads from desc_push_tail:C.
+		 *
+		 * Relies on:
+		 *
+		 * MB from desc_push_tail:C to desc_reserve:B
+		 *    matching
+		 * RMB prb_first_seq:B to prb_first_seq:A
+		 */
+	}
+
+	return desc.info.seq;
+}
+EXPORT_SYMBOL(prb_first_seq);
+
+/*
+ * Non-blocking read of a record. Updates @seq to the last committed record
+ * (which may have no data).
+ *
+ * See the description of prb_read_valid() for details.
+ */
+bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
+		     struct printk_record *r)
+{
+	u64 tail_seq;
+	int err;
+
+	while ((err = prb_read(rb, *seq, r))) {
+		tail_seq = prb_first_seq(rb);
+
+		if (*seq < tail_seq) {
+			/*
+			 * Behind the tail. Catch up and try again. This
+			 * can happen for -ENOENT and -EINVAL cases.
+			 */
+			*seq = tail_seq;
+
+		} else if (err == -ENOENT) {
+			/* Record exists, but no data available. Skip. */
+			(*seq)++;
+
+		} else {
+			/* Non-existent/non-committed record. Must stop. */
+			return false;
+		}
+	}
+
+	return true;
+}
+
+/**
+ * prb_read_valid() - Non-blocking read of a requested record or (if gone)
+ *                    the next available record.
+ *
+ * @rb:  The ringbuffer to read from.
+ * @seq: The sequence number of the record to read.
+ * @r:   The record data buffer to store the read record to.
+ *
+ * This is the public function available to readers to read a record.
+ *
+ * The reader provides the @info, @text_buf, @dict_buf buffers of @r to be
+ * filled in.
+ *
+ * Context: Any context.
+ * Return: true if a record was read, otherwise false.
+ *
+ * On success, the reader must check r->info.seq to see which record was
+ * actually read. This allows the reader to detect dropped records.
+ *
+ * Failure means @seq refers to a not yet written record.
+ */
+bool prb_read_valid(struct printk_ringbuffer *rb, u64 seq,
+		    struct printk_record *r)
+{
+	return _prb_read_valid(rb, &seq, r);
+}
+EXPORT_SYMBOL(prb_read_valid);
+
+/**
+ * prb_next_seq() - Get the sequence number after the last available record.
+ *
+ * @rb:  The ringbuffer to get the sequence number from.
+ *
+ * This is the public function available to readers to see what the next
+ * newest sequence number available to readers will be. This provides readers
+ * a sequence number to jump to if all available records should be skipped.
+ *
+ * Context: Any context.
+ * Return: The sequence number of the next newest (not yet available) record
+ *         for readers.
+ */
+u64 prb_next_seq(struct printk_ringbuffer *rb)
+{
+	u64 seq = 0;
+
+	do {
+		/* Search forward from the oldest descriptor. */
+		if (!_prb_read_valid(rb, &seq, NULL))
+			return seq;
+		seq++;
+	} while (seq);
+
+	return 0;
+}
+EXPORT_SYMBOL(prb_next_seq);
+
+/**
+ * prb_init() - Initialize a ringbuffer to use provided external buffers.
+ *
+ * @rb:       The ringbuffer to initialize.
+ * @text_buf: The data buffer for text data.
+ * @textbits: The size of @text_buf as a power-of-2 value.
+ * @dict_buf: The data buffer for dictionary data.
+ * @dictbits: The size of @dict_buf as a power-of-2 value.
+ * @descs:    The descriptor buffer for ringbuffer records.
+ * @descbits: The count of @descs items as a power-of-2 value.
+ *
+ * This is the public function available to writers to setup a ringbuffer
+ * during runtime using provided buffers.
+ *
+ * Context: Any context.
+ */
+void prb_init(struct printk_ringbuffer *rb,
+	      char *text_buf, unsigned int textbits,
+	      char *dict_buf, unsigned int dictbits,
+	      struct prb_desc *descs, unsigned int descbits)
+{
+	memset(descs, 0, _DESCS_COUNT(descbits) * sizeof(descs[0]));
+
+	rb->desc_ring.count_bits = descbits;
+	rb->desc_ring.descs = descs;
+	atomic_long_set(&rb->desc_ring.head_id, DESC0_ID(descbits));
+	atomic_long_set(&rb->desc_ring.tail_id, DESC0_ID(descbits));
+
+	rb->text_data_ring.size_bits = textbits;
+	rb->text_data_ring.data = text_buf;
+	atomic_long_set(&rb->text_data_ring.head_lpos, BLK0_LPOS(textbits));
+	atomic_long_set(&rb->text_data_ring.tail_lpos, BLK0_LPOS(textbits));
+
+	rb->dict_data_ring.size_bits = dictbits;
+	rb->dict_data_ring.data = dict_buf;
+	atomic_long_set(&rb->dict_data_ring.head_lpos, BLK0_LPOS(dictbits));
+	atomic_long_set(&rb->dict_data_ring.tail_lpos, BLK0_LPOS(dictbits));
+
+	atomic_long_set(&rb->fail, 0);
+
+	descs[0].info.seq = -(u64)_DESCS_COUNT(descbits);
+
+	descs[_DESCS_COUNT(descbits) - 1].info.seq = 0;
+	atomic_long_set(&(descs[_DESCS_COUNT(descbits) - 1].state_var),
+			DESC0_SV(descbits));
+	descs[_DESCS_COUNT(descbits) - 1].text_blk_lpos.begin = INVALID_LPOS;
+	descs[_DESCS_COUNT(descbits) - 1].text_blk_lpos.next = INVALID_LPOS;
+	descs[_DESCS_COUNT(descbits) - 1].dict_blk_lpos.begin = INVALID_LPOS;
+	descs[_DESCS_COUNT(descbits) - 1].dict_blk_lpos.next = INVALID_LPOS;
+}
+EXPORT_SYMBOL(prb_init);
+
+/**
+ * prb_record_text_space() - Query the full actual used ringbuffer space for
+ *                           the text data of a reserved entry.
+ *
+ * @e: The successfully reserved entry to query.
+ *
+ * This is the public function available to writers to see how much actual
+ * space is used in the ringbuffer to store the specified entry.
+ *
+ * This function is only valid if an entry @a has been successfully reserved
+ * using prb_reserve().
+ *
+ * Context: Any context.
+ * Return: The size in bytes used by the associated record.
+ */
+unsigned int prb_record_text_space(struct prb_reserved_entry *e)
+{
+	return e->text_space;
+}
+EXPORT_SYMBOL(prb_record_text_space);
diff --git a/kernel/printk/printk_ringbuffer.h b/kernel/printk/printk_ringbuffer.h
new file mode 100644
index 000000000000..4dc428427e7f
--- /dev/null
+++ b/kernel/printk/printk_ringbuffer.h
@@ -0,0 +1,328 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _KERNEL_PRINTK_RINGBUFFER_H
+#define _KERNEL_PRINTK_RINGBUFFER_H
+
+#include <linux/atomic.h>
+
+struct printk_info {
+	u64	seq;		/* sequence number */
+	u64	ts_nsec;	/* timestamp in nanoseconds */
+	u16	text_len;	/* length of text message */
+	u16	dict_len;	/* length of dictionary message */
+	u8	facility;	/* syslog facility */
+	u8	flags:5;	/* internal record flags */
+	u8	level:3;	/* syslog level */
+	u32	caller_id;	/* thread id or processor id */
+};
+
+/*
+ * A structure providing the buffers, used by writers and readers.
+ *
+ * Writers:
+ * The writer sets @text_buf_size and @dict_buf_size before calling
+ * prb_reserve(). On success, prb_reserve() sets @info, @text_buf, @dict_buf.
+ *
+ * Readers:
+ * The reader sets all fields before calling prb_read_valid(). Note that
+ * the reader provides the @info, @text_buf, @dict_buf buffers. On success,
+ * the struct pointed to by @info will be filled and the char arrays pointed
+ * to by @text_buf and @dict_buf will be filled with text and dict data.
+ * If @text_line_count is provided, the number of lines in @text_buf will
+ * be counted.
+ */
+struct printk_record {
+	struct printk_info	*info;
+	char			*text_buf;
+	char			*dict_buf;
+	unsigned int		text_buf_size;
+	unsigned int		dict_buf_size;
+	unsigned int		*text_line_count;
+};
+
+/* Specifies the position/span of a data block. */
+struct prb_data_blk_lpos {
+	unsigned long	begin;
+	unsigned long	next;
+};
+
+/* A descriptor: the complete meta-data for a record. */
+struct prb_desc {
+	struct printk_info		info;
+	atomic_long_t			state_var;
+	struct prb_data_blk_lpos	text_blk_lpos;
+	struct prb_data_blk_lpos	dict_blk_lpos;
+};
+
+/* A ringbuffer of "struct prb_data_block + data" elements. */
+struct prb_data_ring {
+	unsigned int	size_bits;
+	char		*data;
+	atomic_long_t	head_lpos;
+	atomic_long_t	tail_lpos;
+};
+
+/* A ringbuffer of "struct prb_desc" elements. */
+struct prb_desc_ring {
+	unsigned int		count_bits;
+	struct prb_desc		*descs;
+	atomic_long_t		head_id;
+	atomic_long_t		tail_id;
+};
+
+/* The high level structure representing the printk ringbuffer. */
+struct printk_ringbuffer {
+	struct prb_desc_ring	desc_ring;
+	struct prb_data_ring	text_data_ring;
+	struct prb_data_ring	dict_data_ring;
+	atomic_long_t		fail;
+};
+
+/* Used by writers as a reserve/commit handle. */
+struct prb_reserved_entry {
+	struct printk_ringbuffer	*rb;
+	unsigned long			irqflags;
+	unsigned long			id;
+	unsigned int			text_space;
+};
+
+#define _DATA_SIZE(sz_bits)		(1UL << (sz_bits))
+#define _DESCS_COUNT(ct_bits)		(1U << (ct_bits))
+#define DESC_SV_BITS			(sizeof(unsigned long) * 8)
+#define DESC_COMMITTED_MASK		(1UL << (DESC_SV_BITS - 1))
+#define DESC_REUSE_MASK			(1UL << (DESC_SV_BITS - 2))
+#define DESC_FLAGS_MASK			(DESC_COMMITTED_MASK | DESC_REUSE_MASK)
+#define DESC_ID_MASK			(~DESC_FLAGS_MASK)
+#define DESC_ID(sv)			((sv) & DESC_ID_MASK)
+#define INVALID_LPOS			1
+
+#define INVALID_BLK_LPOS	\
+	{			\
+		.begin	= INVALID_LPOS,	\
+		.next	= INVALID_LPOS,	\
+	}
+
+/*
+ * Descriptor Bootstrap
+ *
+ * The descriptor array is minimally initialized to allow immediate usage
+ * by readers and writers. The requirements that the descriptor array
+ * initialization must satisfy:
+ *
+ * Req1: The tail must point to an existing (committed or reusable)
+ *       descriptor. This is required by the implementation of
+ *       get_desc_tail_seq().
+ *
+ * Req2: Readers must see that the ringbuffer is initially empty.
+ *
+ * Req3: The first record reserved by a writer is assigned sequence number 0.
+ *
+ * To satisfy Req1, the tail points to a descriptor that is minimally
+ * initialized (having no data block, i.e. data block's lpos @begin and @next
+ * values are set to INVALID_LPOS).
+ *
+ * To satisfy Req2, the tail descriptor is initialized to the reusable state.
+ * Readers recognize reusable descriptors as existing records, but skip over
+ * them.
+ *
+ * To satisfy Req3, the last descriptor in the array is used as the initial
+ * head (and tail) descriptor. This allows the first record reserved by a
+ * writer (head + 1) to be the first descriptor in the array. (Only the first
+ * descriptor in the array could have a valid sequence number of 0.)
+ *
+ * The first time a descriptor is reserved, it is assigned a sequence number
+ * with the value of the array index. A "first time reserved" descriptor can
+ * be recognized because it has a sequence number of 0 even though it does not
+ * have an index of 0. (Only the first descriptor in the array could have a
+ * valid sequence number of 0.) After the first reservation, all future
+ * reservations simply involve incrementing the sequence number by the array
+ * count.
+ *
+ * Hack #1:
+ * The first descriptor in the array is allowed to have a sequence number 0.
+ * In this case it is not possible to recognize if it is being reserved the
+ * first time (set to index value) or has been reserved previously (increment
+ * by the the array count). This is handled by _always_ incrementing the
+ * sequence number when reserving the first descriptor in the array. So in
+ * order to satisfy Req3, the sequence number of the first descriptor in the
+ * array is initialized to minus the array count. Then, upon the first
+ * reservation, it is incremented to 0.
+ *
+ * Hack #2:
+ * get_desc_tail_seq() can be called at any time by readers to retrieve the
+ * sequence number of the tail descriptor. However, due to Req2 and Req3,
+ * initially there are no records to report the sequence number of (sequence
+ * numbers are u64 and there is nothing less than 0). To handle this, the
+ * sequence number of the tail descriptor is initialized to 0. Technically
+ * this is incorrect, because there is no record with sequence number 0 (yet)
+ * and the tail descriptor is not the first descriptor in the array. But it
+ * allows prb_read_valid() to correctly report that the record is
+ * non-existent for any given sequence number. Bootstrapping is complete when
+ * the tail is pushed the first time, thus finally pointing to the first
+ * descriptor reserved by a writer, which has the assigned sequence number 0.
+ */
+
+/*
+ * Initiating Logical Value Overflows
+ *
+ * Both logical position (lpos) and ID values can be mapped to array indexes
+ * but may experience overflows during the lifetime of the system. To ensure
+ * that printk_ringbuffer can handle the overflows for these types, initial
+ * values are chosen that map to the correct initial array indexes, but will
+ * result in overflows soon.
+ *
+ * BLK0_LPOS: The initial @head_lpos and @tail_lpos for data rings. It is at
+ *            index 0 and the lpos value is such that it will overflow on the
+ *            first wrap.
+ *
+ * DESC0_ID: The initial @head_id and @tail_id for the desc ring. It is at the
+ *           last index of the descriptor array and the ID value is such that
+ *           it will overflow on the second wrap.
+ */
+#define BLK0_LPOS(sz_bits)	(-(_DATA_SIZE(sz_bits)))
+#define DESC0_ID(ct_bits)	DESC_ID(-(_DESCS_COUNT(ct_bits) + 1))
+#define DESC0_SV(ct_bits)	(DESC_COMMITTED_MASK | DESC_REUSE_MASK | \
+					DESC0_ID(ct_bits))
+
+/*
+ * Declare a ringbuffer with an external text data buffer. The same as
+ * DECLARE_PRINTKRB() but allows specifying an external buffer for the
+ * text data.
+ *
+ * Note: The specified external buffer must be of the size:
+ *       2 ^ (descbits + avgtextbits)
+ */
+#define _DECLARE_PRINTKRB(name, descbits, avgtextbits, avgdictbits,	\
+			  text_buf)					\
+char _##name##_dict[1U << ((avgdictbits) + (descbits))]			\
+	__aligned(__alignof__(unsigned long));				\
+struct prb_desc _##name##_descs[_DESCS_COUNT(descbits)] = {		\
+	/* this will be the first record reserved by a writer */	\
+	[0] = {								\
+			.info = {					\
+				/*
+				 * will be incremented to 0 on
+				 * the first reservation
+				 */					\
+				.seq = -(u64)_DESCS_COUNT(descbits),	\
+				},					\
+		},							\
+	/* the initial head and tail */					\
+	[_DESCS_COUNT(descbits) - 1] = {				\
+			.info = {					\
+				/*
+				 * reports the minimal seq value
+				 * during the bootstrap phase
+				 */					\
+				.seq = 0,				\
+				},					\
+			/* reusable */					\
+			.state_var	= ATOMIC_INIT(DESC0_SV(descbits)), \
+			/* no associated data block */			\
+			.text_blk_lpos	= INVALID_BLK_LPOS,		\
+			.dict_blk_lpos	= INVALID_BLK_LPOS,		\
+		},							\
+	};								\
+struct printk_ringbuffer name = {					\
+	.desc_ring = {							\
+		.count_bits	= descbits,				\
+		.descs		= &_##name##_descs[0],			\
+		.head_id	= ATOMIC_INIT(DESC0_ID(descbits)),	\
+		.tail_id	= ATOMIC_INIT(DESC0_ID(descbits)),	\
+	},								\
+	.text_data_ring = {						\
+		.size_bits	= (avgtextbits) + (descbits),		\
+		.data		= text_buf,				\
+		.head_lpos	= ATOMIC_LONG_INIT(BLK0_LPOS(		\
+					(avgtextbits) + (descbits))),	\
+		.tail_lpos	= ATOMIC_LONG_INIT(BLK0_LPOS(		\
+					(avgtextbits) + (descbits))),	\
+	},								\
+	.dict_data_ring = {						\
+		.size_bits	= (avgtextbits) + (descbits),		\
+		.data		= &_##name##_dict[0],			\
+		.head_lpos	= ATOMIC_LONG_INIT(BLK0_LPOS(		\
+					(avgtextbits) + (descbits))),	\
+		.tail_lpos	= ATOMIC_LONG_INIT(BLK0_LPOS(		\
+					(avgtextbits) + (descbits))),	\
+	},								\
+	.fail			= ATOMIC_LONG_INIT(0),			\
+}
+
+/**
+ * DECLARE_PRINTKRB() - Declare a ringbuffer.
+ *
+ * @name:     The name of the ringbuffer variable.
+ * @descbits: The number of descriptors as a power-of-2 value.
+ * @avgtextbits: The average text data size per record as a power-of-2 value.
+ * @avgdictbits: The average dictionary data size per record as a
+ *               power-of-2 value.
+ *
+ * This is a macro for declaring a ringbuffer and all internal structures
+ * such that it is ready for immediate use. See _DECLARE_PRINTKRB() for a
+ * variant where the text data buffer can be specified externally.
+ */
+#define DECLARE_PRINTKRB(name, descbits, avgtextbits, avgdictbits)	\
+char _##name##_text[1U << ((avgtextbits) + (descbits))]			\
+	__aligned(__alignof__(unsigned long));				\
+_DECLARE_PRINTKRB(name, descbits, avgtextbits, avgdictbits,		\
+		  &_##name##_text[0])
+
+/**
+ * DECLARE_PRINTKRB_RECORD() - Declare a buffer for reading records.
+ *
+ * @name:     The name of the record variable.
+ * @buf_size: The size for the text and dictionary buffers.
+ *
+ * This macro declares a record buffer for use with prb_read_valid().
+ */
+#define DECLARE_PRINTKRB_RECORD(name, buf_size)		\
+struct printk_info _##name##_info;			\
+char _##name##_text_buf[buf_size];			\
+char _##name##_dict_buf[buf_size];			\
+struct printk_record name = {				\
+	.info		= &_##name##_info,		\
+	.text_buf	= &_##name##_text_buf[0],	\
+	.dict_buf	= &_##name##_dict_buf[0],	\
+	.text_buf_size	= buf_size,			\
+	.dict_buf_size	= buf_size,			\
+}
+
+/* Writer Interface */
+
+bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
+		 struct printk_record *r);
+void prb_commit(struct prb_reserved_entry *e);
+
+void prb_init(struct printk_ringbuffer *rb,
+	      char *text_buf, unsigned int text_buf_size,
+	      char *dict_buf, unsigned int dict_buf_size,
+	      struct prb_desc *descs, unsigned int descs_count_bits);
+unsigned int prb_record_text_space(struct prb_reserved_entry *e);
+
+/* Reader Interface */
+
+bool prb_read_valid(struct printk_ringbuffer *rb, u64 seq,
+		    struct printk_record *r);
+
+u64 prb_first_seq(struct printk_ringbuffer *rb);
+u64 prb_next_seq(struct printk_ringbuffer *rb);
+
+/**
+ * prb_for_each_record() - Iterate over a ringbuffer.
+ *
+ * @from: The sequence number to begin with.
+ * @rb:   The ringbuffer to iterate over.
+ * @seq:  A u64 to store the sequence number on each iteration.
+ * @r:    A printk_record to store the record on each iteration.
+ *
+ * This is a macro for conveniently iterating over a ringbuffer.
+ *
+ * Context: Any context.
+ */
+#define prb_for_each_record(from, rb, seq, r)		\
+	for ((seq) = from;				\
+	     prb_read_valid(rb, seq, r);		\
+	     (seq) = (r)->info->seq + 1)
+
+#endif /* _KERNEL_PRINTK_RINGBUFFER_H */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 2/2] printk: use the lockless ringbuffer
  2020-01-28 16:19 [PATCH 0/2] printk: replace ringbuffer John Ogness
  2020-01-28 16:19 ` [PATCH 1/2] printk: add lockless buffer John Ogness
@ 2020-01-28 16:19 ` John Ogness
  2020-02-13  9:07   ` Sergey Senozhatsky
                     ` (2 more replies)
  2020-02-05  4:25 ` [PATCH 0/2] printk: replace ringbuffer lijiang
  2020-02-06  9:21 ` lijiang
  3 siblings, 3 replies; 58+ messages in thread
From: John Ogness @ 2020-01-28 16:19 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Peter Zijlstra, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

Replace the existing ringbuffer usage and implementation with
lockless ringbuffer usage. Even though the new ringbuffer does not
require locking, all existing locking is left in place. Therefore,
this change is purely replacing the underlining ringbuffer.

Changes that exist due to the ringbuffer replacement:

- The VMCOREINFO has been updated for the new structures.

- Dictionary data is now stored in a separate data buffer from the
  human-readable messages. The dictionary data buffer is set to the
  same size as the message buffer. Therefore, the total reserved
  memory for messages is 2 * (2 ^ CONFIG_LOG_BUF_SHIFT) for the
  initial static buffer and 2x the specified size in the log_buf_len
  kernel parameter.

- Record meta-data is now stored in a separate array of descriptors.
  This is an additional 72 * (2 ^ ((CONFIG_LOG_BUF_SHIFT - 6))) bytes
  for the static array and 72 * (2 ^ ((log_buf_len - 6))) bytes for
  the dynamic array.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
---
 include/linux/kmsg_dump.h |   2 -
 kernel/printk/Makefile    |   1 +
 kernel/printk/printk.c    | 836 +++++++++++++++++++-------------------
 3 files changed, 416 insertions(+), 423 deletions(-)

diff --git a/include/linux/kmsg_dump.h b/include/linux/kmsg_dump.h
index 2e7a1e032c71..ae6265033e31 100644
--- a/include/linux/kmsg_dump.h
+++ b/include/linux/kmsg_dump.h
@@ -46,8 +46,6 @@ struct kmsg_dumper {
 	bool registered;
 
 	/* private state of the kmsg iterator */
-	u32 cur_idx;
-	u32 next_idx;
 	u64 cur_seq;
 	u64 next_seq;
 };
diff --git a/kernel/printk/Makefile b/kernel/printk/Makefile
index 4d052fc6bcde..eee3dc9b60a9 100644
--- a/kernel/printk/Makefile
+++ b/kernel/printk/Makefile
@@ -2,3 +2,4 @@
 obj-y	= printk.o
 obj-$(CONFIG_PRINTK)	+= printk_safe.o
 obj-$(CONFIG_A11Y_BRAILLE_CONSOLE)	+= braille.o
+obj-$(CONFIG_PRINTK)	+= printk_ringbuffer.o
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 1ef6f75d92f1..d0d24ee1d1f4 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -56,6 +56,7 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/printk.h>
 
+#include "printk_ringbuffer.h"
 #include "console_cmdline.h"
 #include "braille.h"
 #include "internal.h"
@@ -294,30 +295,22 @@ enum con_msg_format_flags {
 static int console_msg_format = MSG_FORMAT_DEFAULT;
 
 /*
- * The printk log buffer consists of a chain of concatenated variable
- * length records. Every record starts with a record header, containing
- * the overall length of the record.
+ * The printk log buffer consists of a sequenced collection of records, each
+ * containing variable length message and dictionary text. Every record
+ * also contains its own meta-data (@info).
  *
- * The heads to the first and last entry in the buffer, as well as the
- * sequence numbers of these entries are maintained when messages are
- * stored.
- *
- * If the heads indicate available messages, the length in the header
- * tells the start next message. A length == 0 for the next message
- * indicates a wrap-around to the beginning of the buffer.
- *
- * Every record carries the monotonic timestamp in microseconds, as well as
- * the standard userspace syslog level and syslog facility. The usual
+ * Every record meta-data carries the monotonic timestamp in microseconds, as
+ * well as the standard userspace syslog level and syslog facility. The usual
  * kernel messages use LOG_KERN; userspace-injected messages always carry
  * a matching syslog facility, by default LOG_USER. The origin of every
  * message can be reliably determined that way.
  *
- * The human readable log message directly follows the message header. The
- * length of the message text is stored in the header, the stored message
- * is not terminated.
+ * The human readable log message of a record is available in @text, the length
+ * of the message text in @text_len. The stored message is not terminated.
  *
- * Optionally, a message can carry a dictionary of properties (key/value pairs),
- * to provide userspace with a machine-readable message context.
+ * Optionally, a record can carry a dictionary of properties (key/value pairs),
+ * to provide userspace with a machine-readable message context. The length of
+ * the dictionary is available in @dict_len. The dictionary is not terminated.
  *
  * Examples for well-defined, commonly used property names are:
  *   DEVICE=b12:8               device identifier
@@ -331,21 +324,19 @@ static int console_msg_format = MSG_FORMAT_DEFAULT;
  * follows directly after a '=' character. Every property is terminated by
  * a '\0' character. The last property is not terminated.
  *
- * Example of a message structure:
- *   0000  ff 8f 00 00 00 00 00 00      monotonic time in nsec
- *   0008  34 00                        record is 52 bytes long
- *   000a        0b 00                  text is 11 bytes long
- *   000c              1f 00            dictionary is 23 bytes long
- *   000e                    03 00      LOG_KERN (facility) LOG_ERR (level)
- *   0010  69 74 27 73 20 61 20 6c      "it's a l"
- *         69 6e 65                     "ine"
- *   001b           44 45 56 49 43      "DEVIC"
- *         45 3d 62 38 3a 32 00 44      "E=b8:2\0D"
- *         52 49 56 45 52 3d 62 75      "RIVER=bu"
- *         67                           "g"
- *   0032     00 00 00                  padding to next message header
- *
- * The 'struct printk_log' buffer header must never be directly exported to
+ * Example of record values:
+ *   record.text_buf       = "it's a line" (unterminated)
+ *   record.dict_buf       = "DEVICE=b8:2\0DRIVER=bug" (unterminated)
+ *   record.info.seq       = 56
+ *   record.info.ts_sec    = 36863
+ *   record.info.text_len  = 11
+ *   record.info.dict_len  = 22
+ *   record.info.facility  = 0 (LOG_KERN)
+ *   record.info.flags     = 0
+ *   record.info.level     = 3 (LOG_ERR)
+ *   record.info.caller_id = 299 (task 299)
+ *
+ * The 'struct printk_info' buffer must never be directly exported to
  * userspace, it is a kernel-private implementation detail that might
  * need to be changed in the future, when the requirements change.
  *
@@ -365,23 +356,6 @@ enum log_flags {
 	LOG_CONT	= 8,	/* text is a fragment of a continuation line */
 };
 
-struct printk_log {
-	u64 ts_nsec;		/* timestamp in nanoseconds */
-	u16 len;		/* length of entire record */
-	u16 text_len;		/* length of text buffer */
-	u16 dict_len;		/* length of dictionary buffer */
-	u8 facility;		/* syslog facility */
-	u8 flags:5;		/* internal record flags */
-	u8 level:3;		/* syslog level */
-#ifdef CONFIG_PRINTK_CALLER
-	u32 caller_id;            /* thread id or processor id */
-#endif
-}
-#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
-__packed __aligned(4)
-#endif
-;
-
 /*
  * The logbuf_lock protects kmsg buffer, indices, counters.  This can be taken
  * within the scheduler's rq lock. It must be released before calling
@@ -421,26 +395,17 @@ DEFINE_RAW_SPINLOCK(logbuf_lock);
 DECLARE_WAIT_QUEUE_HEAD(log_wait);
 /* the next printk record to read by syslog(READ) or /proc/kmsg */
 static u64 syslog_seq;
-static u32 syslog_idx;
 static size_t syslog_partial;
 static bool syslog_time;
-
-/* index and sequence number of the first record stored in the buffer */
-static u64 log_first_seq;
-static u32 log_first_idx;
-
-/* index and sequence number of the next record to store in the buffer */
-static u64 log_next_seq;
-static u32 log_next_idx;
+DECLARE_PRINTKRB_RECORD(syslog_record, CONSOLE_EXT_LOG_MAX);
 
 /* the next printk record to write to the console */
 static u64 console_seq;
-static u32 console_idx;
 static u64 exclusive_console_stop_seq;
+DECLARE_PRINTKRB_RECORD(console_record, CONSOLE_EXT_LOG_MAX);
 
 /* the next printk record to read after the last 'clear' command */
 static u64 clear_seq;
-static u32 clear_idx;
 
 #ifdef CONFIG_PRINTK_CALLER
 #define PREFIX_MAX		48
@@ -453,13 +418,28 @@ static u32 clear_idx;
 #define LOG_FACILITY(v)		((v) >> 3 & 0xff)
 
 /* record buffer */
-#define LOG_ALIGN __alignof__(struct printk_log)
+#define LOG_ALIGN __alignof__(unsigned long)
 #define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
 #define LOG_BUF_LEN_MAX (u32)(1 << 31)
 static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
 static char *log_buf = __log_buf;
 static u32 log_buf_len = __LOG_BUF_LEN;
 
+/*
+ * Define the average message size. This only affects the number of
+ * descriptors that will be available. Underestimating is better than
+ * overestimating (too many available descriptors is better than not enough).
+ * The dictionary buffer will be the same size as the text buffer.
+ */
+#define PRB_AVGBITS 6
+
+_DECLARE_PRINTKRB(printk_rb_static, CONFIG_LOG_BUF_SHIFT - PRB_AVGBITS,
+		  PRB_AVGBITS, PRB_AVGBITS, &__log_buf[0]);
+
+static struct printk_ringbuffer printk_rb_dynamic;
+
+static struct printk_ringbuffer *prb = &printk_rb_static;
+
 /* Return log buffer address */
 char *log_buf_addr_get(void)
 {
@@ -472,108 +452,6 @@ u32 log_buf_len_get(void)
 	return log_buf_len;
 }
 
-/* human readable text of the record */
-static char *log_text(const struct printk_log *msg)
-{
-	return (char *)msg + sizeof(struct printk_log);
-}
-
-/* optional key/value pair dictionary attached to the record */
-static char *log_dict(const struct printk_log *msg)
-{
-	return (char *)msg + sizeof(struct printk_log) + msg->text_len;
-}
-
-/* get record by index; idx must point to valid msg */
-static struct printk_log *log_from_idx(u32 idx)
-{
-	struct printk_log *msg = (struct printk_log *)(log_buf + idx);
-
-	/*
-	 * A length == 0 record is the end of buffer marker. Wrap around and
-	 * read the message at the start of the buffer.
-	 */
-	if (!msg->len)
-		return (struct printk_log *)log_buf;
-	return msg;
-}
-
-/* get next record; idx must point to valid msg */
-static u32 log_next(u32 idx)
-{
-	struct printk_log *msg = (struct printk_log *)(log_buf + idx);
-
-	/* length == 0 indicates the end of the buffer; wrap */
-	/*
-	 * A length == 0 record is the end of buffer marker. Wrap around and
-	 * read the message at the start of the buffer as *this* one, and
-	 * return the one after that.
-	 */
-	if (!msg->len) {
-		msg = (struct printk_log *)log_buf;
-		return msg->len;
-	}
-	return idx + msg->len;
-}
-
-/*
- * Check whether there is enough free space for the given message.
- *
- * The same values of first_idx and next_idx mean that the buffer
- * is either empty or full.
- *
- * If the buffer is empty, we must respect the position of the indexes.
- * They cannot be reset to the beginning of the buffer.
- */
-static int logbuf_has_space(u32 msg_size, bool empty)
-{
-	u32 free;
-
-	if (log_next_idx > log_first_idx || empty)
-		free = max(log_buf_len - log_next_idx, log_first_idx);
-	else
-		free = log_first_idx - log_next_idx;
-
-	/*
-	 * We need space also for an empty header that signalizes wrapping
-	 * of the buffer.
-	 */
-	return free >= msg_size + sizeof(struct printk_log);
-}
-
-static int log_make_free_space(u32 msg_size)
-{
-	while (log_first_seq < log_next_seq &&
-	       !logbuf_has_space(msg_size, false)) {
-		/* drop old messages until we have enough contiguous space */
-		log_first_idx = log_next(log_first_idx);
-		log_first_seq++;
-	}
-
-	if (clear_seq < log_first_seq) {
-		clear_seq = log_first_seq;
-		clear_idx = log_first_idx;
-	}
-
-	/* sequence numbers are equal, so the log buffer is empty */
-	if (logbuf_has_space(msg_size, log_first_seq == log_next_seq))
-		return 0;
-
-	return -ENOMEM;
-}
-
-/* compute the message size including the padding bytes */
-static u32 msg_used_size(u16 text_len, u16 dict_len, u32 *pad_len)
-{
-	u32 size;
-
-	size = sizeof(struct printk_log) + text_len + dict_len;
-	*pad_len = (-size) & (LOG_ALIGN - 1);
-	size += *pad_len;
-
-	return size;
-}
-
 /*
  * Define how much of the log buffer we could take at maximum. The value
  * must be greater than two. Note that only half of the buffer is available
@@ -582,22 +460,26 @@ static u32 msg_used_size(u16 text_len, u16 dict_len, u32 *pad_len)
 #define MAX_LOG_TAKE_PART 4
 static const char trunc_msg[] = "<truncated>";
 
-static u32 truncate_msg(u16 *text_len, u16 *trunc_msg_len,
-			u16 *dict_len, u32 *pad_len)
+static void truncate_msg(u16 *text_len, u16 *trunc_msg_len, u16 *dict_len)
 {
 	/*
 	 * The message should not take the whole buffer. Otherwise, it might
 	 * get removed too soon.
 	 */
 	u32 max_text_len = log_buf_len / MAX_LOG_TAKE_PART;
+
 	if (*text_len > max_text_len)
 		*text_len = max_text_len;
-	/* enable the warning message */
+
+	/* enable the warning message (if there is room) */
 	*trunc_msg_len = strlen(trunc_msg);
+	if (*text_len >= *trunc_msg_len)
+		*text_len -= *trunc_msg_len;
+	else
+		*trunc_msg_len = 0;
+
 	/* disable the "dict" completely */
 	*dict_len = 0;
-	/* compute the size again, count also the warning message */
-	return msg_used_size(*text_len + *trunc_msg_len, 0, pad_len);
 }
 
 /* insert record into the buffer, discard old ones, update heads */
@@ -606,60 +488,42 @@ static int log_store(u32 caller_id, int facility, int level,
 		     const char *dict, u16 dict_len,
 		     const char *text, u16 text_len)
 {
-	struct printk_log *msg;
-	u32 size, pad_len;
+	struct prb_reserved_entry e;
+	struct printk_record r;
 	u16 trunc_msg_len = 0;
 
-	/* number of '\0' padding bytes to next message */
-	size = msg_used_size(text_len, dict_len, &pad_len);
+	r.text_buf_size = text_len;
+	r.dict_buf_size = dict_len;
 
-	if (log_make_free_space(size)) {
+	if (!prb_reserve(&e, prb, &r)) {
 		/* truncate the message if it is too long for empty buffer */
-		size = truncate_msg(&text_len, &trunc_msg_len,
-				    &dict_len, &pad_len);
+		truncate_msg(&text_len, &trunc_msg_len, &dict_len);
+		r.text_buf_size = text_len + trunc_msg_len;
+		r.dict_buf_size = dict_len;
 		/* survive when the log buffer is too small for trunc_msg */
-		if (log_make_free_space(size))
+		if (!prb_reserve(&e, prb, &r))
 			return 0;
 	}
 
-	if (log_next_idx + size + sizeof(struct printk_log) > log_buf_len) {
-		/*
-		 * This message + an additional empty header does not fit
-		 * at the end of the buffer. Add an empty header with len == 0
-		 * to signify a wrap around.
-		 */
-		memset(log_buf + log_next_idx, 0, sizeof(struct printk_log));
-		log_next_idx = 0;
-	}
-
 	/* fill message */
-	msg = (struct printk_log *)(log_buf + log_next_idx);
-	memcpy(log_text(msg), text, text_len);
-	msg->text_len = text_len;
-	if (trunc_msg_len) {
-		memcpy(log_text(msg) + text_len, trunc_msg, trunc_msg_len);
-		msg->text_len += trunc_msg_len;
-	}
-	memcpy(log_dict(msg), dict, dict_len);
-	msg->dict_len = dict_len;
-	msg->facility = facility;
-	msg->level = level & 7;
-	msg->flags = flags & 0x1f;
+	memcpy(&r.text_buf[0], text, text_len);
+	if (trunc_msg_len)
+		memcpy(&r.text_buf[text_len], trunc_msg, trunc_msg_len);
+	if (r.dict_buf)
+		memcpy(&r.dict_buf[0], dict, dict_len);
+	r.info->facility = facility;
+	r.info->level = level & 7;
+	r.info->flags = flags & 0x1f;
 	if (ts_nsec > 0)
-		msg->ts_nsec = ts_nsec;
+		r.info->ts_nsec = ts_nsec;
 	else
-		msg->ts_nsec = local_clock();
-#ifdef CONFIG_PRINTK_CALLER
-	msg->caller_id = caller_id;
-#endif
-	memset(log_dict(msg) + dict_len, 0, pad_len);
-	msg->len = size;
+		r.info->ts_nsec = local_clock();
+	r.info->caller_id = caller_id;
 
 	/* insert message */
-	log_next_idx += msg->len;
-	log_next_seq++;
+	prb_commit(&e);
 
-	return msg->text_len;
+	return text_len;
 }
 
 int dmesg_restrict = IS_ENABLED(CONFIG_SECURITY_DMESG_RESTRICT);
@@ -711,13 +575,13 @@ static void append_char(char **pp, char *e, char c)
 		*(*pp)++ = c;
 }
 
-static ssize_t msg_print_ext_header(char *buf, size_t size,
-				    struct printk_log *msg, u64 seq)
+static ssize_t info_print_ext_header(char *buf, size_t size,
+				     struct printk_info *info)
 {
-	u64 ts_usec = msg->ts_nsec;
+	u64 ts_usec = info->ts_nsec;
 	char caller[20];
 #ifdef CONFIG_PRINTK_CALLER
-	u32 id = msg->caller_id;
+	u32 id = info->caller_id;
 
 	snprintf(caller, sizeof(caller), ",caller=%c%u",
 		 id & 0x80000000 ? 'C' : 'T', id & ~0x80000000);
@@ -728,8 +592,8 @@ static ssize_t msg_print_ext_header(char *buf, size_t size,
 	do_div(ts_usec, 1000);
 
 	return scnprintf(buf, size, "%u,%llu,%llu,%c%s;",
-			 (msg->facility << 3) | msg->level, seq, ts_usec,
-			 msg->flags & LOG_CONT ? 'c' : '-', caller);
+			 (info->facility << 3) | info->level, info->seq,
+			 ts_usec, info->flags & LOG_CONT ? 'c' : '-', caller);
 }
 
 static ssize_t msg_print_ext_body(char *buf, size_t size,
@@ -783,10 +647,14 @@ static ssize_t msg_print_ext_body(char *buf, size_t size,
 /* /dev/kmsg - userspace message inject/listen interface */
 struct devkmsg_user {
 	u64 seq;
-	u32 idx;
 	struct ratelimit_state rs;
 	struct mutex lock;
 	char buf[CONSOLE_EXT_LOG_MAX];
+
+	struct printk_info info;
+	char text_buf[CONSOLE_EXT_LOG_MAX];
+	char dict_buf[CONSOLE_EXT_LOG_MAX];
+	struct printk_record record;
 };
 
 static __printf(3, 4) __cold
@@ -869,7 +737,7 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
 			    size_t count, loff_t *ppos)
 {
 	struct devkmsg_user *user = file->private_data;
-	struct printk_log *msg;
+	struct printk_record *r = &user->record;
 	size_t len;
 	ssize_t ret;
 
@@ -881,7 +749,7 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
 		return ret;
 
 	logbuf_lock_irq();
-	while (user->seq == log_next_seq) {
+	if (!prb_read_valid(prb, user->seq, r)) {
 		if (file->f_flags & O_NONBLOCK) {
 			ret = -EAGAIN;
 			logbuf_unlock_irq();
@@ -890,30 +758,26 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
 
 		logbuf_unlock_irq();
 		ret = wait_event_interruptible(log_wait,
-					       user->seq != log_next_seq);
+					prb_read_valid(prb, user->seq, r));
 		if (ret)
 			goto out;
 		logbuf_lock_irq();
 	}
 
-	if (user->seq < log_first_seq) {
-		/* our last seen message is gone, return error and reset */
-		user->idx = log_first_idx;
-		user->seq = log_first_seq;
+	if (user->seq < r->info->seq) {
+		/* the expected message is gone, return error and reset */
+		user->seq = r->info->seq;
 		ret = -EPIPE;
 		logbuf_unlock_irq();
 		goto out;
 	}
 
-	msg = log_from_idx(user->idx);
-	len = msg_print_ext_header(user->buf, sizeof(user->buf),
-				   msg, user->seq);
+	len = info_print_ext_header(user->buf, sizeof(user->buf), r->info);
 	len += msg_print_ext_body(user->buf + len, sizeof(user->buf) - len,
-				  log_dict(msg), msg->dict_len,
-				  log_text(msg), msg->text_len);
+				  &r->dict_buf[0], r->info->dict_len,
+				  &r->text_buf[0], r->info->text_len);
 
-	user->idx = log_next(user->idx);
-	user->seq++;
+	user->seq = r->info->seq + 1;
 	logbuf_unlock_irq();
 
 	if (len > count) {
@@ -945,8 +809,7 @@ static loff_t devkmsg_llseek(struct file *file, loff_t offset, int whence)
 	switch (whence) {
 	case SEEK_SET:
 		/* the first record */
-		user->idx = log_first_idx;
-		user->seq = log_first_seq;
+		user->seq = prb_first_seq(prb);
 		break;
 	case SEEK_DATA:
 		/*
@@ -954,13 +817,11 @@ static loff_t devkmsg_llseek(struct file *file, loff_t offset, int whence)
 		 * like issued by 'dmesg -c'. Reading /dev/kmsg itself
 		 * changes no global state, and does not clear anything.
 		 */
-		user->idx = clear_idx;
 		user->seq = clear_seq;
 		break;
 	case SEEK_END:
 		/* after the last record */
-		user->idx = log_next_idx;
-		user->seq = log_next_seq;
+		user->seq = prb_next_seq(prb);
 		break;
 	default:
 		ret = -EINVAL;
@@ -980,9 +841,9 @@ static __poll_t devkmsg_poll(struct file *file, poll_table *wait)
 	poll_wait(file, &log_wait, wait);
 
 	logbuf_lock_irq();
-	if (user->seq < log_next_seq) {
+	if (prb_read_valid(prb, user->seq, NULL)) {
 		/* return error when data has vanished underneath us */
-		if (user->seq < log_first_seq)
+		if (user->seq < prb_first_seq(prb))
 			ret = EPOLLIN|EPOLLRDNORM|EPOLLERR|EPOLLPRI;
 		else
 			ret = EPOLLIN|EPOLLRDNORM;
@@ -1017,9 +878,14 @@ static int devkmsg_open(struct inode *inode, struct file *file)
 
 	mutex_init(&user->lock);
 
+	user->record.info = &user->info;
+	user->record.text_buf = &user->text_buf[0];
+	user->record.text_buf_size = sizeof(user->text_buf);
+	user->record.dict_buf = &user->dict_buf[0];
+	user->record.dict_buf_size = sizeof(user->dict_buf);
+
 	logbuf_lock_irq();
-	user->idx = log_first_idx;
-	user->seq = log_first_seq;
+	user->seq = prb_first_seq(prb);
 	logbuf_unlock_irq();
 
 	file->private_data = user;
@@ -1062,21 +928,16 @@ void log_buf_vmcoreinfo_setup(void)
 {
 	VMCOREINFO_SYMBOL(log_buf);
 	VMCOREINFO_SYMBOL(log_buf_len);
-	VMCOREINFO_SYMBOL(log_first_idx);
-	VMCOREINFO_SYMBOL(clear_idx);
-	VMCOREINFO_SYMBOL(log_next_idx);
 	/*
-	 * Export struct printk_log size and field offsets. User space tools can
-	 * parse it and detect any changes to structure down the line.
+	 * Export struct printk_info size and field offsets. User space tools
+	 * can parse it and detect any changes to structure down the line.
 	 */
-	VMCOREINFO_STRUCT_SIZE(printk_log);
-	VMCOREINFO_OFFSET(printk_log, ts_nsec);
-	VMCOREINFO_OFFSET(printk_log, len);
-	VMCOREINFO_OFFSET(printk_log, text_len);
-	VMCOREINFO_OFFSET(printk_log, dict_len);
-#ifdef CONFIG_PRINTK_CALLER
-	VMCOREINFO_OFFSET(printk_log, caller_id);
-#endif
+	VMCOREINFO_STRUCT_SIZE(printk_info);
+	VMCOREINFO_OFFSET(printk_info, seq);
+	VMCOREINFO_OFFSET(printk_info, ts_nsec);
+	VMCOREINFO_OFFSET(printk_info, text_len);
+	VMCOREINFO_OFFSET(printk_info, dict_len);
+	VMCOREINFO_OFFSET(printk_info, caller_id);
 }
 #endif
 
@@ -1146,11 +1007,55 @@ static void __init log_buf_add_cpu(void)
 static inline void log_buf_add_cpu(void) {}
 #endif /* CONFIG_SMP */
 
+static unsigned int __init add_to_rb(struct printk_ringbuffer *rb,
+				     struct printk_record *r)
+{
+	struct printk_info info;
+	struct printk_record dest_r = {
+		.info = &info,
+		.text_buf_size = r->info->text_len,
+		.dict_buf_size = r->info->dict_len,
+	};
+	struct prb_reserved_entry e;
+
+	if (!prb_reserve(&e, rb, &dest_r))
+		return 0;
+
+	memcpy(&dest_r.text_buf[0], &r->text_buf[0], dest_r.text_buf_size);
+	if (dest_r.dict_buf) {
+		memcpy(&dest_r.dict_buf[0], &r->dict_buf[0],
+		       dest_r.dict_buf_size);
+	}
+	dest_r.info->facility = r->info->facility;
+	dest_r.info->level = r->info->level;
+	dest_r.info->flags = r->info->flags;
+	dest_r.info->ts_nsec = r->info->ts_nsec;
+	dest_r.info->caller_id = r->info->caller_id;
+
+	prb_commit(&e);
+
+	return prb_record_text_space(&e);
+}
+
+static char setup_text_buf[CONSOLE_EXT_LOG_MAX] __initdata;
+static char setup_dict_buf[CONSOLE_EXT_LOG_MAX] __initdata;
+
 void __init setup_log_buf(int early)
 {
+	struct prb_desc *new_descs;
+	struct printk_info info;
+	struct printk_record r = {
+		.info = &info,
+		.text_buf = &setup_text_buf[0],
+		.dict_buf = &setup_dict_buf[0],
+		.text_buf_size = sizeof(setup_text_buf),
+		.dict_buf_size = sizeof(setup_dict_buf),
+	};
 	unsigned long flags;
+	char *new_dict_buf;
 	char *new_log_buf;
 	unsigned int free;
+	u64 seq;
 
 	if (log_buf != __log_buf)
 		return;
@@ -1163,17 +1068,46 @@ void __init setup_log_buf(int early)
 
 	new_log_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN);
 	if (unlikely(!new_log_buf)) {
-		pr_err("log_buf_len: %lu bytes not available\n",
+		pr_err("log_buf_len: %lu text bytes not available\n",
 			new_log_buf_len);
 		return;
 	}
 
+	new_dict_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN);
+	if (unlikely(!new_dict_buf)) {
+		/* dictionary failure is allowed */
+		pr_err("log_buf_len: %lu dict bytes not available\n",
+			new_log_buf_len);
+	}
+
+	new_descs = memblock_alloc((new_log_buf_len >> PRB_AVGBITS) *
+				   sizeof(struct prb_desc), LOG_ALIGN);
+	if (unlikely(!new_descs)) {
+		pr_err("log_buf_len: %lu desc bytes not available\n",
+			new_log_buf_len >> PRB_AVGBITS);
+		if (new_dict_buf)
+			memblock_free(__pa(new_dict_buf), new_log_buf_len);
+		memblock_free(__pa(new_log_buf), new_log_buf_len);
+		return;
+	}
+
 	logbuf_lock_irqsave(flags);
+
+	prb_init(&printk_rb_dynamic,
+		 new_log_buf, bits_per(new_log_buf_len) - 1,
+		 new_dict_buf, bits_per(new_log_buf_len) - 1,
+		 new_descs, (bits_per(new_log_buf_len) - 1) - PRB_AVGBITS);
+
 	log_buf_len = new_log_buf_len;
 	log_buf = new_log_buf;
 	new_log_buf_len = 0;
-	free = __LOG_BUF_LEN - log_next_idx;
-	memcpy(log_buf, __log_buf, __LOG_BUF_LEN);
+
+	free = __LOG_BUF_LEN;
+	prb_for_each_record(0, &printk_rb_static, seq, &r)
+		free -= add_to_rb(&printk_rb_dynamic, &r);
+
+	prb = &printk_rb_dynamic;
+
 	logbuf_unlock_irqrestore(flags);
 
 	pr_info("log_buf_len: %u bytes\n", log_buf_len);
@@ -1285,18 +1219,18 @@ static size_t print_caller(u32 id, char *buf)
 #define print_caller(id, buf) 0
 #endif
 
-static size_t print_prefix(const struct printk_log *msg, bool syslog,
-			   bool time, char *buf)
+static size_t info_print_prefix(const struct printk_info  *info, bool syslog,
+				bool time, char *buf)
 {
 	size_t len = 0;
 
 	if (syslog)
-		len = print_syslog((msg->facility << 3) | msg->level, buf);
+		len = print_syslog((info->facility << 3) | info->level, buf);
 
 	if (time)
-		len += print_time(msg->ts_nsec, buf + len);
+		len += print_time(info->ts_nsec, buf + len);
 
-	len += print_caller(msg->caller_id, buf + len);
+	len += print_caller(info->caller_id, buf + len);
 
 	if (IS_ENABLED(CONFIG_PRINTK_CALLER) || time) {
 		buf[len++] = ' ';
@@ -1306,14 +1240,15 @@ static size_t print_prefix(const struct printk_log *msg, bool syslog,
 	return len;
 }
 
-static size_t msg_print_text(const struct printk_log *msg, bool syslog,
-			     bool time, char *buf, size_t size)
+static size_t record_print_text(const struct printk_record *r, bool syslog,
+				bool time, char *buf, size_t size)
 {
-	const char *text = log_text(msg);
-	size_t text_size = msg->text_len;
+	const char *text = &r->text_buf[0];
+	size_t text_size = r->info->text_len;
 	size_t len = 0;
 	char prefix[PREFIX_MAX];
-	const size_t prefix_len = print_prefix(msg, syslog, time, prefix);
+	const size_t prefix_len = info_print_prefix(r->info, syslog, time,
+						    prefix);
 
 	do {
 		const char *next = memchr(text, '\n', text_size);
@@ -1347,10 +1282,94 @@ static size_t msg_print_text(const struct printk_log *msg, bool syslog,
 	return len;
 }
 
+static size_t record_print_text_inline(struct printk_record *r, bool syslog,
+				       bool time)
+{
+	size_t text_len = r->info->text_len;
+	size_t buf_size = r->text_buf_size;
+	char *text = r->text_buf;
+	char prefix[PREFIX_MAX];
+	bool truncated = false;
+	size_t prefix_len;
+	size_t len = 0;
+
+	prefix_len = info_print_prefix(r->info, syslog, time, prefix);
+
+	if (!text) {
+		/* SYSLOG_ACTION_* buffer size only calculation */
+		unsigned int line_count = 1;
+
+		if (r->text_line_count)
+			line_count = *(r->text_line_count);
+		/*
+		 * Each line will be preceded with a prefix. The intermediate
+		 * newlines are already within the text, but a final trailing
+		 * newline will be added.
+		 */
+		return ((prefix_len * line_count) + r->info->text_len + 1);
+	}
+
+	/*
+	 * Add the prefix for each line by shifting the rest of the text to
+	 * make room for the prefix. If the buffer is not large enough for all
+	 * the prefixes, then drop the trailing text and report the largest
+	 * length that includes full lines with their prefixes.
+	 */
+	while (text_len) {
+		size_t line_len;
+		char *next;
+
+		next = memchr(text, '\n', text_len);
+		if (next) {
+			line_len = next - text;
+		} else {
+			/*
+			 * If the text has been truncated, assume this line
+			 * was truncated and do not include this text.
+			 */
+			if (truncated)
+				break;
+			line_len = text_len;
+		}
+
+		/*
+		 * Is there enough buffer available to shift this line
+		 * (and add a newline at the end)?
+		 */
+		if (len + prefix_len + line_len >= buf_size)
+			break;
+
+		/*
+		 * Is there enough buffer available to shift all remaining
+		 * text (and add a newline at the end)?
+		 */
+		if (len + prefix_len + text_len >= buf_size) {
+			text_len = (buf_size - len) - prefix_len;
+			truncated = true;
+		}
+
+		memmove(text + prefix_len, text, text_len);
+		memcpy(text, prefix, prefix_len);
+
+		text += prefix_len + line_len;
+		text_len -= line_len;
+
+		if (text_len) {
+			text_len--;
+			text++;
+		} else {
+			*text = '\n';
+		}
+
+		len += prefix_len + line_len + 1;
+	}
+
+	return len;
+}
+
 static int syslog_print(char __user *buf, int size)
 {
 	char *text;
-	struct printk_log *msg;
 	int len = 0;
 
 	text = kmalloc(LOG_LINE_MAX + PREFIX_MAX, GFP_KERNEL);
@@ -1362,16 +1381,15 @@ static int syslog_print(char __user *buf, int size)
 		size_t skip;
 
 		logbuf_lock_irq();
-		if (syslog_seq < log_first_seq) {
-			/* messages are gone, move to first one */
-			syslog_seq = log_first_seq;
-			syslog_idx = log_first_idx;
-			syslog_partial = 0;
-		}
-		if (syslog_seq == log_next_seq) {
+		if (!prb_read_valid(prb, syslog_seq, &syslog_record)) {
 			logbuf_unlock_irq();
 			break;
 		}
+		if (syslog_record.info->seq != syslog_seq) {
+			/* messages are gone, move to first one */
+			syslog_seq = syslog_record.info->seq;
+			syslog_partial = 0;
+		}
 
 		/*
 		 * To keep reading/counting partial line consistent,
@@ -1381,13 +1399,11 @@ static int syslog_print(char __user *buf, int size)
 			syslog_time = printk_time;
 
 		skip = syslog_partial;
-		msg = log_from_idx(syslog_idx);
-		n = msg_print_text(msg, true, syslog_time, text,
-				   LOG_LINE_MAX + PREFIX_MAX);
+		n = record_print_text(&syslog_record, true, syslog_time, text,
+				      LOG_LINE_MAX + PREFIX_MAX);
 		if (n - syslog_partial <= size) {
 			/* message fits into buffer, move forward */
-			syslog_idx = log_next(syslog_idx);
-			syslog_seq++;
+			syslog_seq = syslog_record.info->seq + 1;
 			n -= syslog_partial;
 			syslog_partial = 0;
 		} else if (!len){
@@ -1420,9 +1436,7 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
 {
 	char *text;
 	int len = 0;
-	u64 next_seq;
 	u64 seq;
-	u32 idx;
 	bool time;
 
 	text = kmalloc(LOG_LINE_MAX + PREFIX_MAX, GFP_KERNEL);
@@ -1435,38 +1449,30 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
 	 * Find first record that fits, including all following records,
 	 * into the user-provided buffer for this dump.
 	 */
-	seq = clear_seq;
-	idx = clear_idx;
-	while (seq < log_next_seq) {
-		struct printk_log *msg = log_from_idx(idx);
-
-		len += msg_print_text(msg, true, time, NULL, 0);
-		idx = log_next(idx);
-		seq++;
-	}
+	prb_for_each_record(clear_seq, prb, seq, &syslog_record)
+		len += record_print_text(&syslog_record, true, time, NULL, 0);
 
 	/* move first record forward until length fits into the buffer */
-	seq = clear_seq;
-	idx = clear_idx;
-	while (len > size && seq < log_next_seq) {
-		struct printk_log *msg = log_from_idx(idx);
-
-		len -= msg_print_text(msg, true, time, NULL, 0);
-		idx = log_next(idx);
-		seq++;
+	prb_for_each_record(clear_seq, prb, seq, &syslog_record) {
+		if (len <= size)
+			break;
+		len -= record_print_text(&syslog_record, true, time, NULL, 0);
 	}
 
-	/* last message fitting into this dump */
-	next_seq = log_next_seq;
-
 	len = 0;
-	while (len >= 0 && seq < next_seq) {
-		struct printk_log *msg = log_from_idx(idx);
-		int textlen = msg_print_text(msg, true, time, text,
-					     LOG_LINE_MAX + PREFIX_MAX);
+	prb_for_each_record(seq, prb, seq, &syslog_record) {
+		int textlen;
 
-		idx = log_next(idx);
-		seq++;
+		if (len < 0)
+			break;
+
+		textlen = record_print_text(&syslog_record, true, time, text,
+					    LOG_LINE_MAX + PREFIX_MAX);
+
+		if (len + textlen > size) {
+			seq--;
+			break;
+		}
 
 		logbuf_unlock_irq();
 		if (copy_to_user(buf + len, text, textlen))
@@ -1474,18 +1480,10 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
 		else
 			len += textlen;
 		logbuf_lock_irq();
-
-		if (seq < log_first_seq) {
-			/* messages are gone, move to next one */
-			seq = log_first_seq;
-			idx = log_first_idx;
-		}
 	}
 
-	if (clear) {
-		clear_seq = log_next_seq;
-		clear_idx = log_next_idx;
-	}
+	if (clear)
+		clear_seq = seq;
 	logbuf_unlock_irq();
 
 	kfree(text);
@@ -1495,8 +1493,7 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
 static void syslog_clear(void)
 {
 	logbuf_lock_irq();
-	clear_seq = log_next_seq;
-	clear_idx = log_next_idx;
+	clear_seq = prb_next_seq(prb);
 	logbuf_unlock_irq();
 }
 
@@ -1523,7 +1520,7 @@ int do_syslog(int type, char __user *buf, int len, int source)
 		if (!access_ok(buf, len))
 			return -EFAULT;
 		error = wait_event_interruptible(log_wait,
-						 syslog_seq != log_next_seq);
+				prb_read_valid(prb, syslog_seq, NULL));
 		if (error)
 			return error;
 		error = syslog_print(buf, len);
@@ -1572,10 +1569,9 @@ int do_syslog(int type, char __user *buf, int len, int source)
 	/* Number of chars in the log buffer */
 	case SYSLOG_ACTION_SIZE_UNREAD:
 		logbuf_lock_irq();
-		if (syslog_seq < log_first_seq) {
+		if (syslog_seq < prb_first_seq(prb)) {
 			/* messages are gone, move to first one */
-			syslog_seq = log_first_seq;
-			syslog_idx = log_first_idx;
+			syslog_seq = prb_first_seq(prb);
 			syslog_partial = 0;
 		}
 		if (source == SYSLOG_FROM_PROC) {
@@ -1584,20 +1580,17 @@ int do_syslog(int type, char __user *buf, int len, int source)
 			 * for pending data, not the size; return the count of
 			 * records, not the length.
 			 */
-			error = log_next_seq - syslog_seq;
+			error = prb_next_seq(prb) - syslog_seq;
 		} else {
-			u64 seq = syslog_seq;
-			u32 idx = syslog_idx;
 			bool time = syslog_partial ? syslog_time : printk_time;
+			u64 seq;
 
-			while (seq < log_next_seq) {
-				struct printk_log *msg = log_from_idx(idx);
-
-				error += msg_print_text(msg, true, time, NULL,
-							0);
+			prb_for_each_record(syslog_seq, prb, seq,
+					    &syslog_record) {
+				error += record_print_text(&syslog_record,
+							   true, time,
+							   NULL, 0);
 				time = printk_time;
-				idx = log_next(idx);
-				seq++;
 			}
 			error -= syslog_partial;
 		}
@@ -1958,7 +1951,6 @@ asmlinkage int vprintk_emit(int facility, int level,
 	int printed_len;
 	bool in_sched = false, pending_output;
 	unsigned long flags;
-	u64 curr_log_seq;
 
 	/* Suppress unimportant messages after panic happens */
 	if (unlikely(suppress_printk))
@@ -1974,9 +1966,9 @@ asmlinkage int vprintk_emit(int facility, int level,
 
 	/* This stops the holder of console_sem just where we want him */
 	logbuf_lock_irqsave(flags);
-	curr_log_seq = log_next_seq;
+	pending_output = !prb_read_valid(prb, console_seq, NULL);
 	printed_len = vprintk_store(facility, level, dict, dictlen, fmt, args);
-	pending_output = (curr_log_seq != log_next_seq);
+	pending_output &= prb_read_valid(prb, console_seq, NULL);
 	logbuf_unlock_irqrestore(flags);
 
 	/* If called from the scheduler, we can not call up(). */
@@ -2066,21 +2058,30 @@ EXPORT_SYMBOL(printk);
 #define PREFIX_MAX		0
 #define printk_time		false
 
+#define prb_read_valid(rb, seq, r)	false
+#define prb_first_seq(rb)		0
+
 static u64 syslog_seq;
-static u32 syslog_idx;
 static u64 console_seq;
-static u32 console_idx;
 static u64 exclusive_console_stop_seq;
-static u64 log_first_seq;
-static u32 log_first_idx;
-static u64 log_next_seq;
-static char *log_text(const struct printk_log *msg) { return NULL; }
-static char *log_dict(const struct printk_log *msg) { return NULL; }
-static struct printk_log *log_from_idx(u32 idx) { return NULL; }
-static u32 log_next(u32 idx) { return 0; }
-static ssize_t msg_print_ext_header(char *buf, size_t size,
-				    struct printk_log *msg,
-				    u64 seq) { return 0; }
+struct printk_record console_record;
+
+static size_t record_print_text(const struct printk_record *r, bool syslog,
+				bool time, char *buf,
+				size_t size)
+{
+	return 0;
+}
+static size_t record_print_text_inline(const struct printk_record *r,
+				       bool syslog, bool time)
+{
+	return 0;
+}
+static ssize_t info_print_ext_header(char *buf, size_t size,
+				     struct printk_info *info)
+{
+	return 0;
+}
 static ssize_t msg_print_ext_body(char *buf, size_t size,
 				  char *dict, size_t dict_len,
 				  char *text, size_t text_len) { return 0; }
@@ -2088,8 +2089,6 @@ static void console_lock_spinning_enable(void) { }
 static int console_lock_spinning_disable_and_check(void) { return 0; }
 static void call_console_drivers(const char *ext_text, size_t ext_len,
 				 const char *text, size_t len) {}
-static size_t msg_print_text(const struct printk_log *msg, bool syslog,
-			     bool time, char *buf, size_t size) { return 0; }
 static bool suppress_message_printing(int level) { return false; }
 
 #endif /* CONFIG_PRINTK */
@@ -2406,35 +2405,28 @@ void console_unlock(void)
 	}
 
 	for (;;) {
-		struct printk_log *msg;
 		size_t ext_len = 0;
-		size_t len;
+		size_t len = 0;
 
 		printk_safe_enter_irqsave(flags);
 		raw_spin_lock(&logbuf_lock);
-		if (console_seq < log_first_seq) {
+skip:
+		if (!prb_read_valid(prb, console_seq, &console_record))
+			break;
+
+		if (console_seq < console_record.info->seq) {
 			len = sprintf(text,
 				      "** %llu printk messages dropped **\n",
-				      log_first_seq - console_seq);
-
-			/* messages are gone, move to first one */
-			console_seq = log_first_seq;
-			console_idx = log_first_idx;
-		} else {
-			len = 0;
+				      console_record.info->seq - console_seq);
 		}
-skip:
-		if (console_seq == log_next_seq)
-			break;
+		console_seq = console_record.info->seq;
 
-		msg = log_from_idx(console_idx);
-		if (suppress_message_printing(msg->level)) {
+		if (suppress_message_printing(console_record.info->level)) {
 			/*
 			 * Skip record we have buffered and already printed
 			 * directly to the console when we received it, and
 			 * record that has level above the console loglevel.
 			 */
-			console_idx = log_next(console_idx);
 			console_seq++;
 			goto skip;
 		}
@@ -2445,19 +2437,20 @@ void console_unlock(void)
 			exclusive_console = NULL;
 		}
 
-		len += msg_print_text(msg,
+		len += record_print_text(&console_record,
 				console_msg_format & MSG_FORMAT_SYSLOG,
 				printk_time, text + len, sizeof(text) - len);
 		if (nr_ext_console_drivers) {
-			ext_len = msg_print_ext_header(ext_text,
+			ext_len = info_print_ext_header(ext_text,
 						sizeof(ext_text),
-						msg, console_seq);
+						console_record.info);
 			ext_len += msg_print_ext_body(ext_text + ext_len,
 						sizeof(ext_text) - ext_len,
-						log_dict(msg), msg->dict_len,
-						log_text(msg), msg->text_len);
+						&console_record.dict_buf[0],
+						console_record.info->dict_len,
+						&console_record.text_buf[0],
+						console_record.info->text_len);
 		}
-		console_idx = log_next(console_idx);
 		console_seq++;
 		raw_spin_unlock(&logbuf_lock);
 
@@ -2497,7 +2490,7 @@ void console_unlock(void)
 	 * flush, no worries.
 	 */
 	raw_spin_lock(&logbuf_lock);
-	retry = console_seq != log_next_seq;
+	retry = prb_read_valid(prb, console_seq, NULL);
 	raw_spin_unlock(&logbuf_lock);
 	printk_safe_exit_irqrestore(flags);
 
@@ -2566,8 +2559,7 @@ void console_flush_on_panic(enum con_flush_mode mode)
 		unsigned long flags;
 
 		logbuf_lock_irqsave(flags);
-		console_seq = log_first_seq;
-		console_idx = log_first_idx;
+		console_seq = prb_first_seq(prb);
 		logbuf_unlock_irqrestore(flags);
 	}
 	console_unlock();
@@ -2770,8 +2762,6 @@ void register_console(struct console *newcon)
 		 * for us.
 		 */
 		logbuf_lock_irqsave(flags);
-		console_seq = syslog_seq;
-		console_idx = syslog_idx;
 		/*
 		 * We're about to replay the log buffer.  Only do this to the
 		 * just-registered console to avoid excessive message spam to
@@ -2783,6 +2773,7 @@ void register_console(struct console *newcon)
 		 */
 		exclusive_console = newcon;
 		exclusive_console_stop_seq = console_seq;
+		console_seq = syslog_seq;
 		logbuf_unlock_irqrestore(flags);
 	}
 	console_unlock();
@@ -3127,9 +3118,7 @@ void kmsg_dump(enum kmsg_dump_reason reason)
 
 		logbuf_lock_irqsave(flags);
 		dumper->cur_seq = clear_seq;
-		dumper->cur_idx = clear_idx;
-		dumper->next_seq = log_next_seq;
-		dumper->next_idx = log_next_idx;
+		dumper->next_seq = prb_next_seq(prb);
 		logbuf_unlock_irqrestore(flags);
 
 		/* invoke dumper which will iterate over records */
@@ -3163,28 +3152,29 @@ void kmsg_dump(enum kmsg_dump_reason reason)
 bool kmsg_dump_get_line_nolock(struct kmsg_dumper *dumper, bool syslog,
 			       char *line, size_t size, size_t *len)
 {
-	struct printk_log *msg;
+	struct printk_info info;
+	struct printk_record r = {
+		.info = &info,
+		.text_buf = line,
+		.text_buf_size = size,
+	};
+	unsigned int line_count;
 	size_t l = 0;
 	bool ret = false;
 
 	if (!dumper->active)
 		goto out;
 
-	if (dumper->cur_seq < log_first_seq) {
-		/* messages are gone, move to first available one */
-		dumper->cur_seq = log_first_seq;
-		dumper->cur_idx = log_first_idx;
-	}
+	/* Count text lines instead of reading text? */
+	if (!line)
+		r.text_line_count = &line_count;
 
-	/* last entry */
-	if (dumper->cur_seq >= log_next_seq)
+	if (!prb_read_valid(prb, dumper->cur_seq, &r))
 		goto out;
 
-	msg = log_from_idx(dumper->cur_idx);
-	l = msg_print_text(msg, syslog, printk_time, line, size);
+	l = record_print_text_inline(&r, syslog, printk_time);
 
-	dumper->cur_idx = log_next(dumper->cur_idx);
-	dumper->cur_seq++;
+	dumper->cur_seq = r.info->seq + 1;
 	ret = true;
 out:
 	if (len)
@@ -3245,23 +3235,27 @@ EXPORT_SYMBOL_GPL(kmsg_dump_get_line);
 bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
 			  char *buf, size_t size, size_t *len)
 {
+	struct printk_info info;
+	unsigned int line_count;
+	/* initially, only count text lines */
+	struct printk_record r = {
+		.info = &info,
+		.text_line_count = &line_count,
+	};
 	unsigned long flags;
 	u64 seq;
-	u32 idx;
 	u64 next_seq;
-	u32 next_idx;
 	size_t l = 0;
 	bool ret = false;
 	bool time = printk_time;
 
-	if (!dumper->active)
+	if (!dumper->active || !buf || !size)
 		goto out;
 
 	logbuf_lock_irqsave(flags);
-	if (dumper->cur_seq < log_first_seq) {
+	if (dumper->cur_seq < prb_first_seq(prb)) {
 		/* messages are gone, move to first available one */
-		dumper->cur_seq = log_first_seq;
-		dumper->cur_idx = log_first_idx;
+		dumper->cur_seq = prb_first_seq(prb);
 	}
 
 	/* last entry */
@@ -3272,41 +3266,43 @@ bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
 
 	/* calculate length of entire buffer */
 	seq = dumper->cur_seq;
-	idx = dumper->cur_idx;
-	while (seq < dumper->next_seq) {
-		struct printk_log *msg = log_from_idx(idx);
-
-		l += msg_print_text(msg, true, time, NULL, 0);
-		idx = log_next(idx);
-		seq++;
+	while (prb_read_valid(prb, seq, &r)) {
+		if (r.info->seq >= dumper->next_seq)
+			break;
+		l += record_print_text_inline(&r, true, time);
+		seq = r.info->seq + 1;
 	}
 
 	/* move first record forward until length fits into the buffer */
 	seq = dumper->cur_seq;
-	idx = dumper->cur_idx;
-	while (l >= size && seq < dumper->next_seq) {
-		struct printk_log *msg = log_from_idx(idx);
-
-		l -= msg_print_text(msg, true, time, NULL, 0);
-		idx = log_next(idx);
-		seq++;
+	while (l >= size && prb_read_valid(prb, seq, &r)) {
+		if (r.info->seq >= dumper->next_seq)
+			break;
+		l -= record_print_text_inline(&r, true, time);
+		seq = r.info->seq + 1;
 	}
 
 	/* last message in next interation */
 	next_seq = seq;
-	next_idx = idx;
+
+	/* actually read data into the buffer now */
+	r.text_buf = buf;
+	r.text_buf_size = size;
+	r.text_line_count = NULL;
 
 	l = 0;
-	while (seq < dumper->next_seq) {
-		struct printk_log *msg = log_from_idx(idx);
+	while (prb_read_valid(prb, seq, &r)) {
+		if (r.info->seq >= dumper->next_seq)
+			break;
+
+		l += record_print_text_inline(&r, syslog, time);
+		r.text_buf = buf + l;
+		r.text_buf_size = size - l;
 
-		l += msg_print_text(msg, syslog, time, buf + l, size - l);
-		idx = log_next(idx);
-		seq++;
+		seq = r.info->seq + 1;
 	}
 
 	dumper->next_seq = next_seq;
-	dumper->next_idx = next_idx;
 	ret = true;
 	logbuf_unlock_irqrestore(flags);
 out:
@@ -3329,9 +3325,7 @@ EXPORT_SYMBOL_GPL(kmsg_dump_get_buffer);
 void kmsg_dump_rewind_nolock(struct kmsg_dumper *dumper)
 {
 	dumper->cur_seq = clear_seq;
-	dumper->cur_idx = clear_idx;
-	dumper->next_seq = log_next_seq;
-	dumper->next_idx = log_next_idx;
+	dumper->next_seq = prb_next_seq(prb);
 }
 
 /**
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/2] printk: add lockless buffer
  2020-01-28 16:19 ` [PATCH 1/2] printk: add lockless buffer John Ogness
@ 2020-01-29  3:53   ` Steven Rostedt
  2020-02-21 11:54   ` more barriers: " Petr Mladek
  2020-02-21 12:05   ` misc nits " Petr Mladek
  2 siblings, 0 replies; 58+ messages in thread
From: Steven Rostedt @ 2020-01-29  3:53 UTC (permalink / raw)
  To: John Ogness
  Cc: Petr Mladek, Peter Zijlstra, Sergey Senozhatsky,
	Sergey Senozhatsky, Linus Torvalds, Greg Kroah-Hartman,
	Andrea Parri, Thomas Gleixner, kexec, linux-kernel

On Tue, 28 Jan 2020 17:25:47 +0106
John Ogness <john.ogness@linutronix.de> wrote:

> diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
> new file mode 100644
> index 000000000000..796257f226ee
> --- /dev/null
> +++ b/kernel/printk/printk_ringbuffer.c
> @@ -0,0 +1,1370 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/kernel.h>
> +#include <linux/irqflags.h>
> +#include <linux/string.h>
> +#include <linux/errno.h>
> +#include <linux/bug.h>
> +#include "printk_ringbuffer.h"
> +
> +/**
> + * DOC: printk_ringbuffer overview
> + *
> + * Data Structure
> + * --------------
> + * The printk_ringbuffer is made up of 3 internal ringbuffers::
> + *
> + *   * desc_ring:      A ring of descriptors. A descriptor contains all record
> + *                     meta data (sequence number, timestamp, loglevel, etc.)
> + *                     as well as internal state information about the record
> + *                     and logical positions specifying where in the other
> + *                     ringbuffers the text and dictionary strings are
> + *                     located.
> + *
> + *   * text_data_ring: A ring of data blocks. A data block consists of an
> + *                     unsigned long integer (ID) that maps to a desc_ring
> + *                     index followed by the text string of the record.
> + *
> + *   * dict_data_ring: A ring of data blocks. A data block consists of an
> + *                     unsigned long integer (ID) that maps to a desc_ring
> + *                     index followed by the dictionary string of the record.
> + *
> + * Implementation
> + * --------------
> + *
> + * ABA Issues
> + * ~~~~~~~~~~
> + * To help avoid ABA issues, descriptors are referenced by IDs (index values
> + * with tagged states) and data blocks are referenced by logical positions
> + * (index values with tagged states). However, on 32-bit systems the number
> + * of tagged states is relatively small such that an ABA incident is (at
> + * least theoretically) possible. For example, if 4 million maximally sized

4 million? I'm guessing that maximally sized printk messages are 1k?

Perhaps say that, otherwise one might think this is a mistake. "4
million maximally sized (1k) printk messages"

> + * printk messages were to occur in NMI context on a 32-bit system, the
> + * interrupted task would not be able to recognize that the 32-bit integer
> + * wrapped and thus represents a different data block than the one the
> + * interrupted task expects.
> + *
> + * To help combat this possibility, additional state checking is performed
> + * (such as using cmpxchg() even though set() would suffice). These extra
> + * checks will hopefully catch any ABA issue that a 32-bit system might
> + * experience.
> + *
[..]

> + * Usage
> + * -----
> + * Here are some simple examples demonstrating writers and readers. For the
> + * examples a global ringbuffer (test_rb) is available (which is not the
> + * actual ringbuffer used by printk)::
> + *
> + *	DECLARE_PRINTKRB(test_rb, 15, 5, 3);
> + *
> + * This ringbuffer allows up to 32768 records (2 ^ 15) and has a size of
> + * 1 MiB (2 ^ 20) for text data and 256 KiB (2 ^ 18) for dictionary data.

 (2 ^ (15 + 5)) ... (2 ^ (15 + 3)) ?

I'll play around more with this this week. But so far it looks good.

-- Steve

> + *
> + * Sample writer code::
> + *
> + *	struct prb_reserved_entry e;
> + *	struct printk_record r;
> + *
> + *	// specify how much to allocate
> + *	r.text_buf_size = strlen(textstr) + 1;
> + *	r.dict_buf_size = strlen(dictstr) + 1;
> + *
> + *	if (prb_reserve(&e, &test_rb, &r)) {
> + *		snprintf(r.text_buf, r.text_buf_size, "%s", textstr);
> + *
> + *		// dictionary allocation may have failed
> + *		if (r.dict_buf)
> + *			snprintf(r.dict_buf, r.dict_buf_size, "%s", dictstr);
> + *
> + *		r.info->ts_nsec = local_clock();
> + *
> + *		prb_commit(&e);
> + *	}
> + *

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-01-28 16:19 [PATCH 0/2] printk: replace ringbuffer John Ogness
  2020-01-28 16:19 ` [PATCH 1/2] printk: add lockless buffer John Ogness
  2020-01-28 16:19 ` [PATCH 2/2] printk: use the lockless ringbuffer John Ogness
@ 2020-02-05  4:25 ` lijiang
  2020-02-05  4:42   ` Sergey Senozhatsky
  2020-02-05  4:48   ` Sergey Senozhatsky
  2020-02-06  9:21 ` lijiang
  3 siblings, 2 replies; 58+ messages in thread
From: lijiang @ 2020-02-05  4:25 UTC (permalink / raw)
  To: John Ogness, Petr Mladek
  Cc: Peter Zijlstra, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 9231 bytes --]

Hi, John Ogness

Thank you for improving the patch series and making great efforts.

I'm not sure if I missed anything else. Or are there any other related patches to be applied?

After applying this patch series, NMI watchdog detected a hard lockup, which caused that kernel can not boot, please refer to
the following call trace. And I put the complete kernel log in the attachment.

Test machine: 
Intel Platform: Grantley-R Wildcat Pass CPU: Broadwell-EP, B0
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
65536 MB memory, 800 GB disk space

kernel: v5.5-rc7
commit: def9d2780727 ("Linux 5.5-rc7")

......
[  OK  ] Started udev Coldplug all Devices.
[   42.110978] NMI watchdog: Watchdog detected hard LOCKUP on cpu 15
[   42.110978] Modules linked in: ip_tables xfs libcrc32c sr_mod cdrom sd_mod sg mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_vram_helper drm_ttm_helper ttm ahci libahci ixgbe drm crc32c_intel libata mdio dca i2c_algo_bit wmi dm_mirror dm_region_hash dm_log dm_mod
[   42.110986] CPU: 15 PID: 1395 Comm: systemd-journal Not tainted 5.5.0-rc7+ #4
[   42.110986] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.6024.071720181717 07/17/2018
[   42.110987] RIP: 0010:native_queued_spin_lock_slowpath+0x5d/0x1c0
[   42.110988] Code: 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 8b 07 <84> c0 75 f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00 00 75
[   42.110988] RSP: 0018:ffffbbe207a7bc48 EFLAGS: 00000002
[   42.110989] RAX: 0000000000f80101 RBX: ffffffffa1576e80 RCX: 0000000000000000
[   42.110990] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa1e95660
[   42.110990] RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000000b
[   42.110991] R10: ffffa075df5dcf80 R11: ffffffffa0ebfda0 R12: ffffffffa1e95660
[   42.110991] R13: ffffffffa1e97680 R14: ffffffffa17197a0 R15: 0000000000000047
[   42.110991] FS:  00007f7c5642a980(0000) GS:ffffa075df5c0000(0000) knlGS:0000000000000000
[   42.110992] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   42.110992] CR2: 00007ffe95f4c4c0 CR3: 000000084fbfc004 CR4: 00000000003606e0
[   42.110993] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   42.110993] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   42.110993] Call Trace:
[   42.110993]  _raw_spin_lock+0x1a/0x20
[   42.110994]  console_unlock+0x9e/0x450
[   42.110994]  bust_spinlocks+0x16/0x30
[   42.110994]  oops_end+0x33/0xc0
[   42.110995]  general_protection+0x32/0x40
[   42.110995] RIP: 0010:copy_data+0xf2/0x1e0
[   42.110995] Code: eb 08 49 83 c4 08 0f 84 8e 00 00 00 4c 89 74 24 08 4c 89 cd 41 89 d6 44 89 44 24 04 49 39 db 0f 87 c6 00 00 00 4d 85 c9 74 43 <41> c7 01 00 00 00 00 48 85 db 74 37 4c 89 e7 48 89 da 41 bf 01 00
[   42.110996] RSP: 0018:ffffbbe207a7bd80 EFLAGS: 00010002
[   42.110996] RAX: ffffa075d44ca000 RBX: 00000000000000a8 RCX: fffffffffff000b0
[   42.110997] RDX: 00000000000000a8 RSI: 00000fffffffff01 RDI: ffffffffa1456e00
[   42.110997] RBP: 0801364600307073 R08: 0000000000002000 R09: 0801364600307073
[   42.110997] R10: fffffffffff00000 R11: 00000000000000a8 R12: ffffffffa1e98330
[   42.110998] R13: 00000000d7efbe00 R14: 00000000000000a8 R15: 00000000ffffc000
[   42.110998]  _prb_read_valid+0xd8/0x190
[   42.110998]  prb_read_valid+0x15/0x20
[   42.110999]  devkmsg_read+0x9d/0x2a0
[   42.110999]  vfs_read+0x91/0x140
[   42.110999]  ksys_read+0x59/0xd0
[   42.111000]  do_syscall_64+0x55/0x1b0
[   42.111000]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   42.111000] RIP: 0033:0x7f7c55740b62
[   42.111001] Code: 94 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b6 0f 1f 80 00 00 00 00 f3 0f 1e fa 8b 05 e6 d8 20 00 85 c0 75 12 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 41 54 49 89 d4 55 48 89
[   42.111001] RSP: 002b:00007ffe95f4c4a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   42.111002] RAX: ffffffffffffffda RBX: 00007ffe95f4e500 RCX: 00007f7c55740b62
[   42.111002] RDX: 0000000000002000 RSI: 00007ffe95f4c4b0 RDI: 0000000000000008
[   42.111002] RBP: 0000000000000000 R08: 0000000000000100 R09: 0000000000000003
[   42.111003] R10: 0000000000000100 R11: 0000000000000246 R12: 00007ffe95f4c4b0
[   42.111003] R13: 00007ffe95f4e910 R14: 0000000000000000 R15: 0000000000000000
[   42.111003] Kernel panic - not syncing: Hard LOCKUP
[   42.111004] Shutting down cpus with NMI
[   42.111004] Kernel Offset: 0x1f000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   42.111005] general protection fault: 0000 [#1] SMP PTI
[   42.111005] CPU: 15 PID: 1395 Comm: systemd-journal Not tainted 5.5.0-rc7+ #4
[   42.111005] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.6024.071720181717 07/17/2018
[   42.111006] RIP: 0010:copy_data+0xf2/0x1e0
[   42.111006] Code: eb 08 49 83 c4 08 0f 84 8e 00 00 00 4c 89 74 24 08 4c 89 cd 41 89 d6 44 89 44 24 04 49 39 db 0f 87 c6 00 00 00 4d 85 c9 74 43 <41> c7 01 00 00 00 00 48 85 db 74 37 4c 89 e7 48 89 da 41 bf 01 00
[   42.111007] RSP: 0018:ffffbbe207a7bd80 EFLAGS: 00010002
[   42.111007] RAX: ffffa075d44ca000 RBX: 00000000000000a8 RCX: fffffffffff000b0
[   42.111008] RDX: 00000000000000a8 RSI: 00000fffffffff01 RDI: ffffffffa1456e00
[   42.111008] RBP: 0801364600307073 R08: 0000000000002000 R09: 0801364600307073
[   42.111008] R10: fffffffffff00000 R11: 00000000000000a8 R12: ffffffffa1e98330
[   42.111009] R13: 00000000d7efbe00 R14: 00000000000000a8 R15: 00000000ffffc000
[   42.111009] FS:  00007f7c5642a980(0000) GS:ffffa075df5c0000(0000) knlGS:0000000000000000
[   42.111010] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   42.111010] CR2: 00007ffe95f4c4c0 CR3: 000000084fbfc004 CR4: 00000000003606e0
[   42.111011] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   42.111011] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   42.111012] Call Trace:
[   42.111012]  _prb_read_valid+0xd8/0x190
[   42.111012]  prb_read_valid+0x15/0x20
[   42.111013]  devkmsg_read+0x9d/0x2a0
[   42.111013]  vfs_read+0x91/0x140
[   42.111013]  ksys_read+0x59/0xd0
[   42.111014]  do_syscall_64+0x55/0x1b0
[   42.111014]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   42.111014] RIP: 0033:0x7f7c55740b62
[   42.111015] Code: 94 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b6 0f 1f 80 00 00 00 00 f3 0f 1e fa 8b 05 e6 d8 20 00 85 c0 75 12 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 41 54 49 89 d4 55 48 89
[   42.111015] RSP: 002b:00007ffe95f4c4a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   42.111016] RAX: ffffffffffffffda RBX: 00007ffe95f4e500 RCX: 00007f7c55740b62
[   42.111016] RDX: 0000000000002000 RSI: 00007ffe95f4c4b0 RDI: 0000000000000008
[   42.111017] RBP: 0000000000000000 R08: 0000000000000100 R09: 0000000000000003
[   42.111017] R10: 0000000000000100 R11: 0000000000000246 R12: 00007ffe95f4c4b0
[   42.111017] R13: 00007ffe95f4e910 R14: 0000000000000000 R15: 0000000000000000
[   42.111017] Modules linked in: ip_tables xfs libcrc32c sr_mod cdrom sd_mod sg mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_vram_helper drm_ttm_helper ttm ahci libahci ixgbe drm crc32c_intel libata mdio dca i2c_algo_bit wmi dm_mirror dm_region_hash dm_log dm_mod
---hang---


Thanks.
Lianbo

> Hello,
> 
> After several RFC series [0][1][2][3][4], here is the first set of
> patches to rework the printk subsystem. This first set of patches
> only replace the existing ringbuffer implementation. No locking is
> removed. No semantics/behavior of printk are changed.
> 
> The VMCOREINFO is updated, which will require changes to the
> external crash [5] tool. I will be preparing a patch to add support
> for the new VMCOREINFO.
> 
> This series is in line with the agreements [6] made at the meeting
> during LPC2019 in Lisbon, with 1 exception: support for dictionaries
> will _not_ be discontinued [7]. Dictionaries are stored in a separate
> buffer so that they cannot interfere with the human-readable buffer.
> 
> John Ogness
> 
> [0] https://lkml.kernel.org/r/20190212143003.48446-1-john.ogness@linutronix.de
> [1] https://lkml.kernel.org/r/20190607162349.18199-1-john.ogness@linutronix.de
> [2] https://lkml.kernel.org/r/20190727013333.11260-1-john.ogness@linutronix.de
> [3] https://lkml.kernel.org/r/20190807222634.1723-1-john.ogness@linutronix.de
> [4] https://lkml.kernel.org/r/20191128015235.12940-1-john.ogness@linutronix.de
> [5] https://github.com/crash-utility/crash
> [6] https://lkml.kernel.org/r/87k1acz5rx.fsf@linutronix.de
> [7] https://lkml.kernel.org/r/20191007120134.ciywr3wale4gxa6v@pathway.suse.cz
> 
> John Ogness (2):
>   printk: add lockless buffer
>   printk: use the lockless ringbuffer
> 
>  include/linux/kmsg_dump.h         |    2 -
>  kernel/printk/Makefile            |    1 +
>  kernel/printk/printk.c            |  836 +++++++++---------
>  kernel/printk/printk_ringbuffer.c | 1370 +++++++++++++++++++++++++++++
>  kernel/printk/printk_ringbuffer.h |  328 +++++++
>  5 files changed, 2114 insertions(+), 423 deletions(-)
>  create mode 100644 kernel/printk/printk_ringbuffer.c
>  create mode 100644 kernel/printk/printk_ringbuffer.h
> 

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: kernel-5.5.0-rc7.log --]
[-- Type: text/x-log; name="kernel-5.5.0-rc7.log", Size: 101696 bytes --]

[    0.000000] Linux version 5.5.0-rc7+ (root@intel-wildcatpass-07) (gcc version 8.3.1 20191121 (GCC)) #4 SMP Tue Feb 4 05:14:30 EST 2020
[    0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.5.0-rc7+ root=/dev/mapper/intel--wildcatpass--07-root ro crashkernel=512M resume=/dev/mapper/intel--wildcatpass--07-swap rd.lvm.lv=intel-wildcatpass-07/root rd.lvm.lv=intel-wildcatpass-07/swap console=ttyS0,115200n81
[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009b3ff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009b400-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000079f9ffff] usable
[    0.000000] BIOS-e820: [mem 0x0000000079fa0000-0x000000007ac4ffff] reserved
[    0.000000] BIOS-e820: [mem 0x000000007ac50000-0x000000007b67ffff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x000000007b680000-0x000000007b7ccfff] ACPI data
[    0.000000] BIOS-e820: [mem 0x000000007b7cd000-0x000000007b7fffff] usable
[    0.000000] BIOS-e820: [mem 0x000000007b800000-0x000000008fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed1ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000ff400000-0x00000000ffffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000107fffffff] usable
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] SMBIOS 2.7 present.
[    0.000000] DMI: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.6024.071720181717 07/17/2018
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 2194.914 MHz processor
[    0.001575] last_pfn = 0x1080000 max_arch_pfn = 0x400000000
[    0.002262] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT  
[    0.003249] last_pfn = 0x7b800 max_arch_pfn = 0x400000000
[    0.011951] found SMP MP-table at [mem 0x000fd850-0x000fd85f]
[    0.012008] kexec: Reserving the low 1M of memory for crashkernel
[    0.012012] Using GB pages for direct mapping
[    0.012120] RAMDISK: [mem 0x2e1ce000-0x330defff]
[    0.012125] ACPI: Early table checksum verification disabled
[    0.012127] ACPI: RSDP 0x00000000000F0460 000024 (v02 INTEL )
[    0.012132] ACPI: XSDT 0x000000007B7CB0E8 0000C4 (v01 INTEL  S2600WT  00000000 INTL 01000013)
[    0.012137] ACPI: FACP 0x000000007B7CA000 0000F4 (v04 INTEL  S2600WT  00000000 INTL 20091013)
[    0.012143] ACPI: DSDT 0x000000007B78F000 032DA6 (v02 INTEL  S2600WT  00000003 INTL 20091013)
[    0.012146] ACPI: FACS 0x000000007B619000 000040
[    0.012150] ACPI: HPET 0x000000007B7C9000 000038 (v01 INTEL  S2600WT  00000001 INTL 20091013)
[    0.012153] ACPI: APIC 0x000000007B7C8000 000AFC (v03 INTEL  S2600WT  00000000 INTL 20091013)
[    0.012157] ACPI: MCFG 0x000000007B7C7000 00003C (v01 INTEL  S2600WT  00000001 INTL 20091013)
[    0.012160] ACPI: MSCT 0x000000007B7C6000 000090 (v01 INTEL  S2600WT  00000001 INTL 20091013)
[    0.012163] ACPI: SLIT 0x000000007B7C5000 00006C (v01 INTEL  S2600WT  00000001 INTL 20091013)
[    0.012167] ACPI: SRAT 0x000000007B7C4000 000AB0 (v03 INTEL  S2600WT  00000001 INTL 20091013)
[    0.012170] ACPI: SPMI 0x000000007B7C3000 000041 (v05 INTEL  S2600WT  00000001 INTL 20091013)
[    0.012174] ACPI: WDDT 0x000000007B7C2000 000040 (v01 INTEL  S2600WT  00000000 INTL 20091013)
[    0.012177] ACPI: NITR 0x000000007B6A4000 000071 (v02 INTEL  S2600WT  00000001 INTL 20091013)
[    0.012181] ACPI: PRAD 0x000000007B69F000 000102 (v02 INTEL  SpsPrAgg 00000002 INTL 20130328)
[    0.012184] ACPI: UEFI 0x000000007B674000 000042 (v01 INTEL  S2600WT  00000002 INTL 01000013)
[    0.012188] ACPI: SSDT 0x000000007B6A5000 0E96B0 (v02 INTEL  S2600WT  00004000 INTL 20130328)
[    0.012191] ACPI: SSDT 0x000000007B6A1000 002679 (v02 INTEL  S2600WT  00000002 INTL 20130328)
[    0.012195] ACPI: SSDT 0x000000007B6A0000 000064 (v02 INTEL  S2600WT  00000002 INTL 20130328)
[    0.012198] ACPI: HEST 0x000000007B69E000 0000A8 (v01 INTEL  S2600WT  00000001 INTL 00000001)
[    0.012202] ACPI: BERT 0x000000007B69D000 000030 (v01 INTEL  S2600WT  00000001 INTL 00000001)
[    0.012205] ACPI: ERST 0x000000007B69C000 000230 (v01 INTEL  S2600WT  00000001 INTL 00000001)
[    0.012209] ACPI: EINJ 0x000000007B69B000 000130 (v01 INTEL  S2600WT  00000001 INTL 00000001)
[    0.012212] ACPI: SPCR 0x000000007B69A000 000050 (v01 INTEL  S2600WT  00000000 INTL 00000000)
[    0.012248] SRAT: PXM 0 -> APIC 0x00 -> Node 0
[    0.012250] SRAT: PXM 0 -> APIC 0x02 -> Node 0
[    0.012251] SRAT: PXM 0 -> APIC 0x04 -> Node 0
[    0.012252] SRAT: PXM 0 -> APIC 0x06 -> Node 0
[    0.012253] SRAT: PXM 0 -> APIC 0x08 -> Node 0
[    0.012254] SRAT: PXM 0 -> APIC 0x0a -> Node 0
[    0.012255] SRAT: PXM 0 -> APIC 0x10 -> Node 0
[    0.012256] SRAT: PXM 0 -> APIC 0x12 -> Node 0
[    0.012257] SRAT: PXM 0 -> APIC 0x14 -> Node 0
[    0.012258] SRAT: PXM 0 -> APIC 0x16 -> Node 0
[    0.012259] SRAT: PXM 0 -> APIC 0x18 -> Node 0
[    0.012261] SRAT: PXM 0 -> APIC 0x20 -> Node 0
[    0.012262] SRAT: PXM 0 -> APIC 0x22 -> Node 0
[    0.012263] SRAT: PXM 0 -> APIC 0x24 -> Node 0
[    0.012264] SRAT: PXM 0 -> APIC 0x26 -> Node 0
[    0.012265] SRAT: PXM 0 -> APIC 0x28 -> Node 0
[    0.012266] SRAT: PXM 0 -> APIC 0x2a -> Node 0
[    0.012267] SRAT: PXM 0 -> APIC 0x30 -> Node 0
[    0.012268] SRAT: PXM 0 -> APIC 0x32 -> Node 0
[    0.012269] SRAT: PXM 0 -> APIC 0x34 -> Node 0
[    0.012270] SRAT: PXM 0 -> APIC 0x36 -> Node 0
[    0.012271] SRAT: PXM 0 -> APIC 0x38 -> Node 0
[    0.012273] SRAT: PXM 1 -> APIC 0x40 -> Node 1
[    0.012274] SRAT: PXM 1 -> APIC 0x42 -> Node 1
[    0.012275] SRAT: PXM 1 -> APIC 0x44 -> Node 1
[    0.012276] SRAT: PXM 1 -> APIC 0x46 -> Node 1
[    0.012277] SRAT: PXM 1 -> APIC 0x48 -> Node 1
[    0.012278] SRAT: PXM 1 -> APIC 0x4a -> Node 1
[    0.012279] SRAT: PXM 1 -> APIC 0x50 -> Node 1
[    0.012280] SRAT: PXM 1 -> APIC 0x52 -> Node 1
[    0.012282] SRAT: PXM 1 -> APIC 0x54 -> Node 1
[    0.012283] SRAT: PXM 1 -> APIC 0x56 -> Node 1
[    0.012284] SRAT: PXM 1 -> APIC 0x58 -> Node 1
[    0.012285] SRAT: PXM 1 -> APIC 0x60 -> Node 1
[    0.012286] SRAT: PXM 1 -> APIC 0x62 -> Node 1
[    0.012287] SRAT: PXM 1 -> APIC 0x64 -> Node 1
[    0.012288] SRAT: PXM 1 -> APIC 0x66 -> Node 1
[    0.012289] SRAT: PXM 1 -> APIC 0x68 -> Node 1
[    0.012290] SRAT: PXM 1 -> APIC 0x6a -> Node 1
[    0.012291] SRAT: PXM 1 -> APIC 0x70 -> Node 1
[    0.012293] SRAT: PXM 1 -> APIC 0x72 -> Node 1
[    0.012294] SRAT: PXM 1 -> APIC 0x74 -> Node 1
[    0.012295] SRAT: PXM 1 -> APIC 0x76 -> Node 1
[    0.012296] SRAT: PXM 1 -> APIC 0x78 -> Node 1
[    0.012297] SRAT: PXM 0 -> APIC 0x01 -> Node 0
[    0.012298] SRAT: PXM 0 -> APIC 0x03 -> Node 0
[    0.012299] SRAT: PXM 0 -> APIC 0x05 -> Node 0
[    0.012301] SRAT: PXM 0 -> APIC 0x07 -> Node 0
[    0.012302] SRAT: PXM 0 -> APIC 0x09 -> Node 0
[    0.012303] SRAT: PXM 0 -> APIC 0x0b -> Node 0
[    0.012304] SRAT: PXM 0 -> APIC 0x11 -> Node 0
[    0.012305] SRAT: PXM 0 -> APIC 0x13 -> Node 0
[    0.012306] SRAT: PXM 0 -> APIC 0x15 -> Node 0
[    0.012307] SRAT: PXM 0 -> APIC 0x17 -> Node 0
[    0.012308] SRAT: PXM 0 -> APIC 0x19 -> Node 0
[    0.012309] SRAT: PXM 0 -> APIC 0x21 -> Node 0
[    0.012310] SRAT: PXM 0 -> APIC 0x23 -> Node 0
[    0.012311] SRAT: PXM 0 -> APIC 0x25 -> Node 0
[    0.012312] SRAT: PXM 0 -> APIC 0x27 -> Node 0
[    0.012314] SRAT: PXM 0 -> APIC 0x29 -> Node 0
[    0.012315] SRAT: PXM 0 -> APIC 0x2b -> Node 0
[    0.012316] SRAT: PXM 0 -> APIC 0x31 -> Node 0
[    0.012317] SRAT: PXM 0 -> APIC 0x33 -> Node 0
[    0.012318] SRAT: PXM 0 -> APIC 0x35 -> Node 0
[    0.012319] SRAT: PXM 0 -> APIC 0x37 -> Node 0
[    0.012320] SRAT: PXM 0 -> APIC 0x39 -> Node 0
[    0.012321] SRAT: PXM 1 -> APIC 0x41 -> Node 1
[    0.012322] SRAT: PXM 1 -> APIC 0x43 -> Node 1
[    0.012323] SRAT: PXM 1 -> APIC 0x45 -> Node 1
[    0.012324] SRAT: PXM 1 -> APIC 0x47 -> Node 1
[    0.012325] SRAT: PXM 1 -> APIC 0x49 -> Node 1
[    0.012326] SRAT: PXM 1 -> APIC 0x4b -> Node 1
[    0.012328] SRAT: PXM 1 -> APIC 0x51 -> Node 1
[    0.012329] SRAT: PXM 1 -> APIC 0x53 -> Node 1
[    0.012330] SRAT: PXM 1 -> APIC 0x55 -> Node 1
[    0.012331] SRAT: PXM 1 -> APIC 0x57 -> Node 1
[    0.012332] SRAT: PXM 1 -> APIC 0x59 -> Node 1
[    0.012333] SRAT: PXM 1 -> APIC 0x61 -> Node 1
[    0.012334] SRAT: PXM 1 -> APIC 0x63 -> Node 1
[    0.012335] SRAT: PXM 1 -> APIC 0x65 -> Node 1
[    0.012336] SRAT: PXM 1 -> APIC 0x67 -> Node 1
[    0.012337] SRAT: PXM 1 -> APIC 0x69 -> Node 1
[    0.012338] SRAT: PXM 1 -> APIC 0x6b -> Node 1
[    0.012339] SRAT: PXM 1 -> APIC 0x71 -> Node 1
[    0.012340] SRAT: PXM 1 -> APIC 0x73 -> Node 1
[    0.012342] SRAT: PXM 1 -> APIC 0x75 -> Node 1
[    0.012343] SRAT: PXM 1 -> APIC 0x77 -> Node 1
[    0.012344] SRAT: PXM 1 -> APIC 0x79 -> Node 1
[    0.012347] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
[    0.012349] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x87fffffff]
[    0.012350] ACPI: SRAT: Node 1 PXM 1 [mem 0x880000000-0x107fffffff]
[    0.012360] NUMA: Node 0 [mem 0x00000000-0x7fffffff] + [mem 0x100000000-0x87fffffff] -> [mem 0x00000000-0x87fffffff]
[    0.012371] NODE_DATA(0) allocated [mem 0x87ffd6000-0x87fffffff]
[    0.012408] NODE_DATA(1) allocated [mem 0x107ffd5000-0x107fffefff]
[    0.012598] Reserving 512MB of memory at 1424MB for crashkernel (System RAM: 65439MB)
[    0.012674] Zone ranges:
[    0.012676]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.012677]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
[    0.012679]   Normal   [mem 0x0000000100000000-0x000000107fffffff]
[    0.012681]   Device   empty
[    0.012682] Movable zone start for each node
[    0.012686] Early memory node ranges
[    0.012688]   node   0: [mem 0x0000000000001000-0x000000000009afff]
[    0.012689]   node   0: [mem 0x0000000000100000-0x0000000079f9ffff]
[    0.012691]   node   0: [mem 0x000000007b7cd000-0x000000007b7fffff]
[    0.012692]   node   0: [mem 0x0000000100000000-0x000000087fffffff]
[    0.012695]   node   1: [mem 0x0000000880000000-0x000000107fffffff]
[    0.013116] Zeroed struct page in unavailable ranges: 24723 pages
[    0.013117] Initmem setup node 0 [mem 0x0000000000001000-0x000000087fffffff]
[    0.027686] Initmem setup node 1 [mem 0x0000000880000000-0x000000107fffffff]
[    0.028566] ACPI: PM-Timer IO Port: 0x408
[    0.028590] ACPI: LAPIC_NMI (acpi_id[0x00] high level lint[0x1])
[    0.028592] ACPI: LAPIC_NMI (acpi_id[0x01] high level lint[0x1])
[    0.028593] ACPI: LAPIC_NMI (acpi_id[0x02] high level lint[0x1])
[    0.028594] ACPI: LAPIC_NMI (acpi_id[0x03] high level lint[0x1])
[    0.028595] ACPI: LAPIC_NMI (acpi_id[0x04] high level lint[0x1])
[    0.028596] ACPI: LAPIC_NMI (acpi_id[0x05] high level lint[0x1])
[    0.028598] ACPI: LAPIC_NMI (acpi_id[0x06] high level lint[0x1])
[    0.028599] ACPI: LAPIC_NMI (acpi_id[0x07] high level lint[0x1])
[    0.028600] ACPI: LAPIC_NMI (acpi_id[0x08] high level lint[0x1])
[    0.028602] ACPI: LAPIC_NMI (acpi_id[0x09] high level lint[0x1])
[    0.028603] ACPI: LAPIC_NMI (acpi_id[0x0a] high level lint[0x1])
[    0.028604] ACPI: LAPIC_NMI (acpi_id[0x0b] high level lint[0x1])
[    0.028605] ACPI: LAPIC_NMI (acpi_id[0x0c] high level lint[0x1])
[    0.028606] ACPI: LAPIC_NMI (acpi_id[0x0d] high level lint[0x1])
[    0.028608] ACPI: LAPIC_NMI (acpi_id[0x0e] high level lint[0x1])
[    0.028609] ACPI: LAPIC_NMI (acpi_id[0x0f] high level lint[0x1])
[    0.028610] ACPI: LAPIC_NMI (acpi_id[0x10] high level lint[0x1])
[    0.028611] ACPI: LAPIC_NMI (acpi_id[0x11] high level lint[0x1])
[    0.028612] ACPI: LAPIC_NMI (acpi_id[0x12] high level lint[0x1])
[    0.028613] ACPI: LAPIC_NMI (acpi_id[0x13] high level lint[0x1])
[    0.028614] ACPI: LAPIC_NMI (acpi_id[0x14] high level lint[0x1])
[    0.028616] ACPI: LAPIC_NMI (acpi_id[0x15] high level lint[0x1])
[    0.028617] ACPI: LAPIC_NMI (acpi_id[0x16] high level lint[0x1])
[    0.028618] ACPI: LAPIC_NMI (acpi_id[0x17] high level lint[0x1])
[    0.028619] ACPI: LAPIC_NMI (acpi_id[0x18] high level lint[0x1])
[    0.028620] ACPI: LAPIC_NMI (acpi_id[0x19] high level lint[0x1])
[    0.028621] ACPI: LAPIC_NMI (acpi_id[0x1a] high level lint[0x1])
[    0.028623] ACPI: LAPIC_NMI (acpi_id[0x1b] high level lint[0x1])
[    0.028624] ACPI: LAPIC_NMI (acpi_id[0x1c] high level lint[0x1])
[    0.028625] ACPI: LAPIC_NMI (acpi_id[0x1d] high level lint[0x1])
[    0.028626] ACPI: LAPIC_NMI (acpi_id[0x1e] high level lint[0x1])
[    0.028627] ACPI: LAPIC_NMI (acpi_id[0x1f] high level lint[0x1])
[    0.028629] ACPI: LAPIC_NMI (acpi_id[0x20] high level lint[0x1])
[    0.028630] ACPI: LAPIC_NMI (acpi_id[0x21] high level lint[0x1])
[    0.028631] ACPI: LAPIC_NMI (acpi_id[0x22] high level lint[0x1])
[    0.028632] ACPI: LAPIC_NMI (acpi_id[0x23] high level lint[0x1])
[    0.028633] ACPI: LAPIC_NMI (acpi_id[0x24] high level lint[0x1])
[    0.028634] ACPI: LAPIC_NMI (acpi_id[0x25] high level lint[0x1])
[    0.028636] ACPI: LAPIC_NMI (acpi_id[0x26] high level lint[0x1])
[    0.028637] ACPI: LAPIC_NMI (acpi_id[0x27] high level lint[0x1])
[    0.028638] ACPI: LAPIC_NMI (acpi_id[0x28] high level lint[0x1])
[    0.028639] ACPI: LAPIC_NMI (acpi_id[0x29] high level lint[0x1])
[    0.028640] ACPI: LAPIC_NMI (acpi_id[0x2a] high level lint[0x1])
[    0.028641] ACPI: LAPIC_NMI (acpi_id[0x2b] high level lint[0x1])
[    0.028642] ACPI: LAPIC_NMI (acpi_id[0x2c] high level lint[0x1])
[    0.028644] ACPI: LAPIC_NMI (acpi_id[0x2d] high level lint[0x1])
[    0.028645] ACPI: LAPIC_NMI (acpi_id[0x2e] high level lint[0x1])
[    0.028646] ACPI: LAPIC_NMI (acpi_id[0x2f] high level lint[0x1])
[    0.028647] ACPI: LAPIC_NMI (acpi_id[0x30] high level lint[0x1])
[    0.028648] ACPI: LAPIC_NMI (acpi_id[0x31] high level lint[0x1])
[    0.028649] ACPI: LAPIC_NMI (acpi_id[0x32] high level lint[0x1])
[    0.028650] ACPI: LAPIC_NMI (acpi_id[0x33] high level lint[0x1])
[    0.028651] ACPI: LAPIC_NMI (acpi_id[0x34] high level lint[0x1])
[    0.028653] ACPI: LAPIC_NMI (acpi_id[0x35] high level lint[0x1])
[    0.028654] ACPI: LAPIC_NMI (acpi_id[0x36] high level lint[0x1])
[    0.028655] ACPI: LAPIC_NMI (acpi_id[0x37] high level lint[0x1])
[    0.028656] ACPI: LAPIC_NMI (acpi_id[0x38] high level lint[0x1])
[    0.028657] ACPI: LAPIC_NMI (acpi_id[0x39] high level lint[0x1])
[    0.028658] ACPI: LAPIC_NMI (acpi_id[0x3a] high level lint[0x1])
[    0.028660] ACPI: LAPIC_NMI (acpi_id[0x3b] high level lint[0x1])
[    0.028661] ACPI: LAPIC_NMI (acpi_id[0x3c] high level lint[0x1])
[    0.028662] ACPI: LAPIC_NMI (acpi_id[0x3d] high level lint[0x1])
[    0.028663] ACPI: LAPIC_NMI (acpi_id[0x3e] high level lint[0x1])
[    0.028665] ACPI: LAPIC_NMI (acpi_id[0x3f] high level lint[0x1])
[    0.028666] ACPI: LAPIC_NMI (acpi_id[0x40] high level lint[0x1])
[    0.028667] ACPI: LAPIC_NMI (acpi_id[0x41] high level lint[0x1])
[    0.028668] ACPI: LAPIC_NMI (acpi_id[0x42] high level lint[0x1])
[    0.028670] ACPI: LAPIC_NMI (acpi_id[0x43] high level lint[0x1])
[    0.028671] ACPI: LAPIC_NMI (acpi_id[0x44] high level lint[0x1])
[    0.028672] ACPI: LAPIC_NMI (acpi_id[0x45] high level lint[0x1])
[    0.028673] ACPI: LAPIC_NMI (acpi_id[0x46] high level lint[0x1])
[    0.028674] ACPI: LAPIC_NMI (acpi_id[0x47] high level lint[0x1])
[    0.028675] ACPI: LAPIC_NMI (acpi_id[0x48] high level lint[0x1])
[    0.028676] ACPI: LAPIC_NMI (acpi_id[0x49] high level lint[0x1])
[    0.028677] ACPI: LAPIC_NMI (acpi_id[0x4a] high level lint[0x1])
[    0.028679] ACPI: LAPIC_NMI (acpi_id[0x4b] high level lint[0x1])
[    0.028680] ACPI: LAPIC_NMI (acpi_id[0x4c] high level lint[0x1])
[    0.028681] ACPI: LAPIC_NMI (acpi_id[0x4d] high level lint[0x1])
[    0.028682] ACPI: LAPIC_NMI (acpi_id[0x4e] high level lint[0x1])
[    0.028683] ACPI: LAPIC_NMI (acpi_id[0x4f] high level lint[0x1])
[    0.028684] ACPI: LAPIC_NMI (acpi_id[0x50] high level lint[0x1])
[    0.028685] ACPI: LAPIC_NMI (acpi_id[0x51] high level lint[0x1])
[    0.028687] ACPI: LAPIC_NMI (acpi_id[0x52] high level lint[0x1])
[    0.028688] ACPI: LAPIC_NMI (acpi_id[0x53] high level lint[0x1])
[    0.028689] ACPI: LAPIC_NMI (acpi_id[0x54] high level lint[0x1])
[    0.028690] ACPI: LAPIC_NMI (acpi_id[0x55] high level lint[0x1])
[    0.028691] ACPI: LAPIC_NMI (acpi_id[0x56] high level lint[0x1])
[    0.028692] ACPI: LAPIC_NMI (acpi_id[0x57] high level lint[0x1])
[    0.028694] ACPI: LAPIC_NMI (acpi_id[0x58] high level lint[0x1])
[    0.028695] ACPI: LAPIC_NMI (acpi_id[0x59] high level lint[0x1])
[    0.028696] ACPI: LAPIC_NMI (acpi_id[0x5a] high level lint[0x1])
[    0.028697] ACPI: LAPIC_NMI (acpi_id[0x5b] high level lint[0x1])
[    0.028698] ACPI: LAPIC_NMI (acpi_id[0x5c] high level lint[0x1])
[    0.028700] ACPI: LAPIC_NMI (acpi_id[0x5d] high level lint[0x1])
[    0.028701] ACPI: LAPIC_NMI (acpi_id[0x5e] high level lint[0x1])
[    0.028702] ACPI: LAPIC_NMI (acpi_id[0x5f] high level lint[0x1])
[    0.028703] ACPI: LAPIC_NMI (acpi_id[0x60] high level lint[0x1])
[    0.028704] ACPI: LAPIC_NMI (acpi_id[0x61] high level lint[0x1])
[    0.028705] ACPI: LAPIC_NMI (acpi_id[0x62] high level lint[0x1])
[    0.028706] ACPI: LAPIC_NMI (acpi_id[0x63] high level lint[0x1])
[    0.028707] ACPI: LAPIC_NMI (acpi_id[0x64] high level lint[0x1])
[    0.028709] ACPI: LAPIC_NMI (acpi_id[0x65] high level lint[0x1])
[    0.028710] ACPI: LAPIC_NMI (acpi_id[0x66] high level lint[0x1])
[    0.028711] ACPI: LAPIC_NMI (acpi_id[0x67] high level lint[0x1])
[    0.028712] ACPI: LAPIC_NMI (acpi_id[0x68] high level lint[0x1])
[    0.028713] ACPI: LAPIC_NMI (acpi_id[0x69] high level lint[0x1])
[    0.028714] ACPI: LAPIC_NMI (acpi_id[0x6a] high level lint[0x1])
[    0.028715] ACPI: LAPIC_NMI (acpi_id[0x6b] high level lint[0x1])
[    0.028717] ACPI: LAPIC_NMI (acpi_id[0x6c] high level lint[0x1])
[    0.028718] ACPI: LAPIC_NMI (acpi_id[0x6d] high level lint[0x1])
[    0.028719] ACPI: LAPIC_NMI (acpi_id[0x6e] high level lint[0x1])
[    0.028720] ACPI: LAPIC_NMI (acpi_id[0x6f] high level lint[0x1])
[    0.028721] ACPI: LAPIC_NMI (acpi_id[0x70] high level lint[0x1])
[    0.028722] ACPI: LAPIC_NMI (acpi_id[0x71] high level lint[0x1])
[    0.028723] ACPI: LAPIC_NMI (acpi_id[0x72] high level lint[0x1])
[    0.028724] ACPI: LAPIC_NMI (acpi_id[0x73] high level lint[0x1])
[    0.028726] ACPI: LAPIC_NMI (acpi_id[0x74] high level lint[0x1])
[    0.028727] ACPI: LAPIC_NMI (acpi_id[0x75] high level lint[0x1])
[    0.028728] ACPI: LAPIC_NMI (acpi_id[0x76] high level lint[0x1])
[    0.028729] ACPI: LAPIC_NMI (acpi_id[0x77] high level lint[0x1])
[    0.028730] ACPI: LAPIC_NMI (acpi_id[0x78] high level lint[0x1])
[    0.028731] ACPI: LAPIC_NMI (acpi_id[0x79] high level lint[0x1])
[    0.028733] ACPI: LAPIC_NMI (acpi_id[0x7a] high level lint[0x1])
[    0.028734] ACPI: LAPIC_NMI (acpi_id[0x7b] high level lint[0x1])
[    0.028735] ACPI: LAPIC_NMI (acpi_id[0x7c] high level lint[0x1])
[    0.028736] ACPI: LAPIC_NMI (acpi_id[0x7c] high level lint[0x1])
[    0.028737] ACPI: LAPIC_NMI (acpi_id[0x7d] high level lint[0x1])
[    0.028739] ACPI: LAPIC_NMI (acpi_id[0x7e] high level lint[0x1])
[    0.028740] ACPI: LAPIC_NMI (acpi_id[0x7f] high level lint[0x1])
[    0.028741] ACPI: LAPIC_NMI (acpi_id[0x80] high level lint[0x1])
[    0.028742] ACPI: LAPIC_NMI (acpi_id[0x81] high level lint[0x1])
[    0.028743] ACPI: LAPIC_NMI (acpi_id[0x82] high level lint[0x1])
[    0.028744] ACPI: LAPIC_NMI (acpi_id[0x83] high level lint[0x1])
[    0.028745] ACPI: LAPIC_NMI (acpi_id[0x84] high level lint[0x1])
[    0.028747] ACPI: LAPIC_NMI (acpi_id[0x85] high level lint[0x1])
[    0.028748] ACPI: LAPIC_NMI (acpi_id[0x86] high level lint[0x1])
[    0.028749] ACPI: LAPIC_NMI (acpi_id[0x87] high level lint[0x1])
[    0.028750] ACPI: LAPIC_NMI (acpi_id[0x88] high level lint[0x1])
[    0.028751] ACPI: LAPIC_NMI (acpi_id[0x89] high level lint[0x1])
[    0.028752] ACPI: LAPIC_NMI (acpi_id[0x8a] high level lint[0x1])
[    0.028753] ACPI: LAPIC_NMI (acpi_id[0x8b] high level lint[0x1])
[    0.028754] ACPI: LAPIC_NMI (acpi_id[0x8c] high level lint[0x1])
[    0.028756] ACPI: LAPIC_NMI (acpi_id[0x8d] high level lint[0x1])
[    0.028757] ACPI: LAPIC_NMI (acpi_id[0x8f] high level lint[0x1])
[    0.028758] ACPI: LAPIC_NMI (acpi_id[0x90] high level lint[0x1])
[    0.028759] ACPI: LAPIC_NMI (acpi_id[0x91] high level lint[0x1])
[    0.028761] ACPI: LAPIC_NMI (acpi_id[0x92] high level lint[0x1])
[    0.028762] ACPI: LAPIC_NMI (acpi_id[0x93] high level lint[0x1])
[    0.028763] ACPI: LAPIC_NMI (acpi_id[0x94] high level lint[0x1])
[    0.028764] ACPI: LAPIC_NMI (acpi_id[0x95] high level lint[0x1])
[    0.028765] ACPI: LAPIC_NMI (acpi_id[0x96] high level lint[0x1])
[    0.028767] ACPI: LAPIC_NMI (acpi_id[0x97] high level lint[0x1])
[    0.028768] ACPI: LAPIC_NMI (acpi_id[0x98] high level lint[0x1])
[    0.028769] ACPI: LAPIC_NMI (acpi_id[0x99] high level lint[0x1])
[    0.028770] ACPI: LAPIC_NMI (acpi_id[0x9a] high level lint[0x1])
[    0.028771] ACPI: LAPIC_NMI (acpi_id[0x9b] high level lint[0x1])
[    0.028772] ACPI: LAPIC_NMI (acpi_id[0x9c] high level lint[0x1])
[    0.028773] ACPI: LAPIC_NMI (acpi_id[0x9d] high level lint[0x1])
[    0.028775] ACPI: LAPIC_NMI (acpi_id[0x9e] high level lint[0x1])
[    0.028776] ACPI: LAPIC_NMI (acpi_id[0x9f] high level lint[0x1])
[    0.028777] ACPI: LAPIC_NMI (acpi_id[0xa0] high level lint[0x1])
[    0.028778] ACPI: LAPIC_NMI (acpi_id[0xa1] high level lint[0x1])
[    0.028779] ACPI: LAPIC_NMI (acpi_id[0xa2] high level lint[0x1])
[    0.028780] ACPI: LAPIC_NMI (acpi_id[0xa3] high level lint[0x1])
[    0.028781] ACPI: LAPIC_NMI (acpi_id[0xa4] high level lint[0x1])
[    0.028782] ACPI: LAPIC_NMI (acpi_id[0xa5] high level lint[0x1])
[    0.028784] ACPI: LAPIC_NMI (acpi_id[0xa6] high level lint[0x1])
[    0.028785] ACPI: LAPIC_NMI (acpi_id[0xa7] high level lint[0x1])
[    0.028786] ACPI: LAPIC_NMI (acpi_id[0xa8] high level lint[0x1])
[    0.028787] ACPI: LAPIC_NMI (acpi_id[0xa9] high level lint[0x1])
[    0.028788] ACPI: LAPIC_NMI (acpi_id[0xaa] high level lint[0x1])
[    0.028789] ACPI: LAPIC_NMI (acpi_id[0xab] high level lint[0x1])
[    0.028790] ACPI: LAPIC_NMI (acpi_id[0xac] high level lint[0x1])
[    0.028792] ACPI: LAPIC_NMI (acpi_id[0xad] high level lint[0x1])
[    0.028793] ACPI: LAPIC_NMI (acpi_id[0xae] high level lint[0x1])
[    0.028794] ACPI: LAPIC_NMI (acpi_id[0xaf] high level lint[0x1])
[    0.028795] ACPI: LAPIC_NMI (acpi_id[0xb0] high level lint[0x1])
[    0.028796] ACPI: LAPIC_NMI (acpi_id[0xb1] high level lint[0x1])
[    0.028797] ACPI: LAPIC_NMI (acpi_id[0xb2] high level lint[0x1])
[    0.028798] ACPI: LAPIC_NMI (acpi_id[0xb3] high level lint[0x1])
[    0.028799] ACPI: LAPIC_NMI (acpi_id[0xb4] high level lint[0x1])
[    0.028801] ACPI: LAPIC_NMI (acpi_id[0xb5] high level lint[0x1])
[    0.028802] ACPI: LAPIC_NMI (acpi_id[0xb6] high level lint[0x1])
[    0.028803] ACPI: LAPIC_NMI (acpi_id[0xb7] high level lint[0x1])
[    0.028804] ACPI: LAPIC_NMI (acpi_id[0xb8] high level lint[0x1])
[    0.028805] ACPI: LAPIC_NMI (acpi_id[0xb9] high level lint[0x1])
[    0.028806] ACPI: LAPIC_NMI (acpi_id[0xba] high level lint[0x1])
[    0.028808] ACPI: LAPIC_NMI (acpi_id[0xbb] high level lint[0x1])
[    0.028809] ACPI: LAPIC_NMI (acpi_id[0xbc] high level lint[0x1])
[    0.028810] ACPI: LAPIC_NMI (acpi_id[0xbd] high level lint[0x1])
[    0.028811] ACPI: LAPIC_NMI (acpi_id[0xbe] high level lint[0x1])
[    0.028812] ACPI: LAPIC_NMI (acpi_id[0xbf] high level lint[0x1])
[    0.028823] IOAPIC[0]: apic_id 8, version 32, address 0xfec00000, GSI 0-23
[    0.028828] IOAPIC[1]: apic_id 9, version 32, address 0xfec01000, GSI 24-47
[    0.028833] IOAPIC[2]: apic_id 10, version 32, address 0xfec40000, GSI 48-71
[    0.028838] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    0.028840] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[    0.028848] Using ACPI (MADT) for SMP configuration information
[    0.028850] ACPI: HPET id: 0x8086a701 base: 0xfed00000
[    0.028854] ACPI: SPCR: SPCR table version 1
[    0.028856] ACPI: SPCR: console: uart,io,0x3f8,115200
[    0.028859] smpboot: Allowing 88 CPUs, 0 hotplug CPUs
[    0.028876] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
[    0.028878] PM: Registered nosave memory: [mem 0x0009b000-0x0009bfff]
[    0.028879] PM: Registered nosave memory: [mem 0x0009c000-0x0009ffff]
[    0.028880] PM: Registered nosave memory: [mem 0x000a0000-0x000dffff]
[    0.028881] PM: Registered nosave memory: [mem 0x000e0000-0x000fffff]
[    0.028883] PM: Registered nosave memory: [mem 0x79fa0000-0x7ac4ffff]
[    0.028885] PM: Registered nosave memory: [mem 0x7ac50000-0x7b67ffff]
[    0.028886] PM: Registered nosave memory: [mem 0x7b680000-0x7b7ccfff]
[    0.028888] PM: Registered nosave memory: [mem 0x7b800000-0x8fffffff]
[    0.028889] PM: Registered nosave memory: [mem 0x90000000-0xfed1bfff]
[    0.028890] PM: Registered nosave memory: [mem 0xfed1c000-0xfed1ffff]
[    0.028891] PM: Registered nosave memory: [mem 0xfed20000-0xff3fffff]
[    0.028893] PM: Registered nosave memory: [mem 0xff400000-0xffffffff]
[    0.028895] [mem 0x90000000-0xfed1bfff] available for PCI devices
[    0.028896] Booting paravirtualized kernel on bare hardware
[    0.028899] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[    0.142569] setup_percpu: NR_CPUS:8192 nr_cpumask_bits:88 nr_cpu_ids:88 nr_node_ids:2
[    0.149960] percpu: Embedded 56 pages/cpu s192512 r8192 d28672 u262144
[    0.150054] Built 2 zonelists, mobility grouping on.  Total pages: 16490579
[    0.150056] Policy zone: Normal
[    0.150058] Kernel command line: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.5.0-rc7+ root=/dev/mapper/intel--wildcatpass--07-root ro crashkernel=512M resume=/dev/mapper/intel--wildcatpass--07-swap rd.lvm.lv=intel-wildcatpass-07/root rd.lvm.lv=intel-wildcatpass-07/swap console=ttyS0,115200n81
[    0.151328] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.195160] Memory: 1589000K/67009972K available (12292K kernel code, 3223K rwdata, 4424K rodata, 2404K init, 7028K bss, 1777460K reserved, 0K cma-reserved)
[    0.195916] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=88, Nodes=2
[    0.195957] Kernel/User page tables isolation: enabled
[    0.196026] ftrace: allocating 37861 entries in 148 pages
[    0.212037] ftrace: allocated 148 pages with 3 groups
[    0.212776] rcu: Hierarchical RCU implementation.
[    0.212778] rcu: 	RCU restricting CPUs from NR_CPUS=8192 to nr_cpu_ids=88.
[    0.212780] rcu: RCU calculated value of scheduler-enlistment delay is 100 jiffies.
[    0.212782] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=88
[    0.217070] NR_IRQS: 524544, nr_irqs: 1944, preallocated irqs: 16
[    0.217472] random: get_random_bytes called from start_kernel+0x365/0x53f with crng_init=0
[    0.223825] Console: colour VGA+ 80x25
[    2.948034] printk: console [ttyS0] enabled
[    2.952820] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
[    2.965342] ACPI: Core revision 20191018
[    2.970907] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484882848 ns
[    2.981793] APIC: Switch to symmetric I/O mode setup
[    2.987465] x2apic: IRQ remapping doesn't support X2APIC mode
[    2.993957] Switched APIC routing to physical flat.
[    2.999997] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    3.010809] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1fa36e579bf, max_idle_ns: 440795269840 ns
[    3.022555] Calibrating delay loop (skipped), value calculated using timer frequency.. 4389.82 BogoMIPS (lpj=2194914)
[    3.023555] pid_max: default: 90112 minimum: 704
[    3.024728] LSM: Security Framework initializing
[    3.025582] Yama: becoming mindful.
[    3.026577] SELinux:  Initializing.
[    3.044018] Dentry cache hash table entries: 8388608 (order: 14, 67108864 bytes, vmalloc)
[    3.053158] Inode-cache hash table entries: 4194304 (order: 13, 33554432 bytes, vmalloc)
[    3.053889] Mount-cache hash table entries: 131072 (order: 8, 1048576 bytes, vmalloc)
[    3.054831] Mountpoint-cache hash table entries: 131072 (order: 8, 1048576 bytes, vmalloc)
[    3.057105] mce: CPU0: Thermal monitoring enabled (TM1)
[    3.057613] process: using mwait in idle threads
[    3.058558] Last level iTLB entries: 4KB 64, 2MB 8, 4MB 8
[    3.059554] Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0, 1GB 4
[    3.060557] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
[    3.061556] Spectre V2 : Mitigation: Full generic retpoline
[    3.062554] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
[    3.063554] Spectre V2 : Enabling Restricted Speculation for firmware calls
[    3.064556] Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
[    3.065554] Spectre V2 : User space: Mitigation: STIBP via seccomp and prctl
[    3.066554] Speculative Store Bypass: Vulnerable
[    3.067557] TAA: Vulnerable: Clear CPU buffers attempted, no microcode
[    3.068554] MDS: Vulnerable: Clear CPU buffers attempted, no microcode
[    3.070647] Freeing SMP alternatives memory: 32K
[    3.073741] smpboot: CPU0: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz (family: 0x6, model: 0x4f, stepping: 0x1)
[    3.074823] Performance Events: PEBS fmt2+, Broadwell events, 16-deep LBR, full-width counters, Intel PMU driver.
[    3.075556] ... version:                3
[    3.076555] ... bit width:              48
[    3.077555] ... generic registers:      4
[    3.078555] ... value mask:             0000ffffffffffff
[    3.079555] ... max period:             00007fffffffffff
[    3.080555] ... fixed-purpose events:   3
[    3.081555] ... event mask:             000000070000000f
[    3.082667] rcu: Hierarchical SRCU implementation.
[    3.093552] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
[    3.094788] smp: Bringing up secondary CPUs ...
[    3.095642] x86: Booting SMP configuration:
[    3.096556] .... node  #0, CPUs:        #1  #2  #3  #4  #5  #6  #7  #8  #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21
[    3.150556] .... node  #1, CPUs:   #22
[    2.797134] smpboot: CPU 22 Converting physical 0 to logical die 1
[    3.226695]  #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35 #36 #37 #38 #39 #40 #41 #42 #43
[    3.287556] .... node  #0, CPUs:   #44
[    3.288760] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
[    3.290556] TAA CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html for more details.
[    3.291718]  #45 #46 #47 #48 #49 #50 #51 #52 #53 #54 #55 #56 #57 #58 #59 #60 #61 #62 #63 #64 #65
[    3.347557] .... node  #1, CPUs:   #66 #67 #68 #69 #70 #71 #72 #73 #74 #75 #76 #77 #78 #79 #80 #81 #82 #83 #84 #85 #86 #87
[    3.408760] smp: Brought up 2 nodes, 88 CPUs
[    3.410556] smpboot: Max logical packages: 2
[    3.411559] smpboot: Total of 88 processors activated (386613.74 BogoMIPS)
[    3.564568] node 0 initialised, 7688738 pages in 149ms
[    3.575570] node 1 initialised, 8222140 pages in 160ms
[    3.583666] devtmpfs: initialized
[    3.586641] x86/mm: Memory block size: 2048MB
[    3.592601] PM: Registering ACPI NVS region [mem 0x7ac50000-0x7b67ffff] (10682368 bytes)
[    3.602577] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[    3.612647] futex hash table entries: 32768 (order: 9, 2097152 bytes, vmalloc)
[    3.622514] pinctrl core: initialized pinctrl subsystem
[    3.627598] thermal_sys: Registered thermal governor 'fair_share'
[    3.627599] thermal_sys: Registered thermal governor 'bang_bang'
[    3.634556] thermal_sys: Registered thermal governor 'step_wise'
[    3.641556] thermal_sys: Registered thermal governor 'user_space'
[    3.648864] NET: Registered protocol family 16
[    3.660691] audit: initializing netlink subsys (disabled)
[    3.666589] audit: type=2000 audit(1580801596.653:1): state=initialized audit_enabled=0 res=1
[    3.675560] cpuidle: using governor menu
[    3.680664] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
[    3.689557] ACPI: bus type PCI registered
[    3.693556] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
[    3.700762] PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0x80000000-0x8fffffff] (base 0x80000000)
[    3.711598] PCI: MMCONFIG at [mem 0x80000000-0x8fffffff] reserved in E820
[    3.719573] PCI: Using configuration type 1 for base access
[    3.732804] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
[    3.739560] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages
[    3.787708] cryptd: max_cpu_qlen set to 1000
[    3.804763] ACPI: Added _OSI(Module Device)
[    3.809557] ACPI: Added _OSI(Processor Device)
[    3.814556] ACPI: Added _OSI(3.0 _SCP Extensions)
[    3.819555] ACPI: Added _OSI(Processor Aggregator Device)
[    3.825556] ACPI: Added _OSI(Linux-Dell-Video)
[    3.830555] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio)
[    3.836555] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics)
[    3.985794] ACPI: 4 ACPI AML tables successfully acquired and loaded
[    4.008810] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
[    4.020553] ACPI: Dynamic OEM Table Load:
[    4.065460] ACPI: Interpreter enabled
[    4.069569] ACPI: (supports S0 S5)
[    4.073556] ACPI: Using IOAPIC for interrupt routing
[    4.078626] HEST: Table parsing has been initialized.
[    4.084559] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
[    4.095551] ACPI: Enabled 6 GPEs in block 00 to 3F
[    4.139162] ACPI: PCI Root Bridge [UNC1] (domain 0000 [bus ff])
[    4.146561] acpi PNP0A03:02: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
[    4.158562] acpi PNP0A03:02: _OSC: platform does not support [SHPCHotplug AER LTR]
[    4.167414] acpi PNP0A03:02: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
[    4.176556] acpi PNP0A03:02: FADT indicates ASPM is unsupported, using BIOS configuration
[    4.185601] PCI host bridge to bus 0000:ff
[    4.189556] pci_bus 0000:ff: Unknown NUMA node; performance will be reduced
[    4.197557] pci_bus 0000:ff: root bus resource [bus ff]
[    4.203567] pci 0000:ff:08.0: [8086:6f80] type 00 class 0x088000
[    4.209635] pci 0000:ff:08.2: [8086:6f32] type 00 class 0x110100
[    4.216606] pci 0000:ff:08.3: [8086:6f83] type 00 class 0x088000
[    4.223622] pci 0000:ff:09.0: [8086:6f90] type 00 class 0x088000
[    4.230604] pci 0000:ff:09.2: [8086:6f33] type 00 class 0x110100
[    4.236603] pci 0000:ff:09.3: [8086:6f93] type 00 class 0x088000
[    4.243616] pci 0000:ff:0b.0: [8086:6f81] type 00 class 0x088000
[    4.250600] pci 0000:ff:0b.1: [8086:6f36] type 00 class 0x110100
[    4.257605] pci 0000:ff:0b.2: [8086:6f37] type 00 class 0x110100
[    4.263604] pci 0000:ff:0b.3: [8086:6f76] type 00 class 0x088000
[    4.270610] pci 0000:ff:0c.0: [8086:6fe0] type 00 class 0x088000
[    4.277601] pci 0000:ff:0c.1: [8086:6fe1] type 00 class 0x088000
[    4.284601] pci 0000:ff:0c.2: [8086:6fe2] type 00 class 0x088000
[    4.290600] pci 0000:ff:0c.3: [8086:6fe3] type 00 class 0x088000
[    4.297600] pci 0000:ff:0c.4: [8086:6fe4] type 00 class 0x088000
[    4.304600] pci 0000:ff:0c.5: [8086:6fe5] type 00 class 0x088000
[    4.310602] pci 0000:ff:0c.6: [8086:6fe6] type 00 class 0x088000
[    4.317605] pci 0000:ff:0c.7: [8086:6fe7] type 00 class 0x088000
[    4.324601] pci 0000:ff:0d.0: [8086:6fe8] type 00 class 0x088000
[    4.331607] pci 0000:ff:0d.1: [8086:6fe9] type 00 class 0x088000
[    4.337606] pci 0000:ff:0d.2: [8086:6fea] type 00 class 0x088000
[    4.344602] pci 0000:ff:0d.3: [8086:6feb] type 00 class 0x088000
[    4.351601] pci 0000:ff:0d.4: [8086:6fec] type 00 class 0x088000
[    4.357602] pci 0000:ff:0d.5: [8086:6fed] type 00 class 0x088000
[    4.364605] pci 0000:ff:0d.6: [8086:6fee] type 00 class 0x088000
[    4.371602] pci 0000:ff:0d.7: [8086:6fef] type 00 class 0x088000
[    4.378603] pci 0000:ff:0e.0: [8086:6ff0] type 00 class 0x088000
[    4.384602] pci 0000:ff:0e.1: [8086:6ff1] type 00 class 0x088000
[    4.391604] pci 0000:ff:0e.2: [8086:6ff2] type 00 class 0xffffff
[    4.398602] pci 0000:ff:0e.3: [8086:6ff3] type 00 class 0xffffff
[    4.405602] pci 0000:ff:0e.4: [8086:6ff4] type 00 class 0xffffff
[    4.411603] pci 0000:ff:0e.5: [8086:6ff5] type 00 class 0xffffff
[    4.418608] pci 0000:ff:0f.0: [8086:6ff8] type 00 class 0x088000
[    4.425606] pci 0000:ff:0f.1: [8086:6ff9] type 00 class 0x088000
[    4.431603] pci 0000:ff:0f.2: [8086:6ffa] type 00 class 0x088000
[    4.438602] pci 0000:ff:0f.3: [8086:6ffb] type 00 class 0x088000
[    4.445603] pci 0000:ff:0f.4: [8086:6ffc] type 00 class 0x088000
[    4.452603] pci 0000:ff:0f.5: [8086:6ffd] type 00 class 0x088000
[    4.458603] pci 0000:ff:0f.6: [8086:6ffe] type 00 class 0x088000
[    4.465604] pci 0000:ff:10.0: [8086:6f1d] type 00 class 0x088000
[    4.472605] pci 0000:ff:10.1: [8086:6f34] type 00 class 0x110100
[    4.479609] pci 0000:ff:10.5: [8086:6f1e] type 00 class 0x088000
[    4.485601] pci 0000:ff:10.6: [8086:6f7d] type 00 class 0x110100
[    4.492601] pci 0000:ff:10.7: [8086:6f1f] type 00 class 0x088000
[    4.499601] pci 0000:ff:12.0: [8086:6fa0] type 00 class 0x088000
[    4.506590] pci 0000:ff:12.1: [8086:6f30] type 00 class 0x110100
[    4.512604] pci 0000:ff:12.4: [8086:6f60] type 00 class 0x088000
[    4.519593] pci 0000:ff:12.5: [8086:6f38] type 00 class 0x110100
[    4.526609] pci 0000:ff:13.0: [8086:6fa8] type 00 class 0x088000
[    4.532641] pci 0000:ff:13.1: [8086:6f71] type 00 class 0x088000
[    4.539616] pci 0000:ff:13.2: [8086:6faa] type 00 class 0x088000
[    4.546617] pci 0000:ff:13.3: [8086:6fab] type 00 class 0x088000
[    4.553621] pci 0000:ff:13.6: [8086:6fae] type 00 class 0x088000
[    4.559604] pci 0000:ff:13.7: [8086:6faf] type 00 class 0x088000
[    4.566606] pci 0000:ff:14.0: [8086:6fb0] type 00 class 0x088000
[    4.573622] pci 0000:ff:14.1: [8086:6fb1] type 00 class 0x088000
[    4.580617] pci 0000:ff:14.2: [8086:6fb2] type 00 class 0x088000
[    4.586620] pci 0000:ff:14.3: [8086:6fb3] type 00 class 0x088000
[    4.593615] pci 0000:ff:14.4: [8086:6fbc] type 00 class 0x088000
[    4.600604] pci 0000:ff:14.5: [8086:6fbd] type 00 class 0x088000
[    4.607605] pci 0000:ff:14.6: [8086:6fbe] type 00 class 0x088000
[    4.613603] pci 0000:ff:14.7: [8086:6fbf] type 00 class 0x088000
[    4.620607] pci 0000:ff:16.0: [8086:6f68] type 00 class 0x088000
[    4.627642] pci 0000:ff:16.1: [8086:6f79] type 00 class 0x088000
[    4.634618] pci 0000:ff:16.2: [8086:6f6a] type 00 class 0x088000
[    4.640621] pci 0000:ff:16.3: [8086:6f6b] type 00 class 0x088000
[    4.647618] pci 0000:ff:16.6: [8086:6f6e] type 00 class 0x088000
[    4.654604] pci 0000:ff:16.7: [8086:6f6f] type 00 class 0x088000
[    4.660606] pci 0000:ff:17.0: [8086:6fd0] type 00 class 0x088000
[    4.667643] pci 0000:ff:17.1: [8086:6fd1] type 00 class 0x088000
[    4.674617] pci 0000:ff:17.2: [8086:6fd2] type 00 class 0x088000
[    4.681618] pci 0000:ff:17.3: [8086:6fd3] type 00 class 0x088000
[    4.687617] pci 0000:ff:17.4: [8086:6fb8] type 00 class 0x088000
[    4.694609] pci 0000:ff:17.5: [8086:6fb9] type 00 class 0x088000
[    4.701603] pci 0000:ff:17.6: [8086:6fba] type 00 class 0x088000
[    4.708604] pci 0000:ff:17.7: [8086:6fbb] type 00 class 0x088000
[    4.714616] pci 0000:ff:1e.0: [8086:6f98] type 00 class 0x088000
[    4.721602] pci 0000:ff:1e.1: [8086:6f99] type 00 class 0x088000
[    4.728602] pci 0000:ff:1e.2: [8086:6f9a] type 00 class 0x088000
[    4.735608] pci 0000:ff:1e.3: [8086:6fc0] type 00 class 0x088000
[    4.741592] pci 0000:ff:1e.4: [8086:6f9c] type 00 class 0x088000
[    4.748609] pci 0000:ff:1f.0: [8086:6f88] type 00 class 0x088000
[    4.755604] pci 0000:ff:1f.2: [8086:6f8a] type 00 class 0x088000
[    4.761700] ACPI: PCI Root Bridge [UNC0] (domain 0000 [bus 7f])
[    4.768558] acpi PNP0A03:03: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
[    4.779073] acpi PNP0A03:03: _OSC: platform does not support [SHPCHotplug AER LTR]
[    4.788417] acpi PNP0A03:03: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
[    4.797557] acpi PNP0A03:03: FADT indicates ASPM is unsupported, using BIOS configuration
[    4.806598] PCI host bridge to bus 0000:7f
[    4.810556] pci_bus 0000:7f: Unknown NUMA node; performance will be reduced
[    4.818556] pci_bus 0000:7f: root bus resource [bus 7f]
[    4.824565] pci 0000:7f:08.0: [8086:6f80] type 00 class 0x088000
[    4.831614] pci 0000:7f:08.2: [8086:6f32] type 00 class 0x110100
[    4.837612] pci 0000:7f:08.3: [8086:6f83] type 00 class 0x088000
[    4.844625] pci 0000:7f:09.0: [8086:6f90] type 00 class 0x088000
[    4.851609] pci 0000:7f:09.2: [8086:6f33] type 00 class 0x110100
[    4.858612] pci 0000:7f:09.3: [8086:6f93] type 00 class 0x088000
[    4.864629] pci 0000:7f:0b.0: [8086:6f81] type 00 class 0x088000
[    4.871607] pci 0000:7f:0b.1: [8086:6f36] type 00 class 0x110100
[    4.878605] pci 0000:7f:0b.2: [8086:6f37] type 00 class 0x110100
[    4.884605] pci 0000:7f:0b.3: [8086:6f76] type 00 class 0x088000
[    4.891609] pci 0000:7f:0c.0: [8086:6fe0] type 00 class 0x088000
[    4.898606] pci 0000:7f:0c.1: [8086:6fe1] type 00 class 0x088000
[    4.905607] pci 0000:7f:0c.2: [8086:6fe2] type 00 class 0x088000
[    4.911610] pci 0000:7f:0c.3: [8086:6fe3] type 00 class 0x088000
[    4.918608] pci 0000:7f:0c.4: [8086:6fe4] type 00 class 0x088000
[    4.925606] pci 0000:7f:0c.5: [8086:6fe5] type 00 class 0x088000
[    4.932609] pci 0000:7f:0c.6: [8086:6fe6] type 00 class 0x088000
[    4.938607] pci 0000:7f:0c.7: [8086:6fe7] type 00 class 0x088000
[    4.945608] pci 0000:7f:0d.0: [8086:6fe8] type 00 class 0x088000
[    4.952606] pci 0000:7f:0d.1: [8086:6fe9] type 00 class 0x088000
[    4.959615] pci 0000:7f:0d.2: [8086:6fea] type 00 class 0x088000
[    4.965626] pci 0000:7f:0d.3: [8086:6feb] type 00 class 0x088000
[    4.972608] pci 0000:7f:0d.4: [8086:6fec] type 00 class 0x088000
[    4.979611] pci 0000:7f:0d.5: [8086:6fed] type 00 class 0x088000
[    4.985608] pci 0000:7f:0d.6: [8086:6fee] type 00 class 0x088000
[    4.992607] pci 0000:7f:0d.7: [8086:6fef] type 00 class 0x088000
[    4.999608] pci 0000:7f:0e.0: [8086:6ff0] type 00 class 0x088000
[    5.006607] pci 0000:7f:0e.1: [8086:6ff1] type 00 class 0x088000
[    5.012607] pci 0000:7f:0e.2: [8086:6ff2] type 00 class 0xffffff
[    5.019610] pci 0000:7f:0e.3: [8086:6ff3] type 00 class 0xffffff
[    5.026615] pci 0000:7f:0e.4: [8086:6ff4] type 00 class 0xffffff
[    5.033609] pci 0000:7f:0e.5: [8086:6ff5] type 00 class 0xffffff
[    5.039611] pci 0000:7f:0f.0: [8086:6ff8] type 00 class 0x088000
[    5.046607] pci 0000:7f:0f.1: [8086:6ff9] type 00 class 0x088000
[    5.053608] pci 0000:7f:0f.2: [8086:6ffa] type 00 class 0x088000
[    5.060608] pci 0000:7f:0f.3: [8086:6ffb] type 00 class 0x088000
[    5.066609] pci 0000:7f:0f.4: [8086:6ffc] type 00 class 0x088000
[    5.073617] pci 0000:7f:0f.5: [8086:6ffd] type 00 class 0x088000
[    5.080609] pci 0000:7f:0f.6: [8086:6ffe] type 00 class 0x088000
[    5.087612] pci 0000:7f:10.0: [8086:6f1d] type 00 class 0x088000
[    5.093608] pci 0000:7f:10.1: [8086:6f34] type 00 class 0x110100
[    5.100611] pci 0000:7f:10.5: [8086:6f1e] type 00 class 0x088000
[    5.107607] pci 0000:7f:10.6: [8086:6f7d] type 00 class 0x110100
[    5.113606] pci 0000:7f:10.7: [8086:6f1f] type 00 class 0x088000
[    5.120607] pci 0000:7f:12.0: [8086:6fa0] type 00 class 0x088000
[    5.127596] pci 0000:7f:12.1: [8086:6f30] type 00 class 0x110100
[    5.134614] pci 0000:7f:12.4: [8086:6f60] type 00 class 0x088000
[    5.140594] pci 0000:7f:12.5: [8086:6f38] type 00 class 0x110100
[    5.147618] pci 0000:7f:13.0: [8086:6fa8] type 00 class 0x088000
[    5.154654] pci 0000:7f:13.1: [8086:6f71] type 00 class 0x088000
[    5.161630] pci 0000:7f:13.2: [8086:6faa] type 00 class 0x088000
[    5.167628] pci 0000:7f:13.3: [8086:6fab] type 00 class 0x088000
[    5.174628] pci 0000:7f:13.6: [8086:6fae] type 00 class 0x088000
[    5.181618] pci 0000:7f:13.7: [8086:6faf] type 00 class 0x088000
[    5.188613] pci 0000:7f:14.0: [8086:6fb0] type 00 class 0x088000
[    5.194628] pci 0000:7f:14.1: [8086:6fb1] type 00 class 0x088000
[    5.201629] pci 0000:7f:14.2: [8086:6fb2] type 00 class 0x088000
[    5.208627] pci 0000:7f:14.3: [8086:6fb3] type 00 class 0x088000
[    5.215626] pci 0000:7f:14.4: [8086:6fbc] type 00 class 0x088000
[    5.221611] pci 0000:7f:14.5: [8086:6fbd] type 00 class 0x088000
[    5.228611] pci 0000:7f:14.6: [8086:6fbe] type 00 class 0x088000
[    5.235613] pci 0000:7f:14.7: [8086:6fbf] type 00 class 0x088000
[    5.242621] pci 0000:7f:16.0: [8086:6f68] type 00 class 0x088000
[    5.248655] pci 0000:7f:16.1: [8086:6f79] type 00 class 0x088000
[    5.255635] pci 0000:7f:16.2: [8086:6f6a] type 00 class 0x088000
[    5.262629] pci 0000:7f:16.3: [8086:6f6b] type 00 class 0x088000
[    5.269630] pci 0000:7f:16.6: [8086:6f6e] type 00 class 0x088000
[    5.275613] pci 0000:7f:16.7: [8086:6f6f] type 00 class 0x088000
[    5.282614] pci 0000:7f:17.0: [8086:6fd0] type 00 class 0x088000
[    5.289657] pci 0000:7f:17.1: [8086:6fd1] type 00 class 0x088000
[    5.296631] pci 0000:7f:17.2: [8086:6fd2] type 00 class 0x088000
[    5.302630] pci 0000:7f:17.3: [8086:6fd3] type 00 class 0x088000
[    5.309633] pci 0000:7f:17.4: [8086:6fb8] type 00 class 0x088000
[    5.316612] pci 0000:7f:17.5: [8086:6fb9] type 00 class 0x088000
[    5.323612] pci 0000:7f:17.6: [8086:6fba] type 00 class 0x088000
[    5.329613] pci 0000:7f:17.7: [8086:6fbb] type 00 class 0x088000
[    5.336625] pci 0000:7f:1e.0: [8086:6f98] type 00 class 0x088000
[    5.343611] pci 0000:7f:1e.1: [8086:6f99] type 00 class 0x088000
[    5.350609] pci 0000:7f:1e.2: [8086:6f9a] type 00 class 0x088000
[    5.356611] pci 0000:7f:1e.3: [8086:6fc0] type 00 class 0x088000
[    5.363596] pci 0000:7f:1e.4: [8086:6f9c] type 00 class 0x088000
[    5.370615] pci 0000:7f:1f.0: [8086:6f88] type 00 class 0x088000
[    5.377610] pci 0000:7f:1f.2: [8086:6f8a] type 00 class 0x088000
[    5.401517] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-7e])
[    5.408560] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
[    5.419056] acpi PNP0A08:00: _OSC: platform does not support [SHPCHotplug AER LTR]
[    5.428417] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
[    5.436556] acpi PNP0A08:00: FADT indicates ASPM is unsupported, using BIOS configuration
[    5.446866] PCI host bridge to bus 0000:00
[    5.450558] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
[    5.458556] pci_bus 0000:00: root bus resource [io  0x1000-0x7fff window]
[    5.465556] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
[    5.474556] pci_bus 0000:00: root bus resource [mem 0x90000000-0xc7ffbfff window]
[    5.482556] pci_bus 0000:00: root bus resource [mem 0x380000000000-0x383fffffffff window]
[    5.491556] pci_bus 0000:00: root bus resource [bus 00-7e]
[    5.497564] pci 0000:00:00.0: [8086:6f00] type 00 class 0x060000
[    5.504737] pci 0000:00:01.0: [8086:6f02] type 01 class 0x060400
[    5.511625] pci 0000:00:01.0: PME# supported from D0 D3hot D3cold
[    5.518710] pci 0000:00:02.0: [8086:6f04] type 01 class 0x060400
[    5.525623] pci 0000:00:02.0: PME# supported from D0 D3hot D3cold
[    5.531698] pci 0000:00:02.2: [8086:6f06] type 01 class 0x060400
[    5.538623] pci 0000:00:02.2: PME# supported from D0 D3hot D3cold
[    5.545696] pci 0000:00:03.0: [8086:6f08] type 01 class 0x060400
[    5.552622] pci 0000:00:03.0: PME# supported from D0 D3hot D3cold
[    5.559700] pci 0000:00:04.0: [8086:6f20] type 00 class 0x088000
[    5.566570] pci 0000:00:04.0: reg 0x10: [mem 0x383ffff2c000-0x383ffff2ffff 64bit]
[    5.574708] pci 0000:00:04.1: [8086:6f21] type 00 class 0x088000
[    5.581568] pci 0000:00:04.1: reg 0x10: [mem 0x383ffff28000-0x383ffff2bfff 64bit]
[    5.589704] pci 0000:00:04.2: [8086:6f22] type 00 class 0x088000
[    5.596568] pci 0000:00:04.2: reg 0x10: [mem 0x383ffff24000-0x383ffff27fff 64bit]
[    5.604700] pci 0000:00:04.3: [8086:6f23] type 00 class 0x088000
[    5.611568] pci 0000:00:04.3: reg 0x10: [mem 0x383ffff20000-0x383ffff23fff 64bit]
[    5.619705] pci 0000:00:04.4: [8086:6f24] type 00 class 0x088000
[    5.626568] pci 0000:00:04.4: reg 0x10: [mem 0x383ffff1c000-0x383ffff1ffff 64bit]
[    5.635706] pci 0000:00:04.5: [8086:6f25] type 00 class 0x088000
[    5.641568] pci 0000:00:04.5: reg 0x10: [mem 0x383ffff18000-0x383ffff1bfff 64bit]
[    5.650701] pci 0000:00:04.6: [8086:6f26] type 00 class 0x088000
[    5.657568] pci 0000:00:04.6: reg 0x10: [mem 0x383ffff14000-0x383ffff17fff 64bit]
[    5.665707] pci 0000:00:04.7: [8086:6f27] type 00 class 0x088000
[    5.672570] pci 0000:00:04.7: reg 0x10: [mem 0x383ffff10000-0x383ffff13fff 64bit]
[    5.680690] pci 0000:00:05.0: [8086:6f28] type 00 class 0x088000
[    5.687696] pci 0000:00:05.1: [8086:6f29] type 00 class 0x088000
[    5.694710] pci 0000:00:05.2: [8086:6f2a] type 00 class 0x088000
[    5.701682] pci 0000:00:05.4: [8086:6f2c] type 00 class 0x080020
[    5.707565] pci 0000:00:05.4: reg 0x10: [mem 0x91d06000-0x91d06fff]
[    5.714697] pci 0000:00:11.0: [8086:8d7c] type 00 class 0xff0000
[    5.721763] pci 0000:00:11.1: [8086:8d7d] type 00 class 0x0c0500
[    5.728576] pci 0000:00:11.1: reg 0x10: [mem 0x91d05000-0x91d05fff]
[    5.735588] pci 0000:00:11.1: reg 0x20: [io  0x3060-0x307f]
[    5.741737] pci 0000:00:11.4: [8086:8d62] type 00 class 0x010601
[    5.748576] pci 0000:00:11.4: reg 0x10: [io  0x3098-0x309f]
[    5.755563] pci 0000:00:11.4: reg 0x14: [io  0x30cc-0x30cf]
[    5.761563] pci 0000:00:11.4: reg 0x18: [io  0x3090-0x3097]
[    5.767563] pci 0000:00:11.4: reg 0x1c: [io  0x30c8-0x30cb]
[    5.773563] pci 0000:00:11.4: reg 0x20: [io  0x3020-0x303f]
[    5.779563] pci 0000:00:11.4: reg 0x24: [mem 0x91d00000-0x91d007ff]
[    5.786600] pci 0000:00:11.4: PME# supported from D3hot
[    5.792670] pci 0000:00:14.0: [8086:8d31] type 00 class 0x0c0330
[    5.799576] pci 0000:00:14.0: reg 0x10: [mem 0x383ffff00000-0x383ffff0ffff 64bit]
[    5.807618] pci 0000:00:14.0: PME# supported from D3hot D3cold
[    5.814665] pci 0000:00:16.0: [8086:8d3a] type 00 class 0x078000
[    5.821577] pci 0000:00:16.0: reg 0x10: [mem 0x383ffff33000-0x383ffff3300f 64bit]
[    5.829620] pci 0000:00:16.0: PME# supported from D0 D3hot D3cold
[    5.836659] pci 0000:00:16.1: [8086:8d3b] type 00 class 0x078000
[    5.843576] pci 0000:00:16.1: reg 0x10: [mem 0x383ffff32000-0x383ffff3200f 64bit]
[    5.851619] pci 0000:00:16.1: PME# supported from D0 D3hot D3cold
[    5.858672] pci 0000:00:1a.0: [8086:8d2d] type 00 class 0x0c0320
[    5.865577] pci 0000:00:1a.0: reg 0x10: [mem 0x91d02000-0x91d023ff]
[    5.872641] pci 0000:00:1a.0: PME# supported from D0 D3hot D3cold
[    5.878681] pci 0000:00:1c.0: [8086:8d10] type 01 class 0x060400
[    5.885640] pci 0000:00:1c.0: PME# supported from D0 D3hot D3cold
[    5.892707] pci 0000:00:1c.3: [8086:8d16] type 01 class 0x060400
[    5.899641] pci 0000:00:1c.3: PME# supported from D0 D3hot D3cold
[    5.906701] pci 0000:00:1d.0: [8086:8d26] type 00 class 0x0c0320
[    5.913577] pci 0000:00:1d.0: reg 0x10: [mem 0x91d01000-0x91d013ff]
[    5.920642] pci 0000:00:1d.0: PME# supported from D0 D3hot D3cold
[    5.926672] pci 0000:00:1f.0: [8086:8d44] type 00 class 0x060100
[    5.933773] pci 0000:00:1f.2: [8086:8d02] type 00 class 0x010601
[    5.940572] pci 0000:00:1f.2: reg 0x10: [io  0x30c0-0x30c7]
[    5.946562] pci 0000:00:1f.2: reg 0x14: [io  0x30dc-0x30df]
[    5.953562] pci 0000:00:1f.2: reg 0x18: [io  0x30b8-0x30bf]
[    5.959561] pci 0000:00:1f.2: reg 0x1c: [io  0x30d8-0x30db]
[    5.965562] pci 0000:00:1f.2: reg 0x20: [io  0x3040-0x305f]
[    5.971562] pci 0000:00:1f.2: reg 0x24: [mem 0x91d04000-0x91d047ff]
[    5.978593] pci 0000:00:1f.2: PME# supported from D3hot
[    5.984678] pci 0000:00:1f.3: [8086:8d22] type 00 class 0x0c0500
[    5.991573] pci 0000:00:1f.3: reg 0x10: [mem 0x383ffff31000-0x383ffff310ff 64bit]
[    5.999575] pci 0000:00:1f.3: reg 0x20: [io  0x3000-0x301f]
[    6.005851] pci 0000:00:01.0: PCI bridge to [bus 01]
[    6.011715] pci 0000:00:02.0: PCI bridge to [bus 02]
[    6.017729] pci 0000:03:00.0: [8086:1528] type 00 class 0x020000
[    6.024576] pci 0000:03:00.0: reg 0x10: [mem 0x383fffc00000-0x383fffdfffff 64bit pref]
[    6.033576] pci 0000:03:00.0: reg 0x18: [io  0x2020-0x203f]
[    6.039571] pci 0000:03:00.0: reg 0x20: [mem 0x383fffe04000-0x383fffe07fff 64bit pref]
[    6.048618] pci 0000:03:00.0: PME# supported from D0 D3hot D3cold
[    6.054579] pci 0000:03:00.0: reg 0x184: [mem 0x91900000-0x91903fff 64bit]
[    6.062557] pci 0000:03:00.0: VF(n) BAR0 space: [mem 0x91900000-0x919fffff 64bit] (contains BAR0 for 64 VFs)
[    6.073568] pci 0000:03:00.0: reg 0x190: [mem 0x91a00000-0x91a03fff 64bit]
[    6.081556] pci 0000:03:00.0: VF(n) BAR3 space: [mem 0x91a00000-0x91afffff 64bit] (contains BAR3 for 64 VFs)
[    6.092775] pci 0000:03:00.1: [8086:1528] type 00 class 0x020000
[    6.098576] pci 0000:03:00.1: reg 0x10: [mem 0x383fffa00000-0x383fffbfffff 64bit pref]
[    6.107562] pci 0000:03:00.1: reg 0x18: [io  0x2000-0x201f]
[    6.114570] pci 0000:03:00.1: reg 0x20: [mem 0x383fffe00000-0x383fffe03fff 64bit pref]
[    6.122617] pci 0000:03:00.1: PME# supported from D0 D3hot D3cold
[    6.129575] pci 0000:03:00.1: reg 0x184: [mem 0x91b00000-0x91b03fff 64bit]
[    6.137556] pci 0000:03:00.1: VF(n) BAR0 space: [mem 0x91b00000-0x91bfffff 64bit] (contains BAR0 for 64 VFs)
[    6.148569] pci 0000:03:00.1: reg 0x190: [mem 0x91c00000-0x91c03fff 64bit]
[    6.155556] pci 0000:03:00.1: VF(n) BAR3 space: [mem 0x91c00000-0x91cfffff 64bit] (contains BAR3 for 64 VFs)
[    6.166775] pci 0000:00:02.2: PCI bridge to [bus 03-04]
[    6.172557] pci 0000:00:02.2:   bridge window [io  0x2000-0x2fff]
[    6.179557] pci 0000:00:02.2:   bridge window [mem 0x91900000-0x91cfffff]
[    6.187558] pci 0000:00:02.2:   bridge window [mem 0x383fffa00000-0x383fffefffff 64bit pref]
[    6.196706] pci 0000:00:03.0: PCI bridge to [bus 05]
[    6.202603] pci 0000:00:1c.0: PCI bridge to [bus 06]
[    6.207621] pci 0000:07:00.0: [102b:0522] type 00 class 0x030000
[    6.214589] pci 0000:07:00.0: reg 0x10: [mem 0x90000000-0x90ffffff pref]
[    6.222569] pci 0000:07:00.0: reg 0x14: [mem 0x91800000-0x91803fff]
[    6.229569] pci 0000:07:00.0: reg 0x18: [mem 0x91000000-0x917fffff]
[    6.236610] pci 0000:07:00.0: reg 0x30: [mem 0xffff0000-0xffffffff pref]
[    6.243754] pci 0000:00:1c.3: PCI bridge to [bus 07]
[    6.249560] pci 0000:00:1c.3:   bridge window [mem 0x91000000-0x918fffff]
[    6.256560] pci 0000:00:1c.3:   bridge window [mem 0x90000000-0x90ffffff 64bit pref]
[    6.265969] ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 9 *10 11 12 14 15)
[    6.273600] ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 *9 10 11 12 14 15)
[    6.281601] ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 *5 6 7 9 10 11 12 14 15)
[    6.290598] ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 9 10 *11 12 14 15)
[    6.298596] ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 6 7 9 10 *11 12 14 15)
[    6.306597] ACPI: PCI Interrupt Link [LNKF] (IRQs 3 4 5 6 7 9 10 11 12 14 15) *0, disabled.
[    6.315596] ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 5 6 7 9 10 11 12 14 15) *0, disabled.
[    6.324596] ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 5 6 7 9 10 11 12 14 15) *0, disabled.
[    6.334830] ACPI: PCI Root Bridge [PCI1] (domain 0000 [bus 80-fe])
[    6.341559] acpi PNP0A08:01: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
[    6.352054] acpi PNP0A08:01: _OSC: platform does not support [SHPCHotplug AER LTR]
[    6.361403] acpi PNP0A08:01: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
[    6.369556] acpi PNP0A08:01: FADT indicates ASPM is unsupported, using BIOS configuration
[    6.378741] PCI host bridge to bus 0000:80
[    6.383557] pci_bus 0000:80: root bus resource [io  0x8000-0xffff window]
[    6.391556] pci_bus 0000:80: root bus resource [mem 0xc8000000-0xfbffbfff window]
[    6.399556] pci_bus 0000:80: root bus resource [mem 0x384000000000-0x387fffffffff window]
[    6.408556] pci_bus 0000:80: root bus resource [bus 80-fe]
[    6.414567] pci 0000:80:04.0: [8086:6f20] type 00 class 0x088000
[    6.421567] pci 0000:80:04.0: reg 0x10: [mem 0x387ffff1c000-0x387ffff1ffff 64bit]
[    6.429678] pci 0000:80:04.1: [8086:6f21] type 00 class 0x088000
[    6.436577] pci 0000:80:04.1: reg 0x10: [mem 0x387ffff18000-0x387ffff1bfff 64bit]
[    6.444684] pci 0000:80:04.2: [8086:6f22] type 00 class 0x088000
[    6.451566] pci 0000:80:04.2: reg 0x10: [mem 0x387ffff14000-0x387ffff17fff 64bit]
[    6.459679] pci 0000:80:04.3: [8086:6f23] type 00 class 0x088000
[    6.466566] pci 0000:80:04.3: reg 0x10: [mem 0x387ffff10000-0x387ffff13fff 64bit]
[    6.475666] pci 0000:80:04.4: [8086:6f24] type 00 class 0x088000
[    6.481566] pci 0000:80:04.4: reg 0x10: [mem 0x387ffff0c000-0x387ffff0ffff 64bit]
[    6.490669] pci 0000:80:04.5: [8086:6f25] type 00 class 0x088000
[    6.496566] pci 0000:80:04.5: reg 0x10: [mem 0x387ffff08000-0x387ffff0bfff 64bit]
[    6.505665] pci 0000:80:04.6: [8086:6f26] type 00 class 0x088000
[    6.512566] pci 0000:80:04.6: reg 0x10: [mem 0x387ffff04000-0x387ffff07fff 64bit]
[    6.520663] pci 0000:80:04.7: [8086:6f27] type 00 class 0x088000
[    6.527566] pci 0000:80:04.7: reg 0x10: [mem 0x387ffff00000-0x387ffff03fff 64bit]
[    6.535662] pci 0000:80:05.0: [8086:6f28] type 00 class 0x088000
[    6.542674] pci 0000:80:05.1: [8086:6f29] type 00 class 0x088000
[    6.549680] pci 0000:80:05.2: [8086:6f2a] type 00 class 0x088000
[    6.555653] pci 0000:80:05.4: [8086:6f2c] type 00 class 0x080020
[    6.562564] pci 0000:80:05.4: reg 0x10: [mem 0xc8000000-0xc8000fff]
[    6.570975] iommu: Default domain type: Passthrough 
[    6.576617] pci 0000:07:00.0: vgaarb: setting as boot VGA device
[    6.577553] pci 0000:07:00.0: vgaarb: VGA device added: decodes=io+mem,owns=io+mem,locks=none
[    6.592561] pci 0000:07:00.0: vgaarb: bridge control possible
[    6.598555] vgaarb: loaded
[    6.601754] SCSI subsystem initialized
[    6.606576] ACPI: bus type USB registered
[    6.610576] usbcore: registered new interface driver usbfs
[    6.616562] usbcore: registered new interface driver hub
[    6.623595] usbcore: registered new device driver usb
[    6.628593] pps_core: LinuxPPS API ver. 1 registered
[    6.634555] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it>
[    6.644558] PTP clock support registered
[    6.649698] EDAC MC: Ver: 3.0.0
[    6.652833] PCI: Using ACPI for IRQ routing
[    6.662434] NetLabel: Initializing
[    6.666556] NetLabel:  domain hash size = 128
[    6.671555] NetLabel:  protocols = UNLABELED CIPSOv4 CALIPSO
[    6.677578] NetLabel:  unlabeled traffic allowed by default
[    6.683610] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0, 0, 0, 0, 0, 0
[    6.690557] hpet0: 8 comparators, 64-bit 14.318180 MHz counter
[    6.699706] clocksource: Switched to clocksource tsc-early
[    6.725538] VFS: Disk quotas dquot_6.6.0
[    6.729976] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[    6.737765] pnp: PnP ACPI init
[    6.742088] system 00:01: [io  0x0500-0x053f] has been reserved
[    6.748699] system 00:01: [io  0x0400-0x047f] has been reserved
[    6.755307] system 00:01: [io  0x0540-0x057f] has been reserved
[    6.761915] system 00:01: [io  0x0600-0x061f] has been reserved
[    6.768524] system 00:01: [io  0x0ca0-0x0ca5] could not be reserved
[    6.775511] system 00:01: [io  0x0880-0x0883] has been reserved
[    6.782118] system 00:01: [io  0x0800-0x081f] has been reserved
[    6.788728] system 00:01: [mem 0xfed1c000-0xfed3ffff] could not be reserved
[    6.796499] system 00:01: [mem 0xfed45000-0xfed8bfff] has been reserved
[    6.803885] system 00:01: [mem 0xff000000-0xffffffff] could not be reserved
[    6.811656] system 00:01: [mem 0xfee00000-0xfeefffff] has been reserved
[    6.819041] system 00:01: [mem 0xfed12000-0xfed1200f] has been reserved
[    6.826429] system 00:01: [mem 0xfed12010-0xfed1201f] has been reserved
[    6.833813] system 00:01: [mem 0xfed1b000-0xfed1bfff] has been reserved
[    6.841758] pnp: PnP ACPI: found 4 devices
[    6.852973] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    6.862914] pci 0000:07:00.0: can't claim BAR 6 [mem 0xffff0000-0xffffffff pref]: no compatible bridge window
[    6.874024] pci 0000:00:01.0: PCI bridge to [bus 01]
[    6.879572] pci 0000:00:02.0: PCI bridge to [bus 02]
[    6.885120] pci 0000:00:02.2: PCI bridge to [bus 03-04]
[    6.890956] pci 0000:00:02.2:   bridge window [io  0x2000-0x2fff]
[    6.897761] pci 0000:00:02.2:   bridge window [mem 0x91900000-0x91cfffff]
[    6.905342] pci 0000:00:02.2:   bridge window [mem 0x383fffa00000-0x383fffefffff 64bit pref]
[    6.914767] pci 0000:00:03.0: PCI bridge to [bus 05]
[    6.920318] pci 0000:00:1c.0: PCI bridge to [bus 06]
[    6.925871] pci 0000:07:00.0: BAR 6: assigned [mem 0x91810000-0x9181ffff pref]
[    6.933937] pci 0000:00:1c.3: PCI bridge to [bus 07]
[    6.939482] pci 0000:00:1c.3:   bridge window [mem 0x91000000-0x918fffff]
[    6.947063] pci 0000:00:1c.3:   bridge window [mem 0x90000000-0x90ffffff 64bit pref]
[    6.955715] pci_bus 0000:00: resource 4 [io  0x0000-0x0cf7 window]
[    6.962614] pci_bus 0000:00: resource 5 [io  0x1000-0x7fff window]
[    6.969507] pci_bus 0000:00: resource 6 [mem 0x000a0000-0x000bffff window]
[    6.977186] pci_bus 0000:00: resource 7 [mem 0x90000000-0xc7ffbfff window]
[    6.984864] pci_bus 0000:00: resource 8 [mem 0x380000000000-0x383fffffffff window]
[    6.993317] pci_bus 0000:03: resource 0 [io  0x2000-0x2fff]
[    6.999540] pci_bus 0000:03: resource 1 [mem 0x91900000-0x91cfffff]
[    7.006539] pci_bus 0000:03: resource 2 [mem 0x383fffa00000-0x383fffefffff 64bit pref]
[    7.015380] pci_bus 0000:07: resource 1 [mem 0x91000000-0x918fffff]
[    7.022379] pci_bus 0000:07: resource 2 [mem 0x90000000-0x90ffffff 64bit pref]
[    7.030560] pci_bus 0000:80: resource 4 [io  0x8000-0xffff window]
[    7.037457] pci_bus 0000:80: resource 5 [mem 0xc8000000-0xfbffbfff window]
[    7.045135] pci_bus 0000:80: resource 6 [mem 0x384000000000-0x387fffffffff window]
[    7.053767] NET: Registered protocol family 2
[    7.059309] tcp_listen_portaddr_hash hash table entries: 32768 (order: 7, 524288 bytes, vmalloc)
[    7.069368] TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc)
[    7.079382] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes, vmalloc)
[    7.087926] TCP: Hash tables configured (established 524288 bind 65536)
[    7.095590] UDP hash table entries: 32768 (order: 8, 1048576 bytes, vmalloc)
[    7.103689] UDP-Lite hash table entries: 32768 (order: 8, 1048576 bytes, vmalloc)
[    7.112625] NET: Registered protocol family 1
[    7.117492] NET: Registered protocol family 44
[    7.147669] pci 0000:00:1a.0: quirk_usb_early_handoff+0x0/0x643 took 22276 usecs
[    7.178682] pci 0000:00:1d.0: quirk_usb_early_handoff+0x0/0x643 took 22202 usecs
[    7.186969] pci 0000:07:00.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]
[    7.196318] PCI: CLS 32 bytes, default 64
[    7.200876] Trying to unpack rootfs image as initramfs...
[    8.737963] Freeing initrd memory: 80964K
[    8.742474] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[    8.749672] software IO TLB: mapped [mem 0x55000000-0x59000000] (64MB)
[    8.784367] Initialise system trusted keyrings
[    8.789343] Key type blacklist registered
[    8.793892] workingset: timestamp_bits=36 max_order=24 bucket_order=0
[    8.803008] zbud: loaded
[    8.806668] Platform Keyring initialized
[    8.816772] NET: Registered protocol family 38
[    8.821736] Key type asymmetric registered
[    8.826308] Asymmetric key parser 'x509' registered
[    8.831760] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 246)
[    8.840237] io scheduler mq-deadline registered
[    8.845294] io scheduler kyber registered
[    8.849809] io scheduler bfq registered
[    8.854926] atomic64_test: passed for x86-64 platform with CX8 and with SSE
[    8.863140] pcieport 0000:00:01.0: PME: Signaling with IRQ 25
[    8.869934] pcieport 0000:00:02.0: PME: Signaling with IRQ 26
[    8.876611] pcieport 0000:00:02.2: PME: Signaling with IRQ 27
[    8.883265] pcieport 0000:00:03.0: PME: Signaling with IRQ 28
[    8.889964] pcieport 0000:00:1c.0: PME: Signaling with IRQ 29
[    8.896616] pcieport 0000:00:1c.3: PME: Signaling with IRQ 30
[    8.903428] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
[    8.916907] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
[    8.925257] ACPI: Power Button [PWRF]
[    9.037946] ERST: Error Record Serialization Table (ERST) support is initialized.
[    9.046303] pstore: Registered erst as persistent store backend
[    9.054016] GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC.
[    9.062481] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
[    9.090168] 00:02: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
[    9.119058] 00:03: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
[    9.127963] Non-volatile memory driver v1.3
[    9.142241] rdac: device handler registered
[    9.147030] hp_sw: device handler registered
[    9.151800] emc: device handler registered
[    9.156664] alua: device handler registered
[    9.161435] libphy: Fixed MDIO Bus: probed
[    9.166236] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[    9.173531] ehci-pci: EHCI PCI platform driver
[    9.178819] ehci-pci 0000:00:1a.0: EHCI Host Controller
[    9.184761] ehci-pci 0000:00:1a.0: new USB bus registered, assigned bus number 1
[    9.193034] ehci-pci 0000:00:1a.0: debug port 2
[    9.202016] ehci-pci 0000:00:1a.0: cache line size of 32 is not supported
[    9.209605] ehci-pci 0000:00:1a.0: irq 18, io mem 0x91d02000
[    9.222618] ehci-pci 0000:00:1a.0: USB 2.0 started, EHCI 1.00
[    9.229089] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002, bcdDevice= 5.05
[    9.238307] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    9.246370] usb usb1: Product: EHCI Host Controller
[    9.251813] usb usb1: Manufacturer: Linux 5.5.0-rc7+ ehci_hcd
[    9.258227] usb usb1: SerialNumber: 0000:00:1a.0
[    9.263551] hub 1-0:1.0: USB hub found
[    9.267743] hub 1-0:1.0: 2 ports detected
[    9.272522] ehci-pci 0000:00:1d.0: EHCI Host Controller
[    9.278456] ehci-pci 0000:00:1d.0: new USB bus registered, assigned bus number 2
[    9.286723] ehci-pci 0000:00:1d.0: debug port 2
[    9.295676] ehci-pci 0000:00:1d.0: cache line size of 32 is not supported
[    9.303260] ehci-pci 0000:00:1d.0: irq 18, io mem 0x91d01000
[    9.316627] ehci-pci 0000:00:1d.0: USB 2.0 started, EHCI 1.00
[    9.323096] usb usb2: New USB device found, idVendor=1d6b, idProduct=0002, bcdDevice= 5.05
[    9.332323] usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    9.340376] usb usb2: Product: EHCI Host Controller
[    9.345819] usb usb2: Manufacturer: Linux 5.5.0-rc7+ ehci_hcd
[    9.352233] usb usb2: SerialNumber: 0000:00:1d.0
[    9.357531] hub 2-0:1.0: USB hub found
[    9.361721] hub 2-0:1.0: 2 ports detected
[    9.366346] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
[    9.373253] ohci-pci: OHCI PCI platform driver
[    9.378250] uhci_hcd: USB Universal Host Controller Interface driver
[    9.385523] xhci_hcd 0000:00:14.0: xHCI Host Controller
[    9.391456] xhci_hcd 0000:00:14.0: new USB bus registered, assigned bus number 3
[    9.400791] xhci_hcd 0000:00:14.0: hcc params 0x200077c1 hci version 0x100 quirks 0x0000000000009810
[    9.410993] xhci_hcd 0000:00:14.0: cache line size of 32 is not supported
[    9.418801] usb usb3: New USB device found, idVendor=1d6b, idProduct=0002, bcdDevice= 5.05
[    9.428028] usb usb3: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    9.436083] usb usb3: Product: xHCI Host Controller
[    9.441519] usb usb3: Manufacturer: Linux 5.5.0-rc7+ xhci-hcd
[    9.447932] usb usb3: SerialNumber: 0000:00:14.0
[    9.453256] hub 3-0:1.0: USB hub found
[    9.457461] hub 3-0:1.0: 15 ports detected
[    9.463961] xhci_hcd 0000:00:14.0: xHCI Host Controller
[    9.469871] xhci_hcd 0000:00:14.0: new USB bus registered, assigned bus number 4
[    9.478132] xhci_hcd 0000:00:14.0: Host supports USB 3.0 SuperSpeed
[    9.485160] usb usb4: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 5.05
[    9.494387] usb usb4: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    9.502442] usb usb4: Product: xHCI Host Controller
[    9.507886] usb usb4: Manufacturer: Linux 5.5.0-rc7+ xhci-hcd
[    9.514300] usb usb4: SerialNumber: 0000:00:14.0
[    9.519626] hub 4-0:1.0: USB hub found
[    9.523824] hub 4-0:1.0: 6 ports detected
[    9.529250] usbcore: registered new interface driver usbserial_generic
[    9.536544] usbserial: USB Serial support registered for generic
[    9.543313] i8042: PNP: No PS/2 controller found.
[    9.548667] mousedev: PS/2 mouse device common for all mice
[    9.555054] rtc_cmos 00:00: RTC can wake from S4
[    9.560421] rtc_cmos 00:00: registered as rtc0
[    9.565396] rtc_cmos 00:00: alarms up to one month, y3k, 114 bytes nvram, hpet irqs
[    9.573958] intel_pstate: Intel P-state driver initializing
[    9.592479] hid: raw HID events driver (C) Jiri Kosina
[    9.598278] usbcore: registered new interface driver usbhid
[    9.604500] usbhid: USB HID core driver
[    9.608972] drop_monitor: Initializing network drop monitor service
[    9.616054] Initializing XFRM netlink socket
[    9.621027] NET: Registered protocol family 10
[    9.626700] Segment Routing with IPv6
[    9.630800] NET: Registered protocol family 17
[    9.632589] usb 1-1: new high-speed USB device number 2 using ehci-pci
[    9.635932] mpls_gso: MPLS GSO support
[    9.657141] usb 1-1: New USB device found, idVendor=8087, idProduct=800a, bcdDevice= 0.05
[    9.658134] microcode: sig=0x406f1, pf=0x1, revision=0xb00002a
[    9.670472] B device strings: Mfr=0, Product=0, SerialNumber=0
[    9.673006] hub 1-1:1.0: USB hub found
[    9.673084] hub 1-1:1.0: 6 ports detected
[    9.694597] usb 2-1: new high-speed USB device number 2 using ehci-pci
[    9.707110] AVX2 version of gcm_enc/dec engaged.
[    9.709261] usb 2-1: New USB device found, idVendor=8087, idProduct=8002, bcdDevice= 0.05
[    9.712264] AES CTR mode by8 optimization enabled
[    9.726638] usb 2-1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[    9.734892] hub 2-1:1.0: USB hub found
[    9.739254] hub 2-1:1.0: 8 ports detected
[    9.744506] sched_clock: Marking stable (6948364463, 2796134314)->(10206753075, -462254298)
[    9.754019] registered taskstats version 1
[    9.758605] Loading compiled-in X.509 certificates
[    9.764599] tsc: Refined TSC clocksource calibration: 2194.917 MHz
[    9.771520] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1fa37107ca2, max_idle_ns: 440795258165 ns
[    9.782892] clocksource: Switched to clocksource tsc
[    9.786614] usb 3-9: new full-speed USB device number 2 using xhci_hcd
[    9.791003] Loaded X.503057087611c75a5378a4281'
[    9.806921] zswap: loaded using pool lzo/zbud
[    9.811993] pstore: Using crash dump compression: deflate
[    9.824249] Key type big_key registered
[    9.831466] Key type encrypted registered
[    9.835950] ima: No TPM chip found, activating TPM-bypass!
[    9.842077] ima: Allocated hash algorithm: sha1
[    9.847139] ima: No architecture policies found
[    9.852208] evm: Initialising EVM extended attributes:
[    9.857941] evm: security.selinux
[    9.861639] evm: security.ima
[    9.864949] evm: security.capability
[    9.868937] evm: HMAC attrs: 0x1
[    9.873491] rtc_cmos 00:00: setting system clock to 2020-02-04T07:33:26 UTC (1580801606)
[    9.889849] Freeing unused decrypted memory: 2040K
[    9.896311] Freeing unused kernel image (initmem) memory: 2404K
[    9.906661] Write protecting the kernel read-only dat04] Freeing unused kernel image (text/rodata gap) memory: 2040K
[    9.922559] F1720K
[    9.93[    9.948544] usb 3-9: New USB device found, idVendor=046b, idProduct=ff10, bcdDevice= 1.00
[   10.000427] usb 3-9: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[   10.000428] usb 3-9: Product: Virtual Keyboard and Mouse
[   10.000429] usb 3-9: Manufacturer: American Megatrends Inc.
[   10.000430] usb 3-9: SerialNumber: serial
[   10.001987] input: American Megatrends Inc. Virtual Keyboard and Mouse as /devices/pci0000:00/0000:00:14.0/usb3/3-9/3-9:1.0/0003:046B:FF10.0001/input/input1
[   10.002153] hid-generic 0003:046B:FF10.0001: input,hidraw0: USB HID v1.10 Keyboard [American Megatrends Inc. Virtual Keyboard and Mouse] on usb-0000:00:14.0-9/input0
[   10.002864] input: American Megatrends Inc. Virtual Keyboard and Mouse as /devices/pci0000:00/0000:00:14.0/usb3/3-9/3-9:1.1/0003:046B:FF10.0002/input/input2
[   10.002964] hid-generic 0003:046B:FF10.0002: input,hidraw1: USB HID v1.10 Mouse [American Megatrends Inc. Virtual Keyboard and Mouse] on usb-0000:00:14.0-9/input1
[   10.089083] systemd[1]: systemd 239 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=legacy)
[   10.127672] systemd[1]: Detected architecture x86-64.
[   10.133317] systemd[1]: Running in initial RAM disk.

Welcome to Red Hat Enterprise Linux 8.2 Beta (Ootpa) dracut-049-60.git20191129.el8 (Initramfs)!

[   10.159684] systemd[1]: Set hostname to <intel-wildcatpass-07>.
[   10.210372] random: systemd: uninitialized urandom read (16 bytes read)
[   10.217830] systemd[1]: Listening on udev Control Socket.
[  OK  ] Listening on udev Control Socket.
[   10.231678] random: systemd: uninitialized urandom read (16 bytes read)
[   10.239074] systemd[1]: Reached target Slices.
[  OK  ] Reached target Slices.
[   10.249667] random: systemd: uninitialized urandom read (16 bytes read)
[   10.257059] systemd[1]: Reached target Timers.
[  OK  ] Reached target Timers.
[   10.267628] systemd[1]: Reached target Swap.
[  OK  ] Reached target Swap.
[   10.278710] systemd[1]: Listening on Journal Socket.
[  OK  ] Listening on Journal Socket.
[   10.293304] systemd[1]: Starting Setup Virtual Console...
         Starting Setup Virtual Console...
         Starting Create list of required st…ce nodes for the current kernel...
         Starting Apply Kernel Variables...
[  OK  ] Listening on udev Kernel Socket.
[  OK  ] Created slice system-systemd\x2dhibernate\x2dresume.slice.
[  OK  ] Started Hardware RNG Entropy Gatherer Daemon.
[  OK  ] Listening on Journal Socket (/dev/log).
         Starting Journal Service...
[  OK  ] Reached target Sockets.
[  OK  ] Started Setup Virtual Console.
[  OK  ] Started Create list of required sta…vice nodes for the current kernel.
[  OK  ] Started Apply Kernel Variables.
         Starting Create Static Device Nodes in /dev...
         Starting dracut cmdline hook...
[  OK  ] Started Create Static Device Nodes in /dev.
[  OK  ] Started dracut cmdline hook.
         Starting dracut pre-udev hook...
[   10.560243] device-mapper: uevent: version 1.0.3
[   10.565489] device-mapper: ioctl: 4.41.0-ioctl (2019-09-16) initialised
[  OK  ] Started dracut pre-udev hook.
         Starting udev Kernel Device Manager...
[  OK  ] Started Journal Service.
[  OK  ] Started udev Kernel Device Manager.
         Starting udev Coldplug all Devices...
         Mounting Kernel Configuration File System...
[  OK  ] Mounted Kernel Configuration File System.
[  OK  ] Started udev Coldplug all Devices.
         Starting Show Plymouth Boot Screen...
[   10.942415] dca service started, version 1.12.1
         Starting dracut initqueue hook...
[  OK  ] Started Show Plymouth Boot Screen.
[  OK  ] Started Forward Password Requests to Plymouth Directory Watch.
[  OK  ] Reached target Paths.
[   11.050149] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver - version 5.1.0-k
[   11.058700] ixgbe: Copyright (c) 1999-2016 Intel Corporation.
[   11.090777] ahci 0000:00:11.4: AHCI 0001.0300 32 slots 4 ports 6 Gbps 0xf impl SATA mode
[   11.099816] ahci 0000:00:11.4: flags: 64bit ncq led clo pio slum part ems apst 
[   11.120979] scsi host0: ahci
[   11.121138] scsi host1: ahci
[   11.121255] scsi host2: ahci
[   11.121367] scsi host3: ahci
[   11.121432] ata1: SATA max UDMA/133 abar m2048@0x91d00000 port 0x91d00100 irq 33
[   11.121434] ata2: SATA max UDMA/133 abar m2048@0x91d00000 port 0x91d00180 irq 33
[   11.121436] ata3: SATA max UDMA/133 abar m2048@0x91d00000 port 0x91d00200 irq 33
[   11.121438] ata4: SATA max UDMA/133 abar m2048@0x91d00000 port 0x91d00280 irq 33
[   11.121811] ahci 0000:00:1f.2: AHCI 0001.0300 32 slots 6 ports 6 Gbps 0x3f impl SATA mode
[   11.121813] ahci 0000:00:1f.2: flags: 64bit ncq led clo pio slum part ems apst 
[   11.140089] scsi host4: ahci
[   11.140199] scsi host5: ahci
[   11.140306] scsi host6: ahci
[   11.140416] scsi host7: ahci
[   11.140554] scsi host8: ahci
[   11.140679] scsi host9: ahci
[   11.140730] ata5: SATA max UDMA/133 abar m2048@0x91d04000 port 0x91d04100 irq 34
[   11.140733] ata6: SATA max UDMA/133 abar m2048@0x91d04000 port 0x91d04180 irq 34
[   11.140735] ata7: SATA max UDMA/133 abar m2048@0x91d04000 port 0x91d04200 irq 34
[   11.140738] ata8: SATA max UDMA/133 abar m2048@0x91d04000 port 0x91d04280 irq 34
[   11.140739] ata9: SATA max UDMA/133 abar m2048@0x91d04000 port 0x91d04300 irq 34
[   11.140741] ata10: SATA max UDMA/133 abar m2048@0x91d04000 port 0x91d04380 irq 34
[   11.178283] mgag200 0000:07:00.0: vgaarb: deactivate vga console
[   11.287181] Console: switching to colour dummy device 80x25
[   11.308567] [TTM] Zone  kernel: Available graphics memory: 32660860 KiB
[   11.315968] [TTM] Zone   dma32: Available graphics memory: 2097152 KiB
[   11.323253] [TTM] Initializing pool allocator
[   11.328177] [TTM] Initializing DMA pool allocator
[   11.363241] fbcon: mgag200drmfb (fb0) is primary device
[   11.386469] ixgbe 0000:03:00.0: Multiqueue Enabled: Rx Queue count = 63, Tx Queue count = 63 XDP Queue count = 0
[   11.426151] ata4: SATA link down (SStatus 0 SControl 300)
[   11.426479] ata2: SATA link down (SStatus 0 SControl 300)
[   11.426713] ata1: SATA link down (SStatus 0 SControl 300)
[   11.426729] ata3: SATA link down (SStatus 0 SControl 300)
[   11.452783] ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[   11.453883] ata8: SATA link down (SStatus 0 SControl 300)
[   11.454549] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   11.455126] ata6: SATA link down (SStatus 0 SControl 300)
[   11.456200] ata7: SATA link down (SStatus 0 SControl 300)
[   11.456866] ata10: SATA link down (SStatus 0 SControl 300)
[   11.457490] ata9.00: ATAPI: DV-W28S-A, 9.2A, max UDMA/100
[   11.457888] ata5.00: ATA-9: INTEL SSDSC2BA800G3, 5DV10270, max UDMA/133
[   11.457917] ata5.00: 1562824368 sectors, multi 1: LBA48 NCQ (depth 32)
[   11.458454] ata5.00: configured for UDMA/133
[   11.458746] scsi 4:0:0:0: Direct-Access     ATA      INTEL SSDSC2BA80 0270 PQ: 0 ANSI: 5
[   11.459796] ata9.00: configured for UDMA/100
[   11.460708] scsi 8:0:0:0: CD-ROM            TEAC     DV-W28S-A        9.2A PQ: 0 ANSI: 5
[   11.464025] scsi 4:0:0:0: Attached scsi generic sg0 type 0
[   11.468842] ata5.00: Enabling discard_zeroes_data
[   11.468945] sd 4:0:0:0: [sda] 1562824368 512-byte logical blocks: (800 GB/745 GiB)
[   11.468947] sd 4:0:0:0: [sda] 4096-byte physical blocks
[   11.468972] sd 4:0:0:0: [sda] Write Protect is off
[   11.469021] sd 4:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[   11.483838] ata5.00: Enabling discard_zeroes_data
[   11.484459]  sda: sda1 sda2
[   11.485545] ata5.00: Enabling discard_zeroes_data
[   11.485621] scsi 8:0:0:0: Attached scsi generic sg1 type 5
[   11.485887] sd 4:0:0:0: [sda] Attached SCSI disk
[   11.487566] ixgbe 0000:03:00.0: 32.000 Gb/s available PCIe bandwidth (5 GT/s x8 link)
[   11.511560] ixgbe 0000:03:00.0: MAC: 3, PHY: 0, PBA No: 000000-000
[   11.511561] ixgbe 0000:03:00.0: 00:1e:67:d3:27:b0
[   11.516591] sr 8:0:0:0: [sr0] scsi3-mmc drive: 24x/24x writer dvd-ram cd/rw xa/form2 cdda tray
[   11.516592] cdrom: Uniform CD-ROM driver Revision: 3.20
[   11.566563] Console: switching to colour frame buffer device 128x48
[   11.705611] ixgbe 0000:03:00.0: Intel(R) 10 Gigabit Network Connection
[   11.705633] libphy: ixgbe-mdio: probed
[   11.717780] random: fast init done
[   11.867816] mgag200 0000:07:00.0: fb0: mgag200drmfb frame buffer device
[   11.922915] [drm] Initialized mgag200 1.0.0 20110418 for 0000:07:00.0 on minor 0
[   12.186351] ixgbe 0000:03:00.1: Multiqueue Enabled: Rx Queue count = 63, Tx Queue count = 63 XDP Queue count = 0
[   12.258351] random: crng init done
[   12.262151] random: 7 urandom warning(s) missed due to ratelimiting
[   12.281564] ixgbe 0000:03:00.1: 32.000 Gb/s available PCIe bandwidth (5 GT/s x8 link)
[   12.313558] ixgbe 0000:03:00.1: MAC: 3, PHY: 0, PBA No: 000000-000
[   12.320453] ixgbe 0000:03:00.1: 00:1e:67:d3:27:b1
[   12.471572] ixgbe 0000:03:00.1: Intel(R) 10 Gigabit Network Connection
[   12.478899] libphy: ixgbe-mdio: probed
[   12.484687] ixgbe 0000:03:00.0 eno1: renamed from eth0
[   12.495966] ixgbe 0000:03:00.1 eno2: renamed from eth1
[  OK  ] Found device /dev/mapper/intel--wildcatpass--07-root.
[  OK  ] Reached target Initrd Root Device.
[  OK  ] Found device /dev/mapper/intel--wildcatpass--07-swap.
         Starting Resume from hibernation us…hel_intel--wildcatpass--07-swap...
[  OK  ] Started Resume from hibernation usi…/intel--wildcatpass--07-swap.
[  OK  ] Reached target Local File Systems (Pre).
[  OK  ] Reached target Local File Systems.
         Starting Create Volatile Files and Directories...
[  OK  ] Started Create Volatile Files and Directories.
[  OK  ] Reached target System Initialization.
[  OK  ] Reached target Basic System.
[  OK  ] Started dracut initqueue hook.
[  OK  ] Reached target Remote File Systems (Pre).
[  OK  ] Reached target Remote File Systems.
         Starting File System Check on /dev/…intel--wildcatpass--07-root...
[  OK  ] Started File System Check on /dev/m…/intel--wildcatpass--07-root.
         Mounting /sysroot...
[   12.975718] SGI XFS with ACLs, security attributes, quota, no debug enabled
[   12.987138] XFS (dm-0): Mounting V5 Filesystem
[   13.008174] XFS (dm-0): Ending clean mount
[  OK  ] Mounted /sysroot.
[  OK  ] Reached target Initrd Root File System.
         Starting Reload Configuration from the Real Root...
[  OK  ] Started Reload Configuration from the Real Root.
[  OK  ] Reached target Initrd File Systems.
[  OK  ] Reached target Initrd Default Target.
         Starting dracut pre-pivot and cleanup hook...
[  OK  ] Started dracut pre-pivot and cleanup hook.
         Starting Cleaning Up and Shutting Down Daemons...
[  OK  ] Stopped target Timers.
[  OK  ] Stopped dracut pre-pivot and cleanup hook.
[  OK  ] Stopped target Remote File Systems.
         Starting Setup Virtual Console...
         Starting Plymouth switch root service...
[  OK  ] Stopped target Initrd Default Target.
[  OK  ] Stopped target Initrd Root Device.
[  OK  ] Stopped target Basic System.
[  OK  ] Stopped target System Initialization.
[  OK  ] Stopped Apply Kernel Variables.
[  OK  ] Stopped target Swap.
[  OK  ] Stopped Create Volatile Files and Directories.
[  OK  ] Stopped target Local File Systems.
[  OK  ] Stopped target Local File Systems (Pre).
[  OK  ] Stopped target Sock[   13.321682] systemd-journald[816]: Received SIGTERM from PID 1 (systemd).
ets.
[  OK  ] Stopped target Paths.
[  OK  ] Stopped target Slices.
         Stopping udev Kernel Device Manager...
[  OK  ] Stopped ta[   13.347698] printk: systemd: 19 output lines suppressed due to ratelimiting
rget Remote File Systems (Pre).
[  OK  ] Stopped dracut initqueue hook.
[  OK  ] Stopped udev Coldplug all Devices.
[  OK  ] Started Cleaning Up and Shutting Down Daemons.
[  OK  ] Stopped udev Kernel Device Manager.
         Stopping Hardware RNG Entropy Gatherer D[   13.386747] audit: type=1404 audit(1580801610.013:2): enforcing=1 old_enforcing=0 auid=4294967295 ses=4294967295 enabled=1 old-enabled=1 lsm=selinux res=1
aemon...
[  OK  ] Stopped Create Static Device Nodes in /dev.
[  OK  ] Stopped Create list of required sta…vice nodes for the current kernel.
[  OK  ] Stopped dracut pre-udev hook.
[  OK  ] Stopped dracut cmdline hook.
[  OK  ] Closed udev Control Socket.
[  OK  ] Closed udev Kernel Socket.
         Starting Cleanup udevd DB...
[  OK  ] Stopped Hardware RNG Entropy Gatherer Daemon.
[  OK  ] Started Plymouth switch root service.
[  OK  ] Started Cleanup udevd DB.
[  OK  ] Started Setup Virtual Console.
[  OK  ] Reached target Switch Root.
         Starting Switch Root...
[   13.762409] SELinux:  Permission watch in class filesystem not defined in policy.
[   13.770767] SELinux:  Permission watch in class file not defined in policy.
[   13.778535] SELinux:  Permission watch_mount in class file not defined in policy.
[   13.786886] SELinux:  Permission watch_sb in class file not defined in policy.
[   13.794947] SELinux:  Permission watch_with_perm in class file not defined in policy.
[   13.803686] SELinux:  Permission watch_reads in class file not defined in policy.
[   13.812043] SELinux:  Permission watch in class dir not defined in policy.
[   13.819713] SELinux:  Permission watch_mount in class dir not defined in policy.
[   13.827967] SELinux:  Permission watch_sb in class dir not defined in policy.
[   13.835928] SELinux:  Permission watch_with_perm in class dir not defined in policy.
[   13.844571] SELinux:  Permission watch_reads in class dir not defined in policy.
[   13.852831] SELinux:  Permission watch in class lnk_file not defined in policy.
[   13.860994] SELinux:  Permission watch_mount in class lnk_file not defined in policy.
[   13.869733] SELinux:  Permission watch_sb in class lnk_file not defined in policy.
[   13.878181] SELinux:  Permission watch_with_perm in class lnk_file not defined in policy.
[   13.887308] SELinux:  Permission watch_reads in class lnk_file not defined in policy.
[   13.896051] SELinux:  Permission watch in class chr_file not defined in policy.
[   13.904210] SELinux:  Permission watch_mount in class chr_file not defined in policy.
[   13.912950] SELinux:  Permission watch_sb in class chr_file not defined in policy.
[   13.921397] SELinux:  Permission watch_with_perm in class chr_file not defined in policy.
[   13.930524] SELinux:  Permission watch_reads in class chr_file not defined in policy.
[   13.939266] SELinux:  Permission watch in class blk_file not defined in policy.
[   13.947413] SELinux:  Permission watch_mount in class blk_file not defined in policy.
[   13.956154] SELinux:  Permission watch_sb in class blk_file not defined in policy.
[   13.964604] SELinux:  Permission watch_with_perm in class blk_file not defined in policy.
[   13.973731] SELinux:  Permission watch_reads in class blk_file not defined in policy.
[   13.982474] SELinux:  Permission watch in class sock_file not defined in policy.
[   13.990728] SELinux:  Permission watch_mount in class sock_file not defined in policy.
[   13.999564] SELinux:  Permission watch_sb in class sock_file not defined in policy.
[   14.008109] SELinux:  Permission watch_with_perm in class sock_file not defined in policy.
[   14.017333] SELinux:  Permission watch_reads in class sock_file not defined in policy.
[   14.026170] SELinux:  Permission watch in class fifo_file not defined in policy.
[   14.034426] SELinux:  Permission watch_mount in class fifo_file not defined in policy.
[   14.043262] SELinux:  Permission watch_sb in class fifo_file not defined in policy.
[   14.051807] SELinux:  Permission watch_with_perm in class fifo_file not defined in policy.
[   14.061031] SELinux:  Permission watch_reads in class fifo_file not defined in policy.
[   14.069980] SELinux:  Class perf_event not defined in policy.
[   14.076394] SELinux: the above unknown classes and permissions will be allowed
[   14.084457] SELinux:  policy capability network_peer_controls=1
[   14.091060] SELinux:  policy capability open_perms=1
[   14.096598] SELinux:  policy capability extended_socket_class=1
[   14.103201] SELinux:  policy capability always_check_network=0
[   14.109708] SELinux:  policy capability cgroup_seclabel=1
[   14.115730] SELinux:  policy capability nnp_nosuid_transition=1
[   14.148654] audit: type=1403 audit(1580801610.775:3): auid=4294967295 ses=4294967295 lsm=selinux res=1
[   14.150117] systemd[1]: Successfully loaded SELinux policy in 764.203ms.
[   14.171887] systemd[1]: RTC configured in localtime, applying delta of -300 minutes to system time.
[   14.238225] systemd[1]: Relabelled /dev, /run and /sys/fs/cgroup in 39.372ms.
[   14.247754] systemd[1]: systemd 239 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=legacy)
[   14.286717] systemd[1]: Detected architecture x86-64.

Welcome to [[   14.293510] systemd[1]: Set hostname to <intel-wildcatpass-07>.
0;31mRed Hat Enterprise Linux 8.2 Beta (Ootpa)!

[   14.405714] systemd[1]: Stopped Switch Root.
[  OK  [[   14.410848] systemd[1]: systemd-journald.service: Service has no hold-off time (RestartSec=0), scheduling restart.
0m] Stopped Swit[   14.423860] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 1.
ch Root.
[   14.435337] systemd[1]: Stopped Journal Service.
[  OK  [[   14.442387] systemd[1]: Starting Journal Service...
0m] Stopped Jour[   14.449215] systemd[1]: Created slice system-getty.slice.
nal Service.
 [   14.457412] systemd[1]: Starting Create list of required static device nodes for the current kernel...
        Starting Journal Service...
[  OK  ] Created slice system-getty.slice.
         Starting Create list of required st…ce nodes for the current kernel...
[  OK  ] Set up automount Arbitrary Executab…rmat[   14.491624] xfs filesystem being remounted at / supports timestamps until 2038 (0x7fffffff)
s File System Automount Point.
[  OK  [   14.505614] Adding 32993276k swap on /dev/mapper/intel--wildcatpass--07-swap.  Priority:-2 extents:1 across:32993276k SSFS
[0m] Listening on initctl Compatibility Named Pipe.
[  OK  ] Stopped target Switch Root.
[  OK  ] Listening on Process Core Dump Socket.
         Starting Setup Virtual Console...
         Mounting Huge Pages File System...
[  OK  ] Created slice system-serial\x2dgetty.slice.
[  OK  ] Listening on Device-mapper event daemon FIFOs.
[  OK  ] Listening on LVM2 poll daemon socket.
[  OK  ] Listening on udev Kernel Socket.
[  OK  ] Stopped File System Check on Root Device.
         Starting Remount Root and Kernel File Systems...
[  OK  ] Stopped target Initrd File Systems.
[  OK  ] Reached target Remote File Systems.
[  OK  ] Started Forward Password Requests to Wall Directory Watch.
[  OK  ] Reached target Paths.
         Mounting POSIX Message Queue File System...
[  OK  ] Created slice system-sshd\x2dkeygen.slice.
[  OK  ] Reached target Local Encrypted Volumes.
         Mounting Kernel Debug File System...
[  OK  ] Stopped target Initrd Root File System.
         Activating swap /dev/mapper/intel--wildcatpass--07-swap...
[  OK  ] Listening on udev Control Socket.
         Starting udev Coldplug all Devices...
[  OK  ] Created slice User and Session Slice.
[  OK  ] Reached target Slices.
         Starting Apply Kernel Variables...
         Starting Read and set NIS domainname from /etc/sysconfig/network...
         Starting Monitoring of LVM2 mirrors…ng dmeventd or progress polling...
[  OK  ] Started Create list of required sta…vice nodes for the current kernel.
[  OK  ] Mounted Huge Pages File System.
[  OK  ] Started Remount Root and Kernel File Systems.
[  OK  ] Mounted POSIX Message Queue File System.
[  OK  ] Mounted Kernel Debug File System.
[  OK  ] Activated swap /dev/mapper/intel--wildcatpass--07-swap.
[  OK  ] Started Apply Kernel Variables.
[  OK  ] Started Read and set NIS domainname from /etc/sysconfig/network.
[  OK  ] Reached target Swap.
         Starting Load/Save Random Seed...
         Starting Create Static Device Nodes in /dev...
[  OK  ] Started Load/Save Random Seed.
[  OK  ] Started Create Static Device Nodes in /dev.
         Starting udev Kernel Device Manager...
[  OK  ] Started Setup Virtual Console.
[  OK  ] Started udev Coldplug all Devices.
[   42.110978] NMI watchdog: Watchdog detected hard LOCKUP on cpu 15
[   42.110978] Modules linked in: ip_tables xfs libcrc32c sr_mod cdrom sd_mod sg mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_vram_helper drm_ttm_helper ttm ahci libahci ixgbe drm crc32c_intel libata mdio dca i2c_algo_bit wmi dm_mirror dm_region_hash dm_log dm_mod
[   42.110986] CPU: 15 PID: 1395 Comm: systemd-journal Not tainted 5.5.0-rc7+ #4
[   42.110986] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.6024.071720181717 07/17/2018
[   42.110987] RIP: 0010:native_queued_spin_lock_slowpath+0x5d/0x1c0
[   42.110988] Code: 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 8b 07 <84> c0 75 f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00 00 75
[   42.110988] RSP: 0018:ffffbbe207a7bc48 EFLAGS: 00000002
[   42.110989] RAX: 0000000000f80101 RBX: ffffffffa1576e80 RCX: 0000000000000000
[   42.110990] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa1e95660
[   42.110990] RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000000b
[   42.110991] R10: ffffa075df5dcf80 R11: ffffffffa0ebfda0 R12: ffffffffa1e95660
[   42.110991] R13: ffffffffa1e97680 R14: ffffffffa17197a0 R15: 0000000000000047
[   42.110991] FS:  00007f7c5642a980(0000) GS:ffffa075df5c0000(0000) knlGS:0000000000000000
[   42.110992] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   42.110992] CR2: 00007ffe95f4c4c0 CR3: 000000084fbfc004 CR4: 00000000003606e0
[   42.110993] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   42.110993] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   42.110993] Call Trace:
[   42.110993]  _raw_spin_lock+0x1a/0x20
[   42.110994]  console_unlock+0x9e/0x450
[   42.110994]  bust_spinlocks+0x16/0x30
[   42.110994]  oops_end+0x33/0xc0
[   42.110995]  general_protection+0x32/0x40
[   42.110995] RIP: 0010:copy_data+0xf2/0x1e0
[   42.110995] Code: eb 08 49 83 c4 08 0f 84 8e 00 00 00 4c 89 74 24 08 4c 89 cd 41 89 d6 44 89 44 24 04 49 39 db 0f 87 c6 00 00 00 4d 85 c9 74 43 <41> c7 01 00 00 00 00 48 85 db 74 37 4c 89 e7 48 89 da 41 bf 01 00
[   42.110996] RSP: 0018:ffffbbe207a7bd80 EFLAGS: 00010002
[   42.110996] RAX: ffffa075d44ca000 RBX: 00000000000000a8 RCX: fffffffffff000b0
[   42.110997] RDX: 00000000000000a8 RSI: 00000fffffffff01 RDI: ffffffffa1456e00
[   42.110997] RBP: 0801364600307073 R08: 0000000000002000 R09: 0801364600307073
[   42.110997] R10: fffffffffff00000 R11: 00000000000000a8 R12: ffffffffa1e98330
[   42.110998] R13: 00000000d7efbe00 R14: 00000000000000a8 R15: 00000000ffffc000
[   42.110998]  _prb_read_valid+0xd8/0x190
[   42.110998]  prb_read_valid+0x15/0x20
[   42.110999]  devkmsg_read+0x9d/0x2a0
[   42.110999]  vfs_read+0x91/0x140
[   42.110999]  ksys_read+0x59/0xd0
[   42.111000]  do_syscall_64+0x55/0x1b0
[   42.111000]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   42.111000] RIP: 0033:0x7f7c55740b62
[   42.111001] Code: 94 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b6 0f 1f 80 00 00 00 00 f3 0f 1e fa 8b 05 e6 d8 20 00 85 c0 75 12 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 41 54 49 89 d4 55 48 89
[   42.111001] RSP: 002b:00007ffe95f4c4a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   42.111002] RAX: ffffffffffffffda RBX: 00007ffe95f4e500 RCX: 00007f7c55740b62
[   42.111002] RDX: 0000000000002000 RSI: 00007ffe95f4c4b0 RDI: 0000000000000008
[   42.111002] RBP: 0000000000000000 R08: 0000000000000100 R09: 0000000000000003
[   42.111003] R10: 0000000000000100 R11: 0000000000000246 R12: 00007ffe95f4c4b0
[   42.111003] R13: 00007ffe95f4e910 R14: 0000000000000000 R15: 0000000000000000
[   42.111003] Kernel panic - not syncing: Hard LOCKUP
[   42.111004] Shutting down cpus with NMI
[   42.111004] Kernel Offset: 0x1f000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   42.111005] general protection fault: 0000 [#1] SMP PTI
[   42.111005] CPU: 15 PID: 1395 Comm: systemd-journal Not tainted 5.5.0-rc7+ #4
[   42.111005] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.6024.071720181717 07/17/2018
[   42.111006] RIP: 0010:copy_data+0xf2/0x1e0
[   42.111006] Code: eb 08 49 83 c4 08 0f 84 8e 00 00 00 4c 89 74 24 08 4c 89 cd 41 89 d6 44 89 44 24 04 49 39 db 0f 87 c6 00 00 00 4d 85 c9 74 43 <41> c7 01 00 00 00 00 48 85 db 74 37 4c 89 e7 48 89 da 41 bf 01 00
[   42.111007] RSP: 0018:ffffbbe207a7bd80 EFLAGS: 00010002
[   42.111007] RAX: ffffa075d44ca000 RBX: 00000000000000a8 RCX: fffffffffff000b0
[   42.111008] RDX: 00000000000000a8 RSI: 00000fffffffff01 RDI: ffffffffa1456e00
[   42.111008] RBP: 0801364600307073 R08: 0000000000002000 R09: 0801364600307073
[   42.111008] R10: fffffffffff00000 R11: 00000000000000a8 R12: ffffffffa1e98330
[   42.111009] R13: 00000000d7efbe00 R14: 00000000000000a8 R15: 00000000ffffc000
[   42.111009] FS:  00007f7c5642a980(0000) GS:ffffa075df5c0000(0000) knlGS:0000000000000000
[   42.111010] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   42.111010] CR2: 00007ffe95f4c4c0 CR3: 000000084fbfc004 CR4: 00000000003606e0
[   42.111011] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   42.111011] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   42.111012] Call Trace:
[   42.111012]  _prb_read_valid+0xd8/0x190
[   42.111012]  prb_read_valid+0x15/0x20
[   42.111013]  devkmsg_read+0x9d/0x2a0
[   42.111013]  vfs_read+0x91/0x140
[   42.111013]  ksys_read+0x59/0xd0
[   42.111014]  do_syscall_64+0x55/0x1b0
[   42.111014]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   42.111014] RIP: 0033:0x7f7c55740b62
[   42.111015] Code: 94 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b6 0f 1f 80 00 00 00 00 f3 0f 1e fa 8b 05 e6 d8 20 00 85 c0 75 12 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 41 54 49 89 d4 55 48 89
[   42.111015] RSP: 002b:00007ffe95f4c4a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   42.111016] RAX: ffffffffffffffda RBX: 00007ffe95f4e500 RCX: 00007f7c55740b62
[   42.111016] RDX: 0000000000002000 RSI: 00007ffe95f4c4b0 RDI: 0000000000000008
[   42.111017] RBP: 0000000000000000 R08: 0000000000000100 R09: 0000000000000003
[   42.111017] R10: 0000000000000100 R11: 0000000000000246 R12: 00007ffe95f4c4b0
[   42.111017] R13: 00007ffe95f4e910 R14: 0000000000000000 R15: 0000000000000000
[   42.111017] Modules linked in: ip_tables xfs libcrc32c sr_mod cdrom sd_mod sg mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_vram_helper drm_ttm_helper ttm ahci libahci ixgbe drm crc32c_intel libata mdio dca i2c_algo_bit wmi dm_mirror dm_region_hash dm_log dm_mod

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-02-05  4:25 ` [PATCH 0/2] printk: replace ringbuffer lijiang
@ 2020-02-05  4:42   ` Sergey Senozhatsky
  2020-02-05  4:48   ` Sergey Senozhatsky
  1 sibling, 0 replies; 58+ messages in thread
From: Sergey Senozhatsky @ 2020-02-05  4:42 UTC (permalink / raw)
  To: lijiang
  Cc: John Ogness, Petr Mladek, Peter Zijlstra, Sergey Senozhatsky,
	Sergey Senozhatsky, Steven Rostedt, Linus Torvalds,
	Greg Kroah-Hartman, Andrea Parri, Thomas Gleixner, kexec,
	linux-kernel

On (20/02/05 12:25), lijiang wrote:
> Hi, John Ogness
> 
> Thank you for improving the patch series and making great efforts.
> 
> I'm not sure if I missed anything else. Or are there any other related patches to be applied?
> 
> After applying this patch series, NMI watchdog detected a hard lockup, which caused that kernel can not boot, please refer to
> the following call trace. And I put the complete kernel log in the attachment.

I'm also having some problems running the code on my laptop. But may be
I did something wrong while applying patch 0002 (which didn't apply
cleanly). Will look more.

	-ss

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-02-05  4:25 ` [PATCH 0/2] printk: replace ringbuffer lijiang
  2020-02-05  4:42   ` Sergey Senozhatsky
@ 2020-02-05  4:48   ` Sergey Senozhatsky
  2020-02-05  5:02     ` Sergey Senozhatsky
  1 sibling, 1 reply; 58+ messages in thread
From: Sergey Senozhatsky @ 2020-02-05  4:48 UTC (permalink / raw)
  To: lijiang
  Cc: John Ogness, Petr Mladek, Peter Zijlstra, Sergey Senozhatsky,
	Sergey Senozhatsky, Steven Rostedt, Linus Torvalds,
	Greg Kroah-Hartman, Andrea Parri, Thomas Gleixner, kexec,
	linux-kernel

On (20/02/05 12:25), lijiang wrote:
[..]
> [   42.111004] Kernel Offset: 0x1f000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [   42.111005] general protection fault: 0000 [#1] SMP PTI
> [   42.111005] CPU: 15 PID: 1395 Comm: systemd-journal Not tainted 5.5.0-rc7+ #4
> [   42.111005] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.6024.071720181717 07/17/2018
> [   42.111006] RIP: 0010:copy_data+0xf2/0x1e0
> [   42.111006] Code: eb 08 49 83 c4 08 0f 84 8e 00 00 00 4c 89 74 24 08 4c 89 cd 41 89 d6 44 89 44 24 04 49 39 db 0f 87 c6 00 00 00 4d 85 c9 74 43 <41> c7 01 00 00 00 00 48 85 db 74 37 4c 89 e7 48 89 da 41 bf 01 00
> [   42.111007] RSP: 0018:ffffbbe207a7bd80 EFLAGS: 00010002
> [   42.111007] RAX: ffffa075d44ca000 RBX: 00000000000000a8 RCX: fffffffffff000b0
> [   42.111008] RDX: 00000000000000a8 RSI: 00000fffffffff01 RDI: ffffffffa1456e00
> [   42.111008] RBP: 0801364600307073 R08: 0000000000002000 R09: 0801364600307073
> [   42.111008] R10: fffffffffff00000 R11: 00000000000000a8 R12: ffffffffa1e98330
> [   42.111009] R13: 00000000d7efbe00 R14: 00000000000000a8 R15: 00000000ffffc000
> [   42.111009] FS:  00007f7c5642a980(0000) GS:ffffa075df5c0000(0000) knlGS:0000000000000000
> [   42.111010] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   42.111010] CR2: 00007ffe95f4c4c0 CR3: 000000084fbfc004 CR4: 00000000003606e0
> [   42.111011] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   42.111011] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   42.111012] Call Trace:
> [   42.111012]  _prb_read_valid+0xd8/0x190
> [   42.111012]  prb_read_valid+0x15/0x20
> [   42.111013]  devkmsg_read+0x9d/0x2a0
> [   42.111013]  vfs_read+0x91/0x140
> [   42.111013]  ksys_read+0x59/0xd0
> [   42.111014]  do_syscall_64+0x55/0x1b0
> [   42.111014]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [   42.111014] RIP: 0033:0x7f7c55740b62
> [   42.111015] Code: 94 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b6 0f 1f 80 00 00 00 00 f3 0f 1e fa 8b 05 e6 d8 20 00 85 c0 75 12 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 41 54 49 89 d4 55 48 89
> [   42.111015] RSP: 002b:00007ffe95f4c4a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [   42.111016] RAX: ffffffffffffffda RBX: 00007ffe95f4e500 RCX: 00007f7c55740b62
> [   42.111016] RDX: 0000000000002000 RSI: 00007ffe95f4c4b0 RDI: 0000000000000008
> [   42.111017] RBP: 0000000000000000 R08: 0000000000000100 R09: 0000000000000003
> [   42.111017] R10: 0000000000000100 R11: 0000000000000246 R12: 00007ffe95f4c4b0

So there is a General protection fault. That's the type of a problem that
kills the boot for me as well (different backtrace, tho).

	-ss

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-02-05  4:48   ` Sergey Senozhatsky
@ 2020-02-05  5:02     ` Sergey Senozhatsky
  2020-02-05  5:38       ` lijiang
  0 siblings, 1 reply; 58+ messages in thread
From: Sergey Senozhatsky @ 2020-02-05  5:02 UTC (permalink / raw)
  To: lijiang
  Cc: John Ogness, Petr Mladek, Peter Zijlstra, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel, Sergey Senozhatsky

On (20/02/05 13:48), Sergey Senozhatsky wrote:
> On (20/02/05 12:25), lijiang wrote:
> [..]
> > [   42.111004] Kernel Offset: 0x1f000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> > [   42.111005] general protection fault: 0000 [#1] SMP PTI
> > [   42.111005] CPU: 15 PID: 1395 Comm: systemd-journal Not tainted 5.5.0-rc7+ #4
> > [   42.111005] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.6024.071720181717 07/17/2018
> > [   42.111006] RIP: 0010:copy_data+0xf2/0x1e0
> > [   42.111006] Code: eb 08 49 83 c4 08 0f 84 8e 00 00 00 4c 89 74 24 08 4c 89 cd 41 89 d6 44 89 44 24 04 49 39 db 0f 87 c6 00 00 00 4d 85 c9 74 43 <41> c7 01 00 00 00 00 48 85 db 74 37 4c 89 e7 48 89 da 41 bf 01 00
> > [   42.111007] RSP: 0018:ffffbbe207a7bd80 EFLAGS: 00010002
> > [   42.111007] RAX: ffffa075d44ca000 RBX: 00000000000000a8 RCX: fffffffffff000b0
> > [   42.111008] RDX: 00000000000000a8 RSI: 00000fffffffff01 RDI: ffffffffa1456e00
> > [   42.111008] RBP: 0801364600307073 R08: 0000000000002000 R09: 0801364600307073
> > [   42.111008] R10: fffffffffff00000 R11: 00000000000000a8 R12: ffffffffa1e98330
> > [   42.111009] R13: 00000000d7efbe00 R14: 00000000000000a8 R15: 00000000ffffc000
> > [   42.111009] FS:  00007f7c5642a980(0000) GS:ffffa075df5c0000(0000) knlGS:0000000000000000
> > [   42.111010] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   42.111010] CR2: 00007ffe95f4c4c0 CR3: 000000084fbfc004 CR4: 00000000003606e0
> > [   42.111011] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [   42.111011] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [   42.111012] Call Trace:
> > [   42.111012]  _prb_read_valid+0xd8/0x190
> > [   42.111012]  prb_read_valid+0x15/0x20
> > [   42.111013]  devkmsg_read+0x9d/0x2a0
> > [   42.111013]  vfs_read+0x91/0x140
> > [   42.111013]  ksys_read+0x59/0xd0
> > [   42.111014]  do_syscall_64+0x55/0x1b0
> > [   42.111014]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [   42.111014] RIP: 0033:0x7f7c55740b62
> > [   42.111015] Code: 94 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b6 0f 1f 80 00 00 00 00 f3 0f 1e fa 8b 05 e6 d8 20 00 85 c0 75 12 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 41 54 49 89 d4 55 48 89
> > [   42.111015] RSP: 002b:00007ffe95f4c4a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> > [   42.111016] RAX: ffffffffffffffda RBX: 00007ffe95f4e500 RCX: 00007f7c55740b62
> > [   42.111016] RDX: 0000000000002000 RSI: 00007ffe95f4c4b0 RDI: 0000000000000008
> > [   42.111017] RBP: 0000000000000000 R08: 0000000000000100 R09: 0000000000000003
> > [   42.111017] R10: 0000000000000100 R11: 0000000000000246 R12: 00007ffe95f4c4b0
> 
> So there is a General protection fault. That's the type of a problem that
> kills the boot for me as well (different backtrace, tho).

Do you have CONFIG_RELOCATABLE and CONFIG_RANDOMIZE_BASE (KASLR) enabled?

	-ss

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-02-05  5:02     ` Sergey Senozhatsky
@ 2020-02-05  5:38       ` lijiang
  2020-02-05  6:36         ` Sergey Senozhatsky
  0 siblings, 1 reply; 58+ messages in thread
From: lijiang @ 2020-02-05  5:38 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: John Ogness, Petr Mladek, Peter Zijlstra, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel


> On (20/02/05 13:48), Sergey Senozhatsky wrote:
>> On (20/02/05 12:25), lijiang wrote:
>> [..]
>>> [   42.111004] Kernel Offset: 0x1f000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>>> [   42.111005] general protection fault: 0000 [#1] SMP PTI
>>> [   42.111005] CPU: 15 PID: 1395 Comm: systemd-journal Not tainted 5.5.0-rc7+ #4
>>> [   42.111005] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.6024.071720181717 07/17/2018
>>> [   42.111006] RIP: 0010:copy_data+0xf2/0x1e0
>>> [   42.111006] Code: eb 08 49 83 c4 08 0f 84 8e 00 00 00 4c 89 74 24 08 4c 89 cd 41 89 d6 44 89 44 24 04 49 39 db 0f 87 c6 00 00 00 4d 85 c9 74 43 <41> c7 01 00 00 00 00 48 85 db 74 37 4c 89 e7 48 89 da 41 bf 01 00
>>> [   42.111007] RSP: 0018:ffffbbe207a7bd80 EFLAGS: 00010002
>>> [   42.111007] RAX: ffffa075d44ca000 RBX: 00000000000000a8 RCX: fffffffffff000b0
>>> [   42.111008] RDX: 00000000000000a8 RSI: 00000fffffffff01 RDI: ffffffffa1456e00
>>> [   42.111008] RBP: 0801364600307073 R08: 0000000000002000 R09: 0801364600307073
>>> [   42.111008] R10: fffffffffff00000 R11: 00000000000000a8 R12: ffffffffa1e98330
>>> [   42.111009] R13: 00000000d7efbe00 R14: 00000000000000a8 R15: 00000000ffffc000
>>> [   42.111009] FS:  00007f7c5642a980(0000) GS:ffffa075df5c0000(0000) knlGS:0000000000000000
>>> [   42.111010] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [   42.111010] CR2: 00007ffe95f4c4c0 CR3: 000000084fbfc004 CR4: 00000000003606e0
>>> [   42.111011] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [   42.111011] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [   42.111012] Call Trace:
>>> [   42.111012]  _prb_read_valid+0xd8/0x190
>>> [   42.111012]  prb_read_valid+0x15/0x20
>>> [   42.111013]  devkmsg_read+0x9d/0x2a0
>>> [   42.111013]  vfs_read+0x91/0x140
>>> [   42.111013]  ksys_read+0x59/0xd0
>>> [   42.111014]  do_syscall_64+0x55/0x1b0
>>> [   42.111014]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [   42.111014] RIP: 0033:0x7f7c55740b62
>>> [   42.111015] Code: 94 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b6 0f 1f 80 00 00 00 00 f3 0f 1e fa 8b 05 e6 d8 20 00 85 c0 75 12 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 41 54 49 89 d4 55 48 89
>>> [   42.111015] RSP: 002b:00007ffe95f4c4a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
>>> [   42.111016] RAX: ffffffffffffffda RBX: 00007ffe95f4e500 RCX: 00007f7c55740b62
>>> [   42.111016] RDX: 0000000000002000 RSI: 00007ffe95f4c4b0 RDI: 0000000000000008
>>> [   42.111017] RBP: 0000000000000000 R08: 0000000000000100 R09: 0000000000000003
>>> [   42.111017] R10: 0000000000000100 R11: 0000000000000246 R12: 00007ffe95f4c4b0
>>
>> So there is a General protection fault. That's the type of a problem that
>> kills the boot for me as well (different backtrace, tho).
> 
> Do you have CONFIG_RELOCATABLE and CONFIG_RANDOMIZE_BASE (KASLR) enabled?
> 

Yes. These two options are enabled.

CONFIG_RELOCATABLE=y
CONFIG_RANDOMIZE_BASE=y

Thanks.

> 	-ss
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-02-05  5:38       ` lijiang
@ 2020-02-05  6:36         ` Sergey Senozhatsky
  2020-02-05  9:00           ` John Ogness
  2020-02-05  9:36           ` lijiang
  0 siblings, 2 replies; 58+ messages in thread
From: Sergey Senozhatsky @ 2020-02-05  6:36 UTC (permalink / raw)
  To: lijiang, John Ogness
  Cc: Sergey Senozhatsky, Petr Mladek, Peter Zijlstra,
	Sergey Senozhatsky, Steven Rostedt, Linus Torvalds,
	Greg Kroah-Hartman, Andrea Parri, Thomas Gleixner, kexec,
	linux-kernel

On (20/02/05 13:38), lijiang wrote:
> > On (20/02/05 13:48), Sergey Senozhatsky wrote:
> >> On (20/02/05 12:25), lijiang wrote:

[..]

> >>
> >> So there is a General protection fault. That's the type of a problem that
> >> kills the boot for me as well (different backtrace, tho).
> > 
> > Do you have CONFIG_RELOCATABLE and CONFIG_RANDOMIZE_BASE (KASLR) enabled?
> > 
> 
> Yes. These two options are enabled.
> 
> CONFIG_RELOCATABLE=y
> CONFIG_RANDOMIZE_BASE=y

So KASLR kills the boot for me. So does KASAN.

John, do you see any of these problems on your test machine?

	-ss

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-02-05  6:36         ` Sergey Senozhatsky
@ 2020-02-05  9:00           ` John Ogness
  2020-02-05  9:28             ` Sergey Senozhatsky
                               ` (2 more replies)
  2020-02-05  9:36           ` lijiang
  1 sibling, 3 replies; 58+ messages in thread
From: John Ogness @ 2020-02-05  9:00 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: lijiang, Petr Mladek, Peter Zijlstra, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On 2020-02-05, Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> wrote:
>>>> So there is a General protection fault. That's the type of a
>>>> problem that kills the boot for me as well (different backtrace,
>>>> tho).
>>> 
>>> Do you have CONFIG_RELOCATABLE and CONFIG_RANDOMIZE_BASE (KASLR)
>>> enabled?
>> 
>> Yes. These two options are enabled.
>> 
>> CONFIG_RELOCATABLE=y
>> CONFIG_RANDOMIZE_BASE=y
>
> So KASLR kills the boot for me. So does KASAN.

Sergey, thanks for looking into this already!

> John, do you see any of these problems on your test machine?

For x86 I have only been using qemu. (For hardware tests I use arm64-smp
in order to verify memory barriers.) With qemu-x86_64 I am unable to
reproduce the problem.

Lianbo, thanks for the report. Can you share your boot args? Anything
special in there (like log_buf_len=, earlyprintk, etc)?

Also, could you share your CONFIG_LOG_* and CONFIG_PRINTK_* options?

I will move to bare metal x86_64 and hopefully see it as well.

John

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-02-05  9:00           ` John Ogness
@ 2020-02-05  9:28             ` Sergey Senozhatsky
  2020-02-05 10:19             ` lijiang
  2020-02-05 11:07             ` Sergey Senozhatsky
  2 siblings, 0 replies; 58+ messages in thread
From: Sergey Senozhatsky @ 2020-02-05  9:28 UTC (permalink / raw)
  To: John Ogness
  Cc: Sergey Senozhatsky, lijiang, Petr Mladek, Peter Zijlstra,
	Sergey Senozhatsky, Steven Rostedt, Linus Torvalds,
	Greg Kroah-Hartman, Andrea Parri, Thomas Gleixner, kexec,
	linux-kernel

On (20/02/05 10:00), John Ogness wrote:
> On 2020-02-05, Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> wrote:
> >>>> So there is a General protection fault. That's the type of a
> >>>> problem that kills the boot for me as well (different backtrace,
> >>>> tho).
> >>> 
> >>> Do you have CONFIG_RELOCATABLE and CONFIG_RANDOMIZE_BASE (KASLR)
> >>> enabled?
> >> 
> >> Yes. These two options are enabled.
> >> 
> >> CONFIG_RELOCATABLE=y
> >> CONFIG_RANDOMIZE_BASE=y
> >
> > So KASLR kills the boot for me. So does KASAN.
> 
> Sergey, thanks for looking into this already!

Hey, no prob! I can't see how and why that would be KASLR related,
and most likely it's not. Probably we just hit some fault sooner
with it enabled.

So far it seems that reads from /dev/kmsg are causing problems
on my laptop, but it's a bit hard to debug.

Nothing printk-related in my boot params.

	-ss

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-02-05  6:36         ` Sergey Senozhatsky
  2020-02-05  9:00           ` John Ogness
@ 2020-02-05  9:36           ` lijiang
  1 sibling, 0 replies; 58+ messages in thread
From: lijiang @ 2020-02-05  9:36 UTC (permalink / raw)
  To: Sergey Senozhatsky, John Ogness
  Cc: Petr Mladek, Peter Zijlstra, Sergey Senozhatsky, Steven Rostedt,
	Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

> On (20/02/05 13:38), lijiang wrote:
>>> On (20/02/05 13:48), Sergey Senozhatsky wrote:
>>>> On (20/02/05 12:25), lijiang wrote:
> 
> [..]
> 
>>>>
>>>> So there is a General protection fault. That's the type of a problem that
>>>> kills the boot for me as well (different backtrace, tho).
>>>
>>> Do you have CONFIG_RELOCATABLE and CONFIG_RANDOMIZE_BASE (KASLR) enabled?
>>>
>>
>> Yes. These two options are enabled.
>>
>> CONFIG_RELOCATABLE=y
>> CONFIG_RANDOMIZE_BASE=y
> 
> So KASLR kills the boot for me. So does KASAN.
> 
For my side, after adding the option 'nokaslr' to kernel command line, I still have the
previously mentioned problem, finally, kernel failed to boot.

Thanks.

> John, do you see any of these problems on your test machine?
> 
> 	-ss
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-02-05  9:00           ` John Ogness
  2020-02-05  9:28             ` Sergey Senozhatsky
@ 2020-02-05 10:19             ` lijiang
  2020-02-05 16:12               ` John Ogness
  2020-02-05 11:07             ` Sergey Senozhatsky
  2 siblings, 1 reply; 58+ messages in thread
From: lijiang @ 2020-02-05 10:19 UTC (permalink / raw)
  To: John Ogness, Sergey Senozhatsky
  Cc: Petr Mladek, Peter Zijlstra, Sergey Senozhatsky, Steven Rostedt,
	Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

> On 2020-02-05, Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> wrote:
>>>>> So there is a General protection fault. That's the type of a
>>>>> problem that kills the boot for me as well (different backtrace,
>>>>> tho).
>>>>
>>>> Do you have CONFIG_RELOCATABLE and CONFIG_RANDOMIZE_BASE (KASLR)
>>>> enabled?
>>>
>>> Yes. These two options are enabled.
>>>
>>> CONFIG_RELOCATABLE=y
>>> CONFIG_RANDOMIZE_BASE=y
>>
>> So KASLR kills the boot for me. So does KASAN.
> 
> Sergey, thanks for looking into this already!
> 
>> John, do you see any of these problems on your test machine?
> 
> For x86 I have only been using qemu. (For hardware tests I use arm64-smp
> in order to verify memory barriers.) With qemu-x86_64 I am unable to
> reproduce the problem.
> 
> Lianbo, thanks for the report. Can you share your boot args? Anything
> special in there (like log_buf_len=, earlyprintk, etc)?
> 
Thanks for your response. Here is my kernel command line:

Command line: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.5.0-rc7+ root=/dev/mapper/intel--wildcatpass--07-root ro crashkernel=512M resume=/dev/mapper/intel--wildcatpass--07-swap rd.lvm.lv=intel-wildcatpass-07/root rd.lvm.lv=intel-wildcatpass-07/swap console=ttyS0,115200n81

BTW: Actually, I put the complete kernel log in my last email reply, you could check the attachment if needed.

> Also, could you share your CONFIG_LOG_* and CONFIG_PRINTK_* options?
> 
Sure. Please refer to it.

[root@intel-wildcatpass-07 linux]# grep -nr "CONFIG_LOG_" .config 
134:CONFIG_LOG_BUF_SHIFT=20
135:CONFIG_LOG_CPU_MAX_BUF_SHIFT=12

[root@intel-wildcatpass-07 linux]# grep -nr "CONFIG_PRINTK_" .config 
136:CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13
207:CONFIG_PRINTK_NMI=y
7758:CONFIG_PRINTK_TIME=y
7759:# CONFIG_PRINTK_CALLER is not set

Do you have any suggestions about the size of CONFIG_LOG_* and CONFIG_PRINTK_* options by default?

Thanks.
Lianbo

> I will move to bare metal x86_64 and hopefully see it as well.
> 
> John
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-02-05  9:00           ` John Ogness
  2020-02-05  9:28             ` Sergey Senozhatsky
  2020-02-05 10:19             ` lijiang
@ 2020-02-05 11:07             ` Sergey Senozhatsky
  2020-02-05 15:48               ` John Ogness
  2 siblings, 1 reply; 58+ messages in thread
From: Sergey Senozhatsky @ 2020-02-05 11:07 UTC (permalink / raw)
  To: John Ogness
  Cc: Sergey Senozhatsky, lijiang, Petr Mladek, Peter Zijlstra,
	Sergey Senozhatsky, Steven Rostedt, Linus Torvalds,
	Greg Kroah-Hartman, Andrea Parri, Thomas Gleixner, kexec,
	linux-kernel

On (20/02/05 10:00), John Ogness wrote:
> On 2020-02-05, Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> wrote:
> >>>> So there is a General protection fault. That's the type of a
> >>>> problem that kills the boot for me as well (different backtrace,
> >>>> tho).
> >>> 
> >>> Do you have CONFIG_RELOCATABLE and CONFIG_RANDOMIZE_BASE (KASLR)
> >>> enabled?
> >> 
> >> Yes. These two options are enabled.
> >> 
> >> CONFIG_RELOCATABLE=y
> >> CONFIG_RANDOMIZE_BASE=y
> >
> > So KASLR kills the boot for me. So does KASAN.
> 
> Sergey, thanks for looking into this already!
> 

So I hacked the system a bit.

3BUG: KASAN: wild-memory-access in copy_data+0x129/0x220>
3Write of size 4 at addr 5a5a5a5a5a5a5a5a by task cat/474>
Call Trace:>
 dump_stack+0x76/0xa0>
 ? copy_data+0x129/0x220>
 __kasan_report.cold+0x5/0x3b>
 ? get_page_from_freelist+0x1224/0x1490>
 ? copy_data+0x129/0x220>
 copy_data+0x129/0x220>
 _prb_read_valid+0x1a0/0x330>
 ? prb_first_seq+0xe0/0xe0>
 ? __might_sleep+0x2f/0xd0>
 ? __zone_watermark_ok+0x180/0x180>
 ? ___might_sleep+0xbe/0xe0>
 prb_read_valid+0x4f/0x60>
 ? _prb_read_valid+0x330/0x330>
 devkmsg_read+0x12e/0x3d0>
 ? __mod_node_page_state+0x1a/0xa0>
 ? info_print_ext_header.constprop.0+0x120/0x120>
 ? __lru_cache_add+0x16c/0x190>
 ? __handle_mm_fault+0x1097/0x1f60>
 vfs_read+0xdc/0x200>
 ksys_read+0xa0/0x130>
 ? kernel_write+0xb0/0xb0>
 ? up_read+0x56/0x130>
 do_syscall_64+0xa0/0x520>
 ? syscall_return_slowpath+0x210/0x210>
 ? do_page_fault+0x399/0x4fa>
 entry_SYSCALL_64_after_hwframe+0x44/0xa9>
RIP: 0033:0x7ff5f39813f2>
Code: c0 e9 c2 fe ff ff 50 48 8d 3d 9a 0d 0a 00 e8 95 ed 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24>
RSP: 002b:00007ffc47b3ee58 EFLAGS: 0000024>
c ORIG_RAX: 0000000000000000>
RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007ff5f39813f2>
RDX: 0000000000020000 RSI: 00007ff5f3588000 RDI: 0000000000000003>
RBP: 00007ff5f3588000 R08: 00007ff5f3587010 R09: 0000000000000000>
R10: 0000000000000022 R11: 0000000000000246 R12: 000055f9c8a81c00>
R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000>

	-ss

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-02-05 11:07             ` Sergey Senozhatsky
@ 2020-02-05 15:48               ` John Ogness
  2020-02-05 19:29                 ` Joe Perches
                                   ` (4 more replies)
  0 siblings, 5 replies; 58+ messages in thread
From: John Ogness @ 2020-02-05 15:48 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, lijiang, Petr Mladek, Peter Zijlstra,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On 2020-02-05, Sergey Senozhatsky <sergey.senozhatsky@gmail.com> wrote:
> 3BUG: KASAN: wild-memory-access in copy_data+0x129/0x220>
> 3Write of size 4 at addr 5a5a5a5a5a5a5a5a by task cat/474>

The problem was due to an uninitialized pointer.

Very recently the ringbuffer API was expanded so that it could
optionally count lines in a record. This made it possible for me to
implement record_print_text_inline(), which can do all the kmsg_dump
multi-line madness without requiring a temporary buffer. Rather than
passing an extra argument around for the optional line count, I added
the text_line_count pointer to the printk_record struct. And since line
counting is rarely needed, it is only performed if text_line_count is
non-NULL.

I oversaw that devkmsg_open() setup a printk_record and so I did not see
to add the extra NULL initialization of text_line_count. There should be
be an initializer function/macro to avoid this danger.

John Ogness

The quick fixup:

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index d0d24ee1d1f4..5ad67ff60cd9 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -883,6 +883,7 @@ static int devkmsg_open(struct inode *inode, struct file *file)
 	user->record.text_buf_size = sizeof(user->text_buf);
 	user->record.dict_buf = &user->dict_buf[0];
 	user->record.dict_buf_size = sizeof(user->dict_buf);
+	user->record.text_line_count = NULL;
 
 	logbuf_lock_irq();
 	user->seq = prb_first_seq(prb);

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-02-05 10:19             ` lijiang
@ 2020-02-05 16:12               ` John Ogness
  2020-02-06  9:12                 ` lijiang
  2020-02-13 13:07                 ` Petr Mladek
  0 siblings, 2 replies; 58+ messages in thread
From: John Ogness @ 2020-02-05 16:12 UTC (permalink / raw)
  To: lijiang
  Cc: Sergey Senozhatsky, Petr Mladek, Peter Zijlstra,
	Sergey Senozhatsky, Steven Rostedt, Linus Torvalds,
	Greg Kroah-Hartman, Andrea Parri, Thomas Gleixner, kexec,
	linux-kernel

On 2020-02-05, lijiang <lijiang@redhat.com> wrote:
> Do you have any suggestions about the size of CONFIG_LOG_* and
> CONFIG_PRINTK_* options by default?

The new printk implementation consumes more than double the memory that
the current printk implementation requires. This is because dictionaries
and meta-data are now stored separately.

If the old defaults (LOG_BUF_SHIFT=17 LOG_CPU_MAX_BUF_SHIFT=12) were
chosen because they are maximally acceptable defaults, then the defaults
should be reduced by 1 so that the final size is "similar" to the
current implementation.

If instead the defaults are left as-is, a machine with less than 64 CPUs
will reserve 336KiB for printk information (128KiB text, 128KiB
dictionary, 80KiB meta-data).

It might also be desirable to reduce the dictionary size (maybe 1/4 the
size of text?). However, since the new printk implementation allows for
non-intrusive dictionaries, we might see their usage increase and start
to be as large as the messages themselves.

John Ogness

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-02-05 15:48               ` John Ogness
@ 2020-02-05 19:29                 ` Joe Perches
  2020-02-06  6:31                 ` Sergey Senozhatsky
                                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 58+ messages in thread
From: Joe Perches @ 2020-02-05 19:29 UTC (permalink / raw)
  To: John Ogness, Sergey Senozhatsky
  Cc: Sergey Senozhatsky, lijiang, Petr Mladek, Peter Zijlstra,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On Wed, 2020-02-05 at 16:48 +0100, John Ogness wrote:
> On 2020-02-05, Sergey Senozhatsky <sergey.senozhatsky@gmail.com> wrote:
> > 3BUG: KASAN: wild-memory-access in copy_data+0x129/0x220>
> > 3Write of size 4 at addr 5a5a5a5a5a5a5a5a by task cat/474>
> 
> The problem was due to an uninitialized pointer.
> 
> Very recently the ringbuffer API was expanded so that it could
> optionally count lines in a record. This made it possible for me to
> implement record_print_text_inline(), which can do all the kmsg_dump
> multi-line madness without requiring a temporary buffer. Rather than
> passing an extra argument around for the optional line count, I added
> the text_line_count pointer to the printk_record struct. And since line
> counting is rarely needed, it is only performed if text_line_count is
> non-NULL.
> 
> I oversaw that devkmsg_open() setup a printk_record and so I did not see
> to add the extra NULL initialization of text_line_count. There should be
> be an initializer function/macro to avoid this danger.
> 
> John Ogness
> 
> The quick fixup:
> 
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
[]
> @@ -883,6 +883,7 @@ static int devkmsg_open(struct inode *inode, struct file *file)
>  	user->record.text_buf_size = sizeof(user->text_buf);
>  	user->record.dict_buf = &user->dict_buf[0];
>  	user->record.dict_buf_size = sizeof(user->dict_buf);
> +	user->record.text_line_count = NULL;

Probably better to change the kmalloc to kzalloc.

 	user = kzalloc(sizeof(struct devkmsg_user), GFP_KERNEL);



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-02-05 15:48               ` John Ogness
  2020-02-05 19:29                 ` Joe Perches
@ 2020-02-06  6:31                 ` Sergey Senozhatsky
  2020-02-06  7:30                 ` lijiang
                                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 58+ messages in thread
From: Sergey Senozhatsky @ 2020-02-06  6:31 UTC (permalink / raw)
  To: John Ogness
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, lijiang, Petr Mladek,
	Peter Zijlstra, Steven Rostedt, Linus Torvalds,
	Greg Kroah-Hartman, Andrea Parri, Thomas Gleixner, kexec,
	linux-kernel

On (20/02/05 16:48), John Ogness wrote:
> On 2020-02-05, Sergey Senozhatsky <sergey.senozhatsky@gmail.com> wrote:
> > 3BUG: KASAN: wild-memory-access in copy_data+0x129/0x220>
> > 3Write of size 4 at addr 5a5a5a5a5a5a5a5a by task cat/474>
> 
> The problem was due to an uninitialized pointer.
> 
> Very recently the ringbuffer API was expanded so that it could
> optionally count lines in a record. This made it possible for me to
> implement record_print_text_inline(), which can do all the kmsg_dump
> multi-line madness without requiring a temporary buffer. Rather than
> passing an extra argument around for the optional line count, I added
> the text_line_count pointer to the printk_record struct. And since line
> counting is rarely needed, it is only performed if text_line_count is
> non-NULL.
> 
> I oversaw that devkmsg_open() setup a printk_record and so I did not see
> to add the extra NULL initialization of text_line_count. There should be
> be an initializer function/macro to avoid this danger.
> 
> John Ogness
> 
> The quick fixup:
> 
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index d0d24ee1d1f4..5ad67ff60cd9 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -883,6 +883,7 @@ static int devkmsg_open(struct inode *inode, struct file *file)
>  	user->record.text_buf_size = sizeof(user->text_buf);
>  	user->record.dict_buf = &user->dict_buf[0];
>  	user->record.dict_buf_size = sizeof(user->dict_buf);
> +	user->record.text_line_count = NULL;
>  
>  	logbuf_lock_irq();
>  	user->seq = prb_first_seq(prb);

Yes. That should do. It seems that /dev/kmsg reads/writes happen very early in
my system and all the backtraces I saw were from completely unrelated paths -
either a NULL deref at sys_clone()->do_fork()->copy_creds()->prepare_creads(),
or general protection fault in sys_keyctl()->join_session_keyring()->prepare_creds(),
or some weird crashes in ext4. And so on.

I see some more unexplainable lockups on one on my test boards, but I
can't provide more details at this time. Might not be related to the
patch set. Need to investigate further.

	-ss

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-02-05 15:48               ` John Ogness
  2020-02-05 19:29                 ` Joe Perches
  2020-02-06  6:31                 ` Sergey Senozhatsky
@ 2020-02-06  7:30                 ` lijiang
  2020-02-07  1:40                 ` Steven Rostedt
  2020-02-14 15:56                 ` Petr Mladek
  4 siblings, 0 replies; 58+ messages in thread
From: lijiang @ 2020-02-06  7:30 UTC (permalink / raw)
  To: John Ogness
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Petr Mladek,
	Peter Zijlstra, Steven Rostedt, Linus Torvalds,
	Greg Kroah-Hartman, Andrea Parri, Thomas Gleixner, kexec,
	linux-kernel

在 2020年02月05日 23:48, John Ogness 写道:
> On 2020-02-05, Sergey Senozhatsky <sergey.senozhatsky@gmail.com> wrote:
>> 3BUG: KASAN: wild-memory-access in copy_data+0x129/0x220>
>> 3Write of size 4 at addr 5a5a5a5a5a5a5a5a by task cat/474>
> 
> The problem was due to an uninitialized pointer.
> 
> Very recently the ringbuffer API was expanded so that it could
> optionally count lines in a record. This made it possible for me to
> implement record_print_text_inline(), which can do all the kmsg_dump
> multi-line madness without requiring a temporary buffer. Rather than
> passing an extra argument around for the optional line count, I added
> the text_line_count pointer to the printk_record struct. And since line
> counting is rarely needed, it is only performed if text_line_count is
> non-NULL.
> 
> I oversaw that devkmsg_open() setup a printk_record and so I did not see
> to add the extra NULL initialization of text_line_count. There should be
> be an initializer function/macro to avoid this danger.
> 
Good findings. Thanks for the quick fixup, it works well.

Lianbo

> John Ogness
> 
> The quick fixup:
> 
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index d0d24ee1d1f4..5ad67ff60cd9 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -883,6 +883,7 @@ static int devkmsg_open(struct inode *inode, struct file *file)
>  	user->record.text_buf_size = sizeof(user->text_buf);
>  	user->record.dict_buf = &user->dict_buf[0];
>  	user->record.dict_buf_size = sizeof(user->dict_buf);
> +	user->record.text_line_count = NULL;
>  
>  	logbuf_lock_irq();
>  	user->seq = prb_first_seq(prb);
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-02-05 16:12               ` John Ogness
@ 2020-02-06  9:12                 ` lijiang
  2020-02-13 13:07                 ` Petr Mladek
  1 sibling, 0 replies; 58+ messages in thread
From: lijiang @ 2020-02-06  9:12 UTC (permalink / raw)
  To: John Ogness
  Cc: Sergey Senozhatsky, Petr Mladek, Peter Zijlstra,
	Sergey Senozhatsky, Steven Rostedt, Linus Torvalds,
	Greg Kroah-Hartman, Andrea Parri, Thomas Gleixner, kexec,
	linux-kernel

> On 2020-02-05, lijiang <lijiang@redhat.com> wrote:
>> Do you have any suggestions about the size of CONFIG_LOG_* and
>> CONFIG_PRINTK_* options by default?
> 
> The new printk implementation consumes more than double the memory that
> the current printk implementation requires. This is because dictionaries
> and meta-data are now stored separately.
> 
> If the old defaults (LOG_BUF_SHIFT=17 LOG_CPU_MAX_BUF_SHIFT=12) were
> chosen because they are maximally acceptable defaults, then the defaults
> should be reduced by 1 so that the final size is "similar" to the
> current implementation.
> 
> If instead the defaults are left as-is, a machine with less than 64 CPUs
> will reserve 336KiB for printk information (128KiB text, 128KiB
> dictionary, 80KiB meta-data).
> 
> It might also be desirable to reduce the dictionary size (maybe 1/4 the
> size of text?). However, since the new printk implementation allows for
> non-intrusive dictionaries, we might see their usage increase and start
> to be as large as the messages themselves.
> 
> John Ogness
> 

Thanks for the explanation in detail.

Lianbo


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-01-28 16:19 [PATCH 0/2] printk: replace ringbuffer John Ogness
                   ` (2 preceding siblings ...)
  2020-02-05  4:25 ` [PATCH 0/2] printk: replace ringbuffer lijiang
@ 2020-02-06  9:21 ` lijiang
  3 siblings, 0 replies; 58+ messages in thread
From: lijiang @ 2020-02-06  9:21 UTC (permalink / raw)
  To: John Ogness
  Cc: Petr Mladek, Peter Zijlstra, Sergey Senozhatsky,
	Sergey Senozhatsky, Steven Rostedt, Linus Torvalds,
	Greg Kroah-Hartman, Andrea Parri, Thomas Gleixner, kexec,
	linux-kernel

在 2020年01月29日 00:19, John Ogness 写道:
> Hello,
> 
> After several RFC series [0][1][2][3][4], here is the first set of
> patches to rework the printk subsystem. This first set of patches
> only replace the existing ringbuffer implementation. No locking is
> removed. No semantics/behavior of printk are changed.
> 
> The VMCOREINFO is updated, which will require changes to the
> external crash [5] tool. I will be preparing a patch to add support
> for the new VMCOREINFO.
> 
In addition to changing the crash utility, I would think that the
kexec-tools(such as the vmcore-dmesg and makedumpfile) also need to
be modified accordingly.

Thanks
Lianbo

> This series is in line with the agreements [6] made at the meeting
> during LPC2019 in Lisbon, with 1 exception: support for dictionaries
> will _not_ be discontinued [7]. Dictionaries are stored in a separate
> buffer so that they cannot interfere with the human-readable buffer.
> 
> John Ogness
> 
> [0] https://lkml.kernel.org/r/20190212143003.48446-1-john.ogness@linutronix.de
> [1] https://lkml.kernel.org/r/20190607162349.18199-1-john.ogness@linutronix.de
> [2] https://lkml.kernel.org/r/20190727013333.11260-1-john.ogness@linutronix.de
> [3] https://lkml.kernel.org/r/20190807222634.1723-1-john.ogness@linutronix.de
> [4] https://lkml.kernel.org/r/20191128015235.12940-1-john.ogness@linutronix.de
> [5] https://github.com/crash-utility/crash
> [6] https://lkml.kernel.org/r/87k1acz5rx.fsf@linutronix.de
> [7] https://lkml.kernel.org/r/20191007120134.ciywr3wale4gxa6v@pathway.suse.cz
> 
> John Ogness (2):
>   printk: add lockless buffer
>   printk: use the lockless ringbuffer
> 
>  include/linux/kmsg_dump.h         |    2 -
>  kernel/printk/Makefile            |    1 +
>  kernel/printk/printk.c            |  836 +++++++++---------
>  kernel/printk/printk_ringbuffer.c | 1370 +++++++++++++++++++++++++++++
>  kernel/printk/printk_ringbuffer.h |  328 +++++++
>  5 files changed, 2114 insertions(+), 423 deletions(-)
>  create mode 100644 kernel/printk/printk_ringbuffer.c
>  create mode 100644 kernel/printk/printk_ringbuffer.h
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-02-05 15:48               ` John Ogness
                                   ` (2 preceding siblings ...)
  2020-02-06  7:30                 ` lijiang
@ 2020-02-07  1:40                 ` Steven Rostedt
  2020-02-07  7:43                   ` John Ogness
  2020-02-14 15:56                 ` Petr Mladek
  4 siblings, 1 reply; 58+ messages in thread
From: Steven Rostedt @ 2020-02-07  1:40 UTC (permalink / raw)
  To: John Ogness
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, lijiang, Petr Mladek,
	Peter Zijlstra, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On Wed, 05 Feb 2020 16:48:32 +0100
John Ogness <john.ogness@linutronix.de> wrote:

> The quick fixup:
> 
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index d0d24ee1d1f4..5ad67ff60cd9 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -883,6 +883,7 @@ static int devkmsg_open(struct inode *inode, struct file *file)
>  	user->record.text_buf_size = sizeof(user->text_buf);
>  	user->record.dict_buf = &user->dict_buf[0];
>  	user->record.dict_buf_size = sizeof(user->dict_buf);
> +	user->record.text_line_count = NULL;
>  
>  	logbuf_lock_irq();
>  	user->seq = prb_first_seq(prb);

FYI, I used your patch set to test out Konstantin's new get-lore-mbox
script, and then applied them. It locked up on boot up as well, and
applying this appears to fix it.

-- Steve

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-02-07  1:40                 ` Steven Rostedt
@ 2020-02-07  7:43                   ` John Ogness
  0 siblings, 0 replies; 58+ messages in thread
From: John Ogness @ 2020-02-07  7:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, lijiang, Petr Mladek,
	Peter Zijlstra, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On 2020-02-07, Steven Rostedt <rostedt@goodmis.org> wrote:
>> The quick fixup:
>> 
>> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
>> index d0d24ee1d1f4..5ad67ff60cd9 100644
>> --- a/kernel/printk/printk.c
>> +++ b/kernel/printk/printk.c
>> @@ -883,6 +883,7 @@ static int devkmsg_open(struct inode *inode, struct file *file)
>>  	user->record.text_buf_size = sizeof(user->text_buf);
>>  	user->record.dict_buf = &user->dict_buf[0];
>>  	user->record.dict_buf_size = sizeof(user->dict_buf);
>> +	user->record.text_line_count = NULL;
>>  
>>  	logbuf_lock_irq();
>>  	user->seq = prb_first_seq(prb);
>
> FYI, I used your patch set to test out Konstantin's new get-lore-mbox
> script, and then applied them. It locked up on boot up as well, and
> applying this appears to fix it.

Yes, this is a horrible bug. In preparation for my v2 I implemented:

    prb_rec_init_rd()
    prb_rec_init_wr()

as static inline functions to initialize the records. There is a reader
and writer variant because they initialize the records differently:
readers provide buffers, writers request buffers. This eliminates the
manual twiddling with the record struct and ensures that the struct is
always properly initialized.

John Ogness

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] printk: use the lockless ringbuffer
  2020-01-28 16:19 ` [PATCH 2/2] printk: use the lockless ringbuffer John Ogness
@ 2020-02-13  9:07   ` Sergey Senozhatsky
  2020-02-13  9:42     ` John Ogness
  2020-02-14 13:29   ` lijiang
  2020-02-17 14:41   ` misc details: " Petr Mladek
  2 siblings, 1 reply; 58+ messages in thread
From: Sergey Senozhatsky @ 2020-02-13  9:07 UTC (permalink / raw)
  To: John Ogness
  Cc: Petr Mladek, Peter Zijlstra, Sergey Senozhatsky,
	Sergey Senozhatsky, Steven Rostedt, Linus Torvalds,
	Greg Kroah-Hartman, Andrea Parri, Thomas Gleixner, kexec,
	linux-kernel

On (20/01/28 17:25), John Ogness wrote:
[..]
> -	while (user->seq == log_next_seq) {
> +	if (!prb_read_valid(prb, user->seq, r)) {
>  		if (file->f_flags & O_NONBLOCK) {
>  			ret = -EAGAIN;
>  			logbuf_unlock_irq();
> @@ -890,30 +758,26 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
>  
>  		logbuf_unlock_irq();
>  		ret = wait_event_interruptible(log_wait,
> -					       user->seq != log_next_seq);
> +					prb_read_valid(prb, user->seq, r));
>  		if (ret)
>  			goto out;
>  		logbuf_lock_irq();
>  	}
>  
> -	if (user->seq < log_first_seq) {
> -		/* our last seen message is gone, return error and reset */
> -		user->idx = log_first_idx;
> -		user->seq = log_first_seq;
> +	if (user->seq < r->info->seq) {
> +		/* the expected message is gone, return error and reset */
> +		user->seq = r->info->seq;
>  		ret = -EPIPE;
>  		logbuf_unlock_irq();
>  		goto out;
>  	}

Sorry, why doesn't this do something like

	if (user->seq < prb_first_seq(prb)) {
		/* the expected message is gone, return error and reset */
		user->seq = prb_first_seq(prb);
		ret = -EPIPE;
		...
	}

?

	-ss

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] printk: use the lockless ringbuffer
  2020-02-13  9:07   ` Sergey Senozhatsky
@ 2020-02-13  9:42     ` John Ogness
  2020-02-13 11:59       ` Sergey Senozhatsky
  0 siblings, 1 reply; 58+ messages in thread
From: John Ogness @ 2020-02-13  9:42 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Petr Mladek, Peter Zijlstra, Sergey Senozhatsky, Steven Rostedt,
	Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On 2020-02-13, Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> wrote:
>> -	while (user->seq == log_next_seq) {
>> +	if (!prb_read_valid(prb, user->seq, r)) {
>>  		if (file->f_flags & O_NONBLOCK) {
>>  			ret = -EAGAIN;
>>  			logbuf_unlock_irq();
>> @@ -890,30 +758,26 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
>>  
>>  		logbuf_unlock_irq();
>>  		ret = wait_event_interruptible(log_wait,
>> -					       user->seq != log_next_seq);
>> +					prb_read_valid(prb, user->seq, r));
>>  		if (ret)
>>  			goto out;
>>  		logbuf_lock_irq();
>>  	}
>>  
>> -	if (user->seq < log_first_seq) {
>> -		/* our last seen message is gone, return error and reset */
>> -		user->idx = log_first_idx;
>> -		user->seq = log_first_seq;
>> +	if (user->seq < r->info->seq) {
>> +		/* the expected message is gone, return error and reset */
>> +		user->seq = r->info->seq;
>>  		ret = -EPIPE;
>>  		logbuf_unlock_irq();
>>  		goto out;
>>  	}
>
> Sorry, why doesn't this do something like
>
> 	if (user->seq < prb_first_seq(prb)) {
> 		/* the expected message is gone, return error and reset */
> 		user->seq = prb_first_seq(prb);
> 		ret = -EPIPE;
> 		...
> 	}

Here prb_read_valid() was successful, so a record _was_ read. The
kerneldoc for the prb_read_valid() says:

 * On success, the reader must check r->info.seq to see which record was
 * actually read.

The value will either be the requested user->seq or some higher value
because user->seq is not available.

There are 2 reasons why user->seq is not available (and a later record
_is_ available):

1. The ringbuffer overtook user->seq. In this case, comparing and then
   setting using prb_first_seq() could be appropriate. And r->info->seq
   might even already be what prb_first_seq() would return. (More on
   this below.)

2. The record with user->seq has no data because the writer failed to
   allocate dataring space. In this case, resetting back to
   prb_first_seq() would be incorrect. And since r->info->seq is the
   next valid record, it is appropriate that the next devkmsg_read()
   starts there.

Rather than checking these cases separately, it is enough just to check
for the 2nd case. For the 1st case, prb_first_seq() could be less than
r->info->seq if all the preceeding records have no data. But this just
means the whole set of records with missing data are skipped, which
matches existing behavior. (For example, currently when devkmsg is
behind 10 messages, there are not 10 -EPIPE returns. Instead it
immediately catches up to the next available record.)

Perhaps the new comment should be:

/*
 * The expected message is gone, return error and
 * reset to the next available message.
 */

John Ogness

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] printk: use the lockless ringbuffer
  2020-02-13  9:42     ` John Ogness
@ 2020-02-13 11:59       ` Sergey Senozhatsky
  2020-02-13 22:36         ` John Ogness
  0 siblings, 1 reply; 58+ messages in thread
From: Sergey Senozhatsky @ 2020-02-13 11:59 UTC (permalink / raw)
  To: John Ogness
  Cc: Sergey Senozhatsky, Petr Mladek, Peter Zijlstra,
	Sergey Senozhatsky, Steven Rostedt, Linus Torvalds,
	Greg Kroah-Hartman, Andrea Parri, Thomas Gleixner, kexec,
	linux-kernel

On (20/02/13 10:42), John Ogness wrote:
> On 2020-02-13, Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> wrote:
> >> -	while (user->seq == log_next_seq) {
> >> +	if (!prb_read_valid(prb, user->seq, r)) {
> >>  		if (file->f_flags & O_NONBLOCK) {
> >>  			ret = -EAGAIN;
> >>  			logbuf_unlock_irq();
> >> @@ -890,30 +758,26 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
> >>  
> >>  		logbuf_unlock_irq();
> >>  		ret = wait_event_interruptible(log_wait,
> >> -					       user->seq != log_next_seq);
> >> +					prb_read_valid(prb, user->seq, r));
> >>  		if (ret)
> >>  			goto out;
> >>  		logbuf_lock_irq();
> >>  	}
> >>  
> >> -	if (user->seq < log_first_seq) {
> >> -		/* our last seen message is gone, return error and reset */
> >> -		user->idx = log_first_idx;
> >> -		user->seq = log_first_seq;
> >> +	if (user->seq < r->info->seq) {
> >> +		/* the expected message is gone, return error and reset */
> >> +		user->seq = r->info->seq;
> >>  		ret = -EPIPE;
> >>  		logbuf_unlock_irq();
> >>  		goto out;
> >>  	}
> >
> > Sorry, why doesn't this do something like
> >
> > 	if (user->seq < prb_first_seq(prb)) {
> > 		/* the expected message is gone, return error and reset */
> > 		user->seq = prb_first_seq(prb);
> > 		ret = -EPIPE;
> > 		...
> > 	}
> 
> Here prb_read_valid() was successful, so a record _was_ read. The
> kerneldoc for the prb_read_valid() says:

Hmm, yeah. That's true.

OK, something weird...

I ran some random printk-pressure test (mostly printks from IRQs;
+ some NMI printk-s, but they are routed through nmi printk-safe
buffers; + some limited number of printk-safe printk-s, routed
via printk-safe buffer (so, once again, IRQ); + user-space
journalctl -f syslog reader), and after the test 'cat /dev/kmsg'
is terminally broken

[..]
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
cat /dev/kmsg
cat: /dev/kmsg: Broken pipe
[..]

dmesg works. Reading from /dev/kmsg - doesn't; it did work, however,
before the test.

So I printed seq numbers from devksmg read to a seq buffer and dumped
it via procfs, just seq numbers before we adjust user->seq (set to
r->seq) and after

+                       offt += snprintf(BUF + offt,
+                                       sizeof(BUF) - offt,
+                                       "%s: devkmsg_read() error %llu %llu %llu\n",
+                                       current->comm,
+                                       user->seq,
+                                       r->info->seq,
+                                       prb_first_seq(prb));


...
systemd-journal: devkmsg_read() error 1979235 1979236 1979236
systemd-journal: corrected seq 1979236 1979236
systemd-journal: devkmsg_read() error 1979237 1979243 1979243
systemd-journal: corrected seq 1979243 1979243
systemd-journal: devkmsg_read() error 1979244 1979250 1979250
systemd-journal: corrected seq 1979250 1979250
systemd-journal: devkmsg_read() error 1979251 1979257 1979257
systemd-journal: corrected seq 1979257 1979257
systemd-journal: devkmsg_read() error 1979258 1979265 1979265
systemd-journal: corrected seq 1979265 1979265
systemd-journal: devkmsg_read() error 1979266 1979272 1979272
systemd-journal: corrected seq 1979272 1979272
systemd-journal: devkmsg_read() error 1979272 1979273 1979273
systemd-journal: corrected seq 1979273 1979273
systemd-journal: devkmsg_read() error 1979274 1979280 1979280
systemd-journal: corrected seq 1979280 1979280
systemd-journal: devkmsg_read() error 1979281 1982465 1980933
systemd-journal: corrected seq 1982465 1982465
cat: devkmsg_read() error 1980987 1982531 1980987
cat: corrected seq 1982531 1982531
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981015 1982563 1981015
cat: corrected seq 1982563 1982563
cat: devkmsg_read() error 1981080 1982633 1981080
cat: corrected seq 1982633 1982633
cat: devkmsg_read() error 1981080 1982633 1981080
cat: corrected seq 1982633 1982633
cat: devkmsg_read() error 1981080 1982633 1981080
cat: corrected seq 1982633 1982633
cat: devkmsg_read() error 1981080 1982633 1981080
cat: corrected seq 1982633 1982633
cat: devkmsg_read() error 1981080 1982633 1981080
cat: corrected seq 1982633 1982633
cat: devkmsg_read() error 1981080 1982633 1981080
cat: corrected seq 1982633 1982633
cat: devkmsg_read() error 1981080 1982633 1981080
cat: corrected seq 1982633 1982633
cat: devkmsg_read() error 1981080 1982633 1981080
cat: corrected seq 1982633 1982633
cat: devkmsg_read() error 1981080 1982633 1981080
cat: corrected seq 1982633 1982633
cat: devkmsg_read() error 1981080 1982633 1981080
cat: corrected seq 1982633 1982633
cat: devkmsg_read() error 1981080 1982633 1981080
cat: corrected seq 1982633 1982633
cat: devkmsg_read() error 1981080 1982633 1981080
cat: corrected seq 1982633 1982633
cat: devkmsg_read() error 1981080 1982633 1981080
cat: corrected seq 1982633 1982633
cat: devkmsg_read() error 1981095 1982652 1981095
cat: corrected seq 1982652 1982652
cat: devkmsg_read() error 1981095 1982652 1981095
cat: corrected seq 1982652 1982652
cat: devkmsg_read() error 1981095 1982652 1981095
cat: corrected seq 1982652 1982652
cat: devkmsg_read() error 1981095 1982652 1981095
cat: corrected seq 1982652 1982652
cat: devkmsg_read() error 1981095 1982652 1981095
cat: corrected seq 1982652 1982652
cat: devkmsg_read() error 1981095 1982652 1981095
cat: corrected seq 1982652 1982652
cat: devkmsg_read() error 1981095 1982652 1981095
cat: corrected seq 1982652 1982652
...


What's up with that user->seq counter?

	-ss

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-02-05 16:12               ` John Ogness
  2020-02-06  9:12                 ` lijiang
@ 2020-02-13 13:07                 ` Petr Mladek
  2020-02-14  1:07                   ` Sergey Senozhatsky
  1 sibling, 1 reply; 58+ messages in thread
From: Petr Mladek @ 2020-02-13 13:07 UTC (permalink / raw)
  To: John Ogness
  Cc: lijiang, Sergey Senozhatsky, Peter Zijlstra, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On Wed 2020-02-05 17:12:12, John Ogness wrote:
> On 2020-02-05, lijiang <lijiang@redhat.com> wrote:
> > Do you have any suggestions about the size of CONFIG_LOG_* and
> > CONFIG_PRINTK_* options by default?
> 
> The new printk implementation consumes more than double the memory that
> the current printk implementation requires. This is because dictionaries
> and meta-data are now stored separately.
> 
> If the old defaults (LOG_BUF_SHIFT=17 LOG_CPU_MAX_BUF_SHIFT=12) were
> chosen because they are maximally acceptable defaults, then the defaults
> should be reduced by 1 so that the final size is "similar" to the
> current implementation.
>
> If instead the defaults are left as-is, a machine with less than 64 CPUs
> will reserve 336KiB for printk information (128KiB text, 128KiB
> dictionary, 80KiB meta-data).
> 
> It might also be desirable to reduce the dictionary size (maybe 1/4 the
> size of text?).

Good questions. It would be great to check the usage on some real
systems.

In each case, we should inform users when messages and/or dictionaries
were lost.

Also it would be great to have a way (function) that would show how
big parts of the two ring buffers are occupied by valid data. It might
be useful also to detect problems with the ring buffer:

   + too many space reserved but not commited

   + too many records invalidated because of different ordering
     in desc ring and data ring.


> However, since the new printk implementation allows for
> non-intrusive dictionaries, we might see their usage increase and start
> to be as large as the messages themselves.

I wish the dictionaries were never added ;-) They complicate the code
and nobody knows how many people actually use the information.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] printk: use the lockless ringbuffer
  2020-02-13 11:59       ` Sergey Senozhatsky
@ 2020-02-13 22:36         ` John Ogness
  2020-02-14  1:41           ` Sergey Senozhatsky
  0 siblings, 1 reply; 58+ messages in thread
From: John Ogness @ 2020-02-13 22:36 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Petr Mladek, Peter Zijlstra, Sergey Senozhatsky, Steven Rostedt,
	Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On 2020-02-13, Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> wrote:
>>>> @@ -890,30 +758,26 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
>>>>  
>>>>  		logbuf_unlock_irq();
>>>>  		ret = wait_event_interruptible(log_wait,
>>>> -					       user->seq != log_next_seq);
>>>> +					prb_read_valid(prb, user->seq, r));
>>>>  		if (ret)
>>>>  			goto out;
>>>>  		logbuf_lock_irq();
>>>>  	}
>>>>  
>>>> -	if (user->seq < log_first_seq) {
>>>> -		/* our last seen message is gone, return error and reset */
>>>> -		user->idx = log_first_idx;
>>>> -		user->seq = log_first_seq;
>>>> +	if (user->seq < r->info->seq) {
>>>> +		/* the expected message is gone, return error and reset */
>>>> +		user->seq = r->info->seq;
>>>>  		ret = -EPIPE;
>>>>  		logbuf_unlock_irq();
>>>>  		goto out;
>>>>  	}
>>>
>>> Sorry, why doesn't this do something like
>>>
>>> 	if (user->seq < prb_first_seq(prb)) {
>>> 		/* the expected message is gone, return error and reset */
>>> 		user->seq = prb_first_seq(prb);
>>> 		ret = -EPIPE;
>>> 		...
>>> 	}
>> 
>> Here prb_read_valid() was successful, so a record _was_ read. The
>> kerneldoc for the prb_read_valid() says:
>
> Hmm, yeah. That's true.
>
> OK, something weird...
>
> I ran some random printk-pressure test (mostly printks from IRQs;
> + some NMI printk-s, but they are routed through nmi printk-safe
> buffers; + some limited number of printk-safe printk-s, routed
> via printk-safe buffer (so, once again, IRQ); + user-space
> journalctl -f syslog reader), and after the test 'cat /dev/kmsg'
> is terminally broken
>
> [..]
> cat /dev/kmsg
> cat: /dev/kmsg: Broken pipe

In mainline you can have this "problem" as well. Once the ringbuffer has
wrapped, any read to a newly opened /dev/kmsg when a new message arrived
will result in an EPIPE. This happens quite easily once the ringbuffer
has wrapped because each new message is overwriting the oldest message.

Although it can be convenient, cat(1) is actually a poor tool for
viewing the ringbuffer for this reason. Unfortunately dmesg(1) is
sub-optimal as well because it does not show the sequence numbers. So
with dmesg(1) you cannot see if a message was dropped. :-/

> So I printed seq numbers from devksmg read to a seq buffer and dumped
> it via procfs, just seq numbers before we adjust user->seq (set to
> r->seq) and after
>
> +                       offt += snprintf(BUF + offt,
> +                                       sizeof(BUF) - offt,
> +                                       "%s: devkmsg_read() error %llu %llu %llu\n",
> +                                       current->comm,
> +                                       user->seq,
> +                                       r->info->seq,
> +                                       prb_first_seq(prb));
>
>
> ...
> systemd-journal: devkmsg_read() error 1979281 1982465 1980933
> systemd-journal: corrected seq 1982465 1982465
> cat: devkmsg_read() error 1980987 1982531 1980987
> cat: corrected seq 1982531 1982531
> cat: devkmsg_read() error 1981015 1982563 1981015
> cat: corrected seq 1982563 1982563

The situation with a data-less record is the same as when the ringbuffer
wraps: cat is hitting that EPIPE. But re-opening the file descriptor is
not going to help because it will not be able to get past that data-less
record.

We could implement it such that devkmsg_read() will skip over data-less
records instead of issuing an EPIPE. (That is what dmesg does.) But then
do we need EPIPE at all? The reader can see that is has missed records
by tracking the sequence number, so could we just get rid of EPIPE? Then
cat(1) would be a great tool to view the raw ringbuffer. Please share
your thoughts on this.


On a side note (but related to data-less records): I hacked the
ringbuffer code to inject data-less records at various times in order to
verify your report. And I stumbled upon a bug in the ringbuffer, which
can lead to an infinite loop in console_unlock(). The problem occurs at:

    retry = prb_read_valid(prb, console_seq, NULL);

which will erroneously return true if console_seq is pointing to a
data-less record but there are no valid records after it. The following
patch fixes the bug. And yes, for v2 I have added comments to the
desc_read_committed() code.

I now have 2 bugfixes queued up for v2. The first one is here[0].

[0] https://lkml.kernel.org/r/87wo919grz.fsf@linutronix.de

John Ogness


diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
index 796257f226ee..31893051ad6b 100644
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -1074,6 +1071,7 @@ static int desc_read_committed(struct prb_desc_ring *desc_ring,
 			       unsigned long id, u64 seq,
 			       struct prb_desc *desc)
 {
+	struct prb_data_blk_lpos *blk_lpos = &desc->text_blk_lpos;
 	enum desc_state d_state;
 
 	d_state = desc_read(desc_ring, id, desc);
@@ -1084,6 +1082,11 @@ static int desc_read_committed(struct prb_desc_ring *desc_ring,
 	else if (d_state != desc_committed)
 		return -EINVAL;
 
+	if (blk_lpos->begin == INVALID_LPOS &&
+	    blk_lpos->next == INVALID_LPOS) {
+		return -ENOENT;
+	}
+
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-02-13 13:07                 ` Petr Mladek
@ 2020-02-14  1:07                   ` Sergey Senozhatsky
  0 siblings, 0 replies; 58+ messages in thread
From: Sergey Senozhatsky @ 2020-02-14  1:07 UTC (permalink / raw)
  To: Petr Mladek
  Cc: John Ogness, lijiang, Sergey Senozhatsky, Peter Zijlstra,
	Sergey Senozhatsky, Steven Rostedt, Linus Torvalds,
	Greg Kroah-Hartman, Andrea Parri, Thomas Gleixner, kexec,
	linux-kernel

On (20/02/13 14:07), Petr Mladek wrote:
> On Wed 2020-02-05 17:12:12, John Ogness wrote:
> > On 2020-02-05, lijiang <lijiang@redhat.com> wrote:
> > > Do you have any suggestions about the size of CONFIG_LOG_* and
> > > CONFIG_PRINTK_* options by default?
> > 
> > The new printk implementation consumes more than double the memory that
> > the current printk implementation requires. This is because dictionaries
> > and meta-data are now stored separately.
> > 
> > If the old defaults (LOG_BUF_SHIFT=17 LOG_CPU_MAX_BUF_SHIFT=12) were
> > chosen because they are maximally acceptable defaults, then the defaults
> > should be reduced by 1 so that the final size is "similar" to the
> > current implementation.
> >
> > If instead the defaults are left as-is, a machine with less than 64 CPUs
> > will reserve 336KiB for printk information (128KiB text, 128KiB
> > dictionary, 80KiB meta-data).
> > 
> > It might also be desirable to reduce the dictionary size (maybe 1/4 the
> > size of text?).
> 
> Good questions. It would be great to check the usage on some real
> systems.

[..]

> I wish the dictionaries were never added ;-) They complicate the code
> and nobody knows how many people actually use the information.

Maybe we can have CONFIG_PRINTK_EXTRA_PAYLOAD [for dicts] so people can
compile it out if it's not needed. This can save several bytes here and
there.

	-ss

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] printk: use the lockless ringbuffer
  2020-02-13 22:36         ` John Ogness
@ 2020-02-14  1:41           ` Sergey Senozhatsky
  2020-02-14  2:09             ` Sergey Senozhatsky
  2020-02-14  9:48             ` John Ogness
  0 siblings, 2 replies; 58+ messages in thread
From: Sergey Senozhatsky @ 2020-02-14  1:41 UTC (permalink / raw)
  To: John Ogness
  Cc: Sergey Senozhatsky, Petr Mladek, Peter Zijlstra,
	Sergey Senozhatsky, Steven Rostedt, Linus Torvalds,
	Greg Kroah-Hartman, Andrea Parri, Thomas Gleixner, kexec,
	linux-kernel

On (20/02/13 23:36), John Ogness wrote:
> >> Here prb_read_valid() was successful, so a record _was_ read. The
> >> kerneldoc for the prb_read_valid() says:
> >
> > Hmm, yeah. That's true.
> >
> > OK, something weird...
> >
> > I ran some random printk-pressure test (mostly printks from IRQs;
> > + some NMI printk-s, but they are routed through nmi printk-safe
> > buffers; + some limited number of printk-safe printk-s, routed
> > via printk-safe buffer (so, once again, IRQ); + user-space
> > journalctl -f syslog reader), and after the test 'cat /dev/kmsg'
> > is terminally broken
> >
> > [..]
> > cat /dev/kmsg
> > cat: /dev/kmsg: Broken pipe
>
> In mainline you can have this "problem" as well. Once the ringbuffer has
> wrapped, any read to a newly opened /dev/kmsg when a new message arrived
> will result in an EPIPE. This happens quite easily once the ringbuffer
> has wrapped because each new message is overwriting the oldest message.

Hmm. Something doesn't add up.

Looking at the numbers, both r->info->seq and prb_first_seq(prb)
do increase, so there are new messages in the ring buffer

                           u->seq    r->seq    prb_first_seq
[..]
cat: devkmsg_read() error 1981080   1982633   1981080
cat: devkmsg_read() error 1981080   1982633   1981080
cat: devkmsg_read() error 1981095   1982652   1981095
cat: devkmsg_read() error 1981095   1982652   1981095
cat: devkmsg_read() error 1981095   1982652   1981095
[..]

but 'cat' still wouldn't read anything from the logbuf - EPIPE.

NOTE: I don't run 'cat /dev/kmsg' during the test. I run the test first,
then I run 'cat /dev/kmsg', after the test, when printk-pressure is gone.

I can't reproduce it with current logbuf. 'cat' reads from /dev/kmsg after
heavy printk-pressure test. So chances are some loggers can also experience
problems. This might be a regression.

> > ...
> > systemd-journal: devkmsg_read() error 1979281 1982465 1980933
> > systemd-journal: corrected seq 1982465 1982465
> > cat: devkmsg_read() error 1980987 1982531 1980987
> > cat: corrected seq 1982531 1982531
> > cat: devkmsg_read() error 1981015 1982563 1981015
> > cat: corrected seq 1982563 1982563
>
> The situation with a data-less record is the same as when the ringbuffer
> wraps: cat is hitting that EPIPE. But re-opening the file descriptor is
> not going to help because it will not be able to get past that data-less
> record.

So maybe this is the case with broken 'cat' on my system?

> We could implement it such that devkmsg_read() will skip over data-less
> records instead of issuing an EPIPE. (That is what dmesg does.) But then
> do we need EPIPE at all? The reader can see that is has missed records
> by tracking the sequence number, so could we just get rid of EPIPE? Then
> cat(1) would be a great tool to view the raw ringbuffer. Please share
> your thoughts on this.

Looking at systemd/src/journal/journald-kmsg.c : server_read_dev_kmsg()
-EPIPE is just one of the erronos they handle, nothing special. Could it
be the case that some other loggers would have special handling for EPIPE?
I'm not sure, let's look around.

I'd say that EPIPE removal looks OK to me. But before we do that, I'm
not sure that we have clear understanding of 'cat /dev/kmsg' behaviour
change.

> On a side note (but related to data-less records): I hacked the
> ringbuffer code to inject data-less records at various times in order to
> verify your report. And I stumbled upon a bug in the ringbuffer, which
> can lead to an infinite loop in console_unlock(). The problem occurs at:
> 
>     retry = prb_read_valid(prb, console_seq, NULL);
> 
> which will erroneously return true if console_seq is pointing to a
> data-less record but there are no valid records after it. The following
> patch fixes the bug. And yes, for v2 I have added comments to the
> desc_read_committed() code.

That's great to know!

	-ss

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] printk: use the lockless ringbuffer
  2020-02-14  1:41           ` Sergey Senozhatsky
@ 2020-02-14  2:09             ` Sergey Senozhatsky
  2020-02-14  9:48             ` John Ogness
  1 sibling, 0 replies; 58+ messages in thread
From: Sergey Senozhatsky @ 2020-02-14  2:09 UTC (permalink / raw)
  To: John Ogness
  Cc: Petr Mladek, Peter Zijlstra, Sergey Senozhatsky, Steven Rostedt,
	Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel, Sergey Senozhatsky

On (20/02/14 10:41), Sergey Senozhatsky wrote:
> On (20/02/13 23:36), John Ogness wrote:
[..]
> > We could implement it such that devkmsg_read() will skip over data-less
> > records instead of issuing an EPIPE. (That is what dmesg does.) But then
> > do we need EPIPE at all? The reader can see that is has missed records
> > by tracking the sequence number, so could we just get rid of EPIPE? Then
> > cat(1) would be a great tool to view the raw ringbuffer. Please share
> > your thoughts on this.
> 
> Looking at systemd/src/journal/journald-kmsg.c : server_read_dev_kmsg()
> -EPIPE is just one of the erronos they handle, nothing special. Could it
> be the case that some other loggers would have special handling for EPIPE?
> I'm not sure, let's look around.

rsyslog

static void
readkmsg(void)
{
	int i;
	uchar pRcv[8192+1];
	char errmsg[2048];

	for (;;) {
		dbgprintf("imkmsg waiting for kernel log line\n");

		/* every read() from the opened device node receives one record of the printk buffer */
		i = read(fklog, pRcv, 8192);

		if (i > 0) {
			/* successful read of message of nonzero length */
			pRcv[i] = '\0';
		} else if (i == -EPIPE) {
			imkmsgLogIntMsg(LOG_WARNING,
					"imkmsg: some messages in circular buffer got overwritten");
			continue;
		} else {
			/* something went wrong - error or zero length message */
			if (i < 0 && errno != EINTR && errno != EAGAIN) {
				/* error occured */
				imkmsgLogIntMsg(LOG_ERR,
				       "imkmsg: error reading kernel log - shutting down: %s",
					rs_strerror_r(errno, errmsg, sizeof(errmsg)));
				fklog = -1;
			}
			break;
		}

		submitSyslog(pRcv);
	}
}


So EPIPE errno better stay around.

	-ss

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] printk: use the lockless ringbuffer
  2020-02-14  1:41           ` Sergey Senozhatsky
  2020-02-14  2:09             ` Sergey Senozhatsky
@ 2020-02-14  9:48             ` John Ogness
  1 sibling, 0 replies; 58+ messages in thread
From: John Ogness @ 2020-02-14  9:48 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Petr Mladek, Peter Zijlstra, Sergey Senozhatsky, Steven Rostedt,
	Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On 2020-02-14, Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> wrote:
>>> cat /dev/kmsg
>>> cat: /dev/kmsg: Broken pipe
>>
>> In mainline you can have this "problem" as well. Once the ringbuffer
>> has wrapped, any read to a newly opened /dev/kmsg when a new message
>> arrived will result in an EPIPE. This happens quite easily once the
>> ringbuffer has wrapped because each new message is overwriting the
>> oldest message.
>
> Hmm. Something doesn't add up.
>
> Looking at the numbers, both r->info->seq and prb_first_seq(prb)
> do increase, so there are new messages in the ring buffer
>
>                            u->seq    r->seq    prb_first_seq
> [..]
> cat: devkmsg_read() error 1981080   1982633   1981080
> cat: devkmsg_read() error 1981080   1982633   1981080
> cat: devkmsg_read() error 1981095   1982652   1981095
> cat: devkmsg_read() error 1981095   1982652   1981095
> cat: devkmsg_read() error 1981095   1982652   1981095
> [..]
>
> but 'cat' still wouldn't read anything from the logbuf - EPIPE.
>
> NOTE: I don't run 'cat /dev/kmsg' during the test. I run the test
> first, then I run 'cat /dev/kmsg', after the test, when
> printk-pressure is gone.

Sure. The problem is not the printk-pressure. The problem is you have
data-less records in your ringbuffer (from your previous
printk-pressure). If you used your own program that continued to read
after EPIPE, then you would see the sequence numbers jumping over the
data-less records.

> I can't reproduce it with current logbuf. 'cat' reads from /dev/kmsg
> after heavy printk-pressure test. So chances are some loggers can also
> experience problems. This might be a regression.

Mainline doesn't have data-less records. In mainline such failed
printk's are silently ignored (after attepting truncation). So for
mainline you can only reproduce the overflow case.

1. Boot 5.6.0-rc1 (without any console= slowing down printk)

2. Fill the ringbuffer and let it overflow with:

   $ while true; do echo filling buffer > /dev/kmsg; done &

3. Once you can see the ringbuffer has overflowed (and continues to
   overflow), try to read from /dev/kmsg

   $ strace head /dev/kmsg

In most cases you will see:

read(3, 0x7f7307ac1000, 4096)           = -1 EPIPE (Broken pipe)

Current readers need to be able to handle EPIPE. cat(1) does not and so
(unfortunately) is not a good candidate for reading the ringbuffer.

>>> ...
>>> systemd-journal: devkmsg_read() error 1979281 1982465 1980933
>>> systemd-journal: corrected seq 1982465 1982465
>>> cat: devkmsg_read() error 1980987 1982531 1980987
>>> cat: corrected seq 1982531 1982531
>>> cat: devkmsg_read() error 1981015 1982563 1981015
>>> cat: corrected seq 1982563 1982563
>>
>> The situation with a data-less record is the same as when the ringbuffer
>> wraps: cat is hitting that EPIPE. But re-opening the file descriptor is
>> not going to help because it will not be able to get past that data-less
>> record.
>
> So maybe this is the case with broken 'cat' on my system?

I think it is appropriate for an application to close the descriptor
after an EPIPE. /dev/kmsg is special because the reader should continue
reading anyway.

>> We could implement it such that devkmsg_read() will skip over data-less
>> records instead of issuing an EPIPE. (That is what dmesg does.) But then
>> do we need EPIPE at all? The reader can see that is has missed records
>> by tracking the sequence number, so could we just get rid of EPIPE? Then
>> cat(1) would be a great tool to view the raw ringbuffer. Please share
>> your thoughts on this.
>
> Looking at systemd/src/journal/journald-kmsg.c :
> server_read_dev_kmsg() -EPIPE is just one of the erronos they handle,
> nothing special.

Yes, but what does systemd-journald do when the EPIPE is _not_ returned
and instead there is a jump in the sequence number? Looking at
dev_kmsg_record(), systemd actually does it the way I would hope. It
tracks the sequence number correctly.

    /* Did we lose any? */
    if (serial > *s->kernel_seqnum)
         server_driver_message(s, 0,
                               "MESSAGE_ID="
                               SD_MESSAGE_JOURNAL_MISSED_STR,
                               LOG_MESSAGE("Missed %"PRIu64" kernel messages",
                               serial - *s->kernel_seqnum),
                               NULL);

> Could it be the case that some other loggers would have special
> handling for EPIPE?  I'm not sure, let's look around.
>
> I'd say that EPIPE removal looks OK to me. But before we do that, I'm
> not sure that we have clear understanding of 'cat /dev/kmsg' behaviour
> change.

In mainline, with regard to /dev/kmsg, sequence numbers will never
jump. If there would be a jump (due to lost messages) then EPIPE is
issued. The reader can either:

1. continue reading and see the jump

2. reopen the file descriptor, possibly having missed a ton more
   messages due to reopening, and then start from the oldest available
   message

With my series, #2 is no longer an option because the lost messages
could exist in a part of the ringbuffer not yet overwritten.

If we remove EPIPE, then readers will need to track the sequence number
to identify jumps. systemd-journald does this already. And tools like
cat(1) would "just work" because cat does not care if messages were
lost.

John Ogness

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] printk: use the lockless ringbuffer
  2020-01-28 16:19 ` [PATCH 2/2] printk: use the lockless ringbuffer John Ogness
  2020-02-13  9:07   ` Sergey Senozhatsky
@ 2020-02-14 13:29   ` lijiang
  2020-02-14 13:50     ` John Ogness
  2020-02-17 14:41   ` misc details: " Petr Mladek
  2 siblings, 1 reply; 58+ messages in thread
From: lijiang @ 2020-02-14 13:29 UTC (permalink / raw)
  To: John Ogness
  Cc: Petr Mladek, Peter Zijlstra, Sergey Senozhatsky,
	Sergey Senozhatsky, Steven Rostedt, Linus Torvalds,
	Greg Kroah-Hartman, Andrea Parri, Thomas Gleixner, kexec,
	linux-kernel

在 2020年01月29日 00:19, John Ogness 写道:
> Replace the existing ringbuffer usage and implementation with
> lockless ringbuffer usage. Even though the new ringbuffer does not
> require locking, all existing locking is left in place. Therefore,
> this change is purely replacing the underlining ringbuffer.
> 
> Changes that exist due to the ringbuffer replacement:
> 
> - The VMCOREINFO has been updated for the new structures.
> 
> - Dictionary data is now stored in a separate data buffer from the
>   human-readable messages. The dictionary data buffer is set to the
>   same size as the message buffer. Therefore, the total reserved
>   memory for messages is 2 * (2 ^ CONFIG_LOG_BUF_SHIFT) for the
>   initial static buffer and 2x the specified size in the log_buf_len
>   kernel parameter.
> 
> - Record meta-data is now stored in a separate array of descriptors.
>   This is an additional 72 * (2 ^ ((CONFIG_LOG_BUF_SHIFT - 6))) bytes
>   for the static array and 72 * (2 ^ ((log_buf_len - 6))) bytes for
>   the dynamic array.
> 
> Signed-off-by: John Ogness <john.ogness@linutronix.de>
> ---
>  include/linux/kmsg_dump.h |   2 -
>  kernel/printk/Makefile    |   1 +
>  kernel/printk/printk.c    | 836 +++++++++++++++++++-------------------
>  3 files changed, 416 insertions(+), 423 deletions(-)
> 
> diff --git a/include/linux/kmsg_dump.h b/include/linux/kmsg_dump.h
> index 2e7a1e032c71..ae6265033e31 100644
> --- a/include/linux/kmsg_dump.h
> +++ b/include/linux/kmsg_dump.h
> @@ -46,8 +46,6 @@ struct kmsg_dumper {
>  	bool registered;
>  
>  	/* private state of the kmsg iterator */
> -	u32 cur_idx;
> -	u32 next_idx;
>  	u64 cur_seq;
>  	u64 next_seq;
>  };
> diff --git a/kernel/printk/Makefile b/kernel/printk/Makefile
> index 4d052fc6bcde..eee3dc9b60a9 100644
> --- a/kernel/printk/Makefile
> +++ b/kernel/printk/Makefile
> @@ -2,3 +2,4 @@
>  obj-y	= printk.o
>  obj-$(CONFIG_PRINTK)	+= printk_safe.o
>  obj-$(CONFIG_A11Y_BRAILLE_CONSOLE)	+= braille.o
> +obj-$(CONFIG_PRINTK)	+= printk_ringbuffer.o
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 1ef6f75d92f1..d0d24ee1d1f4 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -56,6 +56,7 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/printk.h>
>  
> +#include "printk_ringbuffer.h"
>  #include "console_cmdline.h"
>  #include "braille.h"
>  #include "internal.h"
> @@ -294,30 +295,22 @@ enum con_msg_format_flags {
>  static int console_msg_format = MSG_FORMAT_DEFAULT;
>  
>  /*
> - * The printk log buffer consists of a chain of concatenated variable
> - * length records. Every record starts with a record header, containing
> - * the overall length of the record.
> + * The printk log buffer consists of a sequenced collection of records, each
> + * containing variable length message and dictionary text. Every record
> + * also contains its own meta-data (@info).
>   *
> - * The heads to the first and last entry in the buffer, as well as the
> - * sequence numbers of these entries are maintained when messages are
> - * stored.
> - *
> - * If the heads indicate available messages, the length in the header
> - * tells the start next message. A length == 0 for the next message
> - * indicates a wrap-around to the beginning of the buffer.
> - *
> - * Every record carries the monotonic timestamp in microseconds, as well as
> - * the standard userspace syslog level and syslog facility. The usual
> + * Every record meta-data carries the monotonic timestamp in microseconds, as
> + * well as the standard userspace syslog level and syslog facility. The usual
>   * kernel messages use LOG_KERN; userspace-injected messages always carry
>   * a matching syslog facility, by default LOG_USER. The origin of every
>   * message can be reliably determined that way.
>   *
> - * The human readable log message directly follows the message header. The
> - * length of the message text is stored in the header, the stored message
> - * is not terminated.
> + * The human readable log message of a record is available in @text, the length
> + * of the message text in @text_len. The stored message is not terminated.
>   *
> - * Optionally, a message can carry a dictionary of properties (key/value pairs),
> - * to provide userspace with a machine-readable message context.
> + * Optionally, a record can carry a dictionary of properties (key/value pairs),
> + * to provide userspace with a machine-readable message context. The length of
> + * the dictionary is available in @dict_len. The dictionary is not terminated.
>   *
>   * Examples for well-defined, commonly used property names are:
>   *   DEVICE=b12:8               device identifier
> @@ -331,21 +324,19 @@ static int console_msg_format = MSG_FORMAT_DEFAULT;
>   * follows directly after a '=' character. Every property is terminated by
>   * a '\0' character. The last property is not terminated.
>   *
> - * Example of a message structure:
> - *   0000  ff 8f 00 00 00 00 00 00      monotonic time in nsec
> - *   0008  34 00                        record is 52 bytes long
> - *   000a        0b 00                  text is 11 bytes long
> - *   000c              1f 00            dictionary is 23 bytes long
> - *   000e                    03 00      LOG_KERN (facility) LOG_ERR (level)
> - *   0010  69 74 27 73 20 61 20 6c      "it's a l"
> - *         69 6e 65                     "ine"
> - *   001b           44 45 56 49 43      "DEVIC"
> - *         45 3d 62 38 3a 32 00 44      "E=b8:2\0D"
> - *         52 49 56 45 52 3d 62 75      "RIVER=bu"
> - *         67                           "g"
> - *   0032     00 00 00                  padding to next message header
> - *
> - * The 'struct printk_log' buffer header must never be directly exported to
> + * Example of record values:
> + *   record.text_buf       = "it's a line" (unterminated)
> + *   record.dict_buf       = "DEVICE=b8:2\0DRIVER=bug" (unterminated)
> + *   record.info.seq       = 56
> + *   record.info.ts_sec    = 36863
> + *   record.info.text_len  = 11
> + *   record.info.dict_len  = 22
> + *   record.info.facility  = 0 (LOG_KERN)
> + *   record.info.flags     = 0
> + *   record.info.level     = 3 (LOG_ERR)
> + *   record.info.caller_id = 299 (task 299)
> + *
> + * The 'struct printk_info' buffer must never be directly exported to
>   * userspace, it is a kernel-private implementation detail that might
>   * need to be changed in the future, when the requirements change.
>   *
> @@ -365,23 +356,6 @@ enum log_flags {
>  	LOG_CONT	= 8,	/* text is a fragment of a continuation line */
>  };
>  
> -struct printk_log {
> -	u64 ts_nsec;		/* timestamp in nanoseconds */
> -	u16 len;		/* length of entire record */
> -	u16 text_len;		/* length of text buffer */
> -	u16 dict_len;		/* length of dictionary buffer */
> -	u8 facility;		/* syslog facility */
> -	u8 flags:5;		/* internal record flags */
> -	u8 level:3;		/* syslog level */
> -#ifdef CONFIG_PRINTK_CALLER
> -	u32 caller_id;            /* thread id or processor id */
> -#endif
> -}
> -#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> -__packed __aligned(4)
> -#endif
> -;
> -
>  /*
>   * The logbuf_lock protects kmsg buffer, indices, counters.  This can be taken
>   * within the scheduler's rq lock. It must be released before calling
> @@ -421,26 +395,17 @@ DEFINE_RAW_SPINLOCK(logbuf_lock);
>  DECLARE_WAIT_QUEUE_HEAD(log_wait);
>  /* the next printk record to read by syslog(READ) or /proc/kmsg */
>  static u64 syslog_seq;
> -static u32 syslog_idx;
>  static size_t syslog_partial;
>  static bool syslog_time;
> -
> -/* index and sequence number of the first record stored in the buffer */
> -static u64 log_first_seq;
> -static u32 log_first_idx;
> -
> -/* index and sequence number of the next record to store in the buffer */
> -static u64 log_next_seq;
> -static u32 log_next_idx;
> +DECLARE_PRINTKRB_RECORD(syslog_record, CONSOLE_EXT_LOG_MAX);
>  
>  /* the next printk record to write to the console */
>  static u64 console_seq;
> -static u32 console_idx;
>  static u64 exclusive_console_stop_seq;
> +DECLARE_PRINTKRB_RECORD(console_record, CONSOLE_EXT_LOG_MAX);
>  
>  /* the next printk record to read after the last 'clear' command */
>  static u64 clear_seq;
> -static u32 clear_idx;
>  
>  #ifdef CONFIG_PRINTK_CALLER
>  #define PREFIX_MAX		48
> @@ -453,13 +418,28 @@ static u32 clear_idx;
>  #define LOG_FACILITY(v)		((v) >> 3 & 0xff)
>  
>  /* record buffer */
> -#define LOG_ALIGN __alignof__(struct printk_log)
> +#define LOG_ALIGN __alignof__(unsigned long)
>  #define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
>  #define LOG_BUF_LEN_MAX (u32)(1 << 31)
>  static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
>  static char *log_buf = __log_buf;
>  static u32 log_buf_len = __LOG_BUF_LEN;
>  
> +/*
> + * Define the average message size. This only affects the number of
> + * descriptors that will be available. Underestimating is better than
> + * overestimating (too many available descriptors is better than not enough).
> + * The dictionary buffer will be the same size as the text buffer.
> + */
> +#define PRB_AVGBITS 6
> +
> +_DECLARE_PRINTKRB(printk_rb_static, CONFIG_LOG_BUF_SHIFT - PRB_AVGBITS,
> +		  PRB_AVGBITS, PRB_AVGBITS, &__log_buf[0]);
> +
> +static struct printk_ringbuffer printk_rb_dynamic;
> +
> +static struct printk_ringbuffer *prb = &printk_rb_static;
> +
>  /* Return log buffer address */
>  char *log_buf_addr_get(void)
>  {
> @@ -472,108 +452,6 @@ u32 log_buf_len_get(void)
>  	return log_buf_len;
>  }
>  
> -/* human readable text of the record */
> -static char *log_text(const struct printk_log *msg)
> -{
> -	return (char *)msg + sizeof(struct printk_log);
> -}
> -
> -/* optional key/value pair dictionary attached to the record */
> -static char *log_dict(const struct printk_log *msg)
> -{
> -	return (char *)msg + sizeof(struct printk_log) + msg->text_len;
> -}
> -
> -/* get record by index; idx must point to valid msg */
> -static struct printk_log *log_from_idx(u32 idx)
> -{
> -	struct printk_log *msg = (struct printk_log *)(log_buf + idx);
> -
> -	/*
> -	 * A length == 0 record is the end of buffer marker. Wrap around and
> -	 * read the message at the start of the buffer.
> -	 */
> -	if (!msg->len)
> -		return (struct printk_log *)log_buf;
> -	return msg;
> -}
> -
> -/* get next record; idx must point to valid msg */
> -static u32 log_next(u32 idx)
> -{
> -	struct printk_log *msg = (struct printk_log *)(log_buf + idx);
> -
> -	/* length == 0 indicates the end of the buffer; wrap */
> -	/*
> -	 * A length == 0 record is the end of buffer marker. Wrap around and
> -	 * read the message at the start of the buffer as *this* one, and
> -	 * return the one after that.
> -	 */
> -	if (!msg->len) {
> -		msg = (struct printk_log *)log_buf;
> -		return msg->len;
> -	}
> -	return idx + msg->len;
> -}
> -
> -/*
> - * Check whether there is enough free space for the given message.
> - *
> - * The same values of first_idx and next_idx mean that the buffer
> - * is either empty or full.
> - *
> - * If the buffer is empty, we must respect the position of the indexes.
> - * They cannot be reset to the beginning of the buffer.
> - */
> -static int logbuf_has_space(u32 msg_size, bool empty)
> -{
> -	u32 free;
> -
> -	if (log_next_idx > log_first_idx || empty)
> -		free = max(log_buf_len - log_next_idx, log_first_idx);
> -	else
> -		free = log_first_idx - log_next_idx;
> -
> -	/*
> -	 * We need space also for an empty header that signalizes wrapping
> -	 * of the buffer.
> -	 */
> -	return free >= msg_size + sizeof(struct printk_log);
> -}
> -
> -static int log_make_free_space(u32 msg_size)
> -{
> -	while (log_first_seq < log_next_seq &&
> -	       !logbuf_has_space(msg_size, false)) {
> -		/* drop old messages until we have enough contiguous space */
> -		log_first_idx = log_next(log_first_idx);
> -		log_first_seq++;
> -	}
> -
> -	if (clear_seq < log_first_seq) {
> -		clear_seq = log_first_seq;
> -		clear_idx = log_first_idx;
> -	}
> -
> -	/* sequence numbers are equal, so the log buffer is empty */
> -	if (logbuf_has_space(msg_size, log_first_seq == log_next_seq))
> -		return 0;
> -
> -	return -ENOMEM;
> -}
> -
> -/* compute the message size including the padding bytes */
> -static u32 msg_used_size(u16 text_len, u16 dict_len, u32 *pad_len)
> -{
> -	u32 size;
> -
> -	size = sizeof(struct printk_log) + text_len + dict_len;
> -	*pad_len = (-size) & (LOG_ALIGN - 1);
> -	size += *pad_len;
> -
> -	return size;
> -}
> -
>  /*
>   * Define how much of the log buffer we could take at maximum. The value
>   * must be greater than two. Note that only half of the buffer is available
> @@ -582,22 +460,26 @@ static u32 msg_used_size(u16 text_len, u16 dict_len, u32 *pad_len)
>  #define MAX_LOG_TAKE_PART 4
>  static const char trunc_msg[] = "<truncated>";
>  
> -static u32 truncate_msg(u16 *text_len, u16 *trunc_msg_len,
> -			u16 *dict_len, u32 *pad_len)
> +static void truncate_msg(u16 *text_len, u16 *trunc_msg_len, u16 *dict_len)
>  {
>  	/*
>  	 * The message should not take the whole buffer. Otherwise, it might
>  	 * get removed too soon.
>  	 */
>  	u32 max_text_len = log_buf_len / MAX_LOG_TAKE_PART;
> +
>  	if (*text_len > max_text_len)
>  		*text_len = max_text_len;
> -	/* enable the warning message */
> +
> +	/* enable the warning message (if there is room) */
>  	*trunc_msg_len = strlen(trunc_msg);
> +	if (*text_len >= *trunc_msg_len)
> +		*text_len -= *trunc_msg_len;
> +	else
> +		*trunc_msg_len = 0;
> +
>  	/* disable the "dict" completely */
>  	*dict_len = 0;
> -	/* compute the size again, count also the warning message */
> -	return msg_used_size(*text_len + *trunc_msg_len, 0, pad_len);
>  }
>  
>  /* insert record into the buffer, discard old ones, update heads */
> @@ -606,60 +488,42 @@ static int log_store(u32 caller_id, int facility, int level,
>  		     const char *dict, u16 dict_len,
>  		     const char *text, u16 text_len)
>  {
> -	struct printk_log *msg;
> -	u32 size, pad_len;
> +	struct prb_reserved_entry e;
> +	struct printk_record r;
>  	u16 trunc_msg_len = 0;
>  
> -	/* number of '\0' padding bytes to next message */
> -	size = msg_used_size(text_len, dict_len, &pad_len);
> +	r.text_buf_size = text_len;
> +	r.dict_buf_size = dict_len;
>  
> -	if (log_make_free_space(size)) {
> +	if (!prb_reserve(&e, prb, &r)) {
>  		/* truncate the message if it is too long for empty buffer */
> -		size = truncate_msg(&text_len, &trunc_msg_len,
> -				    &dict_len, &pad_len);
> +		truncate_msg(&text_len, &trunc_msg_len, &dict_len);
> +		r.text_buf_size = text_len + trunc_msg_len;
> +		r.dict_buf_size = dict_len;
>  		/* survive when the log buffer is too small for trunc_msg */
> -		if (log_make_free_space(size))
> +		if (!prb_reserve(&e, prb, &r))
>  			return 0;
>  	}
>  
> -	if (log_next_idx + size + sizeof(struct printk_log) > log_buf_len) {
> -		/*
> -		 * This message + an additional empty header does not fit
> -		 * at the end of the buffer. Add an empty header with len == 0
> -		 * to signify a wrap around.
> -		 */
> -		memset(log_buf + log_next_idx, 0, sizeof(struct printk_log));
> -		log_next_idx = 0;
> -	}
> -
>  	/* fill message */
> -	msg = (struct printk_log *)(log_buf + log_next_idx);
> -	memcpy(log_text(msg), text, text_len);
> -	msg->text_len = text_len;
> -	if (trunc_msg_len) {
> -		memcpy(log_text(msg) + text_len, trunc_msg, trunc_msg_len);
> -		msg->text_len += trunc_msg_len;
> -	}
> -	memcpy(log_dict(msg), dict, dict_len);
> -	msg->dict_len = dict_len;
> -	msg->facility = facility;
> -	msg->level = level & 7;
> -	msg->flags = flags & 0x1f;
> +	memcpy(&r.text_buf[0], text, text_len);
> +	if (trunc_msg_len)
> +		memcpy(&r.text_buf[text_len], trunc_msg, trunc_msg_len);
> +	if (r.dict_buf)
> +		memcpy(&r.dict_buf[0], dict, dict_len);
> +	r.info->facility = facility;
> +	r.info->level = level & 7;
> +	r.info->flags = flags & 0x1f;
>  	if (ts_nsec > 0)
> -		msg->ts_nsec = ts_nsec;
> +		r.info->ts_nsec = ts_nsec;
>  	else
> -		msg->ts_nsec = local_clock();
> -#ifdef CONFIG_PRINTK_CALLER
> -	msg->caller_id = caller_id;
> -#endif
> -	memset(log_dict(msg) + dict_len, 0, pad_len);
> -	msg->len = size;
> +		r.info->ts_nsec = local_clock();
> +	r.info->caller_id = caller_id;
>  
>  	/* insert message */
> -	log_next_idx += msg->len;
> -	log_next_seq++;
> +	prb_commit(&e);
>  
> -	return msg->text_len;
> +	return text_len;
>  }
>  
>  int dmesg_restrict = IS_ENABLED(CONFIG_SECURITY_DMESG_RESTRICT);
> @@ -711,13 +575,13 @@ static void append_char(char **pp, char *e, char c)
>  		*(*pp)++ = c;
>  }
>  
> -static ssize_t msg_print_ext_header(char *buf, size_t size,
> -				    struct printk_log *msg, u64 seq)
> +static ssize_t info_print_ext_header(char *buf, size_t size,
> +				     struct printk_info *info)
>  {
> -	u64 ts_usec = msg->ts_nsec;
> +	u64 ts_usec = info->ts_nsec;
>  	char caller[20];
>  #ifdef CONFIG_PRINTK_CALLER
> -	u32 id = msg->caller_id;
> +	u32 id = info->caller_id;
>  
>  	snprintf(caller, sizeof(caller), ",caller=%c%u",
>  		 id & 0x80000000 ? 'C' : 'T', id & ~0x80000000);
> @@ -728,8 +592,8 @@ static ssize_t msg_print_ext_header(char *buf, size_t size,
>  	do_div(ts_usec, 1000);
>  
>  	return scnprintf(buf, size, "%u,%llu,%llu,%c%s;",
> -			 (msg->facility << 3) | msg->level, seq, ts_usec,
> -			 msg->flags & LOG_CONT ? 'c' : '-', caller);
> +			 (info->facility << 3) | info->level, info->seq,
> +			 ts_usec, info->flags & LOG_CONT ? 'c' : '-', caller);
>  }
>  
>  static ssize_t msg_print_ext_body(char *buf, size_t size,
> @@ -783,10 +647,14 @@ static ssize_t msg_print_ext_body(char *buf, size_t size,
>  /* /dev/kmsg - userspace message inject/listen interface */
>  struct devkmsg_user {
>  	u64 seq;
> -	u32 idx;
>  	struct ratelimit_state rs;
>  	struct mutex lock;
>  	char buf[CONSOLE_EXT_LOG_MAX];
> +
> +	struct printk_info info;
> +	char text_buf[CONSOLE_EXT_LOG_MAX];
> +	char dict_buf[CONSOLE_EXT_LOG_MAX];
> +	struct printk_record record;
>  };
>  
>  static __printf(3, 4) __cold
> @@ -869,7 +737,7 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
>  			    size_t count, loff_t *ppos)
>  {
>  	struct devkmsg_user *user = file->private_data;
> -	struct printk_log *msg;
> +	struct printk_record *r = &user->record;
>  	size_t len;
>  	ssize_t ret;
>  
> @@ -881,7 +749,7 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
>  		return ret;
>  
>  	logbuf_lock_irq();
> -	while (user->seq == log_next_seq) {
> +	if (!prb_read_valid(prb, user->seq, r)) {
>  		if (file->f_flags & O_NONBLOCK) {
>  			ret = -EAGAIN;
>  			logbuf_unlock_irq();
> @@ -890,30 +758,26 @@ static ssize_t devkmsg_read(struct file *file, char __user *buf,
>  
>  		logbuf_unlock_irq();
>  		ret = wait_event_interruptible(log_wait,
> -					       user->seq != log_next_seq);
> +					prb_read_valid(prb, user->seq, r));
>  		if (ret)
>  			goto out;
>  		logbuf_lock_irq();
>  	}
>  
> -	if (user->seq < log_first_seq) {
> -		/* our last seen message is gone, return error and reset */
> -		user->idx = log_first_idx;
> -		user->seq = log_first_seq;
> +	if (user->seq < r->info->seq) {
> +		/* the expected message is gone, return error and reset */
> +		user->seq = r->info->seq;
>  		ret = -EPIPE;
>  		logbuf_unlock_irq();
>  		goto out;
>  	}
>  
> -	msg = log_from_idx(user->idx);
> -	len = msg_print_ext_header(user->buf, sizeof(user->buf),
> -				   msg, user->seq);
> +	len = info_print_ext_header(user->buf, sizeof(user->buf), r->info);
>  	len += msg_print_ext_body(user->buf + len, sizeof(user->buf) - len,
> -				  log_dict(msg), msg->dict_len,
> -				  log_text(msg), msg->text_len);
> +				  &r->dict_buf[0], r->info->dict_len,
> +				  &r->text_buf[0], r->info->text_len);
>  
> -	user->idx = log_next(user->idx);
> -	user->seq++;
> +	user->seq = r->info->seq + 1;
>  	logbuf_unlock_irq();
>  
>  	if (len > count) {
> @@ -945,8 +809,7 @@ static loff_t devkmsg_llseek(struct file *file, loff_t offset, int whence)
>  	switch (whence) {
>  	case SEEK_SET:
>  		/* the first record */
> -		user->idx = log_first_idx;
> -		user->seq = log_first_seq;
> +		user->seq = prb_first_seq(prb);
>  		break;
>  	case SEEK_DATA:
>  		/*
> @@ -954,13 +817,11 @@ static loff_t devkmsg_llseek(struct file *file, loff_t offset, int whence)
>  		 * like issued by 'dmesg -c'. Reading /dev/kmsg itself
>  		 * changes no global state, and does not clear anything.
>  		 */
> -		user->idx = clear_idx;
>  		user->seq = clear_seq;
>  		break;
>  	case SEEK_END:
>  		/* after the last record */
> -		user->idx = log_next_idx;
> -		user->seq = log_next_seq;
> +		user->seq = prb_next_seq(prb);
>  		break;
>  	default:
>  		ret = -EINVAL;
> @@ -980,9 +841,9 @@ static __poll_t devkmsg_poll(struct file *file, poll_table *wait)
>  	poll_wait(file, &log_wait, wait);
>  
>  	logbuf_lock_irq();
> -	if (user->seq < log_next_seq) {
> +	if (prb_read_valid(prb, user->seq, NULL)) {
>  		/* return error when data has vanished underneath us */
> -		if (user->seq < log_first_seq)
> +		if (user->seq < prb_first_seq(prb))
>  			ret = EPOLLIN|EPOLLRDNORM|EPOLLERR|EPOLLPRI;
>  		else
>  			ret = EPOLLIN|EPOLLRDNORM;
> @@ -1017,9 +878,14 @@ static int devkmsg_open(struct inode *inode, struct file *file)
>  
>  	mutex_init(&user->lock);
>  
> +	user->record.info = &user->info;
> +	user->record.text_buf = &user->text_buf[0];
> +	user->record.text_buf_size = sizeof(user->text_buf);
> +	user->record.dict_buf = &user->dict_buf[0];
> +	user->record.dict_buf_size = sizeof(user->dict_buf);
> +
>  	logbuf_lock_irq();
> -	user->idx = log_first_idx;
> -	user->seq = log_first_seq;
> +	user->seq = prb_first_seq(prb);
>  	logbuf_unlock_irq();
>  
>  	file->private_data = user;
> @@ -1062,21 +928,16 @@ void log_buf_vmcoreinfo_setup(void)
>  {
>  	VMCOREINFO_SYMBOL(log_buf);
>  	VMCOREINFO_SYMBOL(log_buf_len);

Hi, John Ogness

I notice that the "prb"(printk tb static) symbol is not exported into vmcoreinfo as follows:

+	VMCOREINFO_SYMBOL(prb);

Should the "prb"(printk tb static) symbol be exported into vmcoreinfo? Otherwise, do you
happen to know how to walk through the log_buf and get all kernel logs from vmcore?

Thanks.
Lianbo

> -	VMCOREINFO_SYMBOL(log_first_idx);
> -	VMCOREINFO_SYMBOL(clear_idx);
> -	VMCOREINFO_SYMBOL(log_next_idx);
>  	/*
> -	 * Export struct printk_log size and field offsets. User space tools can
> -	 * parse it and detect any changes to structure down the line.
> +	 * Export struct printk_info size and field offsets. User space tools
> +	 * can parse it and detect any changes to structure down the line.
>  	 */
> -	VMCOREINFO_STRUCT_SIZE(printk_log);
> -	VMCOREINFO_OFFSET(printk_log, ts_nsec);
> -	VMCOREINFO_OFFSET(printk_log, len);
> -	VMCOREINFO_OFFSET(printk_log, text_len);
> -	VMCOREINFO_OFFSET(printk_log, dict_len);
> -#ifdef CONFIG_PRINTK_CALLER
> -	VMCOREINFO_OFFSET(printk_log, caller_id);
> -#endif
> +	VMCOREINFO_STRUCT_SIZE(printk_info);
> +	VMCOREINFO_OFFSET(printk_info, seq);
> +	VMCOREINFO_OFFSET(printk_info, ts_nsec);
> +	VMCOREINFO_OFFSET(printk_info, text_len);
> +	VMCOREINFO_OFFSET(printk_info, dict_len);
> +	VMCOREINFO_OFFSET(printk_info, caller_id);
>  }
>  #endif
>  
> @@ -1146,11 +1007,55 @@ static void __init log_buf_add_cpu(void)
>  static inline void log_buf_add_cpu(void) {}
>  #endif /* CONFIG_SMP */
>  
> +static unsigned int __init add_to_rb(struct printk_ringbuffer *rb,
> +				     struct printk_record *r)
> +{
> +	struct printk_info info;
> +	struct printk_record dest_r = {
> +		.info = &info,
> +		.text_buf_size = r->info->text_len,
> +		.dict_buf_size = r->info->dict_len,
> +	};
> +	struct prb_reserved_entry e;
> +
> +	if (!prb_reserve(&e, rb, &dest_r))
> +		return 0;
> +
> +	memcpy(&dest_r.text_buf[0], &r->text_buf[0], dest_r.text_buf_size);
> +	if (dest_r.dict_buf) {
> +		memcpy(&dest_r.dict_buf[0], &r->dict_buf[0],
> +		       dest_r.dict_buf_size);
> +	}
> +	dest_r.info->facility = r->info->facility;
> +	dest_r.info->level = r->info->level;
> +	dest_r.info->flags = r->info->flags;
> +	dest_r.info->ts_nsec = r->info->ts_nsec;
> +	dest_r.info->caller_id = r->info->caller_id;
> +
> +	prb_commit(&e);
> +
> +	return prb_record_text_space(&e);
> +}
> +
> +static char setup_text_buf[CONSOLE_EXT_LOG_MAX] __initdata;
> +static char setup_dict_buf[CONSOLE_EXT_LOG_MAX] __initdata;
> +
>  void __init setup_log_buf(int early)
>  {
> +	struct prb_desc *new_descs;
> +	struct printk_info info;
> +	struct printk_record r = {
> +		.info = &info,
> +		.text_buf = &setup_text_buf[0],
> +		.dict_buf = &setup_dict_buf[0],
> +		.text_buf_size = sizeof(setup_text_buf),
> +		.dict_buf_size = sizeof(setup_dict_buf),
> +	};
>  	unsigned long flags;
> +	char *new_dict_buf;
>  	char *new_log_buf;
>  	unsigned int free;
> +	u64 seq;
>  
>  	if (log_buf != __log_buf)
>  		return;
> @@ -1163,17 +1068,46 @@ void __init setup_log_buf(int early)
>  
>  	new_log_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN);
>  	if (unlikely(!new_log_buf)) {
> -		pr_err("log_buf_len: %lu bytes not available\n",
> +		pr_err("log_buf_len: %lu text bytes not available\n",
>  			new_log_buf_len);
>  		return;
>  	}
>  
> +	new_dict_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN);
> +	if (unlikely(!new_dict_buf)) {
> +		/* dictionary failure is allowed */
> +		pr_err("log_buf_len: %lu dict bytes not available\n",
> +			new_log_buf_len);
> +	}
> +
> +	new_descs = memblock_alloc((new_log_buf_len >> PRB_AVGBITS) *
> +				   sizeof(struct prb_desc), LOG_ALIGN);
> +	if (unlikely(!new_descs)) {
> +		pr_err("log_buf_len: %lu desc bytes not available\n",
> +			new_log_buf_len >> PRB_AVGBITS);
> +		if (new_dict_buf)
> +			memblock_free(__pa(new_dict_buf), new_log_buf_len);
> +		memblock_free(__pa(new_log_buf), new_log_buf_len);
> +		return;
> +	}
> +
>  	logbuf_lock_irqsave(flags);
> +
> +	prb_init(&printk_rb_dynamic,
> +		 new_log_buf, bits_per(new_log_buf_len) - 1,
> +		 new_dict_buf, bits_per(new_log_buf_len) - 1,
> +		 new_descs, (bits_per(new_log_buf_len) - 1) - PRB_AVGBITS);
> +
>  	log_buf_len = new_log_buf_len;
>  	log_buf = new_log_buf;
>  	new_log_buf_len = 0;
> -	free = __LOG_BUF_LEN - log_next_idx;
> -	memcpy(log_buf, __log_buf, __LOG_BUF_LEN);
> +
> +	free = __LOG_BUF_LEN;
> +	prb_for_each_record(0, &printk_rb_static, seq, &r)
> +		free -= add_to_rb(&printk_rb_dynamic, &r);
> +
> +	prb = &printk_rb_dynamic;
> +
>  	logbuf_unlock_irqrestore(flags);
>  
>  	pr_info("log_buf_len: %u bytes\n", log_buf_len);
> @@ -1285,18 +1219,18 @@ static size_t print_caller(u32 id, char *buf)
>  #define print_caller(id, buf) 0
>  #endif
>  
> -static size_t print_prefix(const struct printk_log *msg, bool syslog,
> -			   bool time, char *buf)
> +static size_t info_print_prefix(const struct printk_info  *info, bool syslog,
> +				bool time, char *buf)
>  {
>  	size_t len = 0;
>  
>  	if (syslog)
> -		len = print_syslog((msg->facility << 3) | msg->level, buf);
> +		len = print_syslog((info->facility << 3) | info->level, buf);
>  
>  	if (time)
> -		len += print_time(msg->ts_nsec, buf + len);
> +		len += print_time(info->ts_nsec, buf + len);
>  
> -	len += print_caller(msg->caller_id, buf + len);
> +	len += print_caller(info->caller_id, buf + len);
>  
>  	if (IS_ENABLED(CONFIG_PRINTK_CALLER) || time) {
>  		buf[len++] = ' ';
> @@ -1306,14 +1240,15 @@ static size_t print_prefix(const struct printk_log *msg, bool syslog,
>  	return len;
>  }
>  
> -static size_t msg_print_text(const struct printk_log *msg, bool syslog,
> -			     bool time, char *buf, size_t size)
> +static size_t record_print_text(const struct printk_record *r, bool syslog,
> +				bool time, char *buf, size_t size)
>  {
> -	const char *text = log_text(msg);
> -	size_t text_size = msg->text_len;
> +	const char *text = &r->text_buf[0];
> +	size_t text_size = r->info->text_len;
>  	size_t len = 0;
>  	char prefix[PREFIX_MAX];
> -	const size_t prefix_len = print_prefix(msg, syslog, time, prefix);
> +	const size_t prefix_len = info_print_prefix(r->info, syslog, time,
> +						    prefix);
>  
>  	do {
>  		const char *next = memchr(text, '\n', text_size);
> @@ -1347,10 +1282,94 @@ static size_t msg_print_text(const struct printk_log *msg, bool syslog,
>  	return len;
>  }
>  
> +static size_t record_print_text_inline(struct printk_record *r, bool syslog,
> +				       bool time)
> +{
> +	size_t text_len = r->info->text_len;
> +	size_t buf_size = r->text_buf_size;
> +	char *text = r->text_buf;
> +	char prefix[PREFIX_MAX];
> +	bool truncated = false;
> +	size_t prefix_len;
> +	size_t len = 0;
> +
> +	prefix_len = info_print_prefix(r->info, syslog, time, prefix);
> +
> +	if (!text) {
> +		/* SYSLOG_ACTION_* buffer size only calculation */
> +		unsigned int line_count = 1;
> +
> +		if (r->text_line_count)
> +			line_count = *(r->text_line_count);
> +		/*
> +		 * Each line will be preceded with a prefix. The intermediate
> +		 * newlines are already within the text, but a final trailing
> +		 * newline will be added.
> +		 */
> +		return ((prefix_len * line_count) + r->info->text_len + 1);
> +	}
> +
> +	/*
> +	 * Add the prefix for each line by shifting the rest of the text to
> +	 * make room for the prefix. If the buffer is not large enough for all
> +	 * the prefixes, then drop the trailing text and report the largest
> +	 * length that includes full lines with their prefixes.
> +	 */
> +	while (text_len) {
> +		size_t line_len;
> +		char *next;
> +
> +		next = memchr(text, '\n', text_len);
> +		if (next) {
> +			line_len = next - text;
> +		} else {
> +			/*
> +			 * If the text has been truncated, assume this line
> +			 * was truncated and do not include this text.
> +			 */
> +			if (truncated)
> +				break;
> +			line_len = text_len;
> +		}
> +
> +		/*
> +		 * Is there enough buffer available to shift this line
> +		 * (and add a newline at the end)?
> +		 */
> +		if (len + prefix_len + line_len >= buf_size)
> +			break;
> +
> +		/*
> +		 * Is there enough buffer available to shift all remaining
> +		 * text (and add a newline at the end)?
> +		 */
> +		if (len + prefix_len + text_len >= buf_size) {
> +			text_len = (buf_size - len) - prefix_len;
> +			truncated = true;
> +		}
> +
> +		memmove(text + prefix_len, text, text_len);
> +		memcpy(text, prefix, prefix_len);
> +
> +		text += prefix_len + line_len;
> +		text_len -= line_len;
> +
> +		if (text_len) {
> +			text_len--;
> +			text++;
> +		} else {
> +			*text = '\n';
> +		}
> +
> +		len += prefix_len + line_len + 1;
> +	}
> +
> +	return len;
> +}
> +
>  static int syslog_print(char __user *buf, int size)
>  {
>  	char *text;
> -	struct printk_log *msg;
>  	int len = 0;
>  
>  	text = kmalloc(LOG_LINE_MAX + PREFIX_MAX, GFP_KERNEL);
> @@ -1362,16 +1381,15 @@ static int syslog_print(char __user *buf, int size)
>  		size_t skip;
>  
>  		logbuf_lock_irq();
> -		if (syslog_seq < log_first_seq) {
> -			/* messages are gone, move to first one */
> -			syslog_seq = log_first_seq;
> -			syslog_idx = log_first_idx;
> -			syslog_partial = 0;
> -		}
> -		if (syslog_seq == log_next_seq) {
> +		if (!prb_read_valid(prb, syslog_seq, &syslog_record)) {
>  			logbuf_unlock_irq();
>  			break;
>  		}
> +		if (syslog_record.info->seq != syslog_seq) {
> +			/* messages are gone, move to first one */
> +			syslog_seq = syslog_record.info->seq;
> +			syslog_partial = 0;
> +		}
>  
>  		/*
>  		 * To keep reading/counting partial line consistent,
> @@ -1381,13 +1399,11 @@ static int syslog_print(char __user *buf, int size)
>  			syslog_time = printk_time;
>  
>  		skip = syslog_partial;
> -		msg = log_from_idx(syslog_idx);
> -		n = msg_print_text(msg, true, syslog_time, text,
> -				   LOG_LINE_MAX + PREFIX_MAX);
> +		n = record_print_text(&syslog_record, true, syslog_time, text,
> +				      LOG_LINE_MAX + PREFIX_MAX);
>  		if (n - syslog_partial <= size) {
>  			/* message fits into buffer, move forward */
> -			syslog_idx = log_next(syslog_idx);
> -			syslog_seq++;
> +			syslog_seq = syslog_record.info->seq + 1;
>  			n -= syslog_partial;
>  			syslog_partial = 0;
>  		} else if (!len){
> @@ -1420,9 +1436,7 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
>  {
>  	char *text;
>  	int len = 0;
> -	u64 next_seq;
>  	u64 seq;
> -	u32 idx;
>  	bool time;
>  
>  	text = kmalloc(LOG_LINE_MAX + PREFIX_MAX, GFP_KERNEL);
> @@ -1435,38 +1449,30 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
>  	 * Find first record that fits, including all following records,
>  	 * into the user-provided buffer for this dump.
>  	 */
> -	seq = clear_seq;
> -	idx = clear_idx;
> -	while (seq < log_next_seq) {
> -		struct printk_log *msg = log_from_idx(idx);
> -
> -		len += msg_print_text(msg, true, time, NULL, 0);
> -		idx = log_next(idx);
> -		seq++;
> -	}
> +	prb_for_each_record(clear_seq, prb, seq, &syslog_record)
> +		len += record_print_text(&syslog_record, true, time, NULL, 0);
>  
>  	/* move first record forward until length fits into the buffer */
> -	seq = clear_seq;
> -	idx = clear_idx;
> -	while (len > size && seq < log_next_seq) {
> -		struct printk_log *msg = log_from_idx(idx);
> -
> -		len -= msg_print_text(msg, true, time, NULL, 0);
> -		idx = log_next(idx);
> -		seq++;
> +	prb_for_each_record(clear_seq, prb, seq, &syslog_record) {
> +		if (len <= size)
> +			break;
> +		len -= record_print_text(&syslog_record, true, time, NULL, 0);
>  	}
>  
> -	/* last message fitting into this dump */
> -	next_seq = log_next_seq;
> -
>  	len = 0;
> -	while (len >= 0 && seq < next_seq) {
> -		struct printk_log *msg = log_from_idx(idx);
> -		int textlen = msg_print_text(msg, true, time, text,
> -					     LOG_LINE_MAX + PREFIX_MAX);
> +	prb_for_each_record(seq, prb, seq, &syslog_record) {
> +		int textlen;
>  
> -		idx = log_next(idx);
> -		seq++;
> +		if (len < 0)
> +			break;
> +
> +		textlen = record_print_text(&syslog_record, true, time, text,
> +					    LOG_LINE_MAX + PREFIX_MAX);
> +
> +		if (len + textlen > size) {
> +			seq--;
> +			break;
> +		}
>  
>  		logbuf_unlock_irq();
>  		if (copy_to_user(buf + len, text, textlen))
> @@ -1474,18 +1480,10 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
>  		else
>  			len += textlen;
>  		logbuf_lock_irq();
> -
> -		if (seq < log_first_seq) {
> -			/* messages are gone, move to next one */
> -			seq = log_first_seq;
> -			idx = log_first_idx;
> -		}
>  	}
>  
> -	if (clear) {
> -		clear_seq = log_next_seq;
> -		clear_idx = log_next_idx;
> -	}
> +	if (clear)
> +		clear_seq = seq;
>  	logbuf_unlock_irq();
>  
>  	kfree(text);
> @@ -1495,8 +1493,7 @@ static int syslog_print_all(char __user *buf, int size, bool clear)
>  static void syslog_clear(void)
>  {
>  	logbuf_lock_irq();
> -	clear_seq = log_next_seq;
> -	clear_idx = log_next_idx;
> +	clear_seq = prb_next_seq(prb);
>  	logbuf_unlock_irq();
>  }
>  
> @@ -1523,7 +1520,7 @@ int do_syslog(int type, char __user *buf, int len, int source)
>  		if (!access_ok(buf, len))
>  			return -EFAULT;
>  		error = wait_event_interruptible(log_wait,
> -						 syslog_seq != log_next_seq);
> +				prb_read_valid(prb, syslog_seq, NULL));
>  		if (error)
>  			return error;
>  		error = syslog_print(buf, len);
> @@ -1572,10 +1569,9 @@ int do_syslog(int type, char __user *buf, int len, int source)
>  	/* Number of chars in the log buffer */
>  	case SYSLOG_ACTION_SIZE_UNREAD:
>  		logbuf_lock_irq();
> -		if (syslog_seq < log_first_seq) {
> +		if (syslog_seq < prb_first_seq(prb)) {
>  			/* messages are gone, move to first one */
> -			syslog_seq = log_first_seq;
> -			syslog_idx = log_first_idx;
> +			syslog_seq = prb_first_seq(prb);
>  			syslog_partial = 0;
>  		}
>  		if (source == SYSLOG_FROM_PROC) {
> @@ -1584,20 +1580,17 @@ int do_syslog(int type, char __user *buf, int len, int source)
>  			 * for pending data, not the size; return the count of
>  			 * records, not the length.
>  			 */
> -			error = log_next_seq - syslog_seq;
> +			error = prb_next_seq(prb) - syslog_seq;
>  		} else {
> -			u64 seq = syslog_seq;
> -			u32 idx = syslog_idx;
>  			bool time = syslog_partial ? syslog_time : printk_time;
> +			u64 seq;
>  
> -			while (seq < log_next_seq) {
> -				struct printk_log *msg = log_from_idx(idx);
> -
> -				error += msg_print_text(msg, true, time, NULL,
> -							0);
> +			prb_for_each_record(syslog_seq, prb, seq,
> +					    &syslog_record) {
> +				error += record_print_text(&syslog_record,
> +							   true, time,
> +							   NULL, 0);
>  				time = printk_time;
> -				idx = log_next(idx);
> -				seq++;
>  			}
>  			error -= syslog_partial;
>  		}
> @@ -1958,7 +1951,6 @@ asmlinkage int vprintk_emit(int facility, int level,
>  	int printed_len;
>  	bool in_sched = false, pending_output;
>  	unsigned long flags;
> -	u64 curr_log_seq;
>  
>  	/* Suppress unimportant messages after panic happens */
>  	if (unlikely(suppress_printk))
> @@ -1974,9 +1966,9 @@ asmlinkage int vprintk_emit(int facility, int level,
>  
>  	/* This stops the holder of console_sem just where we want him */
>  	logbuf_lock_irqsave(flags);
> -	curr_log_seq = log_next_seq;
> +	pending_output = !prb_read_valid(prb, console_seq, NULL);
>  	printed_len = vprintk_store(facility, level, dict, dictlen, fmt, args);
> -	pending_output = (curr_log_seq != log_next_seq);
> +	pending_output &= prb_read_valid(prb, console_seq, NULL);
>  	logbuf_unlock_irqrestore(flags);
>  
>  	/* If called from the scheduler, we can not call up(). */
> @@ -2066,21 +2058,30 @@ EXPORT_SYMBOL(printk);
>  #define PREFIX_MAX		0
>  #define printk_time		false
>  
> +#define prb_read_valid(rb, seq, r)	false
> +#define prb_first_seq(rb)		0
> +
>  static u64 syslog_seq;
> -static u32 syslog_idx;
>  static u64 console_seq;
> -static u32 console_idx;
>  static u64 exclusive_console_stop_seq;
> -static u64 log_first_seq;
> -static u32 log_first_idx;
> -static u64 log_next_seq;
> -static char *log_text(const struct printk_log *msg) { return NULL; }
> -static char *log_dict(const struct printk_log *msg) { return NULL; }
> -static struct printk_log *log_from_idx(u32 idx) { return NULL; }
> -static u32 log_next(u32 idx) { return 0; }
> -static ssize_t msg_print_ext_header(char *buf, size_t size,
> -				    struct printk_log *msg,
> -				    u64 seq) { return 0; }
> +struct printk_record console_record;
> +
> +static size_t record_print_text(const struct printk_record *r, bool syslog,
> +				bool time, char *buf,
> +				size_t size)
> +{
> +	return 0;
> +}
> +static size_t record_print_text_inline(const struct printk_record *r,
> +				       bool syslog, bool time)
> +{
> +	return 0;
> +}
> +static ssize_t info_print_ext_header(char *buf, size_t size,
> +				     struct printk_info *info)
> +{
> +	return 0;
> +}
>  static ssize_t msg_print_ext_body(char *buf, size_t size,
>  				  char *dict, size_t dict_len,
>  				  char *text, size_t text_len) { return 0; }
> @@ -2088,8 +2089,6 @@ static void console_lock_spinning_enable(void) { }
>  static int console_lock_spinning_disable_and_check(void) { return 0; }
>  static void call_console_drivers(const char *ext_text, size_t ext_len,
>  				 const char *text, size_t len) {}
> -static size_t msg_print_text(const struct printk_log *msg, bool syslog,
> -			     bool time, char *buf, size_t size) { return 0; }
>  static bool suppress_message_printing(int level) { return false; }
>  
>  #endif /* CONFIG_PRINTK */
> @@ -2406,35 +2405,28 @@ void console_unlock(void)
>  	}
>  
>  	for (;;) {
> -		struct printk_log *msg;
>  		size_t ext_len = 0;
> -		size_t len;
> +		size_t len = 0;
>  
>  		printk_safe_enter_irqsave(flags);
>  		raw_spin_lock(&logbuf_lock);
> -		if (console_seq < log_first_seq) {
> +skip:
> +		if (!prb_read_valid(prb, console_seq, &console_record))
> +			break;
> +
> +		if (console_seq < console_record.info->seq) {
>  			len = sprintf(text,
>  				      "** %llu printk messages dropped **\n",
> -				      log_first_seq - console_seq);
> -
> -			/* messages are gone, move to first one */
> -			console_seq = log_first_seq;
> -			console_idx = log_first_idx;
> -		} else {
> -			len = 0;
> +				      console_record.info->seq - console_seq);
>  		}
> -skip:
> -		if (console_seq == log_next_seq)
> -			break;
> +		console_seq = console_record.info->seq;
>  
> -		msg = log_from_idx(console_idx);
> -		if (suppress_message_printing(msg->level)) {
> +		if (suppress_message_printing(console_record.info->level)) {
>  			/*
>  			 * Skip record we have buffered and already printed
>  			 * directly to the console when we received it, and
>  			 * record that has level above the console loglevel.
>  			 */
> -			console_idx = log_next(console_idx);
>  			console_seq++;
>  			goto skip;
>  		}
> @@ -2445,19 +2437,20 @@ void console_unlock(void)
>  			exclusive_console = NULL;
>  		}
>  
> -		len += msg_print_text(msg,
> +		len += record_print_text(&console_record,
>  				console_msg_format & MSG_FORMAT_SYSLOG,
>  				printk_time, text + len, sizeof(text) - len);
>  		if (nr_ext_console_drivers) {
> -			ext_len = msg_print_ext_header(ext_text,
> +			ext_len = info_print_ext_header(ext_text,
>  						sizeof(ext_text),
> -						msg, console_seq);
> +						console_record.info);
>  			ext_len += msg_print_ext_body(ext_text + ext_len,
>  						sizeof(ext_text) - ext_len,
> -						log_dict(msg), msg->dict_len,
> -						log_text(msg), msg->text_len);
> +						&console_record.dict_buf[0],
> +						console_record.info->dict_len,
> +						&console_record.text_buf[0],
> +						console_record.info->text_len);
>  		}
> -		console_idx = log_next(console_idx);
>  		console_seq++;
>  		raw_spin_unlock(&logbuf_lock);
>  
> @@ -2497,7 +2490,7 @@ void console_unlock(void)
>  	 * flush, no worries.
>  	 */
>  	raw_spin_lock(&logbuf_lock);
> -	retry = console_seq != log_next_seq;
> +	retry = prb_read_valid(prb, console_seq, NULL);
>  	raw_spin_unlock(&logbuf_lock);
>  	printk_safe_exit_irqrestore(flags);
>  
> @@ -2566,8 +2559,7 @@ void console_flush_on_panic(enum con_flush_mode mode)
>  		unsigned long flags;
>  
>  		logbuf_lock_irqsave(flags);
> -		console_seq = log_first_seq;
> -		console_idx = log_first_idx;
> +		console_seq = prb_first_seq(prb);
>  		logbuf_unlock_irqrestore(flags);
>  	}
>  	console_unlock();
> @@ -2770,8 +2762,6 @@ void register_console(struct console *newcon)
>  		 * for us.
>  		 */
>  		logbuf_lock_irqsave(flags);
> -		console_seq = syslog_seq;
> -		console_idx = syslog_idx;
>  		/*
>  		 * We're about to replay the log buffer.  Only do this to the
>  		 * just-registered console to avoid excessive message spam to
> @@ -2783,6 +2773,7 @@ void register_console(struct console *newcon)
>  		 */
>  		exclusive_console = newcon;
>  		exclusive_console_stop_seq = console_seq;
> +		console_seq = syslog_seq;
>  		logbuf_unlock_irqrestore(flags);
>  	}
>  	console_unlock();
> @@ -3127,9 +3118,7 @@ void kmsg_dump(enum kmsg_dump_reason reason)
>  
>  		logbuf_lock_irqsave(flags);
>  		dumper->cur_seq = clear_seq;
> -		dumper->cur_idx = clear_idx;
> -		dumper->next_seq = log_next_seq;
> -		dumper->next_idx = log_next_idx;
> +		dumper->next_seq = prb_next_seq(prb);
>  		logbuf_unlock_irqrestore(flags);
>  
>  		/* invoke dumper which will iterate over records */
> @@ -3163,28 +3152,29 @@ void kmsg_dump(enum kmsg_dump_reason reason)
>  bool kmsg_dump_get_line_nolock(struct kmsg_dumper *dumper, bool syslog,
>  			       char *line, size_t size, size_t *len)
>  {
> -	struct printk_log *msg;
> +	struct printk_info info;
> +	struct printk_record r = {
> +		.info = &info,
> +		.text_buf = line,
> +		.text_buf_size = size,
> +	};
> +	unsigned int line_count;
>  	size_t l = 0;
>  	bool ret = false;
>  
>  	if (!dumper->active)
>  		goto out;
>  
> -	if (dumper->cur_seq < log_first_seq) {
> -		/* messages are gone, move to first available one */
> -		dumper->cur_seq = log_first_seq;
> -		dumper->cur_idx = log_first_idx;
> -	}
> +	/* Count text lines instead of reading text? */
> +	if (!line)
> +		r.text_line_count = &line_count;
>  
> -	/* last entry */
> -	if (dumper->cur_seq >= log_next_seq)
> +	if (!prb_read_valid(prb, dumper->cur_seq, &r))
>  		goto out;
>  
> -	msg = log_from_idx(dumper->cur_idx);
> -	l = msg_print_text(msg, syslog, printk_time, line, size);
> +	l = record_print_text_inline(&r, syslog, printk_time);
>  
> -	dumper->cur_idx = log_next(dumper->cur_idx);
> -	dumper->cur_seq++;
> +	dumper->cur_seq = r.info->seq + 1;
>  	ret = true;
>  out:
>  	if (len)
> @@ -3245,23 +3235,27 @@ EXPORT_SYMBOL_GPL(kmsg_dump_get_line);
>  bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
>  			  char *buf, size_t size, size_t *len)
>  {
> +	struct printk_info info;
> +	unsigned int line_count;
> +	/* initially, only count text lines */
> +	struct printk_record r = {
> +		.info = &info,
> +		.text_line_count = &line_count,
> +	};
>  	unsigned long flags;
>  	u64 seq;
> -	u32 idx;
>  	u64 next_seq;
> -	u32 next_idx;
>  	size_t l = 0;
>  	bool ret = false;
>  	bool time = printk_time;
>  
> -	if (!dumper->active)
> +	if (!dumper->active || !buf || !size)
>  		goto out;
>  
>  	logbuf_lock_irqsave(flags);
> -	if (dumper->cur_seq < log_first_seq) {
> +	if (dumper->cur_seq < prb_first_seq(prb)) {
>  		/* messages are gone, move to first available one */
> -		dumper->cur_seq = log_first_seq;
> -		dumper->cur_idx = log_first_idx;
> +		dumper->cur_seq = prb_first_seq(prb);
>  	}
>  
>  	/* last entry */
> @@ -3272,41 +3266,43 @@ bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
>  
>  	/* calculate length of entire buffer */
>  	seq = dumper->cur_seq;
> -	idx = dumper->cur_idx;
> -	while (seq < dumper->next_seq) {
> -		struct printk_log *msg = log_from_idx(idx);
> -
> -		l += msg_print_text(msg, true, time, NULL, 0);
> -		idx = log_next(idx);
> -		seq++;
> +	while (prb_read_valid(prb, seq, &r)) {
> +		if (r.info->seq >= dumper->next_seq)
> +			break;
> +		l += record_print_text_inline(&r, true, time);
> +		seq = r.info->seq + 1;
>  	}
>  
>  	/* move first record forward until length fits into the buffer */
>  	seq = dumper->cur_seq;
> -	idx = dumper->cur_idx;
> -	while (l >= size && seq < dumper->next_seq) {
> -		struct printk_log *msg = log_from_idx(idx);
> -
> -		l -= msg_print_text(msg, true, time, NULL, 0);
> -		idx = log_next(idx);
> -		seq++;
> +	while (l >= size && prb_read_valid(prb, seq, &r)) {
> +		if (r.info->seq >= dumper->next_seq)
> +			break;
> +		l -= record_print_text_inline(&r, true, time);
> +		seq = r.info->seq + 1;
>  	}
>  
>  	/* last message in next interation */
>  	next_seq = seq;
> -	next_idx = idx;
> +
> +	/* actually read data into the buffer now */
> +	r.text_buf = buf;
> +	r.text_buf_size = size;
> +	r.text_line_count = NULL;
>  
>  	l = 0;
> -	while (seq < dumper->next_seq) {
> -		struct printk_log *msg = log_from_idx(idx);
> +	while (prb_read_valid(prb, seq, &r)) {
> +		if (r.info->seq >= dumper->next_seq)
> +			break;
> +
> +		l += record_print_text_inline(&r, syslog, time);
> +		r.text_buf = buf + l;
> +		r.text_buf_size = size - l;
>  
> -		l += msg_print_text(msg, syslog, time, buf + l, size - l);
> -		idx = log_next(idx);
> -		seq++;
> +		seq = r.info->seq + 1;
>  	}
>  
>  	dumper->next_seq = next_seq;
> -	dumper->next_idx = next_idx;
>  	ret = true;
>  	logbuf_unlock_irqrestore(flags);
>  out:
> @@ -3329,9 +3325,7 @@ EXPORT_SYMBOL_GPL(kmsg_dump_get_buffer);
>  void kmsg_dump_rewind_nolock(struct kmsg_dumper *dumper)
>  {
>  	dumper->cur_seq = clear_seq;
> -	dumper->cur_idx = clear_idx;
> -	dumper->next_seq = log_next_seq;
> -	dumper->next_idx = log_next_idx;
> +	dumper->next_seq = prb_next_seq(prb);
>  }
>  
>  /**
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] printk: use the lockless ringbuffer
  2020-02-14 13:29   ` lijiang
@ 2020-02-14 13:50     ` John Ogness
  2020-02-15  4:15       ` lijiang
  2020-02-17 15:40       ` crashdump: " Petr Mladek
  0 siblings, 2 replies; 58+ messages in thread
From: John Ogness @ 2020-02-14 13:50 UTC (permalink / raw)
  To: lijiang
  Cc: Petr Mladek, Peter Zijlstra, Sergey Senozhatsky,
	Sergey Senozhatsky, Steven Rostedt, Linus Torvalds,
	Greg Kroah-Hartman, Andrea Parri, Thomas Gleixner, kexec,
	linux-kernel

Hi Lianbo,

On 2020-02-14, lijiang <lijiang@redhat.com> wrote:
>> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
>> index 1ef6f75d92f1..d0d24ee1d1f4 100644
>> --- a/kernel/printk/printk.c
>> +++ b/kernel/printk/printk.c
>> @@ -1062,21 +928,16 @@ void log_buf_vmcoreinfo_setup(void)
>>  {
>>  	VMCOREINFO_SYMBOL(log_buf);
>>  	VMCOREINFO_SYMBOL(log_buf_len);
>
> I notice that the "prb"(printk tb static) symbol is not exported into
> vmcoreinfo as follows:
>
> +	VMCOREINFO_SYMBOL(prb);
>
> Should the "prb"(printk tb static) symbol be exported into vmcoreinfo?
> Otherwise, do you happen to know how to walk through the log_buf and
> get all kernel logs from vmcore?

You are correct. This will need to be exported as well so that the
descriptors can be accessed. (log_buf is only the pure human-readable
text.) I am currently hacking the crash tool to see exactly what needs
to be made available in order to access all the data of the ringbuffer.

John Ogness

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-02-05 15:48               ` John Ogness
                                   ` (3 preceding siblings ...)
  2020-02-07  1:40                 ` Steven Rostedt
@ 2020-02-14 15:56                 ` Petr Mladek
  2020-02-17 11:13                   ` John Ogness
  4 siblings, 1 reply; 58+ messages in thread
From: Petr Mladek @ 2020-02-14 15:56 UTC (permalink / raw)
  To: John Ogness
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, lijiang, Peter Zijlstra,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On Wed 2020-02-05 16:48:32, John Ogness wrote:
> On 2020-02-05, Sergey Senozhatsky <sergey.senozhatsky@gmail.com> wrote:
> > 3BUG: KASAN: wild-memory-access in copy_data+0x129/0x220>
> > 3Write of size 4 at addr 5a5a5a5a5a5a5a5a by task cat/474>
> 
> The problem was due to an uninitialized pointer.
> 
> Very recently the ringbuffer API was expanded so that it could
> optionally count lines in a record. This made it possible for me to
> implement record_print_text_inline(), which can do all the kmsg_dump
> multi-line madness without requiring a temporary buffer. Rather than
> passing an extra argument around for the optional line count, I added
> the text_line_count pointer to the printk_record struct. And since line
> counting is rarely needed, it is only performed if text_line_count is
> non-NULL.
> 
> I oversaw that devkmsg_open() setup a printk_record and so I did not see
> to add the extra NULL initialization of text_line_count. There should be
> be an initializer function/macro to avoid this danger.
> 
> John Ogness
> 
> The quick fixup:
> 
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index d0d24ee1d1f4..5ad67ff60cd9 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -883,6 +883,7 @@ static int devkmsg_open(struct inode *inode, struct file *file)
>  	user->record.text_buf_size = sizeof(user->text_buf);
>  	user->record.dict_buf = &user->dict_buf[0];
>  	user->record.dict_buf_size = sizeof(user->dict_buf);
> +	user->record.text_line_count = NULL;

The NULL pointer hidden in the structure also complicates the code
reading. It is less obvious when the same function is called
only to get the size/count and when real data.

I played with it and created extra function to get this information.

In addition, I had problems to follow the code in
record_print_text_inline(). So I tried to reuse the new function
and the existing record_printk_text() there.

Please, find below a patch that I ended with. I booted a system
with this patch. But I guess that I actually did not use the
record_print_text_inline(). So, it might be buggy.

Anyway, I wonder what you think about it:

From 383e608f41a2f44898e4cd0751c5ccc18c82f71e Mon Sep 17 00:00:00 2001
From: Petr Mladek <pmladek@suse.com>
Date: Fri, 14 Feb 2020 16:14:18 +0100
Subject: [PATCH] printk: Alternative approach for inline dumping

line_count in struct printk_record looks a bit error prone. It causes
a system crash when people forget to initialize it. It seems better
to read this information via a separate API, for example,
prg_read_valid_info().

record_print_text_inline() is really complicated[*]. It is yet
another variant of the tricky logic used in record_print_text().
It would be great to actually reuse the existing function.

[*] I know that you created it on my request.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 kernel/printk/printk.c            | 134 +++++++++++++-------------------------
 kernel/printk/printk_ringbuffer.c |  55 +++++++++-------
 kernel/printk/printk_ringbuffer.h |   7 +-
 3 files changed, 84 insertions(+), 112 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 5ad67ff60cd9..6b7d6716b178 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -883,7 +883,6 @@ static int devkmsg_open(struct inode *inode, struct file *file)
 	user->record.text_buf_size = sizeof(user->text_buf);
 	user->record.dict_buf = &user->dict_buf[0];
 	user->record.dict_buf_size = sizeof(user->dict_buf);
-	user->record.text_line_count = NULL;
 
 	logbuf_lock_irq();
 	user->seq = prb_first_seq(prb);
@@ -1283,87 +1282,50 @@ static size_t record_print_text(const struct printk_record *r, bool syslog,
 	return len;
 }
 
-static size_t record_print_text_inline(struct printk_record *r, bool syslog,
-				       bool time)
+static size_t
+get_record_text_size(struct printk_info *info, unsigned int line_count,
+			   bool syslog, bool time)
 {
-	size_t text_len = r->info->text_len;
-	size_t buf_size = r->text_buf_size;
-	char *text = r->text_buf;
-	char prefix[PREFIX_MAX];
-	bool truncated = false;
 	size_t prefix_len;
-	size_t len = 0;
 
-	prefix_len = info_print_prefix(r->info, syslog, time, prefix);
-
-	if (!text) {
-		/* SYSLOG_ACTION_* buffer size only calculation */
-		unsigned int line_count = 1;
-
-		if (r->text_line_count)
-			line_count = *(r->text_line_count);
-		/*
-		 * Each line will be preceded with a prefix. The intermediate
-		 * newlines are already within the text, but a final trailing
-		 * newline will be added.
-		 */
-		return ((prefix_len * line_count) + r->info->text_len + 1);
-	}
+	prefix_len = info_print_prefix(info, syslog, time, NULL);
 
 	/*
-	 * Add the prefix for each line by shifting the rest of the text to
-	 * make room for the prefix. If the buffer is not large enough for all
-	 * the prefixes, then drop the trailing text and report the largest
-	 * length that includes full lines with their prefixes.
+	 * Each line will be preceded with a prefix. The intermediate
+	 * newlines are already within the text, but a final trailing
+	 * newline will be added.
 	 */
-	while (text_len) {
-		size_t line_len;
-		char *next;
-
-		next = memchr(text, '\n', text_len);
-		if (next) {
-			line_len = next - text;
-		} else {
-			/*
-			 * If the text has been truncated, assume this line
-			 * was truncated and do not include this text.
-			 */
-			if (truncated)
-				break;
-			line_len = text_len;
-		}
+	return ((prefix_len * line_count) + info->text_len + 1);
+}
 
-		/*
-		 * Is there enough buffer available to shift this line
-		 * (and add a newline at the end)?
-		 */
-		if (len + prefix_len + line_len >= buf_size)
-			break;
+static size_t record_print_text_inline(struct printk_record *r, bool syslog,
+				       bool time)
+{
+	size_t text_len = r->info->text_len;
+	size_t text_buf_size = r->text_buf_size;
+	struct printk_info *info = r->info;
+	size_t record_len;
+	char *text = r->text_buf;
+	char *text_moved;
+	unsigned int line_count;
+	size_t len = 0;
 
-		/*
-		 * Is there enough buffer available to shift all remaining
-		 * text (and add a newline at the end)?
-		 */
-		if (len + prefix_len + text_len >= buf_size) {
-			text_len = (buf_size - len) - prefix_len;
-			truncated = true;
-		}
+	if (!text)
+		return 0;
 
-		memmove(text + prefix_len, text, text_len);
-		memcpy(text, prefix, prefix_len);
+	line_count = prb_count_lines(text, text_len);
+	record_len = get_record_text_size(info, line_count, syslog, time);
 
-		text += prefix_len + line_len;
-		text_len -= line_len;
+	if (text_buf_size < record_len)
+		return 0;
 
-		if (text_len) {
-			text_len--;
-			text++;
-		} else {
-			*text = '\n';
-		}
+	/* Make space for timestamps */
+	text_moved = text + (record_len - text_len);
+	memmove(text_moved, text, text_len);
 
-		len += prefix_len + line_len + 1;
-	}
+	r->text_buf = text_moved;
+	len = record_print_text(r, syslog, time, text, text_buf_size);
+	r->text_buf = text;
 
 	return len;
 }
@@ -3167,13 +3129,15 @@ bool kmsg_dump_get_line_nolock(struct kmsg_dumper *dumper, bool syslog,
 		goto out;
 
 	/* Count text lines instead of reading text? */
-	if (!line)
-		r.text_line_count = &line_count;
-
-	if (!prb_read_valid(prb, dumper->cur_seq, &r))
-		goto out;
-
-	l = record_print_text_inline(&r, syslog, printk_time);
+	if (!line) {
+		if (!prb_read_valid_info(prb, dumper->cur_seq, &info, &line_count))
+			goto out;
+		l = get_record_text_size(&info, line_count, syslog, printk_time);
+	} else {
+		if (!prb_read_valid(prb, dumper->cur_seq, &r))
+			goto out;
+		l = record_print_text_inline(&r, syslog, printk_time);
+	}
 
 	dumper->cur_seq = r.info->seq + 1;
 	ret = true;
@@ -3241,7 +3205,8 @@ bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
 	/* initially, only count text lines */
 	struct printk_record r = {
 		.info = &info,
-		.text_line_count = &line_count,
+		.text_buf = buf,
+		.text_buf_size = size,
 	};
 	unsigned long flags;
 	u64 seq;
@@ -3267,30 +3232,25 @@ bool kmsg_dump_get_buffer(struct kmsg_dumper *dumper, bool syslog,
 
 	/* calculate length of entire buffer */
 	seq = dumper->cur_seq;
-	while (prb_read_valid(prb, seq, &r)) {
+	while (prb_read_valid_info(prb, seq, &info, &line_count)) {
 		if (r.info->seq >= dumper->next_seq)
 			break;
-		l += record_print_text_inline(&r, true, time);
+		l += get_record_text_size(&info, line_count, true, time);
 		seq = r.info->seq + 1;
 	}
 
 	/* move first record forward until length fits into the buffer */
 	seq = dumper->cur_seq;
-	while (l >= size && prb_read_valid(prb, seq, &r)) {
+	while (l >= size && prb_read_valid_info(prb, seq, &info, &line_count)) {
 		if (r.info->seq >= dumper->next_seq)
 			break;
-		l -= record_print_text_inline(&r, true, time);
+		l -= get_record_text_size(&info, line_count, true, time);
 		seq = r.info->seq + 1;
 	}
 
 	/* last message in next interation */
 	next_seq = seq;
 
-	/* actually read data into the buffer now */
-	r.text_buf = buf;
-	r.text_buf_size = size;
-	r.text_line_count = NULL;
-
 	l = 0;
 	while (prb_read_valid(prb, seq, &r)) {
 		if (r.info->seq >= dumper->next_seq)
diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
index 796257f226ee..69976a49f828 100644
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -893,7 +893,6 @@ bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
 		r->dict_buf_size = 0;
 
 	r->info = &d->info;
-	r->text_line_count = NULL;
 
 	/* Set default values for the sizes. */
 	d->info.text_len = r->text_buf_size;
@@ -1002,6 +1001,21 @@ static char *get_data(struct prb_data_ring *data_ring,
 	return &db->data[0];
 }
 
+unsigned long prb_count_lines(char *text, unsigned int text_size)
+{
+	unsigned int line_count;
+	char *next;
+
+	line_count = 1;
+	while ((next = memchr(text, '\n', text_size)) != NULL) {
+		text_size -= (next - text);
+		text = next;
+		line_count++;
+	}
+
+	return line_count;
+}
+
 /*
  * Given @blk_lpos, copy an expected @len of data into the provided buffer.
  * If @line_count is provided, count the number of lines in the data.
@@ -1034,21 +1048,8 @@ static bool copy_data(struct prb_data_ring *data_ring,
 	}
 
 	/* Caller interested in the line count? */
-	if (line_count) {
-		unsigned long next_size = data_size;
-		char *next = data;
-
-		*line_count = 0;
-
-		while (next_size) {
-			(*line_count)++;
-			next = memchr(next, '\n', next_size);
-			if (!next)
-				break;
-			next++;
-			next_size = data_size - (next - data);
-		}
-	}
+	if (line_count)
+		*line_count = prb_count_lines(data, data_size);
 
 	/* Caller interested in the data content? */
 	if (!buf || !buf_size)
@@ -1094,7 +1095,7 @@ static int desc_read_committed(struct prb_desc_ring *desc_ring,
  * See desc_read_committed() for error return values.
  */
 static int prb_read(struct printk_ringbuffer *rb, u64 seq,
-		    struct printk_record *r)
+		    struct printk_record *r, unsigned int *line_count)
 {
 	struct prb_desc_ring *desc_ring = &rb->desc_ring;
 	struct prb_desc *rdesc = to_desc(desc_ring, seq);
@@ -1121,7 +1122,7 @@ static int prb_read(struct printk_ringbuffer *rb, u64 seq,
 	/* Copy text data. If it fails, this is a data-less descriptor. */
 	if (!copy_data(&rb->text_data_ring, &desc.text_blk_lpos,
 		       desc.info.text_len, r->text_buf, r->text_buf_size,
-		       r->text_line_count)) {
+		       line_count)) {
 		return -ENOENT;
 	}
 
@@ -1212,12 +1213,12 @@ EXPORT_SYMBOL(prb_first_seq);
  * See the description of prb_read_valid() for details.
  */
 bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
-		     struct printk_record *r)
+		     struct printk_record *r, unsigned int *line_count)
 {
 	u64 tail_seq;
 	int err;
 
-	while ((err = prb_read(rb, *seq, r))) {
+	while ((err = prb_read(rb, *seq, r, line_count))) {
 		tail_seq = prb_first_seq(rb);
 
 		if (*seq < tail_seq) {
@@ -1264,10 +1265,20 @@ bool _prb_read_valid(struct printk_ringbuffer *rb, u64 *seq,
 bool prb_read_valid(struct printk_ringbuffer *rb, u64 seq,
 		    struct printk_record *r)
 {
-	return _prb_read_valid(rb, &seq, r);
+	return _prb_read_valid(rb, &seq, r, NULL);
 }
 EXPORT_SYMBOL(prb_read_valid);
 
+bool prb_read_valid_info(struct printk_ringbuffer *rb, u64 seq,
+			 struct printk_info *info, unsigned int *line_count)
+{
+	struct printk_record r = {
+		.info = info,
+	};
+
+	return _prb_read_valid(rb, &seq, &r, line_count);
+}
+
 /**
  * prb_next_seq() - Get the sequence number after the last available record.
  *
@@ -1287,7 +1298,7 @@ u64 prb_next_seq(struct printk_ringbuffer *rb)
 
 	do {
 		/* Search forward from the oldest descriptor. */
-		if (!_prb_read_valid(rb, &seq, NULL))
+		if (!_prb_read_valid(rb, &seq, NULL, NULL))
 			return seq;
 		seq++;
 	} while (seq);
diff --git a/kernel/printk/printk_ringbuffer.h b/kernel/printk/printk_ringbuffer.h
index 4dc428427e7f..005b000fdb5b 100644
--- a/kernel/printk/printk_ringbuffer.h
+++ b/kernel/printk/printk_ringbuffer.h
@@ -28,8 +28,6 @@ struct printk_info {
  * the reader provides the @info, @text_buf, @dict_buf buffers. On success,
  * the struct pointed to by @info will be filled and the char arrays pointed
  * to by @text_buf and @dict_buf will be filled with text and dict data.
- * If @text_line_count is provided, the number of lines in @text_buf will
- * be counted.
  */
 struct printk_record {
 	struct printk_info	*info;
@@ -37,7 +35,6 @@ struct printk_record {
 	char			*dict_buf;
 	unsigned int		text_buf_size;
 	unsigned int		dict_buf_size;
-	unsigned int		*text_line_count;
 };
 
 /* Specifies the position/span of a data block. */
@@ -288,6 +285,8 @@ struct printk_record name = {				\
 	.dict_buf_size	= buf_size,			\
 }
 
+unsigned long prb_count_lines(char *text, unsigned int text_size);
+
 /* Writer Interface */
 
 bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
@@ -304,6 +303,8 @@ unsigned int prb_record_text_space(struct prb_reserved_entry *e);
 
 bool prb_read_valid(struct printk_ringbuffer *rb, u64 seq,
 		    struct printk_record *r);
+bool prb_read_valid_info(struct printk_ringbuffer *rb, u64 seq,
+			 struct printk_info *info, unsigned int *line_count);
 
 u64 prb_first_seq(struct printk_ringbuffer *rb);
 u64 prb_next_seq(struct printk_ringbuffer *rb);
-- 
2.16.4



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] printk: use the lockless ringbuffer
  2020-02-14 13:50     ` John Ogness
@ 2020-02-15  4:15       ` lijiang
  2020-02-17 15:40       ` crashdump: " Petr Mladek
  1 sibling, 0 replies; 58+ messages in thread
From: lijiang @ 2020-02-15  4:15 UTC (permalink / raw)
  To: John Ogness
  Cc: Petr Mladek, Peter Zijlstra, Sergey Senozhatsky,
	Sergey Senozhatsky, Steven Rostedt, Linus Torvalds,
	Greg Kroah-Hartman, Andrea Parri, Thomas Gleixner, kexec,
	linux-kernel

在 2020年02月14日 21:50, John Ogness 写道:
> Hi Lianbo,
> 
> On 2020-02-14, lijiang <lijiang@redhat.com> wrote:
>>> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
>>> index 1ef6f75d92f1..d0d24ee1d1f4 100644
>>> --- a/kernel/printk/printk.c
>>> +++ b/kernel/printk/printk.c
>>> @@ -1062,21 +928,16 @@ void log_buf_vmcoreinfo_setup(void)
>>>  {
>>>  	VMCOREINFO_SYMBOL(log_buf);
>>>  	VMCOREINFO_SYMBOL(log_buf_len);
>>
>> I notice that the "prb"(printk tb static) symbol is not exported into
>> vmcoreinfo as follows:
>>
>> +	VMCOREINFO_SYMBOL(prb);
>>
>> Should the "prb"(printk tb static) symbol be exported into vmcoreinfo?
>> Otherwise, do you happen to know how to walk through the log_buf and
>> get all kernel logs from vmcore?
> 
> You are correct. This will need to be exported as well so that the
> descriptors can be accessed. (log_buf is only the pure human-readable

Really agree, and I guess that there may be more structures and their offsets
to be exported, for example: struct prb_desc_ring, struct prb_data_ring, and
struct prb_desc, etc.

This makes sure that tools(such as makedumpfile and crash) can appropriately
access them. 

> text.) I am currently hacking the crash tool to see exactly what needs
> to be made available in order to access all the data of the ringbuffer.
> 
It makes sense and avoids exporting unnecessary symbols and offsets.

Thanks.
Lianbo


> John Ogness
> 


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-02-14 15:56                 ` Petr Mladek
@ 2020-02-17 11:13                   ` John Ogness
  2020-02-17 14:50                     ` Petr Mladek
  0 siblings, 1 reply; 58+ messages in thread
From: John Ogness @ 2020-02-17 11:13 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, lijiang, Peter Zijlstra,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On 2020-02-14, Petr Mladek <pmladek@suse.com> wrote:
>> I oversaw that devkmsg_open() setup a printk_record and so I did not
>> see to add the extra NULL initialization of text_line_count. There
>> should be be an initializer function/macro to avoid this danger.
>> 
>> John Ogness
>> 
>> The quick fixup:
>> 
>> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
>> index d0d24ee1d1f4..5ad67ff60cd9 100644
>> --- a/kernel/printk/printk.c
>> +++ b/kernel/printk/printk.c
>> @@ -883,6 +883,7 @@ static int devkmsg_open(struct inode *inode, struct file *file)
>>  	user->record.text_buf_size = sizeof(user->text_buf);
>>  	user->record.dict_buf = &user->dict_buf[0];
>>  	user->record.dict_buf_size = sizeof(user->dict_buf);
>> +	user->record.text_line_count = NULL;
>
> The NULL pointer hidden in the structure also complicates the code
> reading. It is less obvious when the same function is called
> only to get the size/count and when real data.

OK.

> I played with it and created extra function to get this information.
>
> In addition, I had problems to follow the code in
> record_print_text_inline(). So I tried to reuse the new function
> and the existing record_printk_text() there.
>
> Please, find below a patch that I ended with. I booted a system
> with this patch. But I guess that I actually did not use the
> record_print_text_inline(). So, it might be buggy.

Yes, there are several bugs. But I see where you want to go with this:

- introduce prb_count_lines() to handle line counting

- introduce prb_read_valid_info() for only reading meta-data and getting
  the line count

- also use prb_count_lines() internally

I will include these changes in v2. I will still introduce the static
inlines to initialize records because readers and writers do it
differently.

Thanks.

John Ogness

^ permalink raw reply	[flat|nested] 58+ messages in thread

* misc details: Re: [PATCH 2/2] printk: use the lockless ringbuffer
  2020-01-28 16:19 ` [PATCH 2/2] printk: use the lockless ringbuffer John Ogness
  2020-02-13  9:07   ` Sergey Senozhatsky
  2020-02-14 13:29   ` lijiang
@ 2020-02-17 14:41   ` Petr Mladek
  2020-02-25 20:11     ` John Ogness
  2 siblings, 1 reply; 58+ messages in thread
From: Petr Mladek @ 2020-02-17 14:41 UTC (permalink / raw)
  To: John Ogness
  Cc: Peter Zijlstra, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On Tue 2020-01-28 17:25:48, John Ogness wrote:
> Replace the existing ringbuffer usage and implementation with
> lockless ringbuffer usage. Even though the new ringbuffer does not
> require locking, all existing locking is left in place. Therefore,
> this change is purely replacing the underlining ringbuffer.
> 
> - Record meta-data is now stored in a separate array of descriptors.
>   This is an additional 72 * (2 ^ ((CONFIG_LOG_BUF_SHIFT - 6))) bytes
>   for the static array and 72 * (2 ^ ((log_buf_len - 6))) bytes for
>   the dynamic array.

It might help to show some examples. I mean to mention the sizes
when CONFIG_LOG_BUF_SHIFT is 12 or so.


> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -294,30 +295,22 @@ enum con_msg_format_flags {
>  static int console_msg_format = MSG_FORMAT_DEFAULT;
>  
>  /*
> - * The printk log buffer consists of a chain of concatenated variable
> - * length records. Every record starts with a record header, containing
> - * the overall length of the record.
> + * The printk log buffer consists of a sequenced collection of records, each
> + * containing variable length message and dictionary text. Every record
> + * also contains its own meta-data (@info).
>   *
> - * The heads to the first and last entry in the buffer, as well as the
> - * sequence numbers of these entries are maintained when messages are
> - * stored.
> - *
> - * If the heads indicate available messages, the length in the header
> - * tells the start next message. A length == 0 for the next message
> - * indicates a wrap-around to the beginning of the buffer.
> - *
> - * Every record carries the monotonic timestamp in microseconds, as well as
> - * the standard userspace syslog level and syslog facility. The usual
> + * Every record meta-data carries the monotonic timestamp in microseconds, as

I am afraid that we could not guarantee monotonic timestamp because
the writers are not synchronized. I hope that it will not create
real problems and we could just remove the word "monotonic" ;-)


>  /* record buffer */
> -#define LOG_ALIGN __alignof__(struct printk_log)
> +#define LOG_ALIGN __alignof__(unsigned long)
>  #define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
>  #define LOG_BUF_LEN_MAX (u32)(1 << 31)
>  static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
>  static char *log_buf = __log_buf;
>  static u32 log_buf_len = __LOG_BUF_LEN;
>  
> +/*
> + * Define the average message size. This only affects the number of
> + * descriptors that will be available. Underestimating is better than
> + * overestimating (too many available descriptors is better than not enough).
> + * The dictionary buffer will be the same size as the text buffer.
> + */
> +#define PRB_AVGBITS 6

Do I get it correctly that '6' means 2^6 = 64 characters?

Some ugly counting on my test systems shows the average 49 chars:

$> dmesg | cut -d ']' -f 2- | wc -c
30172
$> dmesg | cut -d ']' -f 2- | wc -l
612
$> echo $((30172 / 612))
49

If I get it correctly then lower number is the more safe side.
So, a more safe default should be 5?


> +
> +_DECLARE_PRINTKRB(printk_rb_static, CONFIG_LOG_BUF_SHIFT - PRB_AVGBITS,
> +		  PRB_AVGBITS, PRB_AVGBITS, &__log_buf[0]);
> +


> @@ -606,60 +488,42 @@ static int log_store(u32 caller_id, int facility, int level,
>  		     const char *dict, u16 dict_len,
>  		     const char *text, u16 text_len)
>  {
> -	struct printk_log *msg;
> -	u32 size, pad_len;
> +	struct prb_reserved_entry e;
> +	struct printk_record r;
>  	u16 trunc_msg_len = 0;
>  
> -	/* number of '\0' padding bytes to next message */
> -	size = msg_used_size(text_len, dict_len, &pad_len);
> +	r.text_buf_size = text_len;
> +	r.dict_buf_size = dict_len;
>  
> -	if (log_make_free_space(size)) {
> +	if (!prb_reserve(&e, prb, &r)) {
>  		/* truncate the message if it is too long for empty buffer */
> -		size = truncate_msg(&text_len, &trunc_msg_len,
> -				    &dict_len, &pad_len);
> +		truncate_msg(&text_len, &trunc_msg_len, &dict_len);
> +		r.text_buf_size = text_len + trunc_msg_len;
> +		r.dict_buf_size = dict_len;
>  		/* survive when the log buffer is too small for trunc_msg */
> -		if (log_make_free_space(size))
> +		if (!prb_reserve(&e, prb, &r))
>  			return 0;
>  	}
>  
> -	if (log_next_idx + size + sizeof(struct printk_log) > log_buf_len) {
> -		/*
> -		 * This message + an additional empty header does not fit
> -		 * at the end of the buffer. Add an empty header with len == 0
> -		 * to signify a wrap around.
> -		 */
> -		memset(log_buf + log_next_idx, 0, sizeof(struct printk_log));
> -		log_next_idx = 0;
> -	}
> -
>  	/* fill message */
> -	msg = (struct printk_log *)(log_buf + log_next_idx);
> -	memcpy(log_text(msg), text, text_len);
> -	msg->text_len = text_len;
> -	if (trunc_msg_len) {
> -		memcpy(log_text(msg) + text_len, trunc_msg, trunc_msg_len);
> -		msg->text_len += trunc_msg_len;

Note that the old code updates msg->text_len.


> -	}
> -	memcpy(log_dict(msg), dict, dict_len);
> -	msg->dict_len = dict_len;
> -	msg->facility = facility;
> -	msg->level = level & 7;
> -	msg->flags = flags & 0x1f;
> +	memcpy(&r.text_buf[0], text, text_len);
> +	if (trunc_msg_len)
> +		memcpy(&r.text_buf[text_len], trunc_msg, trunc_msg_len);

The new one just appends the string.


> +	if (r.dict_buf)
> +		memcpy(&r.dict_buf[0], dict, dict_len);
> +	r.info->facility = facility;
> +	r.info->level = level & 7;
> +	r.info->flags = flags & 0x1f;
>  	if (ts_nsec > 0)
> -		msg->ts_nsec = ts_nsec;
> +		r.info->ts_nsec = ts_nsec;
>  	else
> -		msg->ts_nsec = local_clock();
> -#ifdef CONFIG_PRINTK_CALLER
> -	msg->caller_id = caller_id;
> -#endif
> -	memset(log_dict(msg) + dict_len, 0, pad_len);
> -	msg->len = size;
> +		r.info->ts_nsec = local_clock();
> +	r.info->caller_id = caller_id;
>  
>  	/* insert message */
> -	log_next_idx += msg->len;
> -	log_next_seq++;
> +	prb_commit(&e);
>  
> -	return msg->text_len;
> +	return text_len;

So, this should be text_len + trunc_msg_len.


>  }
>  
>  int dmesg_restrict = IS_ENABLED(CONFIG_SECURITY_DMESG_RESTRICT);
> @@ -1974,9 +1966,9 @@ asmlinkage int vprintk_emit(int facility, int level,
>  
>  	/* This stops the holder of console_sem just where we want him */
>  	logbuf_lock_irqsave(flags);
> -	curr_log_seq = log_next_seq;
> +	pending_output = !prb_read_valid(prb, console_seq, NULL);
>  	printed_len = vprintk_store(facility, level, dict, dictlen, fmt, args);
> -	pending_output = (curr_log_seq != log_next_seq);
> +	pending_output &= prb_read_valid(prb, console_seq, NULL);

The original code checked whether vprintk_store() stored the text
into the main log buffer or only into the cont buffer.

The new code checks whether console is behind which is something
different.

I prefer to call wake_up_klogd() directly from log_output() or
log_store() instead. It might later be used to wake up
printk kthreads as well.

It was done this way because consoles were historically  preferred
over userspace loggers. But the difference will be lower when
consoles are handled by kthread.


>  	logbuf_unlock_irqrestore(flags);
>  
>  	/* If called from the scheduler, we can not call up(). */
> @@ -2406,35 +2405,28 @@ void console_unlock(void)
>  	}
>  
>  	for (;;) {
> -		struct printk_log *msg;
>  		size_t ext_len = 0;
> -		size_t len;
> +		size_t len = 0;
>  
>  		printk_safe_enter_irqsave(flags);
>  		raw_spin_lock(&logbuf_lock);
> -		if (console_seq < log_first_seq) {
> +skip:
> +		if (!prb_read_valid(prb, console_seq, &console_record))
> +			break;
> +
> +		if (console_seq < console_record.info->seq) {
>  			len = sprintf(text,
>  				      "** %llu printk messages dropped **\n",
> -				      log_first_seq - console_seq);
> -
> -			/* messages are gone, move to first one */
> -			console_seq = log_first_seq;
> -			console_idx = log_first_idx;
> -		} else {
> -			len = 0;
> +				      console_record.info->seq - console_seq);
>  		}
> -skip:
> -		if (console_seq == log_next_seq)
> -			break;
> +		console_seq = console_record.info->seq;

This code suggests that it might be possible to get
console_seq > console_record.info->seq and we just
ignore it. I prefer to make it clear by:

		if (console_seq != console_record.info->seq) {
			len = sprintf(text,
				      "** %llu printk messages dropped **\n",
				      log_first_seq - console_seq);
			console_seq = console_record.info->seq;
		}





> -		msg = log_from_idx(console_idx);
> -		if (suppress_message_printing(msg->level)) {
> +		if (suppress_message_printing(console_record.info->level)) {
>  			/*
>  			 * Skip record we have buffered and already printed
>  			 * directly to the console when we received it, and
>  			 * record that has level above the console loglevel.
>  			 */
> -			console_idx = log_next(console_idx);
>  			console_seq++;
>  			goto skip;
>  		}

Otherwise, it looks reasonable.

Best Regards,
Petr

PS: I still have to look at the VMCORE interface, do some testing,
and looks at changes in the 1st patch against the previous version.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-02-17 11:13                   ` John Ogness
@ 2020-02-17 14:50                     ` Petr Mladek
  2020-02-25 19:27                       ` John Ogness
  0 siblings, 1 reply; 58+ messages in thread
From: Petr Mladek @ 2020-02-17 14:50 UTC (permalink / raw)
  To: John Ogness
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, lijiang, Peter Zijlstra,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On Mon 2020-02-17 12:13:25, John Ogness wrote:
> On 2020-02-14, Petr Mladek <pmladek@suse.com> wrote:
> >> I oversaw that devkmsg_open() setup a printk_record and so I did not
> >> see to add the extra NULL initialization of text_line_count. There
> >> should be be an initializer function/macro to avoid this danger.
> >> 
> >> John Ogness
> >> 
> >> The quick fixup:
> >> 
> >> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> >> index d0d24ee1d1f4..5ad67ff60cd9 100644
> >> --- a/kernel/printk/printk.c
> >> +++ b/kernel/printk/printk.c
> >> @@ -883,6 +883,7 @@ static int devkmsg_open(struct inode *inode, struct file *file)
> >>  	user->record.text_buf_size = sizeof(user->text_buf);
> >>  	user->record.dict_buf = &user->dict_buf[0];
> >>  	user->record.dict_buf_size = sizeof(user->dict_buf);
> >> +	user->record.text_line_count = NULL;
> >
> > The NULL pointer hidden in the structure also complicates the code
> > reading. It is less obvious when the same function is called
> > only to get the size/count and when real data.
> 
> OK.
> 
> > I played with it and created extra function to get this information.
> >
> > In addition, I had problems to follow the code in
> > record_print_text_inline(). So I tried to reuse the new function
> > and the existing record_printk_text() there.
> >
> > Please, find below a patch that I ended with. I booted a system
> > with this patch. But I guess that I actually did not use the
> > record_print_text_inline(). So, it might be buggy.
> 
> Yes, there are several bugs. But I see where you want to go with this:
> 
> - introduce prb_count_lines() to handle line counting
> 
> - introduce prb_read_valid_info() for only reading meta-data and getting
>   the line count
> 
> - also use prb_count_lines() internally

In addition, I would like share the code between
record_print_text_inline() and record_print_text().

They both do very similar thing and the logic in far from
trivial.

Alternative solution would be to get rid of record_print_text()
and use record_print_text_inline() everywhere. It will have some
advantages:

  + _inline() variant will get real testing
  + no code duplication
  + saving the extra buffer also in console, sysfs, and devkmsg interface.


> I will include these changes in v2. I will still introduce the static
> inlines to initialize records because readers and writers do it
> differently.

Sounds good.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 58+ messages in thread

* crashdump: Re: [PATCH 2/2] printk: use the lockless ringbuffer
  2020-02-14 13:50     ` John Ogness
  2020-02-15  4:15       ` lijiang
@ 2020-02-17 15:40       ` Petr Mladek
  2020-02-17 16:14         ` John Ogness
  1 sibling, 1 reply; 58+ messages in thread
From: Petr Mladek @ 2020-02-17 15:40 UTC (permalink / raw)
  To: John Ogness
  Cc: lijiang, Peter Zijlstra, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On Fri 2020-02-14 14:50:02, John Ogness wrote:
> Hi Lianbo,
> 
> On 2020-02-14, lijiang <lijiang@redhat.com> wrote:
> >> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> >> index 1ef6f75d92f1..d0d24ee1d1f4 100644
> >> --- a/kernel/printk/printk.c
> >> +++ b/kernel/printk/printk.c
> >> @@ -1062,21 +928,16 @@ void log_buf_vmcoreinfo_setup(void)
> >>  {
> >>  	VMCOREINFO_SYMBOL(log_buf);
> >>  	VMCOREINFO_SYMBOL(log_buf_len);
> >
> > I notice that the "prb"(printk tb static) symbol is not exported into
> > vmcoreinfo as follows:
> >
> > +	VMCOREINFO_SYMBOL(prb);
> >
> > Should the "prb"(printk tb static) symbol be exported into vmcoreinfo?
> > Otherwise, do you happen to know how to walk through the log_buf and
> > get all kernel logs from vmcore?
> 
> You are correct. This will need to be exported as well so that the
> descriptors can be accessed. (log_buf is only the pure human-readable
> text.) I am currently hacking the crash tool to see exactly what needs
> to be made available in order to access all the data of the ringbuffer.

I am not sure which parts you are working on. Are you going to provide
also patch for makedumpfile, please? I get the following failure when
creating the crashdump using:

    echo c >/proc/sysrq-trigger


The kernel version is not supported.
The makedumpfile operation may be incomplete.
dump_dmesg: Can't find variable-length record symbols
makedumpfile Failed.
Running makedumpfile --dump-dmesg /proc/vmcore failed (1).


Best Regards,
Petr

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: crashdump: Re: [PATCH 2/2] printk: use the lockless ringbuffer
  2020-02-17 15:40       ` crashdump: " Petr Mladek
@ 2020-02-17 16:14         ` John Ogness
  0 siblings, 0 replies; 58+ messages in thread
From: John Ogness @ 2020-02-17 16:14 UTC (permalink / raw)
  To: Petr Mladek
  Cc: lijiang, Peter Zijlstra, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On 2020-02-17, Petr Mladek <pmladek@suse.com> wrote:
>>> Should the "prb"(printk tb static) symbol be exported into
>>> vmcoreinfo?  Otherwise, do you happen to know how to walk through
>>> the log_buf and get all kernel logs from vmcore?
>> 
>> You are correct. This will need to be exported as well so that the
>> descriptors can be accessed. (log_buf is only the pure human-readable
>> text.) I am currently hacking the crash tool to see exactly what
>> needs to be made available in order to access all the data of the
>> ringbuffer.
>
> I am not sure which parts you are working on. Are you going to provide
> also patch for makedumpfile, please?

I'm working on crash first. makedumpfile is on my list as well.

> I get the following failure when creating the crashdump using:
>
>     echo c >/proc/sysrq-trigger
>
>
> The kernel version is not supported.
> The makedumpfile operation may be incomplete.
> dump_dmesg: Can't find variable-length record symbols
> makedumpfile Failed.
> Running makedumpfile --dump-dmesg /proc/vmcore failed (1).

Yes, the symbols have changed (and some are missing). I will get this
sorted out for v2. And I will provide some heavily hacked code for crash
and makedumpfile to show that the necessary symbols are there and it
works.

John Ogness

^ permalink raw reply	[flat|nested] 58+ messages in thread

* more barriers: Re: [PATCH 1/2] printk: add lockless buffer
  2020-01-28 16:19 ` [PATCH 1/2] printk: add lockless buffer John Ogness
  2020-01-29  3:53   ` Steven Rostedt
@ 2020-02-21 11:54   ` Petr Mladek
  2020-02-27 12:04     ` John Ogness
  2020-02-21 12:05   ` misc nits " Petr Mladek
  2 siblings, 1 reply; 58+ messages in thread
From: Petr Mladek @ 2020-02-21 11:54 UTC (permalink / raw)
  To: John Ogness
  Cc: Peter Zijlstra, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

Hi,

the new full barrier in desc_reserve() made me to think more about
the existing ones.

If I get it correctly, the used cmpxchg_relaxed() variants does
not provide full barriers. They are just able to prevent parallel
manipulation of the modified variable.

Because of this, I think that we need some more barriers to synchronize
reads and writes of the tail/head values of the three ring buffers.
See below for more details.

It is possible that some of the barriers are superfluous because
some read barriers are hidden in desc_read(). But I think that
barriers are sometimes needed even before the first read or
after the last read in desc_read().


On Tue 2020-01-28 17:25:47, John Ogness wrote:
> Introduce a multi-reader multi-writer lockless ringbuffer for storing
> the kernel log messages. Readers and writers may use their API from
> any context (including scheduler and NMI). This ringbuffer will make
> it possible to decouple printk() callers from any context, locking,
> or console constraints. It also makes it possible for readers to have
> full access to the ringbuffer contents at any time and context (for
> example from any panic situation).
> 
> diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
> new file mode 100644
> index 000000000000..796257f226ee
> --- /dev/null
> +++ b/kernel/printk/printk_ringbuffer.c
> +/*
> + * Take a given descriptor out of the committed state by attempting
> + * the transition from committed to reusable. Either this task or some
> + * other task will have been successful.
> + */
> +static void desc_make_reusable(struct prb_desc_ring *desc_ring,
> +			       unsigned long id)
> +{
> +	struct prb_desc *desc = to_desc(desc_ring, id);
> +	atomic_long_t *state_var = &desc->state_var;
> +	unsigned long val_committed = id | DESC_COMMITTED_MASK;
> +	unsigned long val_reusable = val_committed | DESC_REUSE_MASK;
> +
> +	atomic_long_cmpxchg_relaxed(state_var, val_committed,
> val_reusable);

IMHO, we should add smp_wmb() here to make sure that the reusable
state is written before we shuffle the desc_ring->tail_id/head_id.

It would pair with the read part of smp_mb() in desc_reserve()
before the extra check if the descriptor is really in reusable state.


> +}
> +
> +/*
> + * For a given data ring (text or dict) and its current tail lpos:
> + * for each data block up until @lpos, make the associated descriptor
> + * reusable.
> + *
> + * If there is any problem making the associated descriptor reusable,
> + * either the descriptor has not yet been committed or another writer
> + * task has already pushed the tail lpos past the problematic data
> + * block. Regardless, on error the caller can re-load the tail lpos
> + * to determine the situation.
> + */
> +static bool data_make_reusable(struct printk_ringbuffer *rb,
> +			       struct prb_data_ring *data_ring,
> +			       unsigned long tail_lpos, unsigned long lpos,
> +			       unsigned long *lpos_out)
> +{
> +	struct prb_desc_ring *desc_ring = &rb->desc_ring;
> +	struct prb_data_blk_lpos *blk_lpos;
> +	struct prb_data_block *blk;
> +	enum desc_state d_state;
> +	struct prb_desc desc;
> +	unsigned long id;
> +
> +	/*
> +	 * Using the provided @data_ring, point @blk_lpos to the correct
> +	 * blk_lpos within the local copy of the descriptor.
> +	 */
> +	if (data_ring == &rb->text_data_ring)
> +		blk_lpos = &desc.text_blk_lpos;
> +	else
> +		blk_lpos = &desc.dict_blk_lpos;
> +
> +	/* Loop until @tail_lpos has advanced to or beyond @lpos. */
> +	while ((lpos - tail_lpos) - 1 < DATA_SIZE(data_ring)) {
> +		blk = to_block(data_ring, tail_lpos);

IMHO, we need smp_rmb() here to make sure that we read blk->id
that we written after pushing the tail_lpos.

It would pair with the write barrier in data_alloc() before
before writing blk->id. It is there after updating head_lpos.
But head_lpos could be updated only after updating tail_lpos.
See the comment in data_alloc() below.

> +		id = READ_ONCE(blk->id);


> +
> +		d_state = desc_read(desc_ring, id,
> +				    &desc); /* LMM(data_make_reusable:A) */
> +
> +		switch (d_state) {
> +		case desc_miss:
> +			return false;
> +		case desc_reserved:
> +			return false;
> +		case desc_committed:
> +			/*
> +			 * This data block is invalid if the descriptor
> +			 * does not point back to it.
> +			 */
> +			if (blk_lpos->begin != tail_lpos)
> +				return false;
> +			desc_make_reusable(desc_ring, id);
> +			break;
> +		case desc_reusable:
> +			/*
> +			 * This data block is invalid if the descriptor
> +			 * does not point back to it.
> +			 */
> +			if (blk_lpos->begin != tail_lpos)
> +				return false;
> +			break;
> +		}
> +
> +		/* Advance @tail_lpos to the next data block. */
> +		tail_lpos = blk_lpos->next;
> +	}
> +
> +	*lpos_out = tail_lpos;
> +
> +	return true;
> +}
> +
> +/*
> + * Advance the data ring tail to at least @lpos. This function puts all
> + * descriptors into the reusable state if the tail will be pushed beyond
> + * their associated data block.
> + */
> +static bool data_push_tail(struct printk_ringbuffer *rb,
> +			   struct prb_data_ring *data_ring,
> +			   unsigned long lpos)
> +{
> +	unsigned long tail_lpos;
> +	unsigned long next_lpos;
> +
> +	/* If @lpos is not valid, there is nothing to do. */
> +	if (lpos == INVALID_LPOS)
> +		return true;
> +
> +	tail_lpos = atomic_long_read(&data_ring->tail_lpos);
> +
> +	do {
> +		/* If @lpos is no longer valid, there is nothing to do. */
> +		if (lpos - tail_lpos >= DATA_SIZE(data_ring))
> +			break;
> +
> +		/*
> +		 * Make all descriptors reusable that are associated with
> +		 * data blocks before @lpos.
> +		 */
> +		if (!data_make_reusable(rb, data_ring, tail_lpos, lpos,
> +					&next_lpos)) {
> +			/*
> +			 * data_make_reusable() performed state loads. Make
> +			 * sure they are loaded before reloading the tail lpos
> +			 * in order to see a new tail in the case that the
> +			 * descriptor has been recycled. This pairs with
> +			 * desc_reserve:A.
> +			 */
> +			smp_rmb(); /* LMM(data_push_tail:A) */
> +
> +			/*
> +			 * Reload the tail lpos.
> +			 *
> +			 * Memory barrier involvement:
> +			 *
> +			 * No possibility of missing a recycled descriptor.
> +			 * If data_make_reusable:A reads from desc_reserve:B,
> +			 * then data_push_tail:B reads from desc_push_tail:A.
> +			 *
> +			 * Relies on:
> +			 *
> +			 * MB from desc_push_tail:A to desc_reserve:B
> +			 *    matching
> +			 * RMB from data_make_reusable:A to data_push_tail:B
> +			 */
> +			next_lpos = atomic_long_read(&data_ring->tail_lpos
> +						); /* LMM(data_push_tail:B) */
> +			if (next_lpos == tail_lpos)
> +				return false;
> +
> +			/* Another task pushed the tail. Try again. */
> +			tail_lpos = next_lpos;
> +		}
> +	} while (!atomic_long_try_cmpxchg_relaxed(&data_ring->tail_lpos,
> +			&tail_lpos, next_lpos)); /* can be relaxed? */

IMHO, we need smp_wmb() here so that others see the updated
data_ring->tail_lpos before this thread allocates the space
by pushing head_pos.

It would be paired with a read barrier in data_alloc() between
reading head_lpos and tail_lpos, see below.

> +
> +	return true;
> +}
> +
> +/*
> + * Advance the desc ring tail. This function advances the tail by one
> + * descriptor, thus invalidating the oldest descriptor. Before advancing
> + * the tail, the tail descriptor is made reusable and all data blocks up to
> + * and including the descriptor's data block are invalidated (i.e. the data
> + * ring tail is pushed past the data block of the descriptor being made
> + * reusable).
> + */
> +static bool desc_push_tail(struct printk_ringbuffer *rb,
> +			   unsigned long tail_id)
> +{
> +	struct prb_desc_ring *desc_ring = &rb->desc_ring;
> +	enum desc_state d_state;
> +	struct prb_desc desc;
> +
> +	d_state = desc_read(desc_ring, tail_id, &desc);
> +
> +	switch (d_state) {
> +	case desc_miss:
> +		/*
> +		 * If the ID is exactly 1 wrap behind the expected, it is
> +		 * in the process of being reserved by another writer and
> +		 * must be considered reserved.
> +		 */
> +		if (DESC_ID(atomic_long_read(&desc.state_var)) ==
> +		    DESC_ID_PREV_WRAP(desc_ring, tail_id)) {
> +			return false;
> +		}
> +		return true;
> +	case desc_reserved:
> +		return false;
> +	case desc_committed:
> +		desc_make_reusable(desc_ring, tail_id);
> +		break;
> +	case desc_reusable:
> +		break;
> +	}
> +
> +	/*
> +	 * Data blocks must be invalidated before their associated
> +	 * descriptor can be made available for recycling. Invalidating
> +	 * them later is not possible because there is no way to trust
> +	 * data blocks once their associated descriptor is gone.
> +	 */
> +
> +	if (!data_push_tail(rb, &rb->text_data_ring, desc.text_blk_lpos.next))
> +		return false;
> +	if (!data_push_tail(rb, &rb->dict_data_ring, desc.dict_blk_lpos.next))
> +		return false;
> +
> +	/* The data ring tail(s) were pushed: LMM(desc_push_tail:A) */
> +
> +	/*
> +	 * Check the next descriptor after @tail_id before pushing the tail to
> +	 * it because the tail must always be in a committed or reusable
> +	 * state. The implementation of prb_first_seq() relies on this.
> +	 *
> +	 * A successful read implies that the next descriptor is less than or
> +	 * equal to @head_id so there is no risk of pushing the tail past the
> +	 * head.
> +	 */
> +	d_state = desc_read(desc_ring, DESC_ID(tail_id + 1),
> +			    &desc); /* LMM(desc_push_tail:B) */
> +	if (d_state == desc_committed || d_state == desc_reusable) {
> +		atomic_long_cmpxchg_relaxed(&desc_ring->tail_id, tail_id,
> +			DESC_ID(tail_id + 1)); /* LMM(desc_push_tail:C) */

IMHO, we need smp_wmb() here so that everyone see updated
desc_ring->tail_id before we push the head as well.

It would pair with read barrier in desc_reserve() between reading
tail_id and head_id.

> +	} else {
> +		/*
> +		 * Guarantee the last state load from desc_read() is before
> +		 * reloading @tail_id in order to see a new tail in the case
> +		 * that the descriptor has been recycled. This pairs with
> +		 * desc_reserve:A.
> +		 */
> +		smp_rmb(); /* LMM(desc_push_tail:D) */
> +
> +		/*
> +		 * Re-check the tail ID. The descriptor following @tail_id is
> +		 * not in an allowed tail state. But if the tail has since
> +		 * been moved by another task, then it does not matter.
> +		 *
> +		 * Memory barrier involvement:
> +		 *
> +		 * No possibility of missing a pushed tail.
> +		 * If desc_push_tail:B reads from desc_reserve:B, then
> +		 * desc_push_tail:E reads from desc_push_tail:C.
> +		 *
> +		 * Relies on:
> +		 *
> +		 * MB from desc_push_tail:C to desc_reserve:B
> +		 *    matching
> +		 * RMB from desc_push_tail:B to desc_push_tail:E
> +		 */
> +		if (atomic_long_read(&desc_ring->tail_id) ==
> +					tail_id) { /* LMM(desc_push_tail:E) */
> +			return false;
> +		}
> +	}
> +
> +	return true;
> +}
> +
> +/* Reserve a new descriptor, invalidating the oldest if necessary. */
> +static bool desc_reserve(struct printk_ringbuffer *rb, unsigned long *id_out)
> +{
> +	struct prb_desc_ring *desc_ring = &rb->desc_ring;
> +	unsigned long prev_state_val;
> +	unsigned long id_prev_wrap;
> +	struct prb_desc *desc;
> +	unsigned long head_id;
> +	unsigned long id;
> +
> +	head_id = atomic_long_read(&desc_ring->head_id);
> +
> +	do {
> +		desc = to_desc(desc_ring, head_id);
> +
> +		id = DESC_ID(head_id + 1);
> +		id_prev_wrap = DESC_ID_PREV_WRAP(desc_ring, id);

IMHO, we need smp_rmb() here to to guarantee reading head_id before
desc_ring->tail_id.

It would pair with write barrier in desc_push_tail() after updating
tail_id, see above.

> +
> +		if (id_prev_wrap == atomic_long_read(&desc_ring->tail_id)) {
> +			/*
> +			 * Make space for the new descriptor by
> +			 * advancing the tail.
> +			 */
> +			if (!desc_push_tail(rb, id_prev_wrap))
> +				return false;
> +		}
> +	} while (!atomic_long_try_cmpxchg_relaxed(&desc_ring->head_id,
> +						  &head_id, id));
> +
> +	/*
> +	 * Guarantee any data ring tail changes are stored before recycling
> +	 * the descriptor. A full memory barrier is needed since another
> +	 * task may have pushed the data ring tails. This pairs with
> +	 * data_push_tail:A.
> +	 *
> +	 * Guarantee a new tail ID is stored before recycling the descriptor.
> +	 * A full memory barrier is needed since another task may have pushed
> +	 * the tail ID. This pairs with desc_push_tail:D and prb_first_seq:C.
> +	 */
> +	smp_mb(); /* LMM(desc_reserve:A) */

I am a bit confused by the full barrier here. The description is not
clear. All the three tags (data_push_tail:A, desc_push_tail:D and
prb_first_seq:C) refers read barriers. This would suggest that write
barrier would be enough here.

OK, this barrier is between writing desc_ring->head_id and
reading/writing desc->state_var.

A write barrier here would require a code that reads
desc->state_var before reading head_id, tail_id of desc
or data rings when they check if the descriptor was
reused before. It seems that all the mentioned paring
read barriers are correct. So the above description of
the write barrier part looks correct.

Now, the question is why the read barrier would be needed
here. The only reason might be the check of the desc->state_var.
The pairing write barrier should allow reusing of the descriptor.
For this, we might need to add a write barrier either into
prb_commit() or desc_make_reusable() after updating
the state variable.

We check here if the descriptor is really reusable. So it should
be enough to add write barrier into desc_make_reusable().


> +
> +	desc = to_desc(desc_ring, id);
> +
> +	/* If the descriptor has been recycled, verify the old state val. */
> +	prev_state_val = atomic_long_read(&desc->state_var);
> +	if (prev_state_val && prev_state_val != (id_prev_wrap |
> +						 DESC_COMMITTED_MASK |
> +						 DESC_REUSE_MASK)) {
> +		WARN_ON_ONCE(1);
> +		return false;
> +	}
> +
> +	/* Assign the descriptor a new ID and set its state to reserved. */
> +	if (!atomic_long_try_cmpxchg_relaxed(&desc->state_var,
> +			&prev_state_val, id | 0)) { /* LMM(desc_reserve:B) */
> +		WARN_ON_ONCE(1);
> +		return false;
> +	}
> +
> +	/*
> +	 * Guarantee the new descriptor ID and state is stored before making
> +	 * any other changes. This pairs with desc_read:D.
> +	 */
> +	smp_wmb(); /* LMM(desc_reserve:C) */
> +
> +	/* Now data in @desc can be modified: LMM(desc_reserve:D) */
> +
> +	*id_out = id;
> +	return true;
> +}
> +
> +/*
> + * Allocate a new data block, invalidating the oldest data block(s)
> + * if necessary. This function also associates the data block with
> + * a specified descriptor.
> + */
> +static char *data_alloc(struct printk_ringbuffer *rb,
> +			struct prb_data_ring *data_ring, unsigned long size,
> +			struct prb_data_blk_lpos *blk_lpos, unsigned long id)
> +{
> +	struct prb_data_block *blk;
> +	unsigned long begin_lpos;
> +	unsigned long next_lpos;
> +
> +	if (!data_ring->data || size == 0) {
> +		/* Specify a data-less block. */
> +		blk_lpos->begin = INVALID_LPOS;
> +		blk_lpos->next = INVALID_LPOS;
> +		return NULL;
> +	}
> +
> +	size = to_blk_size(size);
> +
> +	begin_lpos = atomic_long_read(&data_ring->head_lpos);
> +
> +	do {
> +		next_lpos = get_next_lpos(data_ring, begin_lpos, size);
> +

IMHO, we need smp_rmb() here to read begin_lpos before we read
tail_lpos in data_push_tail()

It would pair with a write barrier in data_push_tail() after
updating data_ring->tail_lpos.

> +		if (!data_push_tail(rb, data_ring,
> +				    next_lpos - DATA_SIZE(data_ring))) {
> +			/* Failed to allocate, specify a data-less block. */
> +			blk_lpos->begin = INVALID_LPOS;
> +			blk_lpos->next = INVALID_LPOS;
> +			return NULL;
> +		}
> +	} while (!atomic_long_try_cmpxchg_relaxed(&data_ring->head_lpos,
> +						  &begin_lpos, next_lpos));
> +

IMHO, we need smp_wmb() here to guarantee that others see the updated
data_ring->head_lpos before we write anything into the data buffer.

It would pair with a read barrier in data_make_reusable
between reading tail_lpos and blk->id in data_make_reusable().


> +	blk = to_block(data_ring, begin_lpos);
> +	blk->id = id;
> +
> +	if (DATA_WRAPS(data_ring, begin_lpos) !=
> +	    DATA_WRAPS(data_ring, next_lpos)) {
> +		/* Wrapping data blocks store their data at the beginning. */
> +		blk = to_block(data_ring, 0);
> +		blk->id = id;
> +	}
> +
> +	blk_lpos->begin = begin_lpos;
> +	blk_lpos->next = next_lpos;
> +
> +	return &blk->data[0];
> +}

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 58+ messages in thread

* misc nits Re: [PATCH 1/2] printk: add lockless buffer
  2020-01-28 16:19 ` [PATCH 1/2] printk: add lockless buffer John Ogness
  2020-01-29  3:53   ` Steven Rostedt
  2020-02-21 11:54   ` more barriers: " Petr Mladek
@ 2020-02-21 12:05   ` Petr Mladek
  2020-03-02 10:38     ` John Ogness
  2 siblings, 1 reply; 58+ messages in thread
From: Petr Mladek @ 2020-02-21 12:05 UTC (permalink / raw)
  To: John Ogness
  Cc: Peter Zijlstra, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

Hi,

there are few more small things that catched my eyes during review.
They are from the nits deparment.

On Tue 2020-01-28 17:25:47, John Ogness wrote:
> Introduce a multi-reader multi-writer lockless ringbuffer for storing
> the kernel log messages. Readers and writers may use their API from
> any context (including scheduler and NMI). This ringbuffer will make
> it possible to decouple printk() callers from any context, locking,
> or console constraints. It also makes it possible for readers to have
> full access to the ringbuffer contents at any time and context (for
> example from any panic situation).
> 
> diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
> new file mode 100644
> index 000000000000..796257f226ee
> --- /dev/null
> +++ b/kernel/printk/printk_ringbuffer.c
> +/**
> + * DOC: printk_ringbuffer overview

I really like the overview.

> +/* A data block: maps to the raw data within the data ring. */
> +struct prb_data_block {
> +	unsigned long	id;
> +	char		data[0];
> +};
> +
> +
> +static struct prb_data_block *to_block(struct prb_data_ring *data_ring,
> +				       unsigned long begin_lpos)
> +{
> +	char *data = &data_ring->data[DATA_INDEX(data_ring, begin_lpos)];
> +
> +	return (struct prb_data_block *)data;

Nit: Please, use "blk" instead of "data". I was slightly confused
because "data" is also one member of struct prb_data_block.


> +/* The possible responses of a descriptor state-query. */
> +enum desc_state {
> +	desc_miss,	/* ID mismatch */
> +	desc_reserved,	/* reserved, but still in use by writer */
> +	desc_committed, /* committed, writer is done */
> +	desc_reusable,	/* free, not used by any writer */

s/not used/not yet used/

> +};

[...]

> +EXPORT_SYMBOL(prb_reserve);

Please, do not export symbols if there are no plans to actually
use them from modules. It will be easier to rework the code
in the future. Nobody would need to worry about external
users.

Please, do so everywhere in the patchset.

> +/*
> + * Given @blk_lpos, return a pointer to the raw data from the data block
> + * and calculate the size of the data part. A NULL pointer is returned
> + * if @blk_lpos specifies values that could never be legal.
> + *
> + * This function (used by readers) performs strict validation on the lpos
> + * values to possibly detect bugs in the writer code. A WARN_ON_ONCE() is
> + * triggered if an internal error is detected.
> + */
> +static char *get_data(struct prb_data_ring *data_ring,
> +		      struct prb_data_blk_lpos *blk_lpos,
> +		      unsigned long *data_size)
> +{
> +	struct prb_data_block *db;
> +
> +	/* Data-less data block description. */
> +	if (blk_lpos->begin == INVALID_LPOS &&
> +	    blk_lpos->next == INVALID_LPOS) {
> +		return NULL;

Nit: There is no need for "else" after return. checkpatch.pl usually
complains about it ;-)

> +
> +	/* Regular data block: @begin less than @next and in same wrap. */
> +	} else if (DATA_WRAPS(data_ring, blk_lpos->begin) ==
> +		   DATA_WRAPS(data_ring, blk_lpos->next) &&
> +		   blk_lpos->begin < blk_lpos->next) {
> +		db = to_block(data_ring, blk_lpos->begin);
> +		*data_size = blk_lpos->next - blk_lpos->begin;
> +
> +	/* Wrapping data block: @begin is one wrap behind @next. */
> +	} else if (DATA_WRAPS(data_ring,
> +			      blk_lpos->begin + DATA_SIZE(data_ring)) ==
> +		   DATA_WRAPS(data_ring, blk_lpos->next)) {
> +		db = to_block(data_ring, 0);
> +		*data_size = DATA_INDEX(data_ring, blk_lpos->next);
> +
> +	/* Illegal block description. */
> +	} else {
> +		WARN_ON_ONCE(1);
> +		return NULL;
> +	}
> +
> +	/* A valid data block will always be aligned to the ID size. */
> +	if (WARN_ON_ONCE(blk_lpos->begin !=
> +			 ALIGN(blk_lpos->begin, sizeof(db->id))) ||
> +	    WARN_ON_ONCE(blk_lpos->next !=
> +			 ALIGN(blk_lpos->next, sizeof(db->id)))) {
> +		return NULL;
> +	}
> +
> +	/* A valid data block will always have at least an ID. */
> +	if (WARN_ON_ONCE(*data_size < sizeof(db->id)))
> +		return NULL;
> +
> +	/* Subtract descriptor ID space from size to reflect data size. */
> +	*data_size -= sizeof(db->id);
> +
> +	return &db->data[0];
> +}
> +
> +/*
> + * Read the record @id and verify that it is committed and has the sequence
> + * number @seq. On success, 0 is returned.
> + *
> + * Error return values:
> + * -EINVAL: A committed record @seq does not exist.
> + * -ENOENT: The record @seq exists, but its data is not available. This is a
> + *          valid record, so readers should continue with the next seq.
> + */
> +static int desc_read_committed(struct prb_desc_ring *desc_ring,
> +			       unsigned long id, u64 seq,
> +			       struct prb_desc *desc)
> +{

I was few times confused whether this function reads the descriptor
a safe way or not.

Please, rename it to make it clear that does only a check.
For example, check_state_commited().

> +	enum desc_state d_state;
> +
> +	d_state = desc_read(desc_ring, id, desc);
> +	if (desc->info.seq != seq)
> +		return -EINVAL;
> +	else if (d_state == desc_reusable)
> +		return -ENOENT;
> +	else if (d_state != desc_committed)
> +		return -EINVAL;
> +
> +	return 0;
> +}

Best Regards,
Petr

PS: I am sorry that the review took me so much time. I was sick, had
some other work. And I wanted to have a free mind when thinking
about this lockless stuff. I think that it actually helped me to
realize the need of more barriers discussed in the other thread.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/2] printk: replace ringbuffer
  2020-02-17 14:50                     ` Petr Mladek
@ 2020-02-25 19:27                       ` John Ogness
  0 siblings, 0 replies; 58+ messages in thread
From: John Ogness @ 2020-02-25 19:27 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, lijiang, Peter Zijlstra,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On 2020-02-17, Petr Mladek <pmladek@suse.com> wrote:
> Alternative solution would be to get rid of record_print_text()
> and use record_print_text_inline() everywhere. It will have some
> advantages:
>
>   + _inline() variant will get real testing
>   + no code duplication
>   + saving the extra buffer also in console, sysfs, and devkmsg
>     interface.

In preparation for my v2, I implemented this alternate approach. Rather
than introducing record_print_text_inline(), I changed
record_print_text() to work inline and also it will no longer handle the
counting case. The callers of record_print_text() for counting will now
call the new counting functions. IMHO it is a nice cleanup and also
removes the static printk_record structs for console and syslog.

Thanks.

John Ogness

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: misc details: Re: [PATCH 2/2] printk: use the lockless ringbuffer
  2020-02-17 14:41   ` misc details: " Petr Mladek
@ 2020-02-25 20:11     ` John Ogness
  2020-02-26  9:54       ` Petr Mladek
  0 siblings, 1 reply; 58+ messages in thread
From: John Ogness @ 2020-02-25 20:11 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Peter Zijlstra, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

>> - Record meta-data is now stored in a separate array of descriptors.
>>   This is an additional 72 * (2 ^ ((CONFIG_LOG_BUF_SHIFT - 6))) bytes
>>   for the static array and 72 * (2 ^ ((log_buf_len - 6))) bytes for
>>   the dynamic array.
>
> It might help to show some examples. I mean to mention the sizes
> when CONFIG_LOG_BUF_SHIFT is 12 or so.

OK.

>> --- a/kernel/printk/printk.c
>> +++ b/kernel/printk/printk.c
>> - * Every record carries the monotonic timestamp in microseconds, as well as
>> - * the standard userspace syslog level and syslog facility. The usual
>> + * Every record meta-data carries the monotonic timestamp in microseconds, as
>
> I am afraid that we could not guarantee monotonic timestamp because
> the writers are not synchronized. I hope that it will not create
> real problems and we could just remove the word "monotonic" ;-)

I removed "monotonic". I hope userspace doesn't require the ringbuffer
to be chronologically sorted. That would explain why the safe buffers
use bogus timestamps. :-/

>> +/*
>> + * Define the average message size. This only affects the number of
>> + * descriptors that will be available. Underestimating is better than
>> + * overestimating (too many available descriptors is better than not enough).
>> + * The dictionary buffer will be the same size as the text buffer.
>> + */
>> +#define PRB_AVGBITS 6
>
> Do I get it correctly that '6' means 2^6 = 64 characters?

Correct.

> Some ugly counting on my test systems shows the average 49 chars:
>
> $> dmesg | cut -d ']' -f 2- | wc -c
> 30172
> $> dmesg | cut -d ']' -f 2- | wc -l
> 612
> $> echo $((30172 / 612))
> 49
>
> If I get it correctly then lower number is the more safe side.
> So, a more safe default should be 5?

For v2 the value will be lowered to 5.

>> -	if (log_make_free_space(size)) {
>> +	if (!prb_reserve(&e, prb, &r)) {
>>  		/* truncate the message if it is too long for empty buffer */
>> -		size = truncate_msg(&text_len, &trunc_msg_len,
>> -				    &dict_len, &pad_len);
>> +		truncate_msg(&text_len, &trunc_msg_len, &dict_len);
>> +		r.text_buf_size = text_len + trunc_msg_len;

Note that the additional space for the trunc_msg_len is being reserved.

>> +		r.dict_buf_size = dict_len;
>>  		/* survive when the log buffer is too small for trunc_msg */
>> -		if (log_make_free_space(size))
>> +		if (!prb_reserve(&e, prb, &r))
>>  			return 0;
>>  	}
>>  
>> -	if (log_next_idx + size + sizeof(struct printk_log) > log_buf_len) {
>> -		/*
>> -		 * This message + an additional empty header does not fit
>> -		 * at the end of the buffer. Add an empty header with len == 0
>> -		 * to signify a wrap around.
>> -		 */
>> -		memset(log_buf + log_next_idx, 0, sizeof(struct printk_log));
>> -		log_next_idx = 0;
>> -	}
>> -
>>  	/* fill message */
>> -	msg = (struct printk_log *)(log_buf + log_next_idx);
>> -	memcpy(log_text(msg), text, text_len);
>> -	msg->text_len = text_len;
>> -	if (trunc_msg_len) {
>> -		memcpy(log_text(msg) + text_len, trunc_msg, trunc_msg_len);
>> -		msg->text_len += trunc_msg_len;
>
> Note that the old code updates msg->text_len.

msg->text_len is equivalent to r.info->text_len, which was already set
by the prb_reserve() (and already includes the trunc_msg_len).

>> -	}
>> -	memcpy(log_dict(msg), dict, dict_len);
>> -	msg->dict_len = dict_len;
>> -	msg->facility = facility;
>> -	msg->level = level & 7;
>> -	msg->flags = flags & 0x1f;
>> +	memcpy(&r.text_buf[0], text, text_len);
>> +	if (trunc_msg_len)
>> +		memcpy(&r.text_buf[text_len], trunc_msg, trunc_msg_len);
>
> The new one just appends the string.

That is all it needs to do here.

>> +	if (r.dict_buf)
>> +		memcpy(&r.dict_buf[0], dict, dict_len);
>> +	r.info->facility = facility;
>> +	r.info->level = level & 7;
>> +	r.info->flags = flags & 0x1f;
>>  	if (ts_nsec > 0)
>> -		msg->ts_nsec = ts_nsec;
>> +		r.info->ts_nsec = ts_nsec;
>>  	else
>> -		msg->ts_nsec = local_clock();
>> -#ifdef CONFIG_PRINTK_CALLER
>> -	msg->caller_id = caller_id;
>> -#endif
>> -	memset(log_dict(msg) + dict_len, 0, pad_len);
>> -	msg->len = size;
>> +		r.info->ts_nsec = local_clock();
>> +	r.info->caller_id = caller_id;
>>  
>>  	/* insert message */
>> -	log_next_idx += msg->len;
>> -	log_next_seq++;
>> +	prb_commit(&e);
>>  
>> -	return msg->text_len;
>> +	return text_len;
>
> So, this should be text_len + trunc_msg_len.

Good catch! Yes. Fixed for v2. Thank you.

(Note that simply returning r.info->text_len is not allowed because the
writer must not access that data after calling prb_commit()).

>> @@ -1974,9 +1966,9 @@ asmlinkage int vprintk_emit(int facility, int level,
>>  
>>  	/* This stops the holder of console_sem just where we want him */
>>  	logbuf_lock_irqsave(flags);
>> -	curr_log_seq = log_next_seq;
>> +	pending_output = !prb_read_valid(prb, console_seq, NULL);
>>  	printed_len = vprintk_store(facility, level, dict, dictlen, fmt, args);
>> -	pending_output = (curr_log_seq != log_next_seq);
>> +	pending_output &= prb_read_valid(prb, console_seq, NULL);
>
> The original code checked whether vprintk_store() stored the text
> into the main log buffer or only into the cont buffer.
>
> The new code checks whether console is behind which is something
> different.

I would argue that they are the same thing in this context. Keep in mind
that we are under the logbuf_lock. If there was previously nothing
pending and now there is, this context is the only one that could have
added it.

This logic will change significantly when we remove the locks (and it
will disappear once we go to kthreads). But we aren't that far at this
stage and I'd like to keep the general logic somewhat close to the
current mainline implementation for now.

> I prefer to call wake_up_klogd() directly from log_output() or
> log_store() instead. It might later be used to wake up
> printk kthreads as well.
>
> It was done this way because consoles were historically  preferred
> over userspace loggers. But the difference will be lower when
> consoles are handled by kthread.

Agreed, but that is something I would like to save for a later
series. Right now I only want to replace the ringbuffer without
rearranging priorities.

>> -skip:
>> -		if (console_seq == log_next_seq)
>> -			break;
>> +		console_seq = console_record.info->seq;
>
> This code suggests that it might be possible to get
> console_seq > console_record.info->seq and we just
> ignore it. I prefer to make it clear by:
>
> 		if (console_seq != console_record.info->seq) {

OK.

Thanks for your help.

John Ogness

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: misc details: Re: [PATCH 2/2] printk: use the lockless ringbuffer
  2020-02-25 20:11     ` John Ogness
@ 2020-02-26  9:54       ` Petr Mladek
  0 siblings, 0 replies; 58+ messages in thread
From: Petr Mladek @ 2020-02-26  9:54 UTC (permalink / raw)
  To: John Ogness
  Cc: Peter Zijlstra, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On Tue 2020-02-25 21:11:31, John Ogness wrote:
> >> --- a/kernel/printk/printk.c
> >> +++ b/kernel/printk/printk.c
> >> - * Every record carries the monotonic timestamp in microseconds, as well as
> >> - * the standard userspace syslog level and syslog facility. The usual
> >> + * Every record meta-data carries the monotonic timestamp in microseconds, as
> >
> > I am afraid that we could not guarantee monotonic timestamp because
> > the writers are not synchronized. I hope that it will not create
> > real problems and we could just remove the word "monotonic" ;-)
> 
> I removed "monotonic". I hope userspace doesn't require the ringbuffer
> to be chronologically sorted. That would explain why the safe buffers
> use bogus timestamps. :-/

The timestamp was not stored into the safe buffers to keep the code
simple. And people request to add the proper timestamps from time
to time.

IMHO, the precise timestamps are more important than ordering. So
people should love the lockless ringbuffer from this POV ;-)


> >> @@ -1974,9 +1966,9 @@ asmlinkage int vprintk_emit(int facility, int level,
> >>  
> >>  	/* This stops the holder of console_sem just where we want him */
> >>  	logbuf_lock_irqsave(flags);
> >> -	curr_log_seq = log_next_seq;
> >> +	pending_output = !prb_read_valid(prb, console_seq, NULL);
> >>  	printed_len = vprintk_store(facility, level, dict, dictlen, fmt, args);
> >> -	pending_output = (curr_log_seq != log_next_seq);
> >> +	pending_output &= prb_read_valid(prb, console_seq, NULL);
> >
> > The original code checked whether vprintk_store() stored the text
> > into the main log buffer or only into the cont buffer.
> >
> > The new code checks whether console is behind which is something
> > different.
> 
> I would argue that they are the same thing in this context. Keep in mind
> that we are under the logbuf_lock. If there was previously nothing
> pending and now there is, this context is the only one that could have
> added it.

Right.

> This logic will change significantly when we remove the locks (and it
> will disappear once we go to kthreads). But we aren't that far at this
> stage and I'd like to keep the general logic somewhat close to the
> current mainline implementation for now.

OK, it is not a big deal from my POV. It is just an optimization.
It can be removed or improved later.

It caught my eyes primary because prb_read_valid() was relatively
complex function. I was not sure if it was worth the effort. But
I am fine with keeping your code for now. It will help to reduce
unrelated behavior changes.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: more barriers: Re: [PATCH 1/2] printk: add lockless buffer
  2020-02-21 11:54   ` more barriers: " Petr Mladek
@ 2020-02-27 12:04     ` John Ogness
  2020-03-04 15:08       ` Petr Mladek
  0 siblings, 1 reply; 58+ messages in thread
From: John Ogness @ 2020-02-27 12:04 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Peter Zijlstra, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On 2020-02-21, Petr Mladek <pmladek@suse.com> wrote:
> If I get it correctly, the used cmpxchg_relaxed() variants does not
> provide full barriers. They are just able to prevent parallel
> manipulation of the modified variable.

Correct.

I purposely avoided the full barriers of a successful cmpxchg() so that
we could clearly specify what we needed and why. As Andrea pointed out
[0], we need to understand if/when we require those memory barriers.

Once we've identified these, we may want to fold some of those barriers
back in, going from cmpxchg_relaxed() back to cmpxchg(). In particular
when we see patterns like:

    do {
        ....
    } while (!try_cmpxchg_relaxed());
    smp_mb();

or possibly:

    smp_mb();
    cmpxchg_relaxed(); /* no return value check */

> On Tue 2020-01-28 17:25:47, John Ogness wrote:
>> diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
>> new file mode 100644
>> index 000000000000..796257f226ee
>> --- /dev/null
>> +++ b/kernel/printk/printk_ringbuffer.c
>> +/*
>> + * Take a given descriptor out of the committed state by attempting
>> + * the transition from committed to reusable. Either this task or some
>> + * other task will have been successful.
>> + */
>> +static void desc_make_reusable(struct prb_desc_ring *desc_ring,
>> +			       unsigned long id)
>> +{
>> +	struct prb_desc *desc = to_desc(desc_ring, id);
>> +	atomic_long_t *state_var = &desc->state_var;
>> +	unsigned long val_committed = id | DESC_COMMITTED_MASK;
>> +	unsigned long val_reusable = val_committed | DESC_REUSE_MASK;
>> +
>> +	atomic_long_cmpxchg_relaxed(state_var, val_committed,
>> val_reusable);
>
> IMHO, we should add smp_wmb() here to make sure that the reusable
> state is written before we shuffle the desc_ring->tail_id/head_id.
>
> It would pair with the read part of smp_mb() in desc_reserve()
> before the extra check if the descriptor is really in reusable state.

Yes. Now that we added the extra state checking in desc_reserve(), this
ordering has become important.

However, for this case I would prefer to instead place a full memory
barrier immediately before @tail_id is incremented (in
desc_push_tail()). The tail-incrementing-task must have seen the
reusable state (even if it is not the one that set it) and an
incremented @tail_id must be visible to the task recycling a descriptor.

>> +}
>> +
>> +/*
>> + * For a given data ring (text or dict) and its current tail lpos:
>> + * for each data block up until @lpos, make the associated descriptor
>> + * reusable.
>> + *
>> + * If there is any problem making the associated descriptor reusable,
>> + * either the descriptor has not yet been committed or another writer
>> + * task has already pushed the tail lpos past the problematic data
>> + * block. Regardless, on error the caller can re-load the tail lpos
>> + * to determine the situation.
>> + */
>> +static bool data_make_reusable(struct printk_ringbuffer *rb,
>> +			       struct prb_data_ring *data_ring,
>> +			       unsigned long tail_lpos, unsigned long lpos,
>> +			       unsigned long *lpos_out)
>> +{
>> +	struct prb_desc_ring *desc_ring = &rb->desc_ring;
>> +	struct prb_data_blk_lpos *blk_lpos;
>> +	struct prb_data_block *blk;
>> +	enum desc_state d_state;
>> +	struct prb_desc desc;
>> +	unsigned long id;
>> +
>> +	/*
>> +	 * Using the provided @data_ring, point @blk_lpos to the correct
>> +	 * blk_lpos within the local copy of the descriptor.
>> +	 */
>> +	if (data_ring == &rb->text_data_ring)
>> +		blk_lpos = &desc.text_blk_lpos;
>> +	else
>> +		blk_lpos = &desc.dict_blk_lpos;
>> +
>> +	/* Loop until @tail_lpos has advanced to or beyond @lpos. */
>> +	while ((lpos - tail_lpos) - 1 < DATA_SIZE(data_ring)) {
>> +		blk = to_block(data_ring, tail_lpos);
>
> IMHO, we need smp_rmb() here to make sure that we read blk->id
> that we written after pushing the tail_lpos.
>
> It would pair with the write barrier in data_alloc() before
> before writing blk->id. It is there after updating head_lpos.
> But head_lpos could be updated only after updating tail_lpos.
> See the comment in data_alloc() below.

I do not understand. @blk->id has a data dependency on the provided
@tail_lpos. A random @tail_lpos value could be passed to this function
and it will only make a descriptor state change if the associated
descriptor is in the committed state and points back to that @tail_lpos
value. That is always legal.

If the old @blk->id value is read (just before data_alloc() writes it),
then the following desc_read() will return with desc_miss. That is
correct. If the new @blk->id value is read (just after data_alloc()
writes it), desc_read() will return with desc_reserved. This is also
correct. Why would this code care about @head_lpos or @tail_lpos
ordering to @blk->id? Please explain.

>> +		id = READ_ONCE(blk->id);
>> +
>> +		d_state = desc_read(desc_ring, id,
>> +				    &desc); /* LMM(data_make_reusable:A) */
>> +
>> +		switch (d_state) {
>> +		case desc_miss:
>> +			return false;
>> +		case desc_reserved:
>> +			return false;
>> +		case desc_committed:
>> +			/*
>> +			 * This data block is invalid if the descriptor
>> +			 * does not point back to it.
>> +			 */
>> +			if (blk_lpos->begin != tail_lpos)
>> +				return false;
>> +			desc_make_reusable(desc_ring, id);
>> +			break;
>> +		case desc_reusable:
>> +			/*
>> +			 * This data block is invalid if the descriptor
>> +			 * does not point back to it.
>> +			 */
>> +			if (blk_lpos->begin != tail_lpos)
>> +				return false;
>> +			break;
>> +		}
>> +
>> +		/* Advance @tail_lpos to the next data block. */
>> +		tail_lpos = blk_lpos->next;
>> +	}
>> +
>> +	*lpos_out = tail_lpos;
>> +
>> +	return true;
>> +}
>> +
>> +/*
>> + * Advance the data ring tail to at least @lpos. This function puts all
>> + * descriptors into the reusable state if the tail will be pushed beyond
>> + * their associated data block.
>> + */
>> +static bool data_push_tail(struct printk_ringbuffer *rb,
>> +			   struct prb_data_ring *data_ring,
>> +			   unsigned long lpos)
>> +{
>> +	unsigned long tail_lpos;
>> +	unsigned long next_lpos;
>> +
>> +	/* If @lpos is not valid, there is nothing to do. */
>> +	if (lpos == INVALID_LPOS)
>> +		return true;
>> +
>> +	tail_lpos = atomic_long_read(&data_ring->tail_lpos);
>> +
>> +	do {
>> +		/* If @lpos is no longer valid, there is nothing to do. */
>> +		if (lpos - tail_lpos >= DATA_SIZE(data_ring))
>> +			break;
>> +
>> +		/*
>> +		 * Make all descriptors reusable that are associated with
>> +		 * data blocks before @lpos.
>> +		 */
>> +		if (!data_make_reusable(rb, data_ring, tail_lpos, lpos,
>> +					&next_lpos)) {
>> +			/*
>> +			 * data_make_reusable() performed state loads. Make
>> +			 * sure they are loaded before reloading the tail lpos
>> +			 * in order to see a new tail in the case that the
>> +			 * descriptor has been recycled. This pairs with
>> +			 * desc_reserve:A.
>> +			 */
>> +			smp_rmb(); /* LMM(data_push_tail:A) */
>> +
>> +			/*
>> +			 * Reload the tail lpos.
>> +			 *
>> +			 * Memory barrier involvement:
>> +			 *
>> +			 * No possibility of missing a recycled descriptor.
>> +			 * If data_make_reusable:A reads from desc_reserve:B,
>> +			 * then data_push_tail:B reads from desc_push_tail:A.
>> +			 *
>> +			 * Relies on:
>> +			 *
>> +			 * MB from desc_push_tail:A to desc_reserve:B
>> +			 *    matching
>> +			 * RMB from data_make_reusable:A to data_push_tail:B
>> +			 */
>> +			next_lpos = atomic_long_read(&data_ring->tail_lpos
>> +						); /* LMM(data_push_tail:B) */
>> +			if (next_lpos == tail_lpos)
>> +				return false;
>> +
>> +			/* Another task pushed the tail. Try again. */
>> +			tail_lpos = next_lpos;
>> +		}
>> +	} while (!atomic_long_try_cmpxchg_relaxed(&data_ring->tail_lpos,
>> +			&tail_lpos, next_lpos)); /* can be relaxed? */
>
> IMHO, we need smp_wmb() here so that others see the updated
> data_ring->tail_lpos before this thread allocates the space
> by pushing head_pos.
>
> It would be paired with a read barrier in data_alloc() between
> reading head_lpos and tail_lpos, see below.

data_push_tail() is the only function that concerns itself with
@tail_lpos. Its cmpxchg-loop will prevent any unintended consequences.
And it uses the memory barrier pair data_push_tail:A/desc_reserve:A to
make sure that @tail_lpos reloads will successfully identify a changed
@tail_lpos due to descriptor recycling (which is the only reason that
@tail_lpos changes).

Why is it a problem if the movement of @head_lpos is seen before the
movement of @tail_lpos? Please explain.

>> +
>> +	return true;
>> +}
>> +
>> +/*
>> + * Advance the desc ring tail. This function advances the tail by one
>> + * descriptor, thus invalidating the oldest descriptor. Before advancing
>> + * the tail, the tail descriptor is made reusable and all data blocks up to
>> + * and including the descriptor's data block are invalidated (i.e. the data
>> + * ring tail is pushed past the data block of the descriptor being made
>> + * reusable).
>> + */
>> +static bool desc_push_tail(struct printk_ringbuffer *rb,
>> +			   unsigned long tail_id)
>> +{
>> +	struct prb_desc_ring *desc_ring = &rb->desc_ring;
>> +	enum desc_state d_state;
>> +	struct prb_desc desc;
>> +
>> +	d_state = desc_read(desc_ring, tail_id, &desc);
>> +
>> +	switch (d_state) {
>> +	case desc_miss:
>> +		/*
>> +		 * If the ID is exactly 1 wrap behind the expected, it is
>> +		 * in the process of being reserved by another writer and
>> +		 * must be considered reserved.
>> +		 */
>> +		if (DESC_ID(atomic_long_read(&desc.state_var)) ==
>> +		    DESC_ID_PREV_WRAP(desc_ring, tail_id)) {
>> +			return false;
>> +		}
>> +		return true;
>> +	case desc_reserved:
>> +		return false;
>> +	case desc_committed:
>> +		desc_make_reusable(desc_ring, tail_id);
>> +		break;
>> +	case desc_reusable:
>> +		break;
>> +	}
>> +
>> +	/*
>> +	 * Data blocks must be invalidated before their associated
>> +	 * descriptor can be made available for recycling. Invalidating
>> +	 * them later is not possible because there is no way to trust
>> +	 * data blocks once their associated descriptor is gone.
>> +	 */
>> +
>> +	if (!data_push_tail(rb, &rb->text_data_ring, desc.text_blk_lpos.next))
>> +		return false;
>> +	if (!data_push_tail(rb, &rb->dict_data_ring, desc.dict_blk_lpos.next))
>> +		return false;
>> +
>> +	/* The data ring tail(s) were pushed: LMM(desc_push_tail:A) */
>> +
>> +	/*
>> +	 * Check the next descriptor after @tail_id before pushing the tail to
>> +	 * it because the tail must always be in a committed or reusable
>> +	 * state. The implementation of prb_first_seq() relies on this.
>> +	 *
>> +	 * A successful read implies that the next descriptor is less than or
>> +	 * equal to @head_id so there is no risk of pushing the tail past the
>> +	 * head.
>> +	 */
>> +	d_state = desc_read(desc_ring, DESC_ID(tail_id + 1),
>> +			    &desc); /* LMM(desc_push_tail:B) */
>> +	if (d_state == desc_committed || d_state == desc_reusable) {
>> +		atomic_long_cmpxchg_relaxed(&desc_ring->tail_id, tail_id,
>> +			DESC_ID(tail_id + 1)); /* LMM(desc_push_tail:C) */
>
> IMHO, we need smp_wmb() here so that everyone see updated
> desc_ring->tail_id before we push the head as well.
>
> It would pair with read barrier in desc_reserve() between reading
> tail_id and head_id.

Good catch! This secures probably the most critical point in your
design: when desc_reserve() recognizes that it needs to push the
descriptor tail.

>> +	} else {
>> +		/*
>> +		 * Guarantee the last state load from desc_read() is before
>> +		 * reloading @tail_id in order to see a new tail in the case
>> +		 * that the descriptor has been recycled. This pairs with
>> +		 * desc_reserve:A.
>> +		 */
>> +		smp_rmb(); /* LMM(desc_push_tail:D) */
>> +
>> +		/*
>> +		 * Re-check the tail ID. The descriptor following @tail_id is
>> +		 * not in an allowed tail state. But if the tail has since
>> +		 * been moved by another task, then it does not matter.
>> +		 *
>> +		 * Memory barrier involvement:
>> +		 *
>> +		 * No possibility of missing a pushed tail.
>> +		 * If desc_push_tail:B reads from desc_reserve:B, then
>> +		 * desc_push_tail:E reads from desc_push_tail:C.
>> +		 *
>> +		 * Relies on:
>> +		 *
>> +		 * MB from desc_push_tail:C to desc_reserve:B
>> +		 *    matching
>> +		 * RMB from desc_push_tail:B to desc_push_tail:E
>> +		 */
>> +		if (atomic_long_read(&desc_ring->tail_id) ==
>> +					tail_id) { /* LMM(desc_push_tail:E) */
>> +			return false;
>> +		}
>> +	}
>> +
>> +	return true;
>> +}
>> +
>> +/* Reserve a new descriptor, invalidating the oldest if necessary. */
>> +static bool desc_reserve(struct printk_ringbuffer *rb, unsigned long *id_out)
>> +{
>> +	struct prb_desc_ring *desc_ring = &rb->desc_ring;
>> +	unsigned long prev_state_val;
>> +	unsigned long id_prev_wrap;
>> +	struct prb_desc *desc;
>> +	unsigned long head_id;
>> +	unsigned long id;
>> +
>> +	head_id = atomic_long_read(&desc_ring->head_id);
>> +
>> +	do {
>> +		desc = to_desc(desc_ring, head_id);
>> +
>> +		id = DESC_ID(head_id + 1);
>> +		id_prev_wrap = DESC_ID_PREV_WRAP(desc_ring, id);
>
> IMHO, we need smp_rmb() here to to guarantee reading head_id before
> desc_ring->tail_id.
>
> It would pair with write barrier in desc_push_tail() after updating
> tail_id, see above.

Ack. Critical.

>> +
>> +		if (id_prev_wrap == atomic_long_read(&desc_ring->tail_id)) {
>> +			/*
>> +			 * Make space for the new descriptor by
>> +			 * advancing the tail.
>> +			 */
>> +			if (!desc_push_tail(rb, id_prev_wrap))
>> +				return false;
>> +		}
>> +	} while (!atomic_long_try_cmpxchg_relaxed(&desc_ring->head_id,
>> +						  &head_id, id));
>> +
>> +	/*
>> +	 * Guarantee any data ring tail changes are stored before recycling
>> +	 * the descriptor. A full memory barrier is needed since another
>> +	 * task may have pushed the data ring tails. This pairs with
>> +	 * data_push_tail:A.
>> +	 *
>> +	 * Guarantee a new tail ID is stored before recycling the descriptor.
>> +	 * A full memory barrier is needed since another task may have pushed
>> +	 * the tail ID. This pairs with desc_push_tail:D and prb_first_seq:C.
>> +	 */
>> +	smp_mb(); /* LMM(desc_reserve:A) */
>
> I am a bit confused by the full barrier here. The description is not
> clear. All the three tags (data_push_tail:A, desc_push_tail:D and
> prb_first_seq:C) refers read barriers. This would suggest that write
> barrier would be enough here.

The above comment section states twice why a full memory barrier is
needed: those writes may not have come from this task. We are not only
ordering the visible writes that this task performed, we are also
ordering the visible writes that this task has observed. Here is a
litmus test demonstrating this:

C full-mb-test

{}

P0(int *x, int *y)
{
	WRITE_ONCE(*x, 1);
}

P1(int *x, int *y)
{
	int tmp_x;

	tmp_x = READ_ONCE(*x);
	if (tmp_x) {
		smp_mb();
		WRITE_ONCE(*y, 1);
	}
}

P2(int *x, int *y)
{
	int tmp_x;
	int tmp_y;

	tmp_y = READ_ONCE(*y);
	smp_rmb();
	tmp_x = READ_ONCE(*x);
}

exists (2:tmp_x=0 /\ 2:tmp_y=1)

Running it yields:

$ herd7 -conf linux-kernel.cfg full-mb-test.litmus 
Test full-mb-test Allowed
States 3
2:tmp_x=0; 2:tmp_y=0;
2:tmp_x=1; 2:tmp_y=0;
2:tmp_x=1; 2:tmp_y=1;
No
Witnesses
Positive: 0 Negative: 5
Condition exists (2:tmp_x=0 /\ 2:tmp_y=1)
Observation full-mb-test Never 0 5
Time full-mb-test 0.00
Hash=3a3ae98db0154d29a2854b01ed30ec81

> OK, this barrier is between writing desc_ring->head_id and
> reading/writing desc->state_var.
>
> A write barrier here would require a code that reads
> desc->state_var before reading head_id, tail_id of desc
> or data rings when they check if the descriptor was
> reused before. It seems that all the mentioned paring
> read barriers are correct. So the above description of
> the write barrier part looks correct.
>
> Now, the question is why the read barrier would be needed
> here.

What read barrier? This is a full barrier. A full barrier is _not_
equivalent to:

    smp_wmb();
    smp_rmb();

If the smp_mb() in the above litmus test is changed to smp_wmb(), the
test error-condition would exist. Adding an additional smp_rmb() would
still result in the error-condition. A full memory barrier is needed
here. (An acquire/release would be more efficient, but I am avoiding
those on purpose, sticking with the "better understood" memory
barriers.)

> The only reason might be the check of the desc->state_var.
> The pairing write barrier should allow reusing of the descriptor.
> For this, we might need to add a write barrier either into
> prb_commit() or desc_make_reusable() after updating
> the state variable.
>
> We check here if the descriptor is really reusable. So it should
> be enough to add write barrier into desc_make_reusable().

As mentioned above, I would put the smp_mb() before updating the
@tail_id. That would pair with this smp_mb() and avoid the false
positive on the @state_var check.

>> +
>> +	desc = to_desc(desc_ring, id);
>> +
>> +	/* If the descriptor has been recycled, verify the old state val. */
>> +	prev_state_val = atomic_long_read(&desc->state_var);
>> +	if (prev_state_val && prev_state_val != (id_prev_wrap |
>> +						 DESC_COMMITTED_MASK |
>> +						 DESC_REUSE_MASK)) {
>> +		WARN_ON_ONCE(1);
>> +		return false;
>> +	}
>> +
>> +	/* Assign the descriptor a new ID and set its state to reserved. */
>> +	if (!atomic_long_try_cmpxchg_relaxed(&desc->state_var,
>> +			&prev_state_val, id | 0)) { /* LMM(desc_reserve:B) */
>> +		WARN_ON_ONCE(1);
>> +		return false;
>> +	}
>> +
>> +	/*
>> +	 * Guarantee the new descriptor ID and state is stored before making
>> +	 * any other changes. This pairs with desc_read:D.
>> +	 */
>> +	smp_wmb(); /* LMM(desc_reserve:C) */
>> +
>> +	/* Now data in @desc can be modified: LMM(desc_reserve:D) */
>> +
>> +	*id_out = id;
>> +	return true;
>> +}
>> +
>> +/*
>> + * Allocate a new data block, invalidating the oldest data block(s)
>> + * if necessary. This function also associates the data block with
>> + * a specified descriptor.
>> + */
>> +static char *data_alloc(struct printk_ringbuffer *rb,
>> +			struct prb_data_ring *data_ring, unsigned long size,
>> +			struct prb_data_blk_lpos *blk_lpos, unsigned long id)
>> +{
>> +	struct prb_data_block *blk;
>> +	unsigned long begin_lpos;
>> +	unsigned long next_lpos;
>> +
>> +	if (!data_ring->data || size == 0) {
>> +		/* Specify a data-less block. */
>> +		blk_lpos->begin = INVALID_LPOS;
>> +		blk_lpos->next = INVALID_LPOS;
>> +		return NULL;
>> +	}
>> +
>> +	size = to_blk_size(size);
>> +
>> +	begin_lpos = atomic_long_read(&data_ring->head_lpos);
>> +
>> +	do {
>> +		next_lpos = get_next_lpos(data_ring, begin_lpos, size);
>> +
>
> IMHO, we need smp_rmb() here to read begin_lpos before we read
> tail_lpos in data_push_tail()
>
> It would pair with a write barrier in data_push_tail() after
> updating data_ring->tail_lpos.

Please explain why this pair is necessary. What is the scenario that
needs to be avoided?

>> +		if (!data_push_tail(rb, data_ring,
>> +				    next_lpos - DATA_SIZE(data_ring))) {
>> +			/* Failed to allocate, specify a data-less block. */
>> +			blk_lpos->begin = INVALID_LPOS;
>> +			blk_lpos->next = INVALID_LPOS;
>> +			return NULL;
>> +		}
>> +	} while (!atomic_long_try_cmpxchg_relaxed(&data_ring->head_lpos,
>> +						  &begin_lpos, next_lpos));
>> +
>
> IMHO, we need smp_wmb() here to guarantee that others see the updated
> data_ring->head_lpos before we write anything into the data buffer.
>
> It would pair with a read barrier in data_make_reusable
> between reading tail_lpos and blk->id in data_make_reusable().

Please explain why this pair is necessary. What is the scenario that
needs to be avoided?

>> +	blk = to_block(data_ring, begin_lpos);
>> +	blk->id = id;
>> +
>> +	if (DATA_WRAPS(data_ring, begin_lpos) !=
>> +	    DATA_WRAPS(data_ring, next_lpos)) {
>> +		/* Wrapping data blocks store their data at the beginning. */
>> +		blk = to_block(data_ring, 0);
>> +		blk->id = id;
>> +	}
>> +
>> +	blk_lpos->begin = begin_lpos;
>> +	blk_lpos->next = next_lpos;
>> +
>> +	return &blk->data[0];
>> +}

John Ogness

[0] https://lkml.kernel.org/r/20191221142235.GA7824@andrea

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: misc nits Re: [PATCH 1/2] printk: add lockless buffer
  2020-02-21 12:05   ` misc nits " Petr Mladek
@ 2020-03-02 10:38     ` John Ogness
  2020-03-02 12:17       ` Joe Perches
  2020-03-02 12:32       ` Petr Mladek
  0 siblings, 2 replies; 58+ messages in thread
From: John Ogness @ 2020-03-02 10:38 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Peter Zijlstra, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On 2020-02-21, Petr Mladek <pmladek@suse.com> wrote:
>> diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
>> new file mode 100644
>> index 000000000000..796257f226ee
>> --- /dev/null
>> +++ b/kernel/printk/printk_ringbuffer.c
>> +static struct prb_data_block *to_block(struct prb_data_ring *data_ring,
>> +				       unsigned long begin_lpos)
>> +{
>> +	char *data = &data_ring->data[DATA_INDEX(data_ring, begin_lpos)];
>> +
>> +	return (struct prb_data_block *)data;
>
> Nit: Please, use "blk" instead of "data". I was slightly confused
> because "data" is also one member of struct prb_data_block.

OK.

>> +/* The possible responses of a descriptor state-query. */
>> +enum desc_state {
>> +	desc_miss,	/* ID mismatch */
>> +	desc_reserved,	/* reserved, but still in use by writer */
>> +	desc_committed, /* committed, writer is done */
>> +	desc_reusable,	/* free, not used by any writer */
>
> s/not used/not yet used/

OK.

>> +EXPORT_SYMBOL(prb_reserve);
>
> Please, do not export symbols if there are no plans to actually
> use them from modules. It will be easier to rework the code
> in the future. Nobody would need to worry about external
> users.
>
> Please, do so everywhere in the patchset.

You are correct.

The reason I exported them is that I could run my test module. But since
the test module will not be part of the kernel source, I'll just hack
the exports in when doing my testing.

>> +static char *get_data(struct prb_data_ring *data_ring,
>> +		      struct prb_data_blk_lpos *blk_lpos,
>> +		      unsigned long *data_size)
>> +{
>> +	struct prb_data_block *db;
>> +
>> +	/* Data-less data block description. */
>> +	if (blk_lpos->begin == INVALID_LPOS &&
>> +	    blk_lpos->next == INVALID_LPOS) {
>> +		return NULL;
>
> Nit: There is no need for "else" after return. checkpatch.pl usually
> complains about it ;-)

OK.

>> +/*
>> + * Read the record @id and verify that it is committed and has the sequence
>> + * number @seq. On success, 0 is returned.
>> + *
>> + * Error return values:
>> + * -EINVAL: A committed record @seq does not exist.
>> + * -ENOENT: The record @seq exists, but its data is not available. This is a
>> + *          valid record, so readers should continue with the next seq.
>> + */
>> +static int desc_read_committed(struct prb_desc_ring *desc_ring,
>> +			       unsigned long id, u64 seq,
>> +			       struct prb_desc *desc)
>> +{
>
> I was few times confused whether this function reads the descriptor
> a safe way or not.
>
> Please, rename it to make it clear that does only a check.
> For example, check_state_commited().

This function _does_ read. It is a helper function of prb_read() to
_read_ the descriptor. It is an extended version of desc_read() that
also performs various checks that the descriptor is committed.

I will update the function description to be more similar to desc_read()
so that it is obvious that it is "getting a copy of a specified
descriptor".

John Ogness

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: misc nits Re: [PATCH 1/2] printk: add lockless buffer
  2020-03-02 10:38     ` John Ogness
@ 2020-03-02 12:17       ` Joe Perches
  2020-03-02 12:32       ` Petr Mladek
  1 sibling, 0 replies; 58+ messages in thread
From: Joe Perches @ 2020-03-02 12:17 UTC (permalink / raw)
  To: John Ogness, Petr Mladek
  Cc: Peter Zijlstra, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On Mon, 2020-03-02 at 11:38 +0100, John Ogness wrote:
> On 2020-02-21, Petr Mladek <pmladek@suse.com> wrote:
> > > diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
[]
> > > +static struct prb_data_block *to_block(struct prb_data_ring *data_ring,
> > > +				       unsigned long begin_lpos)
> > > +{
> > > +	char *data = &data_ring->data[DATA_INDEX(data_ring, begin_lpos)];
> > > +
> > > +	return (struct prb_data_block *)data;
> > 
> > Nit: Please, use "blk" instead of "data". I was slightly confused
> > because "data" is also one member of struct prb_data_block.
> 
> OK.

trivia:

Perhaps use void * instead of char * and a direct return
and avoid the naming altogether.

static struct prb_data_block *to_block(struct prb_data_ring *data_ring, 
				       unsigned long begin_lpos)
{
	return (void *)&data_ring->data[DATA_INDEX(data_ring, begin_lpos)];
}


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: misc nits Re: [PATCH 1/2] printk: add lockless buffer
  2020-03-02 10:38     ` John Ogness
  2020-03-02 12:17       ` Joe Perches
@ 2020-03-02 12:32       ` Petr Mladek
  2020-03-02 13:43         ` John Ogness
  1 sibling, 1 reply; 58+ messages in thread
From: Petr Mladek @ 2020-03-02 12:32 UTC (permalink / raw)
  To: John Ogness
  Cc: Peter Zijlstra, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On Mon 2020-03-02 11:38:42, John Ogness wrote:
> On 2020-02-21, Petr Mladek <pmladek@suse.com> wrote:
> >> diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
> >> new file mode 100644
> >> index 000000000000..796257f226ee
> >> --- /dev/null
> >> +++ b/kernel/printk/printk_ringbuffer.c
> >> +/*
> >> + * Read the record @id and verify that it is committed and has the sequence
> >> + * number @seq. On success, 0 is returned.
> >> + *
> >> + * Error return values:
> >> + * -EINVAL: A committed record @seq does not exist.
> >> + * -ENOENT: The record @seq exists, but its data is not available. This is a
> >> + *          valid record, so readers should continue with the next seq.
> >> + */
> >> +static int desc_read_committed(struct prb_desc_ring *desc_ring,
> >> +			       unsigned long id, u64 seq,
> >> +			       struct prb_desc *desc)
> >> +{
> >
> > I was few times confused whether this function reads the descriptor
> > a safe way or not.
> >
> > Please, rename it to make it clear that does only a check.
> > For example, check_state_commited().
> 
> This function _does_ read. It is a helper function of prb_read() to
> _read_ the descriptor. It is an extended version of desc_read() that
> also performs various checks that the descriptor is committed.

I see.

> I will update the function description to be more similar to desc_read()
> so that it is obvious that it is "getting a copy of a specified
> descriptor".

OK, what about having desc_read_by_seq() instead?

Also there is a bug in current desc_read_commited().
desc->info.seq might contain a garbage when d_state is desc_miss
or desc_reserved.

I would change it to:

static enum desc_state
desc_read_by_seq(struct prb_desc_ring *desc_ring,
		 u64 seq, struct prb_desc *desc)
{
	struct prb_desc *rdesc = to_desc(desc_ring, seq);
	atomic_long_t *state_var = &rdesc->state_var;
	id = DESC_ID(atomic_long_read(state_var));
	enum desc_state d_state;

	d_state = desc_read(desc_ring, id, desc);
	if (d_state == desc_miss ||
	    d_state == desc_reserved ||
	    desc->info.seq != seq)
		return -EINVAL;

	if (d_state == desc_reusable)
		return -ENOENT;

	if (d_state != desc_committed)
		return -EINVAL;

	return 0;
}

Best Regards,
Petr

PS: I am going to dive into the barriers again to answer the last
letter about them.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: misc nits Re: [PATCH 1/2] printk: add lockless buffer
  2020-03-02 12:32       ` Petr Mladek
@ 2020-03-02 13:43         ` John Ogness
  2020-03-03  9:47           ` Petr Mladek
  2020-03-04  9:40           ` Petr Mladek
  0 siblings, 2 replies; 58+ messages in thread
From: John Ogness @ 2020-03-02 13:43 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Peter Zijlstra, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On 2020-03-02, Petr Mladek <pmladek@suse.com> wrote:
>>>> diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
>>>> new file mode 100644
>>>> index 000000000000..796257f226ee
>>>> --- /dev/null
>>>> +++ b/kernel/printk/printk_ringbuffer.c
>>>> +/*
>>>> + * Read the record @id and verify that it is committed and has the sequence
>>>> + * number @seq. On success, 0 is returned.
>>>> + *
>>>> + * Error return values:
>>>> + * -EINVAL: A committed record @seq does not exist.
>>>> + * -ENOENT: The record @seq exists, but its data is not available. This is a
>>>> + *          valid record, so readers should continue with the next seq.
>>>> + */
>>>> +static int desc_read_committed(struct prb_desc_ring *desc_ring,
>>>> +			       unsigned long id, u64 seq,
>>>> +			       struct prb_desc *desc)
>>>> +{
>
> OK, what about having desc_read_by_seq() instead?

Well, it isn't actually "reading by seq". @seq is there for additional
verification. Yes, prb_read() is deriving @id from @seq. But it only
does this once and uses that value for both calls.

> Also there is a bug in current desc_read_commited().
> desc->info.seq might contain a garbage when d_state is desc_miss
> or desc_reserved.

It is not a bug. In both of those cases, -EINVAL is the correct return
value.

> I would change it to:
>
> static enum desc_state
> desc_read_by_seq(struct prb_desc_ring *desc_ring,
> 		 u64 seq, struct prb_desc *desc)
> {
> 	struct prb_desc *rdesc = to_desc(desc_ring, seq);
> 	atomic_long_t *state_var = &rdesc->state_var;
> 	id = DESC_ID(atomic_long_read(state_var));

I think it is error-prone to re-read @state_var here. It is lockless
shared data. desc_read_committed() is called twice in prb_read() and it
is expected that both calls are using the same @id.

> 	enum desc_state d_state;
>
> 	d_state = desc_read(desc_ring, id, desc);
> 	if (d_state == desc_miss ||
> 	    d_state == desc_reserved ||
> 	    desc->info.seq != seq)
> 		return -EINVAL;
>
> 	if (d_state == desc_reusable)
> 		return -ENOENT;

I can use this refactoring.

>
> 	if (d_state != desc_committed)
> 		return -EINVAL;

I suppose you meant to remove this check and leave in the @blk_lpos
check instead. If we're trying to minimize lines of code, the @blk_lpos
check could be combined with the "== desc_reusable" check as well.

>
> 	return 0;
> }

Thanks.

John Ogness

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: misc nits Re: [PATCH 1/2] printk: add lockless buffer
  2020-03-02 13:43         ` John Ogness
@ 2020-03-03  9:47           ` Petr Mladek
  2020-03-03 15:42             ` John Ogness
  2020-03-04  9:40           ` Petr Mladek
  1 sibling, 1 reply; 58+ messages in thread
From: Petr Mladek @ 2020-03-03  9:47 UTC (permalink / raw)
  To: John Ogness
  Cc: Peter Zijlstra, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On Mon 2020-03-02 14:43:41, John Ogness wrote:
> On 2020-03-02, Petr Mladek <pmladek@suse.com> wrote:
> >>>> diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
> >>>> new file mode 100644
> >>>> index 000000000000..796257f226ee
> >>>> --- /dev/null
> >>>> +++ b/kernel/printk/printk_ringbuffer.c
> >>>> +/*
> >>>> + * Read the record @id and verify that it is committed and has the sequence
> >>>> + * number @seq. On success, 0 is returned.
> >>>> + *
> >>>> + * Error return values:
> >>>> + * -EINVAL: A committed record @seq does not exist.
> >>>> + * -ENOENT: The record @seq exists, but its data is not available. This is a
> >>>> + *          valid record, so readers should continue with the next seq.
> >>>> + */
> >>>> +static int desc_read_committed(struct prb_desc_ring *desc_ring,
> >>>> +			       unsigned long id, u64 seq,
> >>>> +			       struct prb_desc *desc)
> >>>> +{
> >
> > OK, what about having desc_read_by_seq() instead?
> 
> Well, it isn't actually "reading by seq". @seq is there for additional
> verification. Yes, prb_read() is deriving @id from @seq. But it only
> does this once and uses that value for both calls.

I do not want to nitpick about words. If I get it properly,
the "id" is not important here. Any "id" is fine as long as
"seq" matches. Reading "id" once is just an optimization.

I do not resist on the change. It was just an idea how to
avoid confusion. I was confused more than once. But I might
be the only one. The more strightforward code looked more
important to me than the optimization.


> > Also there is a bug in current desc_read_commited().
> > desc->info.seq might contain a garbage when d_state is desc_miss
> > or desc_reserved.
> 
> It is not a bug. In both of those cases, -EINVAL is the correct return
> value.

No, it is a bug. If info is not read and contains garbage then the
following check may pass by chance:

	if (desc->info.seq != seq)
		return -EINVAL;

Then the function would return 0 even when desc_read() returned
desc_miss or desc_reserved.


> > I would change it to:
> >
> > static enum desc_state
> > desc_read_by_seq(struct prb_desc_ring *desc_ring,
> > 		 u64 seq, struct prb_desc *desc)
> > {
> > 	struct prb_desc *rdesc = to_desc(desc_ring, seq);
> > 	atomic_long_t *state_var = &rdesc->state_var;
> > 	id = DESC_ID(atomic_long_read(state_var));
> 
> I think it is error-prone to re-read @state_var here. It is lockless
> shared data. desc_read_committed() is called twice in prb_read() and it
> is expected that both calls are using the same @id.

It is not error prone. If "id" changes then "seq" will not match.

> > 	enum desc_state d_state;
> >
> > 	d_state = desc_read(desc_ring, id, desc);
> > 	if (d_state == desc_miss ||
> > 	    d_state == desc_reserved ||
> > 	    desc->info.seq != seq)
> > 		return -EINVAL;
> >
> > 	if (d_state == desc_reusable)
> > 		return -ENOENT;
> 
> I can use this refactoring.

Yes please, "else" is not needed.

> >
> > 	if (d_state != desc_committed)
> > 		return -EINVAL;
> 
> I suppose you meant to remove this check and leave in the @blk_lpos
> check instead.

Good catch, this check is superfluous.

> If we're trying to minimize lines of code, the @blk_lpos
> check could be combined with the "== desc_reusable" check as well.

Minimizing the lines of code was not my primary goal. I was just
confused by the function name. Also the fact that "seq" was the
important thing was well hidden.

Best Regards,
Petr

PS: I dived into the barriers and got lost. I hope that I will
be able to send something sensible in the end ;-)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: misc nits Re: [PATCH 1/2] printk: add lockless buffer
  2020-03-03  9:47           ` Petr Mladek
@ 2020-03-03 15:42             ` John Ogness
  2020-03-04 10:09               ` Petr Mladek
  0 siblings, 1 reply; 58+ messages in thread
From: John Ogness @ 2020-03-03 15:42 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Peter Zijlstra, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On 2020-03-03, Petr Mladek <pmladek@suse.com> wrote:
>>>>>> diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
>>>>>> new file mode 100644
>>>>>> index 000000000000..796257f226ee
>>>>>> --- /dev/null
>>>>>> +++ b/kernel/printk/printk_ringbuffer.c
>>>>>> +/*
>>>>>> + * Read the record @id and verify that it is committed and has the sequence
>>>>>> + * number @seq. On success, 0 is returned.
>>>>>> + *
>>>>>> + * Error return values:
>>>>>> + * -EINVAL: A committed record @seq does not exist.
>>>>>> + * -ENOENT: The record @seq exists, but its data is not available. This is a
>>>>>> + *          valid record, so readers should continue with the next seq.
>>>>>> + */
>>>>>> +static int desc_read_committed(struct prb_desc_ring *desc_ring,
>>>>>> +			       unsigned long id, u64 seq,
>>>>>> +			       struct prb_desc *desc)
>>>>>> +{
>>>
>>> OK, what about having desc_read_by_seq() instead?
>> 
>> Well, it isn't actually "reading by seq". @seq is there for
>> additional verification. Yes, prb_read() is deriving @id from
>> @seq. But it only does this once and uses that value for both calls.
>
> I do not want to nitpick about words. If I get it properly,
> the "id" is not important here. Any "id" is fine as long as
> "seq" matches. Reading "id" once is just an optimization.

Your statement is incorrect. We are not nitpicking about words. I am
trying to clarify what you are misunderstanding.

@id _is_ very important because that is how descriptors are
read. desc_read() takes @id as an argument and it is @id that identifies
the descriptor. @seq is only meta-data within a descriptor. The only
reason @seq is even checked is because of possible ABA issues with @id
on 32-bit systems.

> I do not resist on the change. It was just an idea how to
> avoid confusion. I was confused more than once. But I might
> be the only one. The more strightforward code looked more
> important to me than the optimization.

I am sorry for the confusion. In preparation for v2 I have changed the
function description to:

/*
 * Get a copy of a specified descriptor and verify that the record is
 * committed and has the sequence number @seq. @seq is checked because
 * of possible ABA issues with @id on 32-bit systems. On success, 0 is
 * returned.
 *
 * Error return values:
 * -EINVAL: A committed record @seq does not exist.
 * -ENOENT: The record @seq exists, but its data is not available. This is a
 *          valid record, so readers should continue with the next seq.
 */

This is using the same language as the description of desc_read() so
that is it is hopefully clear that desc_read_committed() is an extended
version of desc_read().

>>> Also there is a bug in current desc_read_commited().
>>> desc->info.seq might contain a garbage when d_state is desc_miss
>>> or desc_reserved.
>> 
>> It is not a bug. In both of those cases, -EINVAL is the correct return
>> value.
>
> No, it is a bug. If info is not read and contains garbage then the
> following check may pass by chance:
>
> 	if (desc->info.seq != seq)
> 		return -EINVAL;
>
> Then the function would return 0 even when desc_read() returned
> desc_miss or desc_reserved.

0 cannot be returned. The state is checked. Please let us stop this
bug/non-bug discussion. It is distracting us from clarifying this
function and refactoring it to simplify understanding.

>>> I would change it to:
>>>
>>> static enum desc_state
>>> desc_read_by_seq(struct prb_desc_ring *desc_ring,
>>> 		 u64 seq, struct prb_desc *desc)
>>> {
>>> 	struct prb_desc *rdesc = to_desc(desc_ring, seq);
>>> 	atomic_long_t *state_var = &rdesc->state_var;
>>> 	id = DESC_ID(atomic_long_read(state_var));
>> 
>> I think it is error-prone to re-read @state_var here. It is lockless
>> shared data. desc_read_committed() is called twice in prb_read() and
>> it is expected that both calls are using the same @id.
>
> It is not error prone. If "id" changes then "seq" will not match.

@id is set during prb_reserve(). @seq (being mere meta-data) is set
_afterwards_. Your proposed multiple-deriving of @id from @seq would
work because the _state checks_ would catch it, not because @seq would
necessarily change.

But that logic is backwards. @seq is not what is important here. It is
only meta-data. On 64-bit systems the @seq checks could be safely
removed.

You may want to refer back to your private email [0] from last November
where you asked me to move this code out of prb_read() and into a helper
function. That may clarify what we are talking about (although I hope
the new function description is clear enough).

John Ogness

[0] private: 20191122122724.n6wlummg3ap56mn3@pathway.suse.cz

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: misc nits Re: [PATCH 1/2] printk: add lockless buffer
  2020-03-02 13:43         ` John Ogness
  2020-03-03  9:47           ` Petr Mladek
@ 2020-03-04  9:40           ` Petr Mladek
  1 sibling, 0 replies; 58+ messages in thread
From: Petr Mladek @ 2020-03-04  9:40 UTC (permalink / raw)
  To: John Ogness
  Cc: Peter Zijlstra, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On Mon 2020-03-02 14:43:41, John Ogness wrote:
> On 2020-03-02, Petr Mladek <pmladek@suse.com> wrote:
> >>>> diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
> >>>> new file mode 100644
> >>>> index 000000000000..796257f226ee
> >>>> --- /dev/null
> >>>> +++ b/kernel/printk/printk_ringbuffer.c
> >>>> +/*
> >>>> + * Read the record @id and verify that it is committed and has the sequence
> >>>> + * number @seq. On success, 0 is returned.
> >>>> + *
> >>>> + * Error return values:
> >>>> + * -EINVAL: A committed record @seq does not exist.
> >>>> + * -ENOENT: The record @seq exists, but its data is not available. This is a
> >>>> + *          valid record, so readers should continue with the next seq.
> >>>> + */
> >>>> +static int desc_read_committed(struct prb_desc_ring *desc_ring,
> >>>> +			       unsigned long id, u64 seq,
> >>>> +			       struct prb_desc *desc)
> >>>> +{
> >
> > static enum desc_state
> > desc_read_by_seq(struct prb_desc_ring *desc_ring,
> > 		 u64 seq, struct prb_desc *desc)
> > {
> > 	struct prb_desc *rdesc = to_desc(desc_ring, seq);
> > 	atomic_long_t *state_var = &rdesc->state_var;
> > 	id = DESC_ID(atomic_long_read(state_var));
> 
> I think it is error-prone to re-read @state_var here. It is lockless
> shared data. desc_read_committed() is called twice in prb_read() and it
> is expected that both calls are using the same @id.
> 
> > 	enum desc_state d_state;
> >
> > 	d_state = desc_read(desc_ring, id, desc);
> > 	if (d_state == desc_miss ||
> > 	    d_state == desc_reserved ||
> > 	    desc->info.seq != seq)
> > 		return -EINVAL;
> >
> > 	if (d_state == desc_reusable)
> > 		return -ENOENT;
> 
> I can use this refactoring.
> 
> >
> > 	if (d_state != desc_committed)
> > 		return -EINVAL;
> 
> I suppose you meant to remove this check and leave in the @blk_lpos
> check instead. If we're trying to minimize lines of code, the @blk_lpos
> check could be combined with the "== desc_reusable" check as well.

I am an idiot. I missed that the check "d_state != desc_committed"
will return -EINVAL also when desc_miss or desc_reserved.

I was too concentrated by the fact that desc->info.seq was checked
first even though it might contain garbage.

Also it did not help me much the note about blk_lpos. I did not
see how it was related to this code.

To sum up. The original code worked fine. But I would prefer my variant
that has more lines but it is somehow cleaner.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: misc nits Re: [PATCH 1/2] printk: add lockless buffer
  2020-03-03 15:42             ` John Ogness
@ 2020-03-04 10:09               ` Petr Mladek
  0 siblings, 0 replies; 58+ messages in thread
From: Petr Mladek @ 2020-03-04 10:09 UTC (permalink / raw)
  To: John Ogness
  Cc: Peter Zijlstra, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On Tue 2020-03-03 16:42:07, John Ogness wrote:
> On 2020-03-03, Petr Mladek <pmladek@suse.com> wrote:
> >>>>>> diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
> >>>>>> new file mode 100644
> >>>>>> index 000000000000..796257f226ee
> >>>>>> --- /dev/null
> >>>>>> +++ b/kernel/printk/printk_ringbuffer.c
> >>>>>> +/*
> >>>>>> + * Read the record @id and verify that it is committed and has the sequence
> >>>>>> + * number @seq. On success, 0 is returned.
> >>>>>> + *
> >>>>>> + * Error return values:
> >>>>>> + * -EINVAL: A committed record @seq does not exist.
> >>>>>> + * -ENOENT: The record @seq exists, but its data is not available. This is a
> >>>>>> + *          valid record, so readers should continue with the next seq.
> >>>>>> + */
> >>>>>> +static int desc_read_committed(struct prb_desc_ring *desc_ring,
> >>>>>> +			       unsigned long id, u64 seq,
> >>>>>> +			       struct prb_desc *desc)
> >>>>>> +{
> >>>
> @id _is_ very important because that is how descriptors are
> read. desc_read() takes @id as an argument and it is @id that identifies
> the descriptor. @seq is only meta-data within a descriptor. The only
> reason @seq is even checked is because of possible ABA issues with @id
> on 32-bit systems.

I think that the different view is because I look at this API
from the reader API side. It is called the following way:

prb_read_valid(, seq, )
  _prb_read_valid( , &seq, )
    prb_read( , *seq, )
        # id is read from address defined by seq
	rdesc = dr->descs[seq & MASK];
	id = rdesc->state_var && MASK_ID;

        desc_read_commited( , id, seq, )
	  desc_read( , id, )
	    # desc is the same as rdesc above because
	    # seq & MASK == id & MASK
	    desc = dr->descs[id & MASK];

Note that prb_read_valid() and prb_read() are addressed by seq.

It would be perfectly fine to pass only seq to desc_read_committed()
and read id from inside.

The name desc_read_committed() suggests that the important condition
is that the descriptor is in the committed state. It is not obvious
that seq is important as well.

From my POV, it will be more clear to pass only seq and rename the
function to desc_read_by_seq() or so:

  + seq is enough for addressing
  + function returns true only when the stored seq matches
  + the stored seq is valid only when the state is committed
    or reusable


Please, do not reply to this mail. Either take the idea or keep
the code as is. I could live with it. And it is not important
enough to spend more time on it. I just wanted to explain my view.
But it is obviously just a personal preference.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: more barriers: Re: [PATCH 1/2] printk: add lockless buffer
  2020-02-27 12:04     ` John Ogness
@ 2020-03-04 15:08       ` Petr Mladek
  2020-03-13 10:13         ` John Ogness
  0 siblings, 1 reply; 58+ messages in thread
From: Petr Mladek @ 2020-03-04 15:08 UTC (permalink / raw)
  To: John Ogness
  Cc: Peter Zijlstra, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

On Thu 2020-02-27 13:04:09, John Ogness wrote:
> On 2020-02-21, Petr Mladek <pmladek@suse.com> wrote:
> > If I get it correctly, the used cmpxchg_relaxed() variants does not
> > provide full barriers. They are just able to prevent parallel
> > manipulation of the modified variable.
> 
> Correct.
> 
> I purposely avoided the full barriers of a successful cmpxchg() so that
> we could clearly specify what we needed and why. As Andrea pointed out
> [0], we need to understand if/when we require those memory barriers.
> 
> Once we've identified these, we may want to fold some of those barriers
> back in, going from cmpxchg_relaxed() back to cmpxchg(). In particular
> when we see patterns like:
> 
>     do {
>         ....
>     } while (!try_cmpxchg_relaxed());
>     smp_mb();
> 
> or possibly:
> 
>     smp_mb();
>     cmpxchg_relaxed(); /* no return value check */

It seems that we need more barriers than I expected. If we are able to
get rid of them by using cmpxchg() instead of cmpxchg_relaxed() then
it might be quite some simplification.

I have to admit that my understanding of barriers is more incomplete
than I have hooped for. I am less and less convinced that my ack is
enough to merge this patch. It would be great when PeterZ or another
expert on barriers might give it a cycle (or maybe wait for the next
version of this patch?).

Alternative solution is to do quite some testing and push it into
linux-next to give it even more testing. It seems that the main
danger is that some messages might get lost. But it should
not crash. Well, I would feel much more comfortable if I wasn't
the only reviewer.

> > On Tue 2020-01-28 17:25:47, John Ogness wrote:
> >> diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
> >> new file mode 100644
> >> index 000000000000..796257f226ee
> >> --- /dev/null
> >> +++ b/kernel/printk/printk_ringbuffer.c
> >> +/*
> >> + * Take a given descriptor out of the committed state by attempting
> >> + * the transition from committed to reusable. Either this task or some
> >> + * other task will have been successful.
> >> + */
> >> +static void desc_make_reusable(struct prb_desc_ring *desc_ring,
> >> +			       unsigned long id)
> >> +{
> >> +	struct prb_desc *desc = to_desc(desc_ring, id);
> >> +	atomic_long_t *state_var = &desc->state_var;
> >> +	unsigned long val_committed = id | DESC_COMMITTED_MASK;
> >> +	unsigned long val_reusable = val_committed | DESC_REUSE_MASK;
> >> +
> >> +	atomic_long_cmpxchg_relaxed(state_var, val_committed,
> >> val_reusable);
> >
> > IMHO, we should add smp_wmb() here to make sure that the reusable
> > state is written before we shuffle the desc_ring->tail_id/head_id.
> >
> > It would pair with the read part of smp_mb() in desc_reserve()
> > before the extra check if the descriptor is really in reusable state.
> 
> Yes. Now that we added the extra state checking in desc_reserve(), this
> ordering has become important.
> 
> However, for this case I would prefer to instead place a full memory
> barrier immediately before @tail_id is incremented (in
> desc_push_tail()). The tail-incrementing-task must have seen the
> reusable state (even if it is not the one that set it) and an
> incremented @tail_id must be visible to the task recycling a descriptor.

Ah, the below mentioned litmus tests for the full barrier in
desc_reserve() opened my eyes to see why full barrier is sometimes
needed instead of the write barrier.

I agree that this is exactly the place when the full barrier will be
needed. This write can happen on any CPU and the write
depending on this value might be done on another CPU.

Also I agree that desc_push_tail() looks like the right place
for the full barrier because some actions there are done only
when the descriptor is invalidated.

I just wonder if it should be even before data_push_tail()
calls. It will make sure that everyone sees the reusable state
before we move the data_ring borders.

Also I wonder whether we need even more full barriers in the code.
There are many more dependent actions that can be done on different
CPUs in parallel.

> >> +}
> >> +
> >> +/*
> >> + * For a given data ring (text or dict) and its current tail lpos:
> >> + * for each data block up until @lpos, make the associated descriptor
> >> + * reusable.
> >> + *
> >> + * If there is any problem making the associated descriptor reusable,
> >> + * either the descriptor has not yet been committed or another writer
> >> + * task has already pushed the tail lpos past the problematic data
> >> + * block. Regardless, on error the caller can re-load the tail lpos
> >> + * to determine the situation.
> >> + */
> >> +static bool data_make_reusable(struct printk_ringbuffer *rb,
> >> +			       struct prb_data_ring *data_ring,
> >> +			       unsigned long tail_lpos, unsigned long lpos,
> >> +			       unsigned long *lpos_out)
> >> +{
> >> +	struct prb_desc_ring *desc_ring = &rb->desc_ring;
> >> +	struct prb_data_blk_lpos *blk_lpos;
> >> +	struct prb_data_block *blk;
> >> +	enum desc_state d_state;
> >> +	struct prb_desc desc;
> >> +	unsigned long id;
> >> +
> >> +	/*
> >> +	 * Using the provided @data_ring, point @blk_lpos to the correct
> >> +	 * blk_lpos within the local copy of the descriptor.
> >> +	 */
> >> +	if (data_ring == &rb->text_data_ring)
> >> +		blk_lpos = &desc.text_blk_lpos;
> >> +	else
> >> +		blk_lpos = &desc.dict_blk_lpos;
> >> +
> >> +	/* Loop until @tail_lpos has advanced to or beyond @lpos. */
> >> +	while ((lpos - tail_lpos) - 1 < DATA_SIZE(data_ring)) {
> >> +		blk = to_block(data_ring, tail_lpos);
> >
> > IMHO, we need smp_rmb() here to make sure that we read blk->id
> > that we written after pushing the tail_lpos.
> >
> > It would pair with the write barrier in data_alloc() before
> > before writing blk->id. It is there after updating head_lpos.
> > But head_lpos could be updated only after updating tail_lpos.
> > See the comment in data_alloc() below.
> 
> I do not understand. @blk->id has a data dependency on the provided
> @tail_lpos. A random @tail_lpos value could be passed to this function
> and it will only make a descriptor state change if the associated
> descriptor is in the committed state and points back to that @tail_lpos
> value. That is always legal.
> 
> If the old @blk->id value is read (just before data_alloc() writes it),
> then the following desc_read() will return with desc_miss. That is
> correct. If the new @blk->id value is read (just after data_alloc()
> writes it), desc_read() will return with desc_reserved. This is also
> correct. Why would this code care about @head_lpos or @tail_lpos
> ordering to @blk->id? Please explain.

OK, my proposal does not make much sense. You know, I felt that there
should be smp_wmb() after each atomic_long_try_cmpxchg_relaxed() to
synchronize changes in the other variables.

I added smp_wmb() after the cmpxchg() in data_alloc() and then looked
where might be the related smp_rmb(). This looked promissing because
it was the only location where we read blk->id ;-)

But it does not make sense. The smp_wmb() in data_alloc() was before
writing blk->id. So the corresponding smp_rmb() should be after
reading blk->id.

> >> +		id = READ_ONCE(blk->id);

This brings the question whether the smp_rmb() would make sense here
to make sure that desc_read() see the descriptor state that allowed
allocating this space.

It would pair with smp_wmb() in desc_reserve() right after
setting desc->state_var to the newly reserved descriptor id.
It is the barrier that will allow to modify the reserved space,
including writing the reserved id into the later reserved blk->id.

Note that desc_read() does not have smp_rmb() before the first
read of the state_var. It might theoretically see an outdated
value.

> >> +
> >> +		d_state = desc_read(desc_ring, id,
> >> +				    &desc); /* LMM(data_make_reusable:A) */
> >> +
> >> +		switch (d_state) {
> >> +		case desc_miss:
> >> +			return false;
> >> +		case desc_reserved:
> >> +			return false;
> >> +		case desc_committed:
> >> +			/*
> >> +			 * This data block is invalid if the descriptor
> >> +			 * does not point back to it.
> >> +			 */
> >> +			if (blk_lpos->begin != tail_lpos)
> >> +				return false;
> >> +			desc_make_reusable(desc_ring, id);
> >> +			break;
> >> +		case desc_reusable:
> >> +			/*
> >> +			 * This data block is invalid if the descriptor
> >> +			 * does not point back to it.
> >> +			 */
> >> +			if (blk_lpos->begin != tail_lpos)
> >> +				return false;
> >> +			break;
> >> +		}
> >> +
> >> +		/* Advance @tail_lpos to the next data block. */
> >> +		tail_lpos = blk_lpos->next;
> >> +	}
> >> +
> >> +	*lpos_out = tail_lpos;
> >> +
> >> +	return true;
> >> +}
> >> +
> >> +/*
> >> + * Advance the data ring tail to at least @lpos. This function puts all
> >> + * descriptors into the reusable state if the tail will be pushed beyond
> >> + * their associated data block.
> >> + */
> >> +static bool data_push_tail(struct printk_ringbuffer *rb,
> >> +			   struct prb_data_ring *data_ring,
> >> +			   unsigned long lpos)
> >> +{
> >> +	unsigned long tail_lpos;
> >> +	unsigned long next_lpos;
> >> +
> >> +	/* If @lpos is not valid, there is nothing to do. */
> >> +	if (lpos == INVALID_LPOS)
> >> +		return true;
> >> +
> >> +	tail_lpos = atomic_long_read(&data_ring->tail_lpos);
> >> +
> >> +	do {
> >> +		/* If @lpos is no longer valid, there is nothing to do. */
> >> +		if (lpos - tail_lpos >= DATA_SIZE(data_ring))
> >> +			break;
> >> +
> >> +		/*
> >> +		 * Make all descriptors reusable that are associated with
> >> +		 * data blocks before @lpos.
> >> +		 */
> >> +		if (!data_make_reusable(rb, data_ring, tail_lpos, lpos,
> >> +					&next_lpos)) {
> >> +			/*
> >> +			 * data_make_reusable() performed state loads. Make
> >> +			 * sure they are loaded before reloading the tail lpos
> >> +			 * in order to see a new tail in the case that the
> >> +			 * descriptor has been recycled. This pairs with
> >> +			 * desc_reserve:A.
> >> +			 */
> >> +			smp_rmb(); /* LMM(data_push_tail:A) */
> >> +
> >> +			/*
> >> +			 * Reload the tail lpos.
> >> +			 *
> >> +			 * Memory barrier involvement:
> >> +			 *
> >> +			 * No possibility of missing a recycled descriptor.
> >> +			 * If data_make_reusable:A reads from desc_reserve:B,
> >> +			 * then data_push_tail:B reads from desc_push_tail:A.
> >> +			 *
> >> +			 * Relies on:
> >> +			 *
> >> +			 * MB from desc_push_tail:A to desc_reserve:B
> >> +			 *    matching
> >> +			 * RMB from data_make_reusable:A to data_push_tail:B
> >> +			 */
> >> +			next_lpos = atomic_long_read(&data_ring->tail_lpos
> >> +						); /* LMM(data_push_tail:B) */
> >> +			if (next_lpos == tail_lpos)
> >> +				return false;
> >> +
> >> +			/* Another task pushed the tail. Try again. */
> >> +			tail_lpos = next_lpos;
> >> +		}
> >> +	} while (!atomic_long_try_cmpxchg_relaxed(&data_ring->tail_lpos,
> >> +			&tail_lpos, next_lpos)); /* can be relaxed? */
> >
> > IMHO, we need smp_wmb() here so that others see the updated
> > data_ring->tail_lpos before this thread allocates the space
> > by pushing head_pos.
> >
> > It would be paired with a read barrier in data_alloc() between
> > reading head_lpos and tail_lpos, see below.
> 
> data_push_tail() is the only function that concerns itself with
> @tail_lpos. Its cmpxchg-loop will prevent any unintended consequences.
> And it uses the memory barrier pair data_push_tail:A/desc_reserve:A to
> make sure that @tail_lpos reloads will successfully identify a changed
> @tail_lpos due to descriptor recycling (which is the only reason that
> @tail_lpos changes).
> 
> Why is it a problem if the movement of @head_lpos is seen before the
> movement of @tail_lpos? Please explain.

This was again motivated by the idea that cmpxchg_relaxed() is week
and it should be more safe to synchronize other variables.

OK, "tail_pos" and "head_pos" are closely related. The question is
how they are synchronized.

Hmm, there is the read barrier in LMM(data_push_tail:A) that probably
solves many problems. But what about the following scenario:

CPU0				  CPU1

data_alloc()
  begin_lpos = dr->head_lpos
				  data+alloc()
				    begin_lpos = dr->head_lpos
				    data_push_tail()
				      lpos = dr->tail_lpos
				      id = blk->id
				      date_make_reusable()
				      next_lpos = ...
				      cmpxchg(dr->tail_lpos, next_lpos)
				    cmpxchg(dr->head_lpos)

				    blk->id = id;

  data_push_tail()
    lpos = dr->tail_lpos
    # read old tail_lpos because of missing smp_rmb() amd wmb()
    id = blk->id
    # read new id because the CPU see its new state
    data_make_reusable()
    # fail because id points to the newly allocated block that
    # is still in reserved state [*]
    smp_rmb()
    next_lpos = dr->tail_lpos
    # reading still outdated tail_lpos because there is no smp_wmb()
    # between updating tail_lpos and head_lpos

BANG:

    data_push_tail() would wrongly return false
    => data_alloc() would fail

This won't happen if there was the proposed smp_wmb() at this
location.


[*] Another problem would be when data_make_reusable() see the new
    data already in the commited state. It would make fresh new
    data reusable.

    I mean the following:

CPU0				CPU1

data_alloc()
  begin_lpos = dr->head_lpos
  data_push_tail()
    lpos = dr->tail_lpos
				prb_reserve()
				  # reserve the location of current
				  # dr->tail_lpos
				prb_commit()

    id = blk->id
    # read id for the freshly written data on CPU1
    # and happily make them reusable
    data_make_reusable()


=> We should add a check into data_make_reusable() that
   we are invalidating really the descriptor pointing to
   the given lpos and not a freshly reused one!


> >> +
> >> +	return true;
> >> +}
> >> +
> >> +/*
> >> + * Advance the desc ring tail. This function advances the tail by one
> >> + * descriptor, thus invalidating the oldest descriptor. Before advancing
> >> + * the tail, the tail descriptor is made reusable and all data blocks up to
> >> + * and including the descriptor's data block are invalidated (i.e. the data
> >> + * ring tail is pushed past the data block of the descriptor being made
> >> + * reusable).
> >> + */
> >> +static bool desc_push_tail(struct printk_ringbuffer *rb,
> >> +			   unsigned long tail_id)
> >> +{
> >> +	struct prb_desc_ring *desc_ring = &rb->desc_ring;
> >> +	enum desc_state d_state;
> >> +	struct prb_desc desc;
> >> +
> >> +	d_state = desc_read(desc_ring, tail_id, &desc);
> >> +
> >> +	switch (d_state) {
> >> +	case desc_miss:
> >> +		/*
> >> +		 * If the ID is exactly 1 wrap behind the expected, it is
> >> +		 * in the process of being reserved by another writer and
> >> +		 * must be considered reserved.
> >> +		 */
> >> +		if (DESC_ID(atomic_long_read(&desc.state_var)) ==
> >> +		    DESC_ID_PREV_WRAP(desc_ring, tail_id)) {
> >> +			return false;
> >> +		}
> >> +		return true;
> >> +	case desc_reserved:
> >> +		return false;
> >> +	case desc_committed:
> >> +		desc_make_reusable(desc_ring, tail_id);
> >> +		break;
> >> +	case desc_reusable:
> >> +		break;
> >> +	}
> >> +
> >> +	/*
> >> +	 * Data blocks must be invalidated before their associated
> >> +	 * descriptor can be made available for recycling. Invalidating
> >> +	 * them later is not possible because there is no way to trust
> >> +	 * data blocks once their associated descriptor is gone.
> >> +	 */
> >> +
> >> +	if (!data_push_tail(rb, &rb->text_data_ring, desc.text_blk_lpos.next))
> >> +		return false;
> >> +	if (!data_push_tail(rb, &rb->dict_data_ring, desc.dict_blk_lpos.next))
> >> +		return false;
> >> +
> >> +	/* The data ring tail(s) were pushed: LMM(desc_push_tail:A) */
> >> +
> >> +	/*
> >> +	 * Check the next descriptor after @tail_id before pushing the tail to
> >> +	 * it because the tail must always be in a committed or reusable
> >> +	 * state. The implementation of prb_first_seq() relies on this.
> >> +	 *
> >> +	 * A successful read implies that the next descriptor is less than or
> >> +	 * equal to @head_id so there is no risk of pushing the tail past the
> >> +	 * head.
> >> +	 */
> >> +	d_state = desc_read(desc_ring, DESC_ID(tail_id + 1),
> >> +			    &desc); /* LMM(desc_push_tail:B) */
> >> +	if (d_state == desc_committed || d_state == desc_reusable) {
> >> +		atomic_long_cmpxchg_relaxed(&desc_ring->tail_id, tail_id,
> >> +			DESC_ID(tail_id + 1)); /* LMM(desc_push_tail:C) */
> >
> > IMHO, we need smp_wmb() here so that everyone see updated
> > desc_ring->tail_id before we push the head as well.
> >
> > It would pair with read barrier in desc_reserve() between reading
> > tail_id and head_id.
> 
> Good catch! This secures probably the most critical point in your
> design: when desc_reserve() recognizes that it needs to push the
> descriptor tail.

Sigh, I moved into another mode. I wonder whether we need more
full smp_mb() barriers.

The tail might be pushed by one CPU and the head moved on another CPU.
Do we need smp_mb() before moving head instead?

> >> +	} else {
> >> +		/*
> >> +		 * Guarantee the last state load from desc_read() is before
> >> +		 * reloading @tail_id in order to see a new tail in the case
> >> +		 * that the descriptor has been recycled. This pairs with
> >> +		 * desc_reserve:A.
> >> +		 */
> >> +		smp_rmb(); /* LMM(desc_push_tail:D) */
> >> +
> >> +		/*
> >> +		 * Re-check the tail ID. The descriptor following @tail_id is
> >> +		 * not in an allowed tail state. But if the tail has since
> >> +		 * been moved by another task, then it does not matter.
> >> +		 *
> >> +		 * Memory barrier involvement:
> >> +		 *
> >> +		 * No possibility of missing a pushed tail.
> >> +		 * If desc_push_tail:B reads from desc_reserve:B, then
> >> +		 * desc_push_tail:E reads from desc_push_tail:C.
> >> +		 *
> >> +		 * Relies on:
> >> +		 *
> >> +		 * MB from desc_push_tail:C to desc_reserve:B
> >> +		 *    matching
> >> +		 * RMB from desc_push_tail:B to desc_push_tail:E
> >> +		 */
> >> +		if (atomic_long_read(&desc_ring->tail_id) ==
> >> +					tail_id) { /* LMM(desc_push_tail:E) */
> >> +			return false;
> >> +		}
> >> +	}
> >> +
> >> +	return true;
> >> +}
> >> +
> >> +/* Reserve a new descriptor, invalidating the oldest if necessary. */
> >> +static bool desc_reserve(struct printk_ringbuffer *rb, unsigned long *id_out)
> >> +{
> >> +	struct prb_desc_ring *desc_ring = &rb->desc_ring;
> >> +	unsigned long prev_state_val;
> >> +	unsigned long id_prev_wrap;
> >> +	struct prb_desc *desc;
> >> +	unsigned long head_id;
> >> +	unsigned long id;
> >> +
> >> +	head_id = atomic_long_read(&desc_ring->head_id);
> >> +
> >> +	do {
> >> +		desc = to_desc(desc_ring, head_id);
> >> +
> >> +		id = DESC_ID(head_id + 1);
> >> +		id_prev_wrap = DESC_ID_PREV_WRAP(desc_ring, id);
> >
> > IMHO, we need smp_rmb() here to to guarantee reading head_id before
> > desc_ring->tail_id.
> >
> > It would pair with write barrier in desc_push_tail() after updating
> > tail_id, see above.
> 
> Ack. Critical.
> 
> >> +
> >> +		if (id_prev_wrap == atomic_long_read(&desc_ring->tail_id)) {
> >> +			/*
> >> +			 * Make space for the new descriptor by
> >> +			 * advancing the tail.
> >> +			 */
> >> +			if (!desc_push_tail(rb, id_prev_wrap))
> >> +				return false;
> >> +		}

So, I wonder whether we actually need smp_mb() already here.
It would make sure that all CPUs see the updated tail_id before
head_id is updated. They both might be updated on different CPUs.

> >> +	} while (!atomic_long_try_cmpxchg_relaxed(&desc_ring->head_id,
> >> +						  &head_id, id));
> >> +
> >> +	/*
> >> +	 * Guarantee any data ring tail changes are stored before recycling
> >> +	 * the descriptor. A full memory barrier is needed since another
> >> +	 * task may have pushed the data ring tails. This pairs with
> >> +	 * data_push_tail:A.
> >> +	 *
> >> +	 * Guarantee a new tail ID is stored before recycling the descriptor.
> >> +	 * A full memory barrier is needed since another task may have pushed
> >> +	 * the tail ID. This pairs with desc_push_tail:D and prb_first_seq:C.
> >> +	 */
> >> +	smp_mb(); /* LMM(desc_reserve:A) */
> >
> > I am a bit confused by the full barrier here. The description is not
> > clear. All the three tags (data_push_tail:A, desc_push_tail:D and
> > prb_first_seq:C) refers read barriers. This would suggest that write
> > barrier would be enough here.
> 
> The above comment section states twice why a full memory barrier is
> needed: those writes may not have come from this task. We are not only
> ordering the visible writes that this task performed, we are also
> ordering the visible writes that this task has observed. Here is a
> litmus test demonstrating this:
> 
> C full-mb-test
> 
> {}
> 
> P0(int *x, int *y)
> {
> 	WRITE_ONCE(*x, 1);
> }
> 
> P1(int *x, int *y)
> {
> 	int tmp_x;
> 
> 	tmp_x = READ_ONCE(*x);
> 	if (tmp_x) {
> 		smp_mb();
> 		WRITE_ONCE(*y, 1);
> 	}
> }
> 
> P2(int *x, int *y)
> {
> 	int tmp_x;
> 	int tmp_y;
> 
> 	tmp_y = READ_ONCE(*y);
> 	smp_rmb();
> 	tmp_x = READ_ONCE(*x);
> }
> 
> exists (2:tmp_x=0 /\ 2:tmp_y=1)

Thanks a lot for this Litmus test. I have read several articles about
barrier and the memory model. But I forgot everything that I did not
use in practice.

I still have to shake my head about it.

Now, I came up with the idea that the full smp_mb() barrier should be
earlier (before head_id update). Then smp_wmb() might be enough here.
It would synchronize write to desc_ring->head_id and
desc->state_var. They both happen on the same CPU
by design.

Well, the full barrier smp_mb() might actually still be needed here
because of the paranoid prev_state_val check. It is a read and checks
against potential races with another CPUs.


> As mentioned above, I would put the smp_mb() before updating the
> @tail_id. That would pair with this smp_mb() and avoid the false
> positive on the @state_var check.

I am not comletely sure what you mean. Feel free to use your best
judgement in the next version of the patch. It seems that few more
barriers are needed and it is getting complicated to discuss
changes based on other changes wihtout seeing the code ;-)

> >> +
> >> +	desc = to_desc(desc_ring, id);
> >> +
> >> +	/* If the descriptor has been recycled, verify the old state val. */
> >> +	prev_state_val = atomic_long_read(&desc->state_var);
> >> +	if (prev_state_val && prev_state_val != (id_prev_wrap |
> >> +						 DESC_COMMITTED_MASK |
> >> +						 DESC_REUSE_MASK)) {
> >> +		WARN_ON_ONCE(1);
> >> +		return false;
> >> +	}
> >> +
> >> +	/* Assign the descriptor a new ID and set its state to reserved. */
> >> +	if (!atomic_long_try_cmpxchg_relaxed(&desc->state_var,
> >> +			&prev_state_val, id | 0)) { /* LMM(desc_reserve:B) */
> >> +		WARN_ON_ONCE(1);
> >> +		return false;
> >> +	}
> >> +
> >> +	/*
> >> +	 * Guarantee the new descriptor ID and state is stored before making
> >> +	 * any other changes. This pairs with desc_read:D.
> >> +	 */
> >> +	smp_wmb(); /* LMM(desc_reserve:C) */
> >> +
> >> +	/* Now data in @desc can be modified: LMM(desc_reserve:D) */
> >> +
> >> +	*id_out = id;
> >> +	return true;
> >> +}
> >> +
> >> +/*
> >> + * Allocate a new data block, invalidating the oldest data block(s)
> >> + * if necessary. This function also associates the data block with
> >> + * a specified descriptor.
> >> + */
> >> +static char *data_alloc(struct printk_ringbuffer *rb,
> >> +			struct prb_data_ring *data_ring, unsigned long size,
> >> +			struct prb_data_blk_lpos *blk_lpos, unsigned long id)
> >> +{
> >> +	struct prb_data_block *blk;
> >> +	unsigned long begin_lpos;
> >> +	unsigned long next_lpos;
> >> +
> >> +	if (!data_ring->data || size == 0) {
> >> +		/* Specify a data-less block. */
> >> +		blk_lpos->begin = INVALID_LPOS;
> >> +		blk_lpos->next = INVALID_LPOS;
> >> +		return NULL;
> >> +	}
> >> +
> >> +	size = to_blk_size(size);
> >> +
> >> +	begin_lpos = atomic_long_read(&data_ring->head_lpos);
> >> +
> >> +	do {
> >> +		next_lpos = get_next_lpos(data_ring, begin_lpos, size);
> >> +
> >
> > IMHO, we need smp_rmb() here to read begin_lpos before we read
> > tail_lpos in data_push_tail()
> >
> > It would pair with a write barrier in data_push_tail() after
> > updating data_ring->tail_lpos.
> 
> Please explain why this pair is necessary. What is the scenario that
> needs to be avoided?

What about this:

CPU0				  CPU1

data_alloc()

  begin_lpos = dr->head_lpos
				  data+alloc() (long message)
				    begin_lpos = dr->head_lpos
				    data_push_tail()
				      lpos = dr->tail_lpos
				      id = blk->id
				      date_make_reusable()
				      next_lpos = ...
				      cmpxchg(dr->tail_lpos, next_lpos)
				    cmpxchg(dr->head_lpos)

  begin_lpos = dr->head_lpos
    # reading new head
    data_push_tail()
      lpos = dr->tail_lpos
      # read old tail_lpos because of missing smp_rmb() amd wmb()
      data_make_reusable()
      # success because already done;
      cmpxchg(dr->tail_lpos, next_lpos)
      # fail because it sees the updated tail_lpos

OK, we repeat the cycle with the righ tail_lpos. So the only problem
is the extra cycle that might be prevented by the barrier.

Well, I still feel that the code will be much cleaner and rebust when
we do not rely on these things. In the current state, we rely on the
fact that data_make_reusable() is rebust enough to do not touch
outdated/reused descriptor.

Anyway, there is well defined order in which tail/head pos are read and
written. And it is just a call for problems when we do not synchronize
the reads and writers by barriers.


> >> +		if (!data_push_tail(rb, data_ring,
> >> +				    next_lpos - DATA_SIZE(data_ring))) {
> >> +			/* Failed to allocate, specify a data-less block. */
> >> +			blk_lpos->begin = INVALID_LPOS;
> >> +			blk_lpos->next = INVALID_LPOS;
> >> +			return NULL;
> >> +		}
> >> +	} while (!atomic_long_try_cmpxchg_relaxed(&data_ring->head_lpos,
> >> +						  &begin_lpos, next_lpos));
> >> +
> >
> > IMHO, we need smp_wmb() here to guarantee that others see the updated
> > data_ring->head_lpos before we write anything into the data buffer.
> >
> > It would pair with a read barrier in data_make_reusable
> > between reading tail_lpos and blk->id in data_make_reusable().
> 
> Please explain why this pair is necessary. What is the scenario that
> needs to be avoided?

Uff, I would need to take a day off before I think about this.
But I want to send this mail today, see below.

So I will just write a question. This code looks very similar to
desc_reserve(). We are pushing tail/head and writing into to
allocated space. Why do we need less barriers here?

> >> +	blk = to_block(data_ring, begin_lpos);
> >> +	blk->id = id;
> >> +
> >> +	if (DATA_WRAPS(data_ring, begin_lpos) !=
> >> +	    DATA_WRAPS(data_ring, next_lpos)) {
> >> +		/* Wrapping data blocks store their data at the beginning. */
> >> +		blk = to_block(data_ring, 0);
> >> +		blk->id = id;
> >> +	}
> >> +
> >> +	blk_lpos->begin = begin_lpos;
> >> +	blk_lpos->next = next_lpos;
> >> +
> >> +	return &blk->data[0];
> >> +}

I hope that the mail makes some sense. I feel that I still do not
understand it enough. I am not sure if it would be better to
discuss the things more, or see an updated version, or get
opinion from another person.

Anyway, I am not sure how responsible I would be during
the following days. My both hands are aching (Carpal tunnel
syndrome or so) and it is getting worse. I have to visit
a doctor. I hope that I will be able to work with some
bandage but...

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: more barriers: Re: [PATCH 1/2] printk: add lockless buffer
  2020-03-04 15:08       ` Petr Mladek
@ 2020-03-13 10:13         ` John Ogness
  0 siblings, 0 replies; 58+ messages in thread
From: John Ogness @ 2020-03-13 10:13 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Peter Zijlstra, Sergey Senozhatsky, Sergey Senozhatsky,
	Steven Rostedt, Linus Torvalds, Greg Kroah-Hartman, Andrea Parri,
	Thomas Gleixner, kexec, linux-kernel

Hi,

This is quite a long response. I can summarize here:

- Several new memory barrier pairs were identified.

- The placement of a memory barrier was incorrect.

There are now quite a few changes queued up for v2. I will try to get
this posted soon. Also, I believe we've now identified the cmpxchg's
that really need the full memory barriers. So I will be folding all the
memory barriers into cmpxchg() calls where applicable and include the
appropriate memory barrier documentation.

And now my response...

On 2020-03-04, Petr Mladek <pmladek@suse.com> wrote:
>>>> diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
>>>> new file mode 100644
>>>> index 000000000000..796257f226ee
>>>> --- /dev/null
>>>> +++ b/kernel/printk/printk_ringbuffer.c
>>>> +/*
>>>> + * Take a given descriptor out of the committed state by attempting
>>>> + * the transition from committed to reusable. Either this task or some
>>>> + * other task will have been successful.
>>>> + */
>>>> +static void desc_make_reusable(struct prb_desc_ring *desc_ring,
>>>> +			       unsigned long id)
>>>> +{
>>>> +	struct prb_desc *desc = to_desc(desc_ring, id);
>>>> +	atomic_long_t *state_var = &desc->state_var;
>>>> +	unsigned long val_committed = id | DESC_COMMITTED_MASK;
>>>> +	unsigned long val_reusable = val_committed | DESC_REUSE_MASK;
>>>> +
>>>> +	atomic_long_cmpxchg_relaxed(state_var, val_committed,
>>>> val_reusable);
>>>
>>> IMHO, we should add smp_wmb() here to make sure that the reusable
>>> state is written before we shuffle the desc_ring->tail_id/head_id.
>>>
>>> It would pair with the read part of smp_mb() in desc_reserve()
>>> before the extra check if the descriptor is really in reusable state.
>> 
>> Yes. Now that we added the extra state checking in desc_reserve(),
>> this ordering has become important.
>> 
>> However, for this case I would prefer to instead place a full memory
>> barrier immediately before @tail_id is incremented (in
>> desc_push_tail()). The tail-incrementing-task must have seen the
>> reusable state (even if it is not the one that set it) and an
>> incremented @tail_id must be visible to the task recycling a
>> descriptor.
>
> I agree that this is exactly the place when the full barrier will be
> needed. This write can happen on any CPU and the write depending on
> this value might be done on another CPU.
>
> Also I agree that desc_push_tail() looks like the right place
> for the full barrier because some actions there are done only
> when the descriptor is invalidated.
>
> I just wonder if it should be even before data_push_tail()
> calls. It will make sure that everyone sees the reusable state
> before we move the data_ring borders.

You are correct. The reader is only ordering data reads against the
state of the _descriptor that is being read_. The reader may not yet see
that its descriptor has transitioned to reusable while a writer may have
already recycled the data block (associated with a _different_
descriptor) and started writing something new.

The problem is a missing ordering between setting the descriptor to
reusable and any possibilty of data block recycling (e.g. the data tail
is pushed). Inserting a full memory barrier after setting the state to
reusable and before pushing the data tail will fix that. Then if the
reader reads newer data, it must see that its descriptor state is no
longer committed.

Changing the cmpxchg_relaxed() in data_push_tail() to cmpxchg() will add
the needed full memory barrier. I felt uneasy about making that
cmpxchg() relaxed but couldn't prove why. Thanks for seeing it!

>>>> +}
>>>> +
>>>> +/*
>>>> + * For a given data ring (text or dict) and its current tail lpos:
>>>> + * for each data block up until @lpos, make the associated descriptor
>>>> + * reusable.
>>>> + *
>>>> + * If there is any problem making the associated descriptor reusable,
>>>> + * either the descriptor has not yet been committed or another writer
>>>> + * task has already pushed the tail lpos past the problematic data
>>>> + * block. Regardless, on error the caller can re-load the tail lpos
>>>> + * to determine the situation.
>>>> + */
>>>> +static bool data_make_reusable(struct printk_ringbuffer *rb,
>>>> +			       struct prb_data_ring *data_ring,
>>>> +			       unsigned long tail_lpos, unsigned long lpos,
>>>> +			       unsigned long *lpos_out)
>>>> +{
>>>> +	struct prb_desc_ring *desc_ring = &rb->desc_ring;
>>>> +	struct prb_data_blk_lpos *blk_lpos;
>>>> +	struct prb_data_block *blk;
>>>> +	enum desc_state d_state;
>>>> +	struct prb_desc desc;
>>>> +	unsigned long id;
>>>> +
>>>> +	/*
>>>> +	 * Using the provided @data_ring, point @blk_lpos to the correct
>>>> +	 * blk_lpos within the local copy of the descriptor.
>>>> +	 */
>>>> +	if (data_ring == &rb->text_data_ring)
>>>> +		blk_lpos = &desc.text_blk_lpos;
>>>> +	else
>>>> +		blk_lpos = &desc.dict_blk_lpos;
>>>> +
>>>> +	/* Loop until @tail_lpos has advanced to or beyond @lpos. */
>>>> +	while ((lpos - tail_lpos) - 1 < DATA_SIZE(data_ring)) {
>>>> +		blk = to_block(data_ring, tail_lpos);
>>>> +		id = READ_ONCE(blk->id);
>
> This brings the question whether the smp_rmb() would make sense here
> to make sure that desc_read() see the descriptor state that allowed
> allocating this space.
>
> It would pair with smp_wmb() in desc_reserve() right after
> setting desc->state_var to the newly reserved descriptor id.
> It is the barrier that will allow to modify the reserved space,
> including writing the reserved id into the later reserved blk->id.
>
> Note that desc_read() does not have smp_rmb() before the first
> read of the state_var. It might theoretically see an outdated
> value.

The descriptor read in desc_read() has an address dependency on the @id
argument, and thus an address dependency on the READ_ONCE(blk->id)
above. With that we have an implicit smp_rmb(). (The CPU cannot load the
descriptor without first loading the index for that descriptor.)

However, this deserves some comments describing the relied ordering. (An
explicit smp_rmb() will probably be added here anyway for other
reasons. See below.)

>>>> +
>>>> +		d_state = desc_read(desc_ring, id,
>>>> +				    &desc); /* LMM(data_make_reusable:A) */
>>>> +
>>>> +		switch (d_state) {
>>>> +		case desc_miss:
>>>> +			return false;
>>>> +		case desc_reserved:
>>>> +			return false;
>>>> +		case desc_committed:
>>>> +			/*
>>>> +			 * This data block is invalid if the descriptor
>>>> +			 * does not point back to it.
>>>> +			 */
>>>> +			if (blk_lpos->begin != tail_lpos)
>>>> +				return false;
>>>> +			desc_make_reusable(desc_ring, id);
>>>> +			break;
>>>> +		case desc_reusable:
>>>> +			/*
>>>> +			 * This data block is invalid if the descriptor
>>>> +			 * does not point back to it.
>>>> +			 */
>>>> +			if (blk_lpos->begin != tail_lpos)
>>>> +				return false;
>>>> +			break;
>>>> +		}
>>>> +
>>>> +		/* Advance @tail_lpos to the next data block. */
>>>> +		tail_lpos = blk_lpos->next;
>>>> +	}
>>>> +
>>>> +	*lpos_out = tail_lpos;
>>>> +
>>>> +	return true;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Advance the data ring tail to at least @lpos. This function puts all
>>>> + * descriptors into the reusable state if the tail will be pushed beyond
>>>> + * their associated data block.
>>>> + */
>>>> +static bool data_push_tail(struct printk_ringbuffer *rb,
>>>> +			   struct prb_data_ring *data_ring,
>>>> +			   unsigned long lpos)
>>>> +{
>>>> +	unsigned long tail_lpos;
>>>> +	unsigned long next_lpos;
>>>> +
>>>> +	/* If @lpos is not valid, there is nothing to do. */
>>>> +	if (lpos == INVALID_LPOS)
>>>> +		return true;
>>>> +
>>>> +	tail_lpos = atomic_long_read(&data_ring->tail_lpos);
>>>> +
>>>> +	do {
>>>> +		/* If @lpos is no longer valid, there is nothing to do. */
>>>> +		if (lpos - tail_lpos >= DATA_SIZE(data_ring))
>>>> +			break;
>>>> +
>>>> +		/*
>>>> +		 * Make all descriptors reusable that are associated with
>>>> +		 * data blocks before @lpos.
>>>> +		 */
>>>> +		if (!data_make_reusable(rb, data_ring, tail_lpos, lpos,
>>>> +					&next_lpos)) {
>>>> +			/*
>>>> +			 * data_make_reusable() performed state loads. Make
>>>> +			 * sure they are loaded before reloading the tail lpos
>>>> +			 * in order to see a new tail in the case that the
>>>> +			 * descriptor has been recycled. This pairs with
>>>> +			 * desc_reserve:A.
>>>> +			 */
>>>> +			smp_rmb(); /* LMM(data_push_tail:A) */
>>>> +
>>>> +			/*
>>>> +			 * Reload the tail lpos.
>>>> +			 *
>>>> +			 * Memory barrier involvement:
>>>> +			 *
>>>> +			 * No possibility of missing a recycled descriptor.
>>>> +			 * If data_make_reusable:A reads from desc_reserve:B,
>>>> +			 * then data_push_tail:B reads from desc_push_tail:A.
>>>> +			 *
>>>> +			 * Relies on:
>>>> +			 *
>>>> +			 * MB from desc_push_tail:A to desc_reserve:B
>>>> +			 *    matching
>>>> +			 * RMB from data_make_reusable:A to data_push_tail:B
>>>> +			 */
>>>> +			next_lpos = atomic_long_read(&data_ring->tail_lpos
>>>> +						); /* LMM(data_push_tail:B) */
>>>> +			if (next_lpos == tail_lpos)
>>>> +				return false;
>>>> +
>>>> +			/* Another task pushed the tail. Try again. */
>>>> +			tail_lpos = next_lpos;
>>>> +		}
>>>> +	} while (!atomic_long_try_cmpxchg_relaxed(&data_ring->tail_lpos,
>>>> +			&tail_lpos, next_lpos)); /* can be relaxed? */
>>>
>>> IMHO, we need smp_wmb() here so that others see the updated
>>> data_ring->tail_lpos before this thread allocates the space
>>> by pushing head_pos.
>>>
>>> It would be paired with a read barrier in data_alloc() between
>>> reading head_lpos and tail_lpos, see below.
>> 
>> data_push_tail() is the only function that concerns itself with
>> @tail_lpos. Its cmpxchg-loop will prevent any unintended consequences.
>> And it uses the memory barrier pair data_push_tail:A/desc_reserve:A to
>> make sure that @tail_lpos reloads will successfully identify a changed
>> @tail_lpos due to descriptor recycling (which is the only reason that
>> @tail_lpos changes).
>> 
>> Why is it a problem if the movement of @head_lpos is seen before the
>> movement of @tail_lpos? Please explain.
>
> This was again motivated by the idea that cmpxchg_relaxed() is week
> and it should be more safe to synchronize other variables.
>
> OK, "tail_pos" and "head_pos" are closely related. The question is
> how they are synchronized.
>
> Hmm, there is the read barrier in LMM(data_push_tail:A) that probably
> solves many problems. But what about the following scenario:
>
> CPU0				  CPU1
>
> data_alloc()
>   begin_lpos = dr->head_lpos
> 				  data+alloc()
> 				    begin_lpos = dr->head_lpos
> 				    data_push_tail()
> 				      lpos = dr->tail_lpos
> 				      id = blk->id
> 				      date_make_reusable()
> 				      next_lpos = ...
> 				      cmpxchg(dr->tail_lpos, next_lpos)
> 				    cmpxchg(dr->head_lpos)
>
> 				    blk->id = id;
>
>   data_push_tail()
>     lpos = dr->tail_lpos
>     # read old tail_lpos because of missing smp_rmb() amd wmb()
>     id = blk->id
>     # read new id because the CPU see its new state
>     data_make_reusable()
>     # fail because id points to the newly allocated block that
>     # is still in reserved state [*]
>     smp_rmb()
>     next_lpos = dr->tail_lpos
>     # reading still outdated tail_lpos because there is no smp_wmb()
>     # between updating tail_lpos and head_lpos
>
> BANG:
>
>     data_push_tail() would wrongly return false
>     => data_alloc() would fail
>
> This won't happen if there was the proposed smp_wmb() at this
> location.

Changing the @tail_lpos update to a full barrier cmpxchg() (as mentioned
above) will solve this problem.

> [*] Another problem would be when data_make_reusable() see the new
>     data already in the commited state. It would make fresh new
>     data reusable.
>
>     I mean the following:
>
> CPU0				CPU1
>
> data_alloc()
>   begin_lpos = dr->head_lpos
>   data_push_tail()
>     lpos = dr->tail_lpos
> 				prb_reserve()
> 				  # reserve the location of current
> 				  # dr->tail_lpos
> 				prb_commit()
>
>     id = blk->id
>     # read id for the freshly written data on CPU1
>     # and happily make them reusable
>     data_make_reusable()

Ouch.

> => We should add a check into data_make_reusable() that
>    we are invalidating really the descriptor pointing to
>    the given lpos and not a freshly reused one!

The issue is that data_make_reusable() is not seeing that the tail has
moved.

What about if data_make_reusable() does something like:

    id = READ_ONCE(blk->id);
    smp_rmb();
    ... code to check if tail has moved beyond @tail_lpos ...
    d_state = desc_read()
    
The smp_rmb() would pair with the full barrier cmpxchg() of pushing the
data tail (to be added, as mentioned already). So if a new ID in the
block is seen then a new tail must also be visible.

>>>> +
>>>> +	return true;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Advance the desc ring tail. This function advances the tail by one
>>>> + * descriptor, thus invalidating the oldest descriptor. Before advancing
>>>> + * the tail, the tail descriptor is made reusable and all data blocks up to
>>>> + * and including the descriptor's data block are invalidated (i.e. the data
>>>> + * ring tail is pushed past the data block of the descriptor being made
>>>> + * reusable).
>>>> + */
>>>> +static bool desc_push_tail(struct printk_ringbuffer *rb,
>>>> +			   unsigned long tail_id)
>>>> +{
>>>> +	struct prb_desc_ring *desc_ring = &rb->desc_ring;
>>>> +	enum desc_state d_state;
>>>> +	struct prb_desc desc;
>>>> +
>>>> +	d_state = desc_read(desc_ring, tail_id, &desc);
>>>> +
>>>> +	switch (d_state) {
>>>> +	case desc_miss:
>>>> +		/*
>>>> +		 * If the ID is exactly 1 wrap behind the expected, it is
>>>> +		 * in the process of being reserved by another writer and
>>>> +		 * must be considered reserved.
>>>> +		 */
>>>> +		if (DESC_ID(atomic_long_read(&desc.state_var)) ==
>>>> +		    DESC_ID_PREV_WRAP(desc_ring, tail_id)) {
>>>> +			return false;
>>>> +		}
>>>> +		return true;
>>>> +	case desc_reserved:
>>>> +		return false;
>>>> +	case desc_committed:
>>>> +		desc_make_reusable(desc_ring, tail_id);
>>>> +		break;
>>>> +	case desc_reusable:
>>>> +		break;
>>>> +	}
>>>> +
>>>> +	/*
>>>> +	 * Data blocks must be invalidated before their associated
>>>> +	 * descriptor can be made available for recycling. Invalidating
>>>> +	 * them later is not possible because there is no way to trust
>>>> +	 * data blocks once their associated descriptor is gone.
>>>> +	 */
>>>> +
>>>> +	if (!data_push_tail(rb, &rb->text_data_ring, desc.text_blk_lpos.next))
>>>> +		return false;
>>>> +	if (!data_push_tail(rb, &rb->dict_data_ring, desc.dict_blk_lpos.next))
>>>> +		return false;
>>>> +
>>>> +	/* The data ring tail(s) were pushed: LMM(desc_push_tail:A) */
>>>> +
>>>> +	/*
>>>> +	 * Check the next descriptor after @tail_id before pushing the tail to
>>>> +	 * it because the tail must always be in a committed or reusable
>>>> +	 * state. The implementation of prb_first_seq() relies on this.
>>>> +	 *
>>>> +	 * A successful read implies that the next descriptor is less than or
>>>> +	 * equal to @head_id so there is no risk of pushing the tail past the
>>>> +	 * head.
>>>> +	 */
>>>> +	d_state = desc_read(desc_ring, DESC_ID(tail_id + 1),
>>>> +			    &desc); /* LMM(desc_push_tail:B) */
>>>> +	if (d_state == desc_committed || d_state == desc_reusable) {
>>>> +		atomic_long_cmpxchg_relaxed(&desc_ring->tail_id, tail_id,
>>>> +			DESC_ID(tail_id + 1)); /* LMM(desc_push_tail:C) */
>>>
>>> IMHO, we need smp_wmb() here so that everyone see updated
>>> desc_ring->tail_id before we push the head as well.
>>>
>>> It would pair with read barrier in desc_reserve() between reading
>>> tail_id and head_id.
>> 
>> Good catch! This secures probably the most critical point in your
>> design: when desc_reserve() recognizes that it needs to push the
>> descriptor tail.
>
> Sigh, I moved into another mode. I wonder whether we need more
> full smp_mb() barriers.
>
> The tail might be pushed by one CPU and the head moved on another CPU.

Correct. You just made me realize that I wasn't using enough tasks in my
litmus tests here. :-/

> Do we need smp_mb() before moving head instead?

Yes. I wrote a litmus test to verify it (below). It includes _3_
identical tasks that are doing the critical tail/head checking and
pushing from desc_reserve(). 3 tasks are needed in order to establish
the scenario that one CPU "relax pushed" the head and another CPU "relax
pushed" the tail. The third CPU then has the danger that it sees the
head pushed, but not the tail (i.e. the head has wrapped over the
tail). And in that case it will skip the tail push and successfully push
the head.

In the litmus test I used variable names similar to the actual code. I
think it makes the litmus test harder to read, but probably easier to
verify that it is representing the code.

(There probably is a way of specifying functions or specifing that a
task should run in parallel. But I don't know it. So you will have to
excuse the copy/pasting in the litmus test. Sorry.)

>>>> +	} else {
>>>> +		/*
>>>> +		 * Guarantee the last state load from desc_read() is before
>>>> +		 * reloading @tail_id in order to see a new tail in the case
>>>> +		 * that the descriptor has been recycled. This pairs with
>>>> +		 * desc_reserve:A.
>>>> +		 */
>>>> +		smp_rmb(); /* LMM(desc_push_tail:D) */
>>>> +
>>>> +		/*
>>>> +		 * Re-check the tail ID. The descriptor following @tail_id is
>>>> +		 * not in an allowed tail state. But if the tail has since
>>>> +		 * been moved by another task, then it does not matter.
>>>> +		 *
>>>> +		 * Memory barrier involvement:
>>>> +		 *
>>>> +		 * No possibility of missing a pushed tail.
>>>> +		 * If desc_push_tail:B reads from desc_reserve:B, then
>>>> +		 * desc_push_tail:E reads from desc_push_tail:C.
>>>> +		 *
>>>> +		 * Relies on:
>>>> +		 *
>>>> +		 * MB from desc_push_tail:C to desc_reserve:B
>>>> +		 *    matching
>>>> +		 * RMB from desc_push_tail:B to desc_push_tail:E
>>>> +		 */
>>>> +		if (atomic_long_read(&desc_ring->tail_id) ==
>>>> +					tail_id) { /* LMM(desc_push_tail:E) */
>>>> +			return false;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	return true;
>>>> +}
>>>> +
>>>> +/* Reserve a new descriptor, invalidating the oldest if necessary. */
>>>> +static bool desc_reserve(struct printk_ringbuffer *rb, unsigned long *id_out)
>>>> +{
>>>> +	struct prb_desc_ring *desc_ring = &rb->desc_ring;
>>>> +	unsigned long prev_state_val;
>>>> +	unsigned long id_prev_wrap;
>>>> +	struct prb_desc *desc;
>>>> +	unsigned long head_id;
>>>> +	unsigned long id;
>>>> +
>>>> +	head_id = atomic_long_read(&desc_ring->head_id);
>>>> +
>>>> +	do {
>>>> +		desc = to_desc(desc_ring, head_id);
>>>> +
>>>> +		id = DESC_ID(head_id + 1);
>>>> +		id_prev_wrap = DESC_ID_PREV_WRAP(desc_ring, id);
>>>
>>> IMHO, we need smp_rmb() here to to guarantee reading head_id before
>>> desc_ring->tail_id.
>>>
>>> It would pair with write barrier in desc_push_tail() after updating
>>> tail_id, see above.
>> 
>> Ack. Critical.
>> 
>>>> +
>>>> +		if (id_prev_wrap == atomic_long_read(&desc_ring->tail_id)) {
>>>> +			/*
>>>> +			 * Make space for the new descriptor by
>>>> +			 * advancing the tail.
>>>> +			 */
>>>> +			if (!desc_push_tail(rb, id_prev_wrap))
>>>> +				return false;
>>>> +		}
>
> So, I wonder whether we actually need smp_mb() already here.
> It would make sure that all CPUs see the updated tail_id before
> head_id is updated. They both might be updated on different CPUs.

As written above, yes.

>>>> +	} while (!atomic_long_try_cmpxchg_relaxed(&desc_ring->head_id,
>>>> +						  &head_id, id));
>>>> +
>>>> +	/*
>>>> +	 * Guarantee any data ring tail changes are stored before recycling
>>>> +	 * the descriptor. A full memory barrier is needed since another
>>>> +	 * task may have pushed the data ring tails. This pairs with
>>>> +	 * data_push_tail:A.
>>>> +	 *
>>>> +	 * Guarantee a new tail ID is stored before recycling the descriptor.
>>>> +	 * A full memory barrier is needed since another task may have pushed
>>>> +	 * the tail ID. This pairs with desc_push_tail:D and prb_first_seq:C.
>>>> +	 */
>>>> +	smp_mb(); /* LMM(desc_reserve:A) */

> Now, I came up with the idea that the full smp_mb() barrier should be
> earlier (before head_id update).

As written above, yes.

> Then smp_wmb() might be enough here. It would synchronize write to
> desc_ring->head_id and desc->state_var. They both happen on the same
> CPU by design.

The smp_mb() here is not responsible for any ordering between storing a
new head ID and setting the new descriptor state. When we move the
smp_mb() up before the head ID update, it will suffice.

> Well, the full barrier smp_mb() might actually still be needed here
> because of the paranoid prev_state_val check. It is a read and checks
> against potential races with another CPUs.

Here is the same. The smp_mb() here is not responsible for any ordering
between reading a new head ID and reading the new descriptor state. When
we move the smp_mb() up before the head ID update, it will suffice.

>>>> +
>>>> +	desc = to_desc(desc_ring, id);
>>>> +
>>>> +	/* If the descriptor has been recycled, verify the old state val. */
>>>> +	prev_state_val = atomic_long_read(&desc->state_var);
>>>> +	if (prev_state_val && prev_state_val != (id_prev_wrap |
>>>> +						 DESC_COMMITTED_MASK |
>>>> +						 DESC_REUSE_MASK)) {
>>>> +		WARN_ON_ONCE(1);
>>>> +		return false;
>>>> +	}
>>>> +
>>>> +	/* Assign the descriptor a new ID and set its state to reserved. */
>>>> +	if (!atomic_long_try_cmpxchg_relaxed(&desc->state_var,
>>>> +			&prev_state_val, id | 0)) { /* LMM(desc_reserve:B) */
>>>> +		WARN_ON_ONCE(1);
>>>> +		return false;
>>>> +	}
>>>> +
>>>> +	/*
>>>> +	 * Guarantee the new descriptor ID and state is stored before making
>>>> +	 * any other changes. This pairs with desc_read:D.
>>>> +	 */
>>>> +	smp_wmb(); /* LMM(desc_reserve:C) */
>>>> +
>>>> +	/* Now data in @desc can be modified: LMM(desc_reserve:D) */
>>>> +
>>>> +	*id_out = id;
>>>> +	return true;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Allocate a new data block, invalidating the oldest data block(s)
>>>> + * if necessary. This function also associates the data block with
>>>> + * a specified descriptor.
>>>> + */
>>>> +static char *data_alloc(struct printk_ringbuffer *rb,
>>>> +			struct prb_data_ring *data_ring, unsigned long size,
>>>> +			struct prb_data_blk_lpos *blk_lpos, unsigned long id)
>>>> +{
>>>> +	struct prb_data_block *blk;
>>>> +	unsigned long begin_lpos;
>>>> +	unsigned long next_lpos;
>>>> +
>>>> +	if (!data_ring->data || size == 0) {
>>>> +		/* Specify a data-less block. */
>>>> +		blk_lpos->begin = INVALID_LPOS;
>>>> +		blk_lpos->next = INVALID_LPOS;
>>>> +		return NULL;
>>>> +	}
>>>> +
>>>> +	size = to_blk_size(size);
>>>> +
>>>> +	begin_lpos = atomic_long_read(&data_ring->head_lpos);
>>>> +
>>>> +	do {
>>>> +		next_lpos = get_next_lpos(data_ring, begin_lpos, size);
>>>> +
>>>
>>> IMHO, we need smp_rmb() here to read begin_lpos before we read
>>> tail_lpos in data_push_tail()
>>>
>>> It would pair with a write barrier in data_push_tail() after
>>> updating data_ring->tail_lpos.
>> 
>> Please explain why this pair is necessary. What is the scenario that
>> needs to be avoided?
>
> What about this:
>
> CPU0				  CPU1
>
> data_alloc()
>
>   begin_lpos = dr->head_lpos
> 				  data+alloc() (long message)
> 				    begin_lpos = dr->head_lpos
> 				    data_push_tail()
> 				      lpos = dr->tail_lpos
> 				      id = blk->id
> 				      date_make_reusable()
> 				      next_lpos = ...
> 				      cmpxchg(dr->tail_lpos, next_lpos)
> 				    cmpxchg(dr->head_lpos)
>
>   begin_lpos = dr->head_lpos
>     # reading new head
>     data_push_tail()
>       lpos = dr->tail_lpos
>       # read old tail_lpos because of missing smp_rmb() amd wmb()
>       data_make_reusable()
>       # success because already done;
>       cmpxchg(dr->tail_lpos, next_lpos)
>       # fail because it sees the updated tail_lpos
>
> OK, we repeat the cycle with the righ tail_lpos. So the only problem
> is the extra cycle that might be prevented by the barrier.
>
> Well, I still feel that the code will be much cleaner and rebust when
> we do not rely on these things. In the current state, we rely on the
> fact that data_make_reusable() is rebust enough to do not touch
> outdated/reused descriptor.
>
> Anyway, there is well defined order in which tail/head pos are read and
> written. And it is just a call for problems when we do not synchronize
> the reads and writers by barriers.

It would be a barrier that only optimizes a very particular case,
penalizing the main case. There are other reasons that the cmpxchg()
could fail and the loop repeat, even with an smp_rmb() here. And in most
cases, the cmpxchg() will not fail anyway.

It adds complexity by declaring yet another barrier pair. IMHO it is not
a "call for problems" to rely on cmpxchg() failing if the expected value
changed. I think it is more important to keep the barrier pairs to a
minimal set of _necessary_ barriers.

>>>> +		if (!data_push_tail(rb, data_ring,
>>>> +				    next_lpos - DATA_SIZE(data_ring))) {
>>>> +			/* Failed to allocate, specify a data-less block. */
>>>> +			blk_lpos->begin = INVALID_LPOS;
>>>> +			blk_lpos->next = INVALID_LPOS;
>>>> +			return NULL;
>>>> +		}
>>>> +	} while (!atomic_long_try_cmpxchg_relaxed(&data_ring->head_lpos,
>>>> +						  &begin_lpos, next_lpos));
>>>> +
>>>
>>> IMHO, we need smp_wmb() here to guarantee that others see the updated
>>> data_ring->head_lpos before we write anything into the data buffer.
>>>
>>> It would pair with a read barrier in data_make_reusable
>>> between reading tail_lpos and blk->id in data_make_reusable().
>> 
>> Please explain why this pair is necessary. What is the scenario that
>> needs to be avoided?
>
> This code looks very similar to desc_reserve(). We are pushing
> tail/head and writing into to allocated space. Why do we need less
> barriers here?

The memory barrier pairing in desc_reserve() is necessary to order
descriptor reading with descriptor tail changes. For data we do not need
such a synchronization because data validity is guaranteed by the
descriptor states, not the data tail.

Note that above I talked about changing the cmpxchg_relaxed() in
data_push_tail() to cmpxchg() to deal with a data validity issue that
you discovered. That probably covers your gut feeling that we need
something here.

>>>> +	blk = to_block(data_ring, begin_lpos);
>>>> +	blk->id = id;
>>>> +
>>>> +	if (DATA_WRAPS(data_ring, begin_lpos) !=
>>>> +	    DATA_WRAPS(data_ring, next_lpos)) {
>>>> +		/* Wrapping data blocks store their data at the beginning. */
>>>> +		blk = to_block(data_ring, 0);
>>>> +		blk->id = id;
>>>> +	}
>>>> +
>>>> +	blk_lpos->begin = begin_lpos;
>>>> +	blk_lpos->next = next_lpos;
>>>> +
>>>> +	return &blk->data[0];
>>>> +}
>
> Anyway, I am not sure how responsible I would be during
> the following days. My both hands are aching (Carpal tunnel
> syndrome or so) and it is getting worse. I have to visit
> a doctor. I hope that I will be able to work with some
> bandage but...

Please take care of yourself!

Here is the litmust test I talked about above, showing that smp_rb()
together with the smp_mb() before the head ID update does indeed avoid
the fail case.

------ begin desc-reserve.litmus ------
C desc-reserve

(*
 * Result: Never
 *
 * Make sure the head ID can never be pushed past the tail ID.
 *)

{
	dr_head_id = 0;
	dr_tail_id = 1;
}

P0(int *dr_head_id, int *dr_tail_id)
{
	int tail_id_next;
	int id_prev_wrap;
	int head_id;
	int tail_id;
	int r0;
	int r1;

	head_id = READ_ONCE(*dr_head_id);
	id_prev_wrap = head_id + 1;

	// Guarantee the head ID is read before reading the tail ID.
	smp_rmb();

	tail_id = READ_ONCE(*dr_tail_id);

	if (id_prev_wrap == tail_id) {
		// Make space for the new descriptor by advancing the tail.
		tail_id_next = tail_id + 1;
		r0 = cmpxchg_relaxed(dr_tail_id, tail_id, tail_id_next);
	}

	// Guarantee a new tail ID is stored before recycling the descriptor.
	smp_mb();

	r1 = cmpxchg_relaxed(dr_head_id, head_id, id_prev_wrap);
}

// identical to P0
P1(int *dr_head_id, int *dr_tail_id)
{
	int tail_id_next;
	int id_prev_wrap;
	int head_id;
	int tail_id;
	int r0;
	int r1;

	head_id = READ_ONCE(*dr_head_id);
	id_prev_wrap = head_id + 1;

	// Guarantee the head ID is read before reading the tail ID.
	smp_rmb();

	tail_id = READ_ONCE(*dr_tail_id);

	if (id_prev_wrap == tail_id) {
		// Make space for the new descriptor by advancing the tail.
		tail_id_next = tail_id + 1;
		r0 = cmpxchg_relaxed(dr_tail_id, tail_id, tail_id_next);
	}

	// Guarantee a new tail ID is stored before recycling the descriptor.
	smp_mb();

	r1 = cmpxchg_relaxed(dr_head_id, head_id, id_prev_wrap);
}

// identical to P0
P2(int *dr_head_id, int *dr_tail_id)
{
	int tail_id_next;
	int id_prev_wrap;
	int head_id;
	int tail_id;
	int r0;
	int r1;

	head_id = READ_ONCE(*dr_head_id);
	id_prev_wrap = head_id + 1;

	// Guarantee the head ID is read before reading the tail ID.
	smp_rmb();

	tail_id = READ_ONCE(*dr_tail_id);

	if (id_prev_wrap == tail_id) {
		// Make space for the new descriptor by advancing the tail.
		tail_id_next = tail_id + 1;
		r0 = cmpxchg_relaxed(dr_tail_id, tail_id, tail_id_next);
	}

	// Guarantee a new tail ID is stored before recycling the descriptor.
	smp_mb();

	r1 = cmpxchg_relaxed(dr_head_id, head_id, id_prev_wrap);
}

exists (dr_head_id=2 /\ dr_tail_id=2)
------ end desc-reserve.litmus ------

$ herd7 -conf linux-kernel.cfg desc-reserve.litmus 
Test desc-reserve Allowed
States 3
dr_head_id=1; dr_tail_id=2;
dr_head_id=2; dr_tail_id=3;
dr_head_id=3; dr_tail_id=4;
No
Witnesses
Positive: 0 Negative: 138
Condition exists (dr_head_id=2 /\ dr_tail_id=2)
Observation desc-reserve Never 0 138
Time desc-reserve 490.62
Hash=4198247b011ab3db1ac8ff48152bbb18


Note that if the smp_mb() is _not_ moved up (even with an added
smp_rmb() in desc_reserve()), the fail case will happen:


$ herd7 -conf linux-kernel.cfg desc-reserve-bad.litmus 
Test desc-reserve Allowed
States 4
dr_head_id=1; dr_tail_id=2;
dr_head_id=2; dr_tail_id=2;  <---- head overtakes tail!
dr_head_id=2; dr_tail_id=3;
dr_head_id=3; dr_tail_id=4;
Ok
Witnesses
Positive: 24 Negative: 162
Condition exists (dr_head_id=2 /\ dr_tail_id=2)
Observation desc-reserve Sometimes 24 162
Time desc-reserve 515.87
Hash=1e80a5d56c53a87355d8a34a850cb7f5

The smp_mb() happens too late. Moving it before pushing the head ID
fixes the problem.

John Ogness

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2020-03-13 10:13 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-28 16:19 [PATCH 0/2] printk: replace ringbuffer John Ogness
2020-01-28 16:19 ` [PATCH 1/2] printk: add lockless buffer John Ogness
2020-01-29  3:53   ` Steven Rostedt
2020-02-21 11:54   ` more barriers: " Petr Mladek
2020-02-27 12:04     ` John Ogness
2020-03-04 15:08       ` Petr Mladek
2020-03-13 10:13         ` John Ogness
2020-02-21 12:05   ` misc nits " Petr Mladek
2020-03-02 10:38     ` John Ogness
2020-03-02 12:17       ` Joe Perches
2020-03-02 12:32       ` Petr Mladek
2020-03-02 13:43         ` John Ogness
2020-03-03  9:47           ` Petr Mladek
2020-03-03 15:42             ` John Ogness
2020-03-04 10:09               ` Petr Mladek
2020-03-04  9:40           ` Petr Mladek
2020-01-28 16:19 ` [PATCH 2/2] printk: use the lockless ringbuffer John Ogness
2020-02-13  9:07   ` Sergey Senozhatsky
2020-02-13  9:42     ` John Ogness
2020-02-13 11:59       ` Sergey Senozhatsky
2020-02-13 22:36         ` John Ogness
2020-02-14  1:41           ` Sergey Senozhatsky
2020-02-14  2:09             ` Sergey Senozhatsky
2020-02-14  9:48             ` John Ogness
2020-02-14 13:29   ` lijiang
2020-02-14 13:50     ` John Ogness
2020-02-15  4:15       ` lijiang
2020-02-17 15:40       ` crashdump: " Petr Mladek
2020-02-17 16:14         ` John Ogness
2020-02-17 14:41   ` misc details: " Petr Mladek
2020-02-25 20:11     ` John Ogness
2020-02-26  9:54       ` Petr Mladek
2020-02-05  4:25 ` [PATCH 0/2] printk: replace ringbuffer lijiang
2020-02-05  4:42   ` Sergey Senozhatsky
2020-02-05  4:48   ` Sergey Senozhatsky
2020-02-05  5:02     ` Sergey Senozhatsky
2020-02-05  5:38       ` lijiang
2020-02-05  6:36         ` Sergey Senozhatsky
2020-02-05  9:00           ` John Ogness
2020-02-05  9:28             ` Sergey Senozhatsky
2020-02-05 10:19             ` lijiang
2020-02-05 16:12               ` John Ogness
2020-02-06  9:12                 ` lijiang
2020-02-13 13:07                 ` Petr Mladek
2020-02-14  1:07                   ` Sergey Senozhatsky
2020-02-05 11:07             ` Sergey Senozhatsky
2020-02-05 15:48               ` John Ogness
2020-02-05 19:29                 ` Joe Perches
2020-02-06  6:31                 ` Sergey Senozhatsky
2020-02-06  7:30                 ` lijiang
2020-02-07  1:40                 ` Steven Rostedt
2020-02-07  7:43                   ` John Ogness
2020-02-14 15:56                 ` Petr Mladek
2020-02-17 11:13                   ` John Ogness
2020-02-17 14:50                     ` Petr Mladek
2020-02-25 19:27                       ` John Ogness
2020-02-05  9:36           ` lijiang
2020-02-06  9:21 ` lijiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).