linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] LTTng relay buffer allocation, read, write
@ 2008-09-27 13:40 Mathieu Desnoyers
  2008-09-27 17:10 ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Mathieu Desnoyers @ 2008-09-27 13:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Martin Bligh, Peter Zijlstra, prasad, Linus Torvalds,
	Thomas Gleixner, Steven Rostedt, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder

As I told Martin, I was thinking about taking an axe and moving stuff around in
relay. Which I just did.

This patch reimplements relay with a linked list of pages. Provides read/write
wrappers which should be used to read or write from the buffers. It's the core
of a layered approach to the design requirements expressed by Martin and
discussed earlier.

It does not provide _any_ sort of locking on buffer data. Locking should be done
by the caller. Given that we might think of very lightweight locking schemes, it
makes sense to me that the underlying buffering infrastructure supports event
records larger than 1 page.

A cache saving 3 pointers is used to keep track of current page used for the
buffer for write, read and current subbuffer header pointer lookup. The offset
of each page within the buffer is saved in the page frame structure to permit
cache lookup without extra locking.

TODO : Currently, no splice file operations are implemented. Should come soon.
The idea is to splice the buffers directly into files or to the network.

This patch is released as early RFC. It builds, that's about it. Testing will
come when I implement the splice ops.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Martin Bligh <mbligh@google.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: prasad@linux.vnet.ibm.com
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: od@suse.com
CC: "Frank Ch. Eigler" <fche@redhat.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: hch@lst.de
CC: David Wilder <dwilder@us.ibm.com>
---
 include/linux/ltt-relay.h |  291 +++++++++++++++++++++++++++++
 ltt/ltt-relay-alloc.c     |  450 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 741 insertions(+)

Index: linux-2.6-lttng/ltt/ltt-relay-alloc.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/ltt/ltt-relay-alloc.c	2008-09-27 09:18:47.000000000 -0400
@@ -0,0 +1,450 @@
+/*
+ * Public API and common code for kernel->userspace relay file support.
+ *
+ * Copyright (C) 2002-2005 - Tom Zanussi (zanussi@us.ibm.com), IBM Corp
+ * Copyright (C) 1999-2005 - Karim Yaghmour (karim@opersys.com)
+ * Copyright (C) 2008 - Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca)
+ *
+ * Moved to kernel/relay.c by Paul Mundt, 2006.
+ * November 2006 - CPU hotplug support by Mathieu Desnoyers
+ * 	(mathieu.desnoyers@polymtl.ca)
+ *
+ * This file is released under the GPL.
+ */
+#include <linux/errno.h>
+#include <linux/stddef.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/ltt-relay.h>
+#include <linux/vmalloc.h>
+#include <linux/mm.h>
+#include <linux/cpu.h>
+#include <linux/splice.h>
+
+/* list of open channels, for cpu hotplug */
+static DEFINE_MUTEX(relay_channels_mutex);
+static LIST_HEAD(relay_channels);
+
+/**
+ *	relay_alloc_buf - allocate a channel buffer
+ *	@buf: the buffer struct
+ *	@size: total size of the buffer
+ */
+static int relay_alloc_buf(struct rchan_buf *buf, size_t *size)
+{
+	unsigned int i, n_pages;
+	struct buf_page *buf_page, *n;
+
+	*size = PAGE_ALIGN(*size);
+	n_pages = *size >> PAGE_SHIFT;
+
+	for (i = 0; i < n_pages; i++) {
+		buf_page = (struct buf_page *)alloc_page(GFP_KERNEL);
+		if (unlikely(!buf_page))
+			goto depopulate;
+		list_add(&buf_page->list, &buf->pages);
+		buf_page->offset = i << PAGE_SHIFT;
+		buf_page->buf = buf;
+		memset(page_address(&buf_page->page), 0, PAGE_SIZE);
+		if (i == 0) {
+			buf->wpage = buf_page;
+			buf->hpage = buf_page;
+			buf->rpage = buf_page;
+		}
+	}
+	buf->page_count = n_pages;
+	return 0;
+
+depopulate:
+	list_for_each_entry_safe(buf_page, n, &buf->pages, list)
+		__free_page(&buf_page->page);
+	return -ENOMEM;
+}
+
+/**
+ *	relay_create_buf - allocate and initialize a channel buffer
+ *	@chan: the relay channel
+ *
+ *	Returns channel buffer if successful, %NULL otherwise.
+ */
+static struct rchan_buf *relay_create_buf(struct rchan *chan)
+{
+	int ret;
+	struct rchan_buf *buf = kzalloc(sizeof(struct rchan_buf), GFP_KERNEL);
+	if (!buf)
+		return NULL;
+
+	ret = relay_alloc_buf(buf, &chan->alloc_size);
+	if (ret)
+		goto free_buf;
+
+	buf->chan = chan;
+	kref_get(&buf->chan->kref);
+	return buf;
+
+free_buf:
+	kfree(buf);
+	return NULL;
+}
+
+/**
+ *	relay_destroy_channel - free the channel struct
+ *	@kref: target kernel reference that contains the relay channel
+ *
+ *	Should only be called from kref_put().
+ */
+static void relay_destroy_channel(struct kref *kref)
+{
+	struct rchan *chan = container_of(kref, struct rchan, kref);
+	kfree(chan);
+}
+
+/**
+ *	relay_destroy_buf - destroy an rchan_buf struct and associated buffer
+ *	@buf: the buffer struct
+ */
+static void relay_destroy_buf(struct rchan_buf *buf)
+{
+	struct rchan *chan = buf->chan;
+	struct buf_page *buf_page, *n;
+
+	list_for_each_entry_safe(buf_page, n, &buf->pages, list)
+		__free_page(&buf_page->page);
+	chan->buf[buf->cpu] = NULL;
+	kfree(buf);
+	kref_put(&chan->kref, relay_destroy_channel);
+}
+
+/**
+ *	relay_remove_buf - remove a channel buffer
+ *	@kref: target kernel reference that contains the relay buffer
+ *
+ *	Removes the file from the fileystem, which also frees the
+ *	rchan_buf_struct and the channel buffer.  Should only be called from
+ *	kref_put().
+ */
+static void relay_remove_buf(struct kref *kref)
+{
+	struct rchan_buf *buf = container_of(kref, struct rchan_buf, kref);
+	buf->chan->cb->remove_buf_file(buf->dentry);
+	relay_destroy_buf(buf);
+}
+
+/*
+ * High-level relay kernel API and associated functions.
+ */
+
+/*
+ * rchan_callback implementations defining default channel behavior.  Used
+ * in place of corresponding NULL values in client callback struct.
+ */
+
+/*
+ * create_buf_file_create() default callback.  Does nothing.
+ */
+static struct dentry *create_buf_file_default_callback(const char *filename,
+						       struct dentry *parent,
+						       int mode,
+						       struct rchan_buf *buf)
+{
+	return NULL;
+}
+
+/*
+ * remove_buf_file() default callback.  Does nothing.
+ */
+static int remove_buf_file_default_callback(struct dentry *dentry)
+{
+	return -EINVAL;
+}
+
+/* relay channel default callbacks */
+static struct rchan_callbacks default_channel_callbacks = {
+	.create_buf_file = create_buf_file_default_callback,
+	.remove_buf_file = remove_buf_file_default_callback,
+};
+
+/**
+ *	wakeup_readers - wake up readers waiting on a channel
+ *	@data: contains the channel buffer
+ *
+ *	This is the timer function used to defer reader waking.
+ */
+static void wakeup_readers(unsigned long data)
+{
+	struct rchan_buf *buf = (struct rchan_buf *)data;
+	wake_up_interruptible(&buf->read_wait);
+}
+
+/**
+ *	__relay_reset - reset a channel buffer
+ *	@buf: the channel buffer
+ *	@init: 1 if this is a first-time initialization
+ *
+ *	See relay_reset() for description of effect.
+ */
+static void __relay_reset(struct rchan_buf *buf, unsigned int init)
+{
+	if (init) {
+		init_waitqueue_head(&buf->read_wait);
+		kref_init(&buf->kref);
+		setup_timer(&buf->timer, wakeup_readers, (unsigned long)buf);
+	} else
+		del_timer_sync(&buf->timer);
+
+	buf->finalized = 0;
+}
+
+/*
+ *	relay_open_buf - create a new relay channel buffer
+ *
+ *	used by relay_open() and CPU hotplug.
+ */
+static struct rchan_buf *relay_open_buf(struct rchan *chan, unsigned int cpu)
+{
+ 	struct rchan_buf *buf = NULL;
+	struct dentry *dentry;
+ 	char *tmpname;
+
+	tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
+ 	if (!tmpname)
+ 		goto end;
+ 	snprintf(tmpname, NAME_MAX, "%s%d", chan->base_filename, cpu);
+
+	buf = relay_create_buf(chan);
+	if (!buf)
+ 		goto free_name;
+
+ 	buf->cpu = cpu;
+ 	__relay_reset(buf, 1);
+
+	/* Create file in fs */
+ 	dentry = chan->cb->create_buf_file(tmpname, chan->parent, S_IRUSR,
+ 					   buf);
+ 	if (!dentry)
+ 		goto free_buf;
+
+	buf->dentry = dentry;
+
+ 	goto free_name;
+
+free_buf:
+ 	relay_destroy_buf(buf);
+ 	buf = NULL;
+free_name:
+ 	kfree(tmpname);
+end:
+	return buf;
+}
+
+/**
+ *	relay_close_buf - close a channel buffer
+ *	@buf: channel buffer
+ *
+ *	Marks the buffer finalized and restores the default callbacks.
+ *	The channel buffer and channel buffer data structure are then freed
+ *	automatically when the last reference is given up.
+ */
+static void relay_close_buf(struct rchan_buf *buf)
+{
+	del_timer_sync(&buf->timer);
+	kref_put(&buf->kref, relay_remove_buf);
+}
+
+static void setup_callbacks(struct rchan *chan,
+				   struct rchan_callbacks *cb)
+{
+	if (!cb) {
+		chan->cb = &default_channel_callbacks;
+		return;
+	}
+
+	if (!cb->create_buf_file)
+		cb->create_buf_file = create_buf_file_default_callback;
+	if (!cb->remove_buf_file)
+		cb->remove_buf_file = remove_buf_file_default_callback;
+	chan->cb = cb;
+}
+
+/**
+ * 	relay_hotcpu_callback - CPU hotplug callback
+ * 	@nb: notifier block
+ * 	@action: hotplug action to take
+ * 	@hcpu: CPU number
+ *
+ * 	Returns the success/failure of the operation. (%NOTIFY_OK, %NOTIFY_BAD)
+ */
+static int __cpuinit relay_hotcpu_callback(struct notifier_block *nb,
+				unsigned long action,
+				void *hcpu)
+{
+	unsigned int hotcpu = (unsigned long)hcpu;
+	struct rchan *chan;
+
+	switch(action) {
+	case CPU_UP_PREPARE:
+	case CPU_UP_PREPARE_FROZEN:
+		mutex_lock(&relay_channels_mutex);
+		list_for_each_entry(chan, &relay_channels, list) {
+			if (chan->buf[hotcpu])
+				continue;
+			chan->buf[hotcpu] = relay_open_buf(chan, hotcpu);
+			if(!chan->buf[hotcpu]) {
+				printk(KERN_ERR
+					"relay_hotcpu_callback: cpu %d buffer "
+					"creation failed\n", hotcpu);
+				mutex_unlock(&relay_channels_mutex);
+				return NOTIFY_BAD;
+			}
+		}
+		mutex_unlock(&relay_channels_mutex);
+		break;
+	case CPU_DEAD:
+	case CPU_DEAD_FROZEN:
+		/* No need to flush the cpu : will be flushed upon
+		 * final relay_flush() call. */
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+/**
+ *	ltt_relay_open - create a new relay channel
+ *	@base_filename: base name of files to create
+ *	@parent: dentry of parent directory, %NULL for root directory
+ *	@subbuf_size: size of sub-buffers
+ *	@n_subbufs: number of sub-buffers
+ *	@cb: client callback functions
+ *	@private_data: user-defined data
+ *
+ *	Returns channel pointer if successful, %NULL otherwise.
+ *
+ *	Creates a channel buffer for each cpu using the sizes and
+ *	attributes specified.  The created channel buffer files
+ *	will be named base_filename0...base_filenameN-1.  File
+ *	permissions will be %S_IRUSR.
+ */
+struct rchan *ltt_relay_open(const char *base_filename,
+			 struct dentry *parent,
+			 size_t subbuf_size,
+			 size_t n_subbufs,
+			 struct rchan_callbacks *cb,
+			 void *private_data)
+{
+	unsigned int i;
+	struct rchan *chan;
+	if (!base_filename)
+		return NULL;
+
+	if (!(subbuf_size && n_subbufs))
+		return NULL;
+
+	chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
+	if (!chan)
+		return NULL;
+
+	chan->version = LTT_RELAY_CHANNEL_VERSION;
+	chan->n_subbufs = n_subbufs;
+	chan->subbuf_size = subbuf_size;
+	chan->alloc_size = FIX_SIZE(subbuf_size * n_subbufs);
+	chan->parent = parent;
+	chan->private_data = private_data;
+	strlcpy(chan->base_filename, base_filename, NAME_MAX);
+	setup_callbacks(chan, cb);
+	kref_init(&chan->kref);
+
+	mutex_lock(&relay_channels_mutex);
+	for_each_online_cpu(i) {
+		chan->buf[i] = relay_open_buf(chan, i);
+		if (!chan->buf[i])
+			goto free_bufs;
+	}
+	list_add(&chan->list, &relay_channels);
+	mutex_unlock(&relay_channels_mutex);
+
+	return chan;
+
+free_bufs:
+	for_each_online_cpu(i) {
+		if (!chan->buf[i])
+			break;
+		relay_close_buf(chan->buf[i]);
+	}
+
+	kref_put(&chan->kref, relay_destroy_channel);
+	mutex_unlock(&relay_channels_mutex);
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(ltt_relay_open);
+
+/**
+ *	ltt_relay_close - close the channel
+ *	@chan: the channel
+ *
+ *	Closes all channel buffers and frees the channel.
+ */
+void ltt_relay_close(struct rchan *chan)
+{
+	unsigned int i;
+
+	if (!chan)
+		return;
+
+	mutex_lock(&relay_channels_mutex);
+	for_each_possible_cpu(i)
+		if (chan->buf[i])
+			relay_close_buf(chan->buf[i]);
+
+	list_del(&chan->list);
+	kref_put(&chan->kref, relay_destroy_channel);
+	mutex_unlock(&relay_channels_mutex);
+}
+EXPORT_SYMBOL_GPL(ltt_relay_close);
+
+/**
+ *	relay_file_open - open file op for relay files
+ *	@inode: the inode
+ *	@filp: the file
+ *
+ *	Increments the channel buffer refcount.
+ */
+static int relay_file_open(struct inode *inode, struct file *filp)
+{
+	struct rchan_buf *buf = inode->i_private;
+	kref_get(&buf->kref);
+	filp->private_data = buf;
+
+	return nonseekable_open(inode, filp);
+}
+
+/**
+ *	relay_file_release - release file op for relay files
+ *	@inode: the inode
+ *	@filp: the file
+ *
+ *	Decrements the channel refcount, as the filesystem is
+ *	no longer using it.
+ */
+static int relay_file_release(struct inode *inode, struct file *filp)
+{
+	struct rchan_buf *buf = filp->private_data;
+	kref_put(&buf->kref, relay_remove_buf);
+
+	return 0;
+}
+
+const struct file_operations ltt_relay_file_operations = {
+	.open		= relay_file_open,
+	.llseek		= no_llseek,
+	.release	= relay_file_release,
+};
+EXPORT_SYMBOL_GPL(ltt_relay_file_operations);
+
+static __init int relay_init(void)
+{
+	hotcpu_notifier(relay_hotcpu_callback, 5);
+	return 0;
+}
+
+module_init(relay_init);
Index: linux-2.6-lttng/include/linux/ltt-relay.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/linux/ltt-relay.h	2008-09-27 09:22:34.000000000 -0400
@@ -0,0 +1,291 @@
+/*
+ * linux/include/linux/ltt-relay.h
+ *
+ * Copyright (C) 2002, 2003 - Tom Zanussi (zanussi@us.ibm.com), IBM Corp
+ * Copyright (C) 1999, 2000, 2001, 2002 - Karim Yaghmour (karim@opersys.com)
+ * Copyright (C) 2008 - Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca)
+ *
+ * CONFIG_RELAY definitions and declarations
+ */
+
+#ifndef _LINUX_LTT_RELAY_H
+#define _LINUX_LTT_RELAY_H
+
+#include <linux/types.h>
+#include <linux/sched.h>
+#include <linux/timer.h>
+#include <linux/wait.h>
+#include <linux/list.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/kref.h>
+#include <linux/mm.h>
+
+/* Needs a _much_ better name... */
+#define FIX_SIZE(x) ((((x) - 1) & PAGE_MASK) + PAGE_SIZE)
+
+/*
+ * Tracks changes to rchan/rchan_buf structs
+ */
+#define LTT_RELAY_CHANNEL_VERSION		8
+
+struct buf_page {
+	union {
+		struct {
+			unsigned long flags;    /* mandatory */
+			atomic_t _count;        /* mandatory */
+			struct list_head list;  /* linked list of buf pages */
+			size_t offset;		/* page offset in the buffer */
+		};
+		struct page page;
+	};
+};
+
+/*
+ * Per-cpu relay channel buffer
+ */
+struct rchan_buf
+{
+	struct rchan *chan;		/* associated channel */
+	wait_queue_head_t read_wait;	/* reader wait queue */
+	struct timer_list timer; 	/* reader wake-up timer */
+	struct dentry *dentry;		/* channel file dentry */
+	struct kref kref;		/* channel buffer refcount */
+	struct list_head pages;		/* list of buffer pages */
+	struct buf_page *wpage;		/* current write page (cache) */
+	struct buf_page *hpage;		/* current subbuf header page (cache) */
+	struct buf_page *rpage;		/* current subbuf read page (cache) */
+	unsigned int page_count;	/* number of current buffer pages */
+	unsigned int finalized;		/* buffer has been finalized */
+	unsigned int cpu;		/* this buf's cpu */
+} ____cacheline_aligned;
+
+/*
+ * Relay channel data structure
+ */
+struct rchan
+{
+	u32 version;			/* the version of this struct */
+	size_t subbuf_size;		/* sub-buffer size */
+	size_t n_subbufs;		/* number of sub-buffers per buffer */
+	size_t alloc_size;		/* total buffer size allocated */
+	struct rchan_callbacks *cb;	/* client callbacks */
+	struct kref kref;		/* channel refcount */
+	void *private_data;		/* for user-defined data */
+	struct rchan_buf *buf[NR_CPUS]; /* per-cpu channel buffers */
+	struct list_head list;		/* for channel list */
+	struct dentry *parent;		/* parent dentry passed to open */
+	char base_filename[NAME_MAX];	/* saved base filename */
+};
+
+/*
+ * Relay channel client callbacks
+ */
+struct rchan_callbacks
+{
+	/*
+	 * subbuf_start - called on buffer-switch to a new sub-buffer
+	 * @buf: the channel buffer containing the new sub-buffer
+	 * @subbuf: the start of the new sub-buffer
+	 * @prev_subbuf: the start of the previous sub-buffer
+	 * @prev_padding: unused space at the end of previous sub-buffer
+	 *
+	 * The client should return 1 to continue logging, 0 to stop
+	 * logging.
+	 *
+	 * NOTE: subbuf_start will also be invoked when the buffer is
+	 *       created, so that the first sub-buffer can be initialized
+	 *       if necessary.  In this case, prev_subbuf will be NULL.
+	 *
+	 * NOTE: the client can reserve bytes at the beginning of the new
+	 *       sub-buffer by calling subbuf_start_reserve() in this callback.
+	 */
+	int (*subbuf_start) (struct rchan_buf *buf,
+			     void *subbuf,
+			     void *prev_subbuf,
+			     size_t prev_padding);
+
+	/*
+	 * create_buf_file - create file to represent a relay channel buffer
+	 * @filename: the name of the file to create
+	 * @parent: the parent of the file to create
+	 * @mode: the mode of the file to create
+	 * @buf: the channel buffer
+	 *
+	 * Called during relay_open(), once for each per-cpu buffer,
+	 * to allow the client to create a file to be used to
+	 * represent the corresponding channel buffer.  If the file is
+	 * created outside of relay, the parent must also exist in
+	 * that filesystem.
+	 *
+	 * The callback should return the dentry of the file created
+	 * to represent the relay buffer.
+	 *
+	 * Setting the is_global outparam to a non-zero value will
+	 * cause relay_open() to create a single global buffer rather
+	 * than the default set of per-cpu buffers.
+	 *
+	 * See Documentation/filesystems/relayfs.txt for more info.
+	 */
+	struct dentry *(*create_buf_file)(const char *filename,
+					  struct dentry *parent,
+					  int mode,
+					  struct rchan_buf *buf);
+
+	/*
+	 * remove_buf_file - remove file representing a relay channel buffer
+	 * @dentry: the dentry of the file to remove
+	 *
+	 * Called during relay_close(), once for each per-cpu buffer,
+	 * to allow the client to remove a file used to represent a
+	 * channel buffer.
+	 *
+	 * The callback should return 0 if successful, negative if not.
+	 */
+	int (*remove_buf_file)(struct dentry *dentry);
+};
+
+static inline struct buf_page *ltt_relay_find_prev_page(struct buf_page *page,
+	ssize_t offset)
+{
+	list_for_each_entry_reverse(page, &page->list, list)
+		if (offset >= page->offset &&
+				offset < page->offset + PAGE_SIZE)
+			return page;
+	return NULL;
+}
+
+static inline struct buf_page *ltt_relay_find_next_page(struct buf_page *page,
+	ssize_t offset)
+{
+	list_for_each_entry(page, &page->list, list)
+		if (offset >= page->offset &&
+				offset < page->offset + PAGE_SIZE)
+			return page;
+	return NULL;
+}
+
+/*
+ * Find the page containing "offset". Cache it if it is after the currently
+ * cached page.
+ */
+static inline
+struct buf_page *ltt_relay_cache_page(struct buf_page **page_cache,
+		struct buf_page *page, size_t offset)
+{
+	ssize_t diff_offset;
+
+	/*
+	 * Make sure this is the page we want to write into. The current
+	 * page is changed concurrently by other writers. [wrh]page are
+	 * used as a cache remembering the last page written
+	 * to/read/looked up for header address. No synchronization;
+	 * could have to find the previous page is a nested write
+	 * occured. Finding the right page is done by comparing the
+	 * dest_offset with the buf_page offsets.
+	 */
+	diff_offset = offset - page->offset;
+	if (unlikely(diff_offset >= PAGE_SIZE)) {
+		page = ltt_relay_find_next_page(page, offset);
+		WARN_ON(!page);
+		*page_cache = page;
+	} else if (unlikely(diff_offset < 0)) {
+		page = ltt_relay_find_prev_page(page, offset);
+		WARN_ON(!page);
+	}
+	return page;
+}
+
+static inline int ltt_relay_write(struct rchan_buf *buf, size_t offset,
+	const void *src, size_t len)
+{
+	struct buf_page *page;
+	ssize_t pagecpy, orig_len;
+
+	orig_len = len;
+	page = buf->wpage;
+	if (unlikely(!len))
+		return 0;
+	for (;;) {
+		page = ltt_relay_cache_page(&buf->wpage, page, offset);
+		pagecpy = min(len, PAGE_SIZE - (offset & ~PAGE_MASK));
+		memcpy(page_address(&page->page)
+			+ (offset & ~PAGE_MASK), src, pagecpy);
+		len -= pagecpy;
+		if (likely(!len))
+			break;
+		src += pagecpy;
+		offset += pagecpy;
+		/*
+		 * Underlying layer should never ask for writes across
+		 * subbuffers.
+		 */
+		WARN_ON(offset >= buf->chan->alloc_size);
+	}
+	return orig_len;
+}
+
+static inline int ltt_relay_read(struct rchan_buf *buf, size_t offset,
+	void *dest, size_t len)
+{
+	struct buf_page *page;
+	ssize_t pagecpy, orig_len;
+
+	orig_len = len;
+	page = buf->rpage;
+	if (unlikely(!len))
+		return 0;
+	for (;;) {
+		page = ltt_relay_cache_page(&buf->rpage, page, offset);
+		pagecpy = min(len, PAGE_SIZE - (offset & ~PAGE_MASK));
+		memcpy(dest, page_address(&page->page) + (offset & ~PAGE_MASK),
+			pagecpy);
+		len -= pagecpy;
+		if (likely(!len))
+			break;
+		dest += pagecpy;
+		offset += pagecpy;
+		/*
+		 * Underlying layer should never ask for reads across
+		 * subbuffers.
+		 */
+		WARN_ON(offset >= buf->chan->alloc_size);
+	}
+	return orig_len;
+}
+
+/*
+ * Return the address where a given offset is located.
+ * Should be used to get the current subbuffer header pointer. Given we know
+ * it's never on a page boundary, it's safe to write directly to this address,
+ * as long as the write is never bigger than a page size.
+ */
+static inline void *ltt_relay_offset_address(struct rchan_buf *buf,
+	size_t offset)
+{
+	struct buf_page *page;
+
+	page = buf->hpage;
+	page = ltt_relay_cache_page(&buf->hpage, page, offset);
+	return page_address(&page->page) + (offset & ~PAGE_MASK);
+}
+
+/*
+ * CONFIG_LTT_RELAY kernel API, ltt/ltt-relay-alloc.c
+ */
+
+struct rchan *ltt_relay_open(const char *base_filename,
+			 struct dentry *parent,
+			 size_t subbuf_size,
+			 size_t n_subbufs,
+			 struct rchan_callbacks *cb,
+			 void *private_data);
+extern void ltt_relay_close(struct rchan *chan);
+
+/*
+ * exported ltt_relay file operations, ltt/ltt-relay-alloc.c
+ */
+extern const struct file_operations ltt_relay_file_operations;
+
+#endif /* _LINUX_LTT_RELAY_H */
+
-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] LTTng relay buffer allocation, read, write
  2008-09-27 13:40 [RFC PATCH] LTTng relay buffer allocation, read, write Mathieu Desnoyers
@ 2008-09-27 17:10 ` Peter Zijlstra
  2008-09-28  8:59   ` Peter Zijlstra
  2008-09-29 15:50   ` Mathieu Desnoyers
  0 siblings, 2 replies; 24+ messages in thread
From: Peter Zijlstra @ 2008-09-27 17:10 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Martin Bligh, prasad, Linus Torvalds,
	Thomas Gleixner, Steven Rostedt, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder

On Sat, 2008-09-27 at 09:40 -0400, Mathieu Desnoyers wrote:
> As I told Martin, I was thinking about taking an axe and moving stuff around in
> relay. Which I just did.
> 
> This patch reimplements relay with a linked list of pages. Provides read/write
> wrappers which should be used to read or write from the buffers. It's the core
> of a layered approach to the design requirements expressed by Martin and
> discussed earlier.
> 
> It does not provide _any_ sort of locking on buffer data. Locking should be done
> by the caller. Given that we might think of very lightweight locking schemes, it
> makes sense to me that the underlying buffering infrastructure supports event
> records larger than 1 page.
> 
> A cache saving 3 pointers is used to keep track of current page used for the
> buffer for write, read and current subbuffer header pointer lookup. The offset
> of each page within the buffer is saved in the page frame structure to permit
> cache lookup without extra locking.
> 
> TODO : Currently, no splice file operations are implemented. Should come soon.
> The idea is to splice the buffers directly into files or to the network.
> 
> This patch is released as early RFC. It builds, that's about it. Testing will
> come when I implement the splice ops.

Why? What aspects of Steve's ring-buffer interface will hinder us
optimizing the implementation to be as light-weight as you like?

The thing is, I'd like to see it that light as well ;-)

As for the page-spanning entries, I think we can do that with Steve's
system just fine, its just that Linus said its a dumb idea and Steve
dropped it, but there is nothing fundamental stopping us from recording
a length that is > PAGE_SIZE and copying data into the pages one at a
time.

Nor do I see it impossible to implement splice on top of Steve's
ring-buffer..

So again, why?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] LTTng relay buffer allocation, read, write
  2008-09-27 17:10 ` Peter Zijlstra
@ 2008-09-28  8:59   ` Peter Zijlstra
  2008-09-29 16:06     ` Mathieu Desnoyers
  2008-09-29 15:50   ` Mathieu Desnoyers
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2008-09-28  8:59 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Martin Bligh, prasad, Linus Torvalds,
	Thomas Gleixner, Steven Rostedt, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder

On Sat, 2008-09-27 at 19:10 +0200, Peter Zijlstra wrote:
> On Sat, 2008-09-27 at 09:40 -0400, Mathieu Desnoyers wrote:

> > It does not provide _any_ sort of locking on buffer data. Locking should be done
> > by the caller. Given that we might think of very lightweight locking schemes,

Which defeats the whole purpose of the exercise, we want to provide a
single mechanism - including locking - that is usable to all. Otherwise
everybody gets to do the hard part themselves, which will undoubtedly
result in many broken/suboptimal locking schemes. 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] LTTng relay buffer allocation, read, write
  2008-09-27 17:10 ` Peter Zijlstra
  2008-09-28  8:59   ` Peter Zijlstra
@ 2008-09-29 15:50   ` Mathieu Desnoyers
  2008-09-29 16:38     ` Steven Rostedt
  2008-09-29 17:30     ` Peter Zijlstra
  1 sibling, 2 replies; 24+ messages in thread
From: Mathieu Desnoyers @ 2008-09-29 15:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Martin Bligh, prasad, Linus Torvalds,
	Thomas Gleixner, Steven Rostedt, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder, Tom Zanussi

* Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> On Sat, 2008-09-27 at 09:40 -0400, Mathieu Desnoyers wrote:
> > As I told Martin, I was thinking about taking an axe and moving stuff around in
> > relay. Which I just did.
> > 
> > This patch reimplements relay with a linked list of pages. Provides read/write
> > wrappers which should be used to read or write from the buffers. It's the core
> > of a layered approach to the design requirements expressed by Martin and
> > discussed earlier.
> > 
> > It does not provide _any_ sort of locking on buffer data. Locking should be done
> > by the caller. Given that we might think of very lightweight locking schemes, it
> > makes sense to me that the underlying buffering infrastructure supports event
> > records larger than 1 page.
> > 
> > A cache saving 3 pointers is used to keep track of current page used for the
> > buffer for write, read and current subbuffer header pointer lookup. The offset
> > of each page within the buffer is saved in the page frame structure to permit
> > cache lookup without extra locking.
> > 
> > TODO : Currently, no splice file operations are implemented. Should come soon.
> > The idea is to splice the buffers directly into files or to the network.
> > 
> > This patch is released as early RFC. It builds, that's about it. Testing will
> > come when I implement the splice ops.
> 
> Why? What aspects of Steve's ring-buffer interface will hinder us
> optimizing the implementation to be as light-weight as you like?
> 
> The thing is, I'd like to see it that light as well ;-)
> 

Ok, I'll try to explain my point of view. The thing is : I want those
binary buffers to be exported to userspace, and I fear that the approach
taken by Steven (let's write "simple" C structure directly into the
buffers) will in fact be much more _complex_ (due to subtle compiler
dependencies) that doing our own event payload (event data) format.

Also, things such as 
ring_buffer_lock: A way to lock the entire buffer.
ring_buffer_unlock: unlock the buffer.
will probably become a problem when trying to go for a more efficient
locking scheme.

ring_buffer_peek: Look at a next item in the cpu buffer.
This kind of feature is useless for data extraction to userspace and
poses serious synchronization concerns if you have other writers in the
same buffer.

Structure for event records :

struct ring_buffer_event {
        u32 type:2, len:3, time_delta:27;
        u32 array[];
};

The only good thing about reserving 2 bits for event IDs in there is to
put the most frequent events in those IDs, which is clearly not the
case:
RB_TYPE_PADDING: utterly useless. Can be expressed by a sub-buffer
header telling the length of data written into the subbuffer (what you
guys call a "page", but what I still think might be worthy to be of
variable size, especially with a light locking infrastructure and
considering we might want to export this data to userspace).

RB_TYPE_TIME_EXTENT : I'd reserve a separate bit for this one.

Also, if size _really_ matters, we should just export the event ID and
look up the event payload size through a separate table. If the payload
consists of a binary blob, then the data structure should start by a
payload size and then have a the actual binary blob.

struct ring_buffer_event {
        u32 time_ext:1, evid:4, time_lsb:27;
        union {
          u32 array[];
          struct {
            u32 ext_id;
            u32 array[];
          };
          struct {
            u32 ext_time;
            u32 array[];
          };
          struct {
            u32 ext_time;
            u32 ext_id;
            u32 array[];
          };
};

Therefore we can encode up to 15 event IDs into this compact
representation (we reserve ID 0xF for extended id). If we assign those
IDs per subbuffer, it leaves plenty of room before we have to write a
supplementary field for more IDs.

OTOH, if we really want to have an event size in there (which is more
solid), we could have :

struct ring_buffer_event {
        u32 time_ext:1, time_lsb:31;
        u16 evid;
        u16 evsize;
        union {
          u32 array[];
          struct {
            u32 ext_time;
            u32 array[];
          };
};

That's a bit bigger, but keeps the event size in the data stream.

Also, nobody has explained successfully why we have to encode a time
_delta_ (current tsc - prev tsc) rather than just putting the LSBs. So
either I fail to see the light here (I doubt it), or I am not clear
enough when I say that we can just put the raw LSBs and compute the
delta afterward when reading the buffers, manage to keep the same
overflow detection power, and also keep the absolute value of the tsc
lsb, which makes it much easier to cross-check than "deltas".

Now for the buffer pages implementation :


+struct buffer_page {
+       union {
+               struct {
+                       unsigned long flags;    /* mandatory */
+                       atomic_t _count;        /* mandatory */
+                       u64     time_stamp;     /* page time stamp */

Why should we ever associate a time stamp with a page ??

I see that we could save the timestamp at which a subbuffer switch
happens (which in this patchset semantics happens to be a page), but why
would we every want to save that in the page frame ? Especially if we
simply write the LSBs instead of a time delta... Also, I would write
this timestamp in a subbuffer _header_ which is exported to userspace,
but I clealry don't see why we keep it here. In addition, it's
completely wrong in a layered approach : if we want to switch from pages
to video memory or to a statically allocated buffer at boot time, such
"page-related" timestamp hacks won't do.

+                       unsigned size;          /* size of page data */

This size should be exported to userspace, and therefore belongs to a
subbuffer header, not to the page frame struct.

+                       struct list_head list;  /* linked list of free pages */

"free pages" ? Are those "free" ? What does free mean here ? If Steven
actually moves head and tail pointers to keep track of which pages data
is written into, it will become an utter mess when we'll try to
implement a lockless algorithm, since those are all non-atomic
operations.

+               };
+               struct page page;
+       };
+};

> As for the page-spanning entries, I think we can do that with Steve's
> system just fine, its just that Linus said its a dumb idea and Steve
> dropped it, but there is nothing fundamental stopping us from recording
> a length that is > PAGE_SIZE and copying data into the pages one at a
> time.
> 
> Nor do I see it impossible to implement splice on top of Steve's
> ring-buffer..
> 
> So again, why?
> 

I'd like to separate the layer which deals with data read/write from the
layer which deals with synchronization of data write/read to/from the
buffers so we can eventually switch to a locking mechanism which
provides a sane performance level, and given Steven's linked list
implementation, it will just add unneeded locking requirements.

Compared to this, I deal with buffer coherency with two 32 bits counters
in LTTng : a write offset and a consumer offset. (plus a per-subbuffer
commit count). I'd like to keep this semantic and yet support writing to
non-vmap'd memory (which I do in the patch I propose). I'd just have to
implement the splice operation in ltt-relay.c (the layer that sits on
top of this patch).

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] LTTng relay buffer allocation, read, write
  2008-09-28  8:59   ` Peter Zijlstra
@ 2008-09-29 16:06     ` Mathieu Desnoyers
  0 siblings, 0 replies; 24+ messages in thread
From: Mathieu Desnoyers @ 2008-09-29 16:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Martin Bligh, prasad, Linus Torvalds,
	Thomas Gleixner, Steven Rostedt, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder

* Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> On Sat, 2008-09-27 at 19:10 +0200, Peter Zijlstra wrote:
> > On Sat, 2008-09-27 at 09:40 -0400, Mathieu Desnoyers wrote:
> 
> > > It does not provide _any_ sort of locking on buffer data. Locking should be done
> > > by the caller. Given that we might think of very lightweight locking schemes,
> 
> Which defeats the whole purpose of the exercise, we want to provide a
> single mechanism - including locking - that is usable to all. Otherwise
> everybody gets to do the hard part themselves, which will undoubtedly
> result in many broken/suboptimal locking schemes. 
> 

Well, this is my answer to Steven's "this is too complex" comments,
which I suspect is really a "this is too implement to implement". Sorry
Steven, but you do not actually propose anything to address my concerns,
which are : I want to export this data to userspace without tricky
dependencies on the compiler ABI. I also don't want to be limited in
locking infrastructure implementation.

Those are the kind of concerns that are much easier to address in a
layered and modular implementation. If we try to do everything in the
same C file, we end up having typing/memory management/time management
all closely tied.

So I am all for providing a common infrastructure which implements all
this, but I think this infrastructure should itself be layered and
modular.

Also, I have something really really near to the requirements expressed
in LTTng, which is :

Linux Kernel Markers : Event data typing exportable to userspace without
                       tricky compiler ABI dependency.
                       TODO : Export marker list to debugfs.
                              Allow individual marker enable/disable
                              through debugfs file.
                              Use per client buffer marker IDs rather
                              than a global ID table.
                              Export the markers IDs/format/name through
                              one small buffer for each client buffer.
ltt-relay :            Buffer coherency management. TODO : splice.
ltt-relay-alloc :      Buffer allocation and read/write, without vmap.
ltt-tracer :           In-kernel API to manage trace allocation,
                       start/stop.
                       TODO : Currently has a statically limited set of
                       buffers. Should be extended so that clients could
                       register new buffers.
ltt-control :          Netlink control which calls the in-kernel
                       ltt-tracer API.
                       TODO : switch from netlink to debugfs.
ltt-timestamp :        Timestamping infrastructure (tsc, global
                       counter). Currently supports about 6
                       architectures. Has an asm-generic fallback.
ltt-heartbeat :        Deal with 32 TSC overflow by periodically writing
                       an event in every buffers.
                       TODO : switch to "extended time" field by keeping
                       track of the previously written timestamp.

If you think it's worthwhile, I could post a selected set of my patches
to LKML to see the reactions. However, note that there are a few TODOs,
so it does not address all the requirements.

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] LTTng relay buffer allocation, read, write
  2008-09-29 15:50   ` Mathieu Desnoyers
@ 2008-09-29 16:38     ` Steven Rostedt
  2008-09-29 18:38       ` Mathieu Desnoyers
  2008-09-29 17:30     ` Peter Zijlstra
  1 sibling, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2008-09-29 16:38 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Martin Bligh, prasad,
	Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder, Tom Zanussi


On Mon, 29 Sep 2008, Mathieu Desnoyers wrote:
> Ok, I'll try to explain my point of view. The thing is : I want those
> binary buffers to be exported to userspace, and I fear that the approach
> taken by Steven (let's write "simple" C structure directly into the
> buffers) will in fact be much more _complex_ (due to subtle compiler
> dependencies) that doing our own event payload (event data) format.
> 
> Also, things such as 
> ring_buffer_lock: A way to lock the entire buffer.
> ring_buffer_unlock: unlock the buffer.
> will probably become a problem when trying to go for a more efficient
> locking scheme.

I plan on nuking the above for something better in v2.

> 
> ring_buffer_peek: Look at a next item in the cpu buffer.
> This kind of feature is useless for data extraction to userspace and
> poses serious synchronization concerns if you have other writers in the
> same buffer.

It absolutely important for ftrace.

> 
> Structure for event records :
> 
> struct ring_buffer_event {
>         u32 type:2, len:3, time_delta:27;
>         u32 array[];
> };
> 
> The only good thing about reserving 2 bits for event IDs in there is to
> put the most frequent events in those IDs, which is clearly not the
> case:
> RB_TYPE_PADDING: utterly useless. Can be expressed by a sub-buffer
> header telling the length of data written into the subbuffer (what you
> guys call a "page", but what I still think might be worthy to be of
> variable size, especially with a light locking infrastructure and
> considering we might want to export this data to userspace).

I now have both. But I think userspace can now easily see when there
is a place in the buffer that is empty.

> 
> RB_TYPE_TIME_EXTENT : I'd reserve a separate bit for this one.
> 
> Also, if size _really_ matters, we should just export the event ID and
> look up the event payload size through a separate table. If the payload
> consists of a binary blob, then the data structure should start by a
> payload size and then have a the actual binary blob.
> 
> struct ring_buffer_event {
>         u32 time_ext:1, evid:4, time_lsb:27;
>         union {
>           u32 array[];
>           struct {
>             u32 ext_id;
>             u32 array[];
>           };
>           struct {
>             u32 ext_time;
>             u32 array[];
>           };
>           struct {
>             u32 ext_time;
>             u32 ext_id;
>             u32 array[];
>           };
> };
> 
> Therefore we can encode up to 15 event IDs into this compact
> representation (we reserve ID 0xF for extended id). If we assign those
> IDs per subbuffer, it leaves plenty of room before we have to write a
> supplementary field for more IDs.

I wanted to push the event ids out of the ring buffer logic. Only a few 
internal ones are used.  Otherwise, we'll have one hell of a bit enum 
table with every freaking tracing type in it. That's what I want to avoid.


> 
> OTOH, if we really want to have an event size in there (which is more
> solid), we could have :
> 
> struct ring_buffer_event {
>         u32 time_ext:1, time_lsb:31;
>         u16 evid;
>         u16 evsize;
>         union {
>           u32 array[];
>           struct {
>             u32 ext_time;
>             u32 array[];
>           };
> };

My smallest record is 8 bytes. Yours now is 12.

> 
> That's a bit bigger, but keeps the event size in the data stream.

So does mine.

> 
> Also, nobody has explained successfully why we have to encode a time
> _delta_ (current tsc - prev tsc) rather than just putting the LSBs. So
> either I fail to see the light here (I doubt it), or I am not clear
> enough when I say that we can just put the raw LSBs and compute the
> delta afterward when reading the buffers, manage to keep the same
> overflow detection power, and also keep the absolute value of the tsc
> lsb, which makes it much easier to cross-check than "deltas".

Well, you need to record wraps. Probably more often then you record delta
wraps.

> 
> Now for the buffer pages implementation :
> 
> 
> +struct buffer_page {
> +       union {
> +               struct {
> +                       unsigned long flags;    /* mandatory */
> +                       atomic_t _count;        /* mandatory */
> +                       u64     time_stamp;     /* page time stamp */
> 
> Why should we ever associate a time stamp with a page ??

Because we save it on overwrite, which is the default mode for ftrace.

> 
> I see that we could save the timestamp at which a subbuffer switch
> happens (which in this patchset semantics happens to be a page), but why
> would we every want to save that in the page frame ? Especially if we
> simply write the LSBs instead of a time delta... Also, I would write

Where do you get the MSBs from?

> this timestamp in a subbuffer _header_ which is exported to userspace,

Well, our subbuffer header is the page frame.

> but I clealry don't see why we keep it here. In addition, it's
> completely wrong in a layered approach : if we want to switch from pages
> to video memory or to a statically allocated buffer at boot time, such
> "page-related" timestamp hacks won't do.

As Linus said, anything bigger than a page should be outside this buffer.
All the buffer would then need is a pointer to the data. Then the tracer
can figure out what to do with the rest of that.

> 
> +                       unsigned size;          /* size of page data */
> 
> This size should be exported to userspace, and therefore belongs to a
> subbuffer header, not to the page frame struct.

Again, page frame == sub buffer, period!

> 
> +                       struct list_head list;  /* linked list of free pages */
> 
> "free pages" ? Are those "free" ? What does free mean here ? If Steven

Bah, thanks. "free" is a misnomer.  It should just be list of pages.

> actually moves head and tail pointers to keep track of which pages data
> is written into, it will become an utter mess when we'll try to
> implement a lockless algorithm, since those are all non-atomic
> operations.

cmpxchg(head, old_head, head->next) ?

> 
> +               };
> +               struct page page;
> +       };
> +};
> 
> > As for the page-spanning entries, I think we can do that with Steve's
> > system just fine, its just that Linus said its a dumb idea and Steve
> > dropped it, but there is nothing fundamental stopping us from recording
> > a length that is > PAGE_SIZE and copying data into the pages one at a
> > time.
> > 
> > Nor do I see it impossible to implement splice on top of Steve's
> > ring-buffer..
> > 
> > So again, why?
> > 
> 
> I'd like to separate the layer which deals with data read/write from the
> layer which deals with synchronization of data write/read to/from the
> buffers so we can eventually switch to a locking mechanism which
> provides a sane performance level, and given Steven's linked list
> implementation, it will just add unneeded locking requirements.
> 
> Compared to this, I deal with buffer coherency with two 32 bits counters
> in LTTng : a write offset and a consumer offset. (plus a per-subbuffer
> commit count). I'd like to keep this semantic and yet support writing to
> non-vmap'd memory (which I do in the patch I propose). I'd just have to
> implement the splice operation in ltt-relay.c (the layer that sits on
> top of this patch).
> 

-- Steve


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] LTTng relay buffer allocation, read, write
  2008-09-29 15:50   ` Mathieu Desnoyers
  2008-09-29 16:38     ` Steven Rostedt
@ 2008-09-29 17:30     ` Peter Zijlstra
  2008-09-29 20:31       ` Mathieu Desnoyers
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2008-09-29 17:30 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Martin Bligh, prasad, Linus Torvalds,
	Thomas Gleixner, Steven Rostedt, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder, Tom Zanussi

On Mon, 2008-09-29 at 11:50 -0400, Mathieu Desnoyers wrote:
> * Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> > On Sat, 2008-09-27 at 09:40 -0400, Mathieu Desnoyers wrote:
> > > As I told Martin, I was thinking about taking an axe and moving stuff around in
> > > relay. Which I just did.
> > > 
> > > This patch reimplements relay with a linked list of pages. Provides read/write
> > > wrappers which should be used to read or write from the buffers. It's the core
> > > of a layered approach to the design requirements expressed by Martin and
> > > discussed earlier.
> > > 
> > > It does not provide _any_ sort of locking on buffer data. Locking should be done
> > > by the caller. Given that we might think of very lightweight locking schemes, it
> > > makes sense to me that the underlying buffering infrastructure supports event
> > > records larger than 1 page.
> > > 
> > > A cache saving 3 pointers is used to keep track of current page used for the
> > > buffer for write, read and current subbuffer header pointer lookup. The offset
> > > of each page within the buffer is saved in the page frame structure to permit
> > > cache lookup without extra locking.
> > > 
> > > TODO : Currently, no splice file operations are implemented. Should come soon.
> > > The idea is to splice the buffers directly into files or to the network.
> > > 
> > > This patch is released as early RFC. It builds, that's about it. Testing will
> > > come when I implement the splice ops.
> > 
> > Why? What aspects of Steve's ring-buffer interface will hinder us
> > optimizing the implementation to be as light-weight as you like?
> > 
> > The thing is, I'd like to see it that light as well ;-)
> > 
> 
> Ok, I'll try to explain my point of view. The thing is : I want those
> binary buffers to be exported to userspace, and I fear that the approach
> taken by Steven (let's write "simple" C structure directly into the
> buffers) will in fact be much more _complex_ (due to subtle compiler
> dependencies) that doing our own event payload (event data) format.

The only compiler dependant thing in there is the bitfield, which could
be recoded using regular bitops.

I'm not seeing anything particularly worrysome about that.

> Also, things such as 
> ring_buffer_lock: A way to lock the entire buffer.
> ring_buffer_unlock: unlock the buffer.
> will probably become a problem when trying to go for a more efficient
> locking scheme.

You can do

stop:
cpu_buffer->stop=1;
smp_wmb();
sched_sync();

start:
smp_wmb();
cpu_buffer->stop=0;


write:
if (unlikely(smp_rmb(), cpu_buffer->stop))) {
  return -EBUSY;
    or
  while (cpu_buffer->stop)
    cpu_relax();
}

The read in the fast path is just a read, nothing fancy...

> ring_buffer_peek: Look at a next item in the cpu buffer.
> This kind of feature is useless for data extraction to userspace and
> poses serious synchronization concerns if you have other writers in the
> same buffer.

Sure, its probably possible to rework the merge-iterator to use consume,
but that would require it having storage for 1 event, which might be a
bit messy.

How would your locking deal with this? - it really is a requirement to
be able to merge-sort-iterate the output from kernel space..

> Structure for event records :
> 
> struct ring_buffer_event {
>         u32 type:2, len:3, time_delta:27;
>         u32 array[];
> };
> 
> The only good thing about reserving 2 bits for event IDs in there is to
> put the most frequent events in those IDs

Not so, these types are buffer package types, not actual event types, I
think thats a useful distinction.

> , which is clearly not the
> case:
> RB_TYPE_PADDING: utterly useless. Can be expressed by a sub-buffer
> header telling the length of data written into the subbuffer (what you
> guys call a "page", but what I still think might be worthy to be of
> variable size, especially with a light locking infrastructure and
> considering we might want to export this data to userspace).

But then you'd need a sub-buffer header, which in itself takes space,
this padding solution seems like a fine middle-ground, it only takes
space when you need it and it free otherwise.

The sub-buffer headers would always take space.

> RB_TYPE_TIME_EXTENT : I'd reserve a separate bit for this one.
> 
> Also, if size _really_ matters,

You and Martin have been telling it does ;-)

>  we should just export the event ID and
> look up the event payload size through a separate table. If the payload
> consists of a binary blob, then the data structure should start by a
> payload size and then have a the actual binary blob.
> 
> struct ring_buffer_event {
>         u32 time_ext:1, evid:4, time_lsb:27;
>         union {
>           u32 array[];
>           struct {
>             u32 ext_id;
>             u32 array[];
>           };
>           struct {
>             u32 ext_time;
>             u32 array[];
>           };
>           struct {
>             u32 ext_time;
>             u32 ext_id;
>             u32 array[];
>           };
> };
> 
> Therefore we can encode up to 15 event IDs into this compact
> representation (we reserve ID 0xF for extended id). If we assign those
> IDs per subbuffer, it leaves plenty of room before we have to write a
> supplementary field for more IDs.
> 
> OTOH, if we really want to have an event size in there (which is more
> solid), we could have :
> 
> struct ring_buffer_event {
>         u32 time_ext:1, time_lsb:31;
>         u16 evid;
>         u16 evsize;
>         union {
>           u32 array[];
>           struct {
>             u32 ext_time;
>             u32 array[];
>           };
> };
> 
> That's a bit bigger, but keeps the event size in the data stream.

I'm really not seeing what any of these proposals have on top of what
Steve currently has. We have the ability to encode up to 28 bytes of
payload in a 4 bytes header, which should suffice for most entries,
right?

> Also, nobody has explained successfully why we have to encode a time
> _delta_ (current tsc - prev tsc) rather than just putting the LSBs. So
> either I fail to see the light here (I doubt it), or I am not clear
> enough when I say that we can just put the raw LSBs and compute the
> delta afterward when reading the buffers, manage to keep the same
> overflow detection power, and also keep the absolute value of the tsc
> lsb, which makes it much easier to cross-check than "deltas".

Still not quite understanding where you get the MSBs from, how do you
tell if two LSBs are from the same period?

> Now for the buffer pages implementation :
> 
> 
> +struct buffer_page {
> +       union {
> +               struct {
> +                       unsigned long flags;    /* mandatory */
> +                       atomic_t _count;        /* mandatory */
> +                       u64     time_stamp;     /* page time stamp */
> 
> Why should we ever associate a time stamp with a page ??

Discarting the whole sub-buffer idea, it could be used to validate
whichever time-stamp logic.

Personally I don't particulary like the sub-buffer concept, and I don't
think we need it.

> I see that we could save the timestamp at which a subbuffer switch
> happens (which in this patchset semantics happens to be a page), but why
> would we every want to save that in the page frame ? Especially if we
> simply write the LSBs instead of a time delta... Also, I would write
> this timestamp in a subbuffer _header_ which is exported to userspace,

Why have sub-buffers at all?

> but I clealry don't see why we keep it here. In addition, it's
> completely wrong in a layered approach : if we want to switch from pages
> to video memory or to a statically allocated buffer at boot time, such
> "page-related" timestamp hacks won't do.

I don't think you _ever_ want to insert actual video memory in the
trace, what you _might_ possibly want to do, is insert a copy of a
frame, but that you can do with paged buffers like we have, just add an
entry with 3840*1200*4 bytes (one screen in my case), and use sub-writes
to copy everything to page-alinged chunks.

There is nothing techinically prohibiting this in Steve's scheme, except
for Linus telling you you're an idiot for doing it.

The NOP packets you dislike allow us to align regular entries with page
boundaries, which in turn allows this 0-copy stuff. If you use the
sub-write iterator stuff we taked about a while ago, you can leave out
the NOPs.

Having both makes sense (if you want the large packets stuff), use the
regular page aligned 0-copy stuff for regular small packets, and use the
sub-write iterator stuff for your huge packets.

> > As for the page-spanning entries, I think we can do that with Steve's
> > system just fine, its just that Linus said its a dumb idea and Steve
> > dropped it, but there is nothing fundamental stopping us from recording
> > a length that is > PAGE_SIZE and copying data into the pages one at a
> > time.
> > 
> > Nor do I see it impossible to implement splice on top of Steve's
> > ring-buffer..
> > 
> > So again, why?
> > 
> 
> I'd like to separate the layer which deals with data read/write from the
> layer which deals with synchronization of data write/read to/from the
> buffers so we can eventually switch to a locking mechanism which
> provides a sane performance level, and given Steven's linked list
> implementation, it will just add unneeded locking requirements.

How does it differ from your linked list implementation? Reaslistically,
we want but a single locking scheme for the trace buffer stuff. So
factoring it out doesn't really make sense.

> Compared to this, I deal with buffer coherency with two 32 bits counters
> in LTTng : a write offset and a consumer offset. (plus a per-subbuffer
> commit count). 

I think that can work just fine on top of Steve's stuff too, it needs a
bit of trimming etc.. but afaict there isn't anything stopping us from
implementing his reserve function as light as you want:

int reserve(buffer, size, flags)
{
 preempt_disable()
 cpu = smp_processor_id();
 cpu_buf = buffer->buffers[cpu];

 if (cpu_buf->stop) {
   ret = -EBUSY;
   goto out;
 }

again:
 pos = cpu_buf->write_pos;
 if (flags & CONTIG) {
   new_pos = pos + size;
 } else {
   if (size > PAGE_SIZE) {
     ret = -ENOSPACE;
     goto out;
   }
   if ((pos + size) & PAGE_MASK != pos & PAGE_MASK) {
     insert_nops();
   }
   new_pos = pos + size;
 }
 if (cmpxchg(&cpu_buf->write_pos, pos, new_pos) != pos)
   goto again;
 return pos;

out:
 preempt_enable();
 return ret;
}

If you put the offset&PAGE_MASK in each page-frame you can use that to
easily detect when you need to flip to the next page.

Which I imagine is similar to what you have... although I must admit to
not being sure how to deal with over-write here, I guess your buffer
must be large enough to ensure no nesting depth allows you to
wrap-around while having an even un-commited.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] LTTng relay buffer allocation, read, write
  2008-09-29 16:38     ` Steven Rostedt
@ 2008-09-29 18:38       ` Mathieu Desnoyers
  2008-09-29 19:40         ` Steven Rostedt
  0 siblings, 1 reply; 24+ messages in thread
From: Mathieu Desnoyers @ 2008-09-29 18:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, linux-kernel, Martin Bligh, prasad,
	Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder, Tom Zanussi

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Mon, 29 Sep 2008, Mathieu Desnoyers wrote:
> > Ok, I'll try to explain my point of view. The thing is : I want those
> > binary buffers to be exported to userspace, and I fear that the approach
> > taken by Steven (let's write "simple" C structure directly into the
> > buffers) will in fact be much more _complex_ (due to subtle compiler
> > dependencies) that doing our own event payload (event data) format.
> > 
> > Also, things such as 
> > ring_buffer_lock: A way to lock the entire buffer.
> > ring_buffer_unlock: unlock the buffer.
> > will probably become a problem when trying to go for a more efficient
> > locking scheme.
> 
> I plan on nuking the above for something better in v2.
> 
> > 
> > ring_buffer_peek: Look at a next item in the cpu buffer.
> > This kind of feature is useless for data extraction to userspace and
> > poses serious synchronization concerns if you have other writers in the
> > same buffer.
> 
> It absolutely important for ftrace.
> 

As I already told you in person, if you have, say 16 pages for your
buffer, you could peek into all the pages which are not being currently
written into (15 other pages). This would permit to remove unnecessary
writer synchronization from a cmpxchg scheme : it would only have to
push the readers upon page switch rather than at every single even.
Pushing the readers involves, at best, a fully synchronized cmpxchg,
which I would prefer not to do at each and every event

> > 
> > Structure for event records :
> > 
> > struct ring_buffer_event {
> >         u32 type:2, len:3, time_delta:27;
> >         u32 array[];
> > };
> > 
> > The only good thing about reserving 2 bits for event IDs in there is to
> > put the most frequent events in those IDs, which is clearly not the
> > case:
> > RB_TYPE_PADDING: utterly useless. Can be expressed by a sub-buffer
> > header telling the length of data written into the subbuffer (what you
> > guys call a "page", but what I still think might be worthy to be of
> > variable size, especially with a light locking infrastructure and
> > considering we might want to export this data to userspace).
> 
> I now have both. But I think userspace can now easily see when there
> is a place in the buffer that is empty.
> 

(I notice the comment in your v10 says that minimum size for event is 8
bytes, isn't it rather 4 bytes ?)

(len field set to zero for events bigger than 28 bytes.. what do you use
for events with 0 byte payload ? I'd rather use the 28 bytes value to
identify all events >= 28 bytes and then use the first field to identify
the length).

What you propose here is to duplicate the information : have the number
of bytes used in the page header, exported to userspace, and also to
reserve an event ID (which becomes unavailable to encode anything else)
for "padding" event. There is clearly a space loss here.

Also, dealing with end of page padding will become a bit complex : you
have to check whether your padding event fits in the space left at the
end (4-bytes aligned, but minimum event size is 8 bytes, as stated in
your comment ? Is this true ?) Also, what happens if you plan to write
an event bigger than 28 bytes in your subbuffer (or page) and happen to
be at the end of the page ? You'd have to create padding which is larger
than 28 bytes. Your v10 comments seems to indicate the design does not
support it.

> > 
> > RB_TYPE_TIME_EXTENT : I'd reserve a separate bit for this one.
> > 
> > Also, if size _really_ matters, we should just export the event ID and
> > look up the event payload size through a separate table. If the payload
> > consists of a binary blob, then the data structure should start by a
> > payload size and then have a the actual binary blob.
> > 
> > struct ring_buffer_event {
> >         u32 time_ext:1, evid:4, time_lsb:27;
> >         union {
> >           u32 array[];
> >           struct {
> >             u32 ext_id;
> >             u32 array[];
> >           };
> >           struct {
> >             u32 ext_time;
> >             u32 array[];
> >           };
> >           struct {
> >             u32 ext_time;
> >             u32 ext_id;
> >             u32 array[];
> >           };
> > };
> > 
> > Therefore we can encode up to 15 event IDs into this compact
> > representation (we reserve ID 0xF for extended id). If we assign those
> > IDs per subbuffer, it leaves plenty of room before we have to write a
> > supplementary field for more IDs.
> 
> I wanted to push the event ids out of the ring buffer logic. Only a few 
> internal ones are used.  Otherwise, we'll have one hell of a bit enum 
> table with every freaking tracing type in it. That's what I want to avoid.
> 

This is why I propose to dynamically allocate event IDs through the
markers infrastructure. Other you'll have to declare and export such
large enumerations by hand.

> 
> > 
> > OTOH, if we really want to have an event size in there (which is more
> > solid), we could have :
> > 
> > struct ring_buffer_event {
> >         u32 time_ext:1, time_lsb:31;
> >         u16 evid;
> >         u16 evsize;
> >         union {
> >           u32 array[];
> >           struct {
> >             u32 ext_time;
> >             u32 array[];
> >           };
> > };
> 
> My smallest record is 8 bytes. Yours now is 12.
> 

Uh ?

The smallest record here is 8 bytes too. It has time_ext bit, time_lsb,
evid and evsize. It does not contain any event-specific payload.

With this given implementation :

struct ring_buffer_event {
  u32 time_ext:1, evid:4, time_lsb:27;
};

The smallest event is 4-bytes, which is twice smaller than yours.


> > That's a bit bigger, but keeps the event size in the data stream.
> 
> So does mine.
> 

I see the 4 first bytes of this smallest event (the ring_buffer_event).
What does the following 4 bytes contain exactly ?


> > 
> > Also, nobody has explained successfully why we have to encode a time
> > _delta_ (current tsc - prev tsc) rather than just putting the LSBs. So
> > either I fail to see the light here (I doubt it), or I am not clear
> > enough when I say that we can just put the raw LSBs and compute the
> > delta afterward when reading the buffers, manage to keep the same
> > overflow detection power, and also keep the absolute value of the tsc
> > lsb, which makes it much easier to cross-check than "deltas".
> 
> Well, you need to record wraps. Probably more often then you record delta
> wraps.
> 

Nope. You don't even need to record wraps. Say you have the kernel code
that checks for 27 bits overflows between two consecutive events and
write an extended timestamp if this is detected
(that is, if (evBtsc - evAtsc > 0x7FFFFFF)).
Therefore, you are sure that if you have two consecutive events with
only 27 bits TSC lsb in the trace (non-extended timestamp), those are at
most 2^27 cycles apart. Let's call them event A and event B.

Event A would have this TSC value (only 27 LSBs are saved here) :
0x100
Event B would have this TSC LSB value :
0x200

In this case, 0x200 - 0x100 > 0
  -> no overflow

However, if we have :
Event A : 0x2
Event B : 0x1

Then we know that there has been exactly one overflow between those two
events :
0x1 - 0x2 <= 0
  -> overflow

And the extreme case :
Event A : 0x10
Event B : 0x10
0x10 - 0x10 <= 0
  -> overflow

Notice that if event B, in the last case, would be just one cycle above,
the check initially done by the kernel would have switched to an
extended timestamp, so we would have detected the overflow anyway.

> > 
> > Now for the buffer pages implementation :
> > 
> > 
> > +struct buffer_page {
> > +       union {
> > +               struct {
> > +                       unsigned long flags;    /* mandatory */
> > +                       atomic_t _count;        /* mandatory */
> > +                       u64     time_stamp;     /* page time stamp */
> > 
> > Why should we ever associate a time stamp with a page ??
> 
> Because we save it on overwrite, which is the default mode for ftrace.
> 

Sorry, I don't understand this one. You "save it" (where ?) on overwrite
(when ? When you start overwrting a previously populated page ?) and
this serves what purpose exactly ?

> > 
> > I see that we could save the timestamp at which a subbuffer switch
> > happens (which in this patchset semantics happens to be a page), but why
> > would we every want to save that in the page frame ? Especially if we
> > simply write the LSBs instead of a time delta... Also, I would write
> 
> Where do you get the MSBs from?
> 

Just to try to grasp the page frame concept : are those actually bytes
physically located at the beginning of the page (and thus actually
representing a page header) or are they placed elsewhere in the kernel
memory and just serves as memory-management data for pages ?

I doubt this struct page is actually part of the page itself, and thus I
don't see how it exports the TSC MSBs to userspace. I would propose to
keep a few byte at the beginning of every subbuffer (actually pages in
your implementation) to save the full TSC.

> > this timestamp in a subbuffer _header_ which is exported to userspace,
> 
> Well, our subbuffer header is the page frame.
> 

Hrm, this is actually my question above.

> > but I clealry don't see why we keep it here. In addition, it's
> > completely wrong in a layered approach : if we want to switch from pages
> > to video memory or to a statically allocated buffer at boot time, such
> > "page-related" timestamp hacks won't do.
> 
> As Linus said, anything bigger than a page should be outside this buffer.
> All the buffer would then need is a pointer to the data. Then the tracer
> can figure out what to do with the rest of that.
> 

That will lead to memory leaks in flight recorder mode. What happens if
the data is not consumed ? We never free the referenced memory ?

Also, how to we present that to userspace ?

Also, if we have a corrupted buffer, lost events ? Those will call into
kfree in the tracing hot path ? This sounds _very_ instrusive and
dangerous. I am quite sure we'll want to instrumentat the memory
management parts of the kernel too (I actually have patches in LTTng for
that and kmemtrace already does it).

> > 
> > +                       unsigned size;          /* size of page data */
> > 
> > This size should be exported to userspace, and therefore belongs to a
> > subbuffer header, not to the page frame struct.
> 
> Again, page frame == sub buffer, period!
> 

The best argument I have heard yet is from Linus and is based on the
fact that we'll never want to write more than 4096 bytes within locking
such as irq disable and spinlock. This assumption does not hold with a
cmpxchg-based locking scheme.

Also, I kind of sensed that people were unwilling to write large events
in the same stream where small events are put. There is no need to do
that. We can have various buffers for the various tracers (one for
scheduler, one for ftrace, one for network packet sniffing, one for
usbmon...) and therefore manage to keep large events in separate
buffers.

> > 
> > +                       struct list_head list;  /* linked list of free pages */
> > 
> > "free pages" ? Are those "free" ? What does free mean here ? If Steven
> 
> Bah, thanks. "free" is a misnomer.  It should just be list of pages.
> 
> > actually moves head and tail pointers to keep track of which pages data
> > is written into, it will become an utter mess when we'll try to
> > implement a lockless algorithm, since those are all non-atomic
> > operations.
> 
> cmpxchg(head, old_head, head->next) ?
> 

Nope, that won't work because you'll have to cmpxchg two things :
- The offset within the page
- The current page pointer

And you cannot do both at the same time, and therefore any of those two
can fail. Dealing with page switch will therefore become racy by design.

Mathieu

> > 
> > +               };
> > +               struct page page;
> > +       };
> > +};
> > 
> > > As for the page-spanning entries, I think we can do that with Steve's
> > > system just fine, its just that Linus said its a dumb idea and Steve
> > > dropped it, but there is nothing fundamental stopping us from recording
> > > a length that is > PAGE_SIZE and copying data into the pages one at a
> > > time.
> > > 
> > > Nor do I see it impossible to implement splice on top of Steve's
> > > ring-buffer..
> > > 
> > > So again, why?
> > > 
> > 
> > I'd like to separate the layer which deals with data read/write from the
> > layer which deals with synchronization of data write/read to/from the
> > buffers so we can eventually switch to a locking mechanism which
> > provides a sane performance level, and given Steven's linked list
> > implementation, it will just add unneeded locking requirements.
> > 
> > Compared to this, I deal with buffer coherency with two 32 bits counters
> > in LTTng : a write offset and a consumer offset. (plus a per-subbuffer
> > commit count). I'd like to keep this semantic and yet support writing to
> > non-vmap'd memory (which I do in the patch I propose). I'd just have to
> > implement the splice operation in ltt-relay.c (the layer that sits on
> > top of this patch).
> > 
> 
> -- Steve
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] LTTng relay buffer allocation, read, write
  2008-09-29 18:38       ` Mathieu Desnoyers
@ 2008-09-29 19:40         ` Steven Rostedt
  2008-09-29 19:54           ` Steven Rostedt
  0 siblings, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2008-09-29 19:40 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Martin Bligh, prasad,
	Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder, Tom Zanussi


On Mon, 29 Sep 2008, Mathieu Desnoyers wrote:
> > 
> > > 
> > > ring_buffer_peek: Look at a next item in the cpu buffer.
> > > This kind of feature is useless for data extraction to userspace and
> > > poses serious synchronization concerns if you have other writers in the
> > > same buffer.
> > 
> > It absolutely important for ftrace.
> > 
> 
> As I already told you in person, if you have, say 16 pages for your
> buffer, you could peek into all the pages which are not being currently
> written into (15 other pages). This would permit to remove unnecessary
> writer synchronization from a cmpxchg scheme : it would only have to
> push the readers upon page switch rather than at every single even.
> Pushing the readers involves, at best, a fully synchronized cmpxchg,
> which I would prefer not to do at each and every event

Actually, I also have been looking at doing things lockless. Note, the 
peek operation is for the iteration mode of read, which must disable 
writes (for now). The iteration mode does not consume the reader, in fact 
there is no consumer. It is used to give a trace after the fact. No writer 
should be present anyway.

the peek operation will not be used for a consumer read, which allows for 
both readers and writers to run at the same time.

> > 
> > I now have both. But I think userspace can now easily see when there
> > is a place in the buffer that is empty.
> > 
> 
> (I notice the comment in your v10 says that minimum size for event is 8
> bytes, isn't it rather 4 bytes ?)

Yes it is 8 bytes minimum, 4 bytes aligned.

> 
> (len field set to zero for events bigger than 28 bytes.. what do you use
> for events with 0 byte payload ? I'd rather use the 28 bytes value to
> identify all events >= 28 bytes and then use the first field to identify
> the length).

I force a zero type payload to have a len of 1. This keeps the minimum at
8 bytes, with 4 bytes unused.

> 
> What you propose here is to duplicate the information : have the number
> of bytes used in the page header, exported to userspace, and also to
> reserve an event ID (which becomes unavailable to encode anything else)
> for "padding" event. There is clearly a space loss here.

Because my focus is not on userspace. I don't care about user space apps!

My user space app is "cat".  I don't want to map ever single kind of event 
into the buffer. Keeping the length in the header keeps everything more 
robust. I don't have to worry about searching an event hash table, or 
remap what events are. Hell, I don't want to always register an event, I 
might just say, load this data and be done with it.

The user shouldn't need to always keep track of event sizes on reading.
Remember, the user could be something in the kernel space as well.

> 
> Also, dealing with end of page padding will become a bit complex : you
> have to check whether your padding event fits in the space left at the
> end (4-bytes aligned, but minimum event size is 8 bytes, as stated in
> your comment ? Is this true ?) Also, what happens if you plan to write
> an event bigger than 28 bytes in your subbuffer (or page) and happen to
> be at the end of the page ? You'd have to create padding which is larger
> than 28 bytes. Your v10 comments seems to indicate the design does not
> support it.

Yep, we add more than 28 bytes of padding. Linus said this was fine. If
you don't have enough space to fit your event, go to the next page.
If you only have 4 bytes left, yes we have an exception to the rule, the
padding event will be 4 bytes.

> > > Therefore we can encode up to 15 event IDs into this compact
> > > representation (we reserve ID 0xF for extended id). If we assign those
> > > IDs per subbuffer, it leaves plenty of room before we have to write a
> > > supplementary field for more IDs.
> > 
> > I wanted to push the event ids out of the ring buffer logic. Only a few 
> > internal ones are used.  Otherwise, we'll have one hell of a bit enum 
> > table with every freaking tracing type in it. That's what I want to avoid.
> > 
> 
> This is why I propose to dynamically allocate event IDs through the
> markers infrastructure. Other you'll have to declare and export such
> large enumerations by hand.

I don't want to be dependent on markers. No need to be, it's too much for
what I need.

> > 
> > > 
> > > OTOH, if we really want to have an event size in there (which is more
> > > solid), we could have :
> > > 
> > > struct ring_buffer_event {
> > >         u32 time_ext:1, time_lsb:31;
> > >         u16 evid;
> > >         u16 evsize;
> > >         union {
> > >           u32 array[];
> > >           struct {
> > >             u32 ext_time;
> > >             u32 array[];
> > >           };
> > > };
> > 
> > My smallest record is 8 bytes. Yours now is 12.
> > 
> 
> Uh ?
> 
> The smallest record here is 8 bytes too. It has time_ext bit, time_lsb,
> evid and evsize. It does not contain any event-specific payload.

OK, my smallest event with data is 8 bytes, yours is 12. What do I want a 
payload without data for?

I see:

 4 bytes time_ext, time
 2 byts evid
 2 bytes evsize

There's your 8bytes. You need anoter 4 bytes to hold data, which makes it 
12.


> 
> With this given implementation :
> 
> struct ring_buffer_event {
>   u32 time_ext:1, evid:4, time_lsb:27;
> };
> 
> The smallest event is 4-bytes, which is twice smaller than yours.

With no data. I'm not going to worry about saving 4 bytes for a zero 
payload.

> 
> 
> > > That's a bit bigger, but keeps the event size in the data stream.
> > 
> > So does mine.
> > 
> 
> I see the 4 first bytes of this smallest event (the ring_buffer_event).
> What does the following 4 bytes contain exactly ?

The payload.

> 
> 
> > > 
> > > Also, nobody has explained successfully why we have to encode a time
> > > _delta_ (current tsc - prev tsc) rather than just putting the LSBs. So
> > > either I fail to see the light here (I doubt it), or I am not clear
> > > enough when I say that we can just put the raw LSBs and compute the
> > > delta afterward when reading the buffers, manage to keep the same
> > > overflow detection power, and also keep the absolute value of the tsc
> > > lsb, which makes it much easier to cross-check than "deltas".
> > 
> > Well, you need to record wraps. Probably more often then you record delta
> > wraps.
> > 
> 
> Nope. You don't even need to record wraps. Say you have the kernel code
> that checks for 27 bits overflows between two consecutive events and
> write an extended timestamp if this is detected
> (that is, if (evBtsc - evAtsc > 0x7FFFFFF)).
> Therefore, you are sure that if you have two consecutive events with
> only 27 bits TSC lsb in the trace (non-extended timestamp), those are at
> most 2^27 cycles apart. Let's call them event A and event B.
> 
> Event A would have this TSC value (only 27 LSBs are saved here) :
> 0x100
> Event B would have this TSC LSB value :
> 0x200
> 
> In this case, 0x200 - 0x100 > 0
>   -> no overflow
> 
> However, if we have :
> Event A : 0x2
> Event B : 0x1
> 
> Then we know that there has been exactly one overflow between those two
> events :
> 0x1 - 0x2 <= 0
>   -> overflow
> 
> And the extreme case :
> Event A : 0x10
> Event B : 0x10
> 0x10 - 0x10 <= 0
>   -> overflow
> 
> Notice that if event B, in the last case, would be just one cycle above,
> the check initially done by the kernel would have switched to an
> extended timestamp, so we would have detected the overflow anyway.

I actually don't care which method we use. This is something that we can
change later.

> 
> > > 
> > > Now for the buffer pages implementation :
> > > 
> > > 
> > > +struct buffer_page {
> > > +       union {
> > > +               struct {
> > > +                       unsigned long flags;    /* mandatory */
> > > +                       atomic_t _count;        /* mandatory */
> > > +                       u64     time_stamp;     /* page time stamp */
> > > 
> > > Why should we ever associate a time stamp with a page ??
> > 
> > Because we save it on overwrite, which is the default mode for ftrace.
> > 
> 
> Sorry, I don't understand this one. You "save it" (where ?) on overwrite
> (when ? When you start overwrting a previously populated page ?) and
> this serves what purpose exactly ?

Actually, when we write to a new page, we save the timestamp. We don't 
care about pages we overwrite, since the next read page still has the
timestamp value that was record when it was first written to.

I'm saying, every page needs a timestamp value that can be retrieved, 
otherwise, how do you retrieve the timestamp of a page that was 
overwritten several times since the trace was started and never was read.

> 
> > > 
> > > I see that we could save the timestamp at which a subbuffer switch
> > > happens (which in this patchset semantics happens to be a page), but why
> > > would we every want to save that in the page frame ? Especially if we
> > > simply write the LSBs instead of a time delta... Also, I would write
> > 
> > Where do you get the MSBs from?
> > 
> 
> Just to try to grasp the page frame concept : are those actually bytes
> physically located at the beginning of the page (and thus actually
> representing a page header) or are they placed elsewhere in the kernel
> memory and just serves as memory-management data for pages ?

The later.  Yes the page is the page frame management infrastructure.
When the page is allocated somewhere with get_free_page, most of that
structure is not being used. When the page is free or being used as
cache, then that structure matters.

> 
> I doubt this struct page is actually part of the page itself, and thus I
> don't see how it exports the TSC MSBs to userspace. I would propose to
> keep a few byte at the beginning of every subbuffer (actually pages in
> your implementation) to save the full TSC.

Actually, my first versions had that. But I moved it. I'll come up with
a way to get this info later.

> 
> > > this timestamp in a subbuffer _header_ which is exported to userspace,
> > 
> > Well, our subbuffer header is the page frame.
> > 
> 
> Hrm, this is actually my question above.

hehe, answered above.

> 
> > > but I clealry don't see why we keep it here. In addition, it's
> > > completely wrong in a layered approach : if we want to switch from pages
> > > to video memory or to a statically allocated buffer at boot time, such
> > > "page-related" timestamp hacks won't do.
> > 
> > As Linus said, anything bigger than a page should be outside this buffer.
> > All the buffer would then need is a pointer to the data. Then the tracer
> > can figure out what to do with the rest of that.
> > 
> 
> That will lead to memory leaks in flight recorder mode. What happens if
> the data is not consumed ? We never free the referenced memory ?

IIRC, Linus had for these records:

  array[0] = pointer to big data
  array[1] = pointer to free function.

When the data is consumed in overwrite mode, we can call a free function
to get rid of it.

> 
> Also, how to we present that to userspace ?

The tracer layer will make something available for userspace. The ring
buffer layer will not.

> 
> Also, if we have a corrupted buffer, lost events ? Those will call into
> kfree in the tracing hot path ? This sounds _very_ instrusive and
> dangerous. I am quite sure we'll want to instrumentat the memory
> management parts of the kernel too (I actually have patches in LTTng for
> that and kmemtrace already does it).

We could add it to a per_cpu list that can free these later.

> 
> > > 
> > > +                       unsigned size;          /* size of page data */
> > > 
> > > This size should be exported to userspace, and therefore belongs to a
> > > subbuffer header, not to the page frame struct.
> > 
> > Again, page frame == sub buffer, period!
> > 
> 
> The best argument I have heard yet is from Linus and is based on the
> fact that we'll never want to write more than 4096 bytes within locking
> such as irq disable and spinlock. This assumption does not hold with a
> cmpxchg-based locking scheme.
> 
> Also, I kind of sensed that people were unwilling to write large events
> in the same stream where small events are put. There is no need to do
> that. We can have various buffers for the various tracers (one for
> scheduler, one for ftrace, one for network packet sniffing, one for
> usbmon...) and therefore manage to keep large events in separate
> buffers.

Make your tracer layer have a "large event" buffer.  This current code 
will focus on little events.

> 
> > > 
> > > +                       struct list_head list;  /* linked list of free pages */
> > > 
> > > "free pages" ? Are those "free" ? What does free mean here ? If Steven
> > 
> > Bah, thanks. "free" is a misnomer.  It should just be list of pages.
> > 
> > > actually moves head and tail pointers to keep track of which pages data
> > > is written into, it will become an utter mess when we'll try to
> > > implement a lockless algorithm, since those are all non-atomic
> > > operations.
> > 
> > cmpxchg(head, old_head, head->next) ?
> > 
> 
> Nope, that won't work because you'll have to cmpxchg two things :
> - The offset within the page
> - The current page pointer

BTW, I remember telling you that I want the reader and writer to loop
on the same page. I've changed my mind. This code will push the reader
to the next page on overwrite. That is, a writer will not write on a page
that a reader is on (unless the reader is behind it).

Some assumptions that I plan on making.

1) A writer can only write to the CPU buffer it is on.
2) All readers must be synchronized with each other. Two readers can
  not read from the same buffer at the same time (at the API level).

Now, what this gives us is.

1) stack order of writes.

  All interrupts and NMIs will enter and leave in stack fashion.
  Writers will always disable preemption.
  We do not need to worry about SMP for writes.
  That is, if you reserve data and atomically push the head forward
   if the head goes past the end of page, One, you check if your
   start of head (head - length) is still on the page. If it is
   you add your padding (you already reserved it), then you atomically
   push the head forward. Then you start the process again.
  If the start of the head (head - length) is not on the page, that
   means that an IRQ or NMI came in and pushed it before you.


2) Only need to deal with one reader at a time.

 The one reader just helps make the reader side easier. It doesn't make
 much difference with the writer side.

 I've been thinking of various ways to handle the reader.

 1) only consumer mode handle synchronization with writers.
    The iterator mode (read a trace but do not consume) must disable
    writes. This just makes everything easier, and is usually what you
    want, otherwise the read is just corrupted. ftrace currently disables
    recording when the trace is being read (except for the consumer 
    trace_pipe).

 2) Have an extra sub buffer. This might be what you do. That is, have 
    an extra page, and when a reader starts, it will swap the page from
    the buffer with the extra page. The writer will then write on this
    new page.

    We could use atomic inc on the pages to detect if the page changed
    since we copied it, and if we fail try again.

This will also allow you to take these extra pages and push them to a 
file.

This is just the jest of things to come in v2.

-- Steve

> 
> And you cannot do both at the same time, and therefore any of those two
> can fail. Dealing with page switch will therefore become racy by design.
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] LTTng relay buffer allocation, read, write
  2008-09-29 19:40         ` Steven Rostedt
@ 2008-09-29 19:54           ` Steven Rostedt
  0 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2008-09-29 19:54 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Martin Bligh, prasad,
	Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder, Tom Zanussi


On Mon, 29 Sep 2008, Steven Rostedt wrote:
>   We do not need to worry about SMP for writes.
>   That is, if you reserve data and atomically push the head forward
>    if the head goes past the end of page, One, you check if your
>    start of head (head - length) is still on the page. If it is
>    you add your padding (you already reserved it), then you atomically
>    push the head forward. Then you start the process again.
>   If the start of the head (head - length) is not on the page, that
>    means that an IRQ or NMI came in and pushed it before you.

I forgot to mention one important detail. The "head" index will stay
on the page frame. That way we do not need to figure out which 
head_page we are on. We grab the head_page, we atomically 
(local_inc_return) the head pointer on that page. If the return value is 
still on the page, we succeeded. We can also increment a value on this 
page frame that will prevent recording if we somehow overflowed the buffer 
before relinquishing the stack.

That is

  reserve_event()

     IRQ->
            reserve_event();

      [...]

     IRQ->reserve_event() came back to original head!

      Here we do not have a big enough buffer, and this is just stupid ;-)
      We would drop packets in this case, and should drop the guy on his
      head who came up with the too small buffer.

-- Steve


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] LTTng relay buffer allocation, read, write
  2008-09-29 17:30     ` Peter Zijlstra
@ 2008-09-29 20:31       ` Mathieu Desnoyers
  2008-09-29 21:24         ` Steven Rostedt
  0 siblings, 1 reply; 24+ messages in thread
From: Mathieu Desnoyers @ 2008-09-29 20:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Martin Bligh, prasad, Linus Torvalds,
	Thomas Gleixner, Steven Rostedt, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder, Tom Zanussi

* Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> On Mon, 2008-09-29 at 11:50 -0400, Mathieu Desnoyers wrote:
> > * Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> > > On Sat, 2008-09-27 at 09:40 -0400, Mathieu Desnoyers wrote:
> > > > As I told Martin, I was thinking about taking an axe and moving stuff around in
> > > > relay. Which I just did.
> > > > 
> > > > This patch reimplements relay with a linked list of pages. Provides read/write
> > > > wrappers which should be used to read or write from the buffers. It's the core
> > > > of a layered approach to the design requirements expressed by Martin and
> > > > discussed earlier.
> > > > 
> > > > It does not provide _any_ sort of locking on buffer data. Locking should be done
> > > > by the caller. Given that we might think of very lightweight locking schemes, it
> > > > makes sense to me that the underlying buffering infrastructure supports event
> > > > records larger than 1 page.
> > > > 
> > > > A cache saving 3 pointers is used to keep track of current page used for the
> > > > buffer for write, read and current subbuffer header pointer lookup. The offset
> > > > of each page within the buffer is saved in the page frame structure to permit
> > > > cache lookup without extra locking.
> > > > 
> > > > TODO : Currently, no splice file operations are implemented. Should come soon.
> > > > The idea is to splice the buffers directly into files or to the network.
> > > > 
> > > > This patch is released as early RFC. It builds, that's about it. Testing will
> > > > come when I implement the splice ops.
> > > 
> > > Why? What aspects of Steve's ring-buffer interface will hinder us
> > > optimizing the implementation to be as light-weight as you like?
> > > 
> > > The thing is, I'd like to see it that light as well ;-)
> > > 
> > 
> > Ok, I'll try to explain my point of view. The thing is : I want those
> > binary buffers to be exported to userspace, and I fear that the approach
> > taken by Steven (let's write "simple" C structure directly into the
> > buffers) will in fact be much more _complex_ (due to subtle compiler
> > dependencies) that doing our own event payload (event data) format.
> 
> The only compiler dependant thing in there is the bitfield, which could
> be recoded using regular bitops.
> 
> I'm not seeing anything particularly worrysome about that.
> 


+struct ring_buffer_event {
+       u32             type:2, len:3, time_delta:27;
+       u32             array[];
+};

I am mostly talking about what goes in the array[] part of the event.
This will be where the complex typing will occur.


> > Also, things such as 
> > ring_buffer_lock: A way to lock the entire buffer.
> > ring_buffer_unlock: unlock the buffer.
> > will probably become a problem when trying to go for a more efficient
> > locking scheme.
> 
> You can do
> 
> stop:
> cpu_buffer->stop=1;
> smp_wmb();
> sched_sync();
> 
> start:
> smp_wmb();
> cpu_buffer->stop=0;
> 
> 
> write:
> if (unlikely(smp_rmb(), cpu_buffer->stop))) {
>   return -EBUSY;
>     or
>   while (cpu_buffer->stop)
>     cpu_relax();
> }
> 
> The read in the fast path is just a read, nothing fancy...
> 

That will stop tracing if you want to read a trace at the same time you
want to write it, which does not permit continuous streaming.

I agree that such start/stop is needed to control tracing on/off, but it
should be separate from the actual writer/reader locking.

> > ring_buffer_peek: Look at a next item in the cpu buffer.
> > This kind of feature is useless for data extraction to userspace and
> > poses serious synchronization concerns if you have other writers in the
> > same buffer.
> 
> Sure, its probably possible to rework the merge-iterator to use consume,
> but that would require it having storage for 1 event, which might be a
> bit messy.
> 

Not sure why you'd need storage for 1 event ?

> How would your locking deal with this? - it really is a requirement to
> be able to merge-sort-iterate the output from kernel space..
> 

Sure, I plan to support this. This would be a subset of userspace data
extraction. The in-kernel consumer would be a consumer just like a
splice operation which would extract buffers one event at a time. Rather
than doing a splice to write the data to disk or to the network, this
special in-kernel consumer would merge-sort events from the various
buffers it is connected to and pretty-print one event record at a time
to a seq_file.

Instead of peeking in the "next event", which implies closely coupled
locking with the writer, it would therefore have to ability to consume
all the oldest subbuffers which are not being written to.

Being a standard consumer, it would therefore increment the consumer
offset, which is synchronized by the writer at each subbuffer (page)
switch only.

> > Structure for event records :
> > 
> > struct ring_buffer_event {
> >         u32 type:2, len:3, time_delta:27;
> >         u32 array[];
> > };
> > 
> > The only good thing about reserving 2 bits for event IDs in there is to
> > put the most frequent events in those IDs
> 
> Not so, these types are buffer package types, not actual event types, I
> think thats a useful distinction.
> 

I actually think we only need 1 bit to represent extended timestamp
headers. The rest of these event types are just unneeded and consume
precious ID space for nothing. I have not seen any justification telling
why reserving these events actually does something that cannot be done
by the extended timestamp header or by reserving a small header at the
beginning of the subbuffer (page).

> > , which is clearly not the
> > case:
> > RB_TYPE_PADDING: utterly useless. Can be expressed by a sub-buffer
> > header telling the length of data written into the subbuffer (what you
> > guys call a "page", but what I still think might be worthy to be of
> > variable size, especially with a light locking infrastructure and
> > considering we might want to export this data to userspace).
> 
> But then you'd need a sub-buffer header, which in itself takes space,
> this padding solution seems like a fine middle-ground, it only takes
> space when you need it and it free otherwise.
> 
> The sub-buffer headers would always take space.
> 

Note that you already need to save the timestamp in the subbuffer
header. The question is : should we also save the unused size.

If we don't, keeping a padding event will likely :
- Rarely save space, given the variable size nature of event records,
  the likeliness of needing padding is pretty high. A padding event is
  bigger than just writing the unused size anyway, given the unused size
  can be written in a few bits for a page, but the padding event
  requires at least 8 bytes.
- Add complexity to the buffer structure; we have to make sure there is
  enough room in the page to write the padding event, have to make sure
  we support large padding events (> 28 bytes)...
- It also reserves an event ID, which could be used for other purposes.
  That means if we have 8 event IDs we which to write in a specific
  buffer (let's suppose they have a 0 bytes payload), we therefore need
  to increase the event size from 4-bytes to 8-bytes just because we
  reserved those IDs for "internal" (useless) purpose. OTOH, if we keep
  those IDs free to encode tracer-specific IDs, we can pack a lot of
  data in 4-bytes events rather than 8-bytes, which actually doubles the
  amount of events we can record.


> > RB_TYPE_TIME_EXTENT : I'd reserve a separate bit for this one.
> > 
> > Also, if size _really_ matters,
> 
> You and Martin have been telling it does ;-)
> 
> >  we should just export the event ID and
> > look up the event payload size through a separate table. If the payload
> > consists of a binary blob, then the data structure should start by a
> > payload size and then have a the actual binary blob.
> > 
> > struct ring_buffer_event {
> >         u32 time_ext:1, evid:4, time_lsb:27;
> >         union {
> >           u32 array[];
> >           struct {
> >             u32 ext_id;
> >             u32 array[];
> >           };
> >           struct {
> >             u32 ext_time;
> >             u32 array[];
> >           };
> >           struct {
> >             u32 ext_time;
> >             u32 ext_id;
> >             u32 array[];
> >           };
> > };
> > 
> > Therefore we can encode up to 15 event IDs into this compact
> > representation (we reserve ID 0xF for extended id). If we assign those
> > IDs per subbuffer, it leaves plenty of room before we have to write a
> > supplementary field for more IDs.
> > 
> > OTOH, if we really want to have an event size in there (which is more
> > solid), we could have :
> > 
> > struct ring_buffer_event {
> >         u32 time_ext:1, time_lsb:31;
> >         u16 evid;
> >         u16 evsize;
> >         union {
> >           u32 array[];
> >           struct {
> >             u32 ext_time;
> >             u32 array[];
> >           };
> > };
> > 
> > That's a bit bigger, but keeps the event size in the data stream.
> 
> I'm really not seeing what any of these proposals have on top of what
> Steve currently has. We have the ability to encode up to 28 bytes of
> payload in a 4 bytes header, which should suffice for most entries,
> right?
> 

The common entries would be under 28 bytes, right. But the thing is that
Steven's proposal uses 8 bytes for the _smallest_ event when we can in
fact cut that down to 4 bytes.


> > Also, nobody has explained successfully why we have to encode a time
> > _delta_ (current tsc - prev tsc) rather than just putting the LSBs. So
> > either I fail to see the light here (I doubt it), or I am not clear
> > enough when I say that we can just put the raw LSBs and compute the
> > delta afterward when reading the buffers, manage to keep the same
> > overflow detection power, and also keep the absolute value of the tsc
> > lsb, which makes it much easier to cross-check than "deltas".
> 
> Still not quite understanding where you get the MSBs from, how do you
> tell if two LSBs are from the same period?
> 

See my answer to Steven for this.

> > Now for the buffer pages implementation :
> > 
> > 
> > +struct buffer_page {
> > +       union {
> > +               struct {
> > +                       unsigned long flags;    /* mandatory */
> > +                       atomic_t _count;        /* mandatory */
> > +                       u64     time_stamp;     /* page time stamp */
> > 
> > Why should we ever associate a time stamp with a page ??
> 
> Discarting the whole sub-buffer idea, it could be used to validate
> whichever time-stamp logic.
> 
> Personally I don't particulary like the sub-buffer concept, and I don't
> think we need it.
> 

Depends on the use-cases I guess, and on the locking used. My question
about this time-stamp is : shouldn't we, instead, put it in a page
header (exposed to userspace) rather than in this page frame, which I
believe is in a separate data structure (and not in the page per se) ?

> > I see that we could save the timestamp at which a subbuffer switch
> > happens (which in this patchset semantics happens to be a page), but why
> > would we every want to save that in the page frame ? Especially if we
> > simply write the LSBs instead of a time delta... Also, I would write
> > this timestamp in a subbuffer _header_ which is exported to userspace,
> 
> Why have sub-buffers at all?
> 

To support events larger than 4kB. Also, to support different backing
storage for the buffers without keeping a limitation specific a single
page-tied implementation of those.

> > but I clealry don't see why we keep it here. In addition, it's
> > completely wrong in a layered approach : if we want to switch from pages
> > to video memory or to a statically allocated buffer at boot time, such
> > "page-related" timestamp hacks won't do.
> 
> I don't think you _ever_ want to insert actual video memory in the
> trace, what you _might_ possibly want to do, is insert a copy of a
> frame, but that you can do with paged buffers like we have, just add an
> entry with 3840*1200*4 bytes (one screen in my case), and use sub-writes
> to copy everything to page-alinged chunks.
> 

No, no :) What I meant is : some people would like to _reserve_ part of
their video card memory so they can use it as backing storage for the
_buffers_. The benefit of doing that is that those buffers will survive
a hot reboot. Rather useful to debug kernel crashes. :)

Copying larger event payloads would actually require to reserve multiple
events in the trace, and would require a cookie or some locking to tell
that a given event is actually the follow-up of the previous one,
because with more evolved locking schemes, each event space reservation
is atomic and we can therefore not assume that a single payload splitted
in two events will be put contiguously in the trace buffers (can be
interrupted, preempted...).

> There is nothing techinically prohibiting this in Steve's scheme, except
> for Linus telling you you're an idiot for doing it.
> 
> The NOP packets you dislike allow us to align regular entries with page
> boundaries, which in turn allows this 0-copy stuff. If you use the
> sub-write iterator stuff we taked about a while ago, you can leave out
> the NOPs.
> 
> Having both makes sense (if you want the large packets stuff), use the
> regular page aligned 0-copy stuff for regular small packets, and use the
> sub-write iterator stuff for your huge packets.
> 

Sorry, I don't understand what the NOP packet gives you that the used
bytes amount in the subbuffer (page) header doesn't provide (given my
explanation above).


> > > As for the page-spanning entries, I think we can do that with Steve's
> > > system just fine, its just that Linus said its a dumb idea and Steve
> > > dropped it, but there is nothing fundamental stopping us from recording
> > > a length that is > PAGE_SIZE and copying data into the pages one at a
> > > time.
> > > 
> > > Nor do I see it impossible to implement splice on top of Steve's
> > > ring-buffer..
> > > 
> > > So again, why?
> > > 
> > 
> > I'd like to separate the layer which deals with data read/write from the
> > layer which deals with synchronization of data write/read to/from the
> > buffers so we can eventually switch to a locking mechanism which
> > provides a sane performance level, and given Steven's linked list
> > implementation, it will just add unneeded locking requirements.
> 
> How does it differ from your linked list implementation? Reaslistically,
> we want but a single locking scheme for the trace buffer stuff. So
> factoring it out doesn't really make sense.
> 

My linked list implementation does not assume we have to do a "disable
interrupts" around the whole pointer manipulation, simply because there
isn't any complicated manipulation done. Actually, the only linked list
pointer manipulation I need to do a to atomically save one of three
pointers to cache the current page I am writing to/reading from/getting
the header pointer from.

> > Compared to this, I deal with buffer coherency with two 32 bits counters
> > in LTTng : a write offset and a consumer offset. (plus a per-subbuffer
> > commit count). 
> 
> I think that can work just fine on top of Steve's stuff too, it needs a
> bit of trimming etc.. but afaict there isn't anything stopping us from
> implementing his reserve function as light as you want:
> 
> int reserve(buffer, size, flags)
> {
>  preempt_disable()
>  cpu = smp_processor_id();
>  cpu_buf = buffer->buffers[cpu];
> 
>  if (cpu_buf->stop) {
>    ret = -EBUSY;
>    goto out;
>  }
> 
> again:
>  pos = cpu_buf->write_pos;
>  if (flags & CONTIG) {
>    new_pos = pos + size;
>  } else {
>    if (size > PAGE_SIZE) {
>      ret = -ENOSPACE;
>      goto out;
>    }
>    if ((pos + size) & PAGE_MASK != pos & PAGE_MASK) {
>      insert_nops();
>    }
>    new_pos = pos + size;
>  }
>  if (cmpxchg(&cpu_buf->write_pos, pos, new_pos) != pos)

How do you deal with his call to __rb_reserve_next which deals with page
change and plays with tail_page pointer ? This is where the whole
problem lays; you have to atomically update more than a single atomic
variable.

>    goto again;
>  return pos;
> 
> out:
>  preempt_enable();
>  return ret;
> }
> 
> If you put the offset&PAGE_MASK in each page-frame you can use that to
> easily detect when you need to flip to the next page.
> 

Exactly what I have in this patch.

> Which I imagine is similar to what you have... although I must admit to
> not being sure how to deal with over-write here, I guess your buffer
> must be large enough to ensure no nesting depth allows you to
> wrap-around while having an even un-commited.
> 

I deal with overwrite by looking that the commit count value of the
subbuffer (page) I am about to start writing into. If it's not a
multiple of subbuffer (page) size, then I consider the content of that
specific subbuffer as corrupted. And yes, it implies that nesting that
wraps around the number of subbuffers would cause (detected) data
corruption. If it ever happens, the corrupted subbuffers count will be
incremented and the given subbuffer will be overwritten (the subbuffer
commit count is reequilibrated at this moment). When the stale writer
will resume its execution, it will increment the commit count and
therefore cause a second subbuffer corruption. In the end, 2 subbuffers
will be dropped if this kind of situation happens. But I guess the rule
is to make sure nested writes does not overflow the complete buffer
size, or to simply make the buffers large enough.

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] LTTng relay buffer allocation, read, write
  2008-09-29 20:31       ` Mathieu Desnoyers
@ 2008-09-29 21:24         ` Steven Rostedt
  2008-09-30 17:22           ` Martin Bligh
  0 siblings, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2008-09-29 21:24 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Martin Bligh, prasad,
	Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder, Tom Zanussi



On Mon, 29 Sep 2008, Mathieu Desnoyers wrote:
> > > 
> > > Ok, I'll try to explain my point of view. The thing is : I want those
> > > binary buffers to be exported to userspace, and I fear that the approach
> > > taken by Steven (let's write "simple" C structure directly into the
> > > buffers) will in fact be much more _complex_ (due to subtle compiler
> > > dependencies) that doing our own event payload (event data) format.
> > 
> > The only compiler dependant thing in there is the bitfield, which could
> > be recoded using regular bitops.
> > 
> > I'm not seeing anything particularly worrysome about that.
> > 
> 
> 
> +struct ring_buffer_event {
> +       u32             type:2, len:3, time_delta:27;
> +       u32             array[];
> +};
> 
> I am mostly talking about what goes in the array[] part of the event.
> This will be where the complex typing will occur.

Only on the reader side.

> 
> 
> > > Also, things such as 
> > > ring_buffer_lock: A way to lock the entire buffer.
> > > ring_buffer_unlock: unlock the buffer.
> > > will probably become a problem when trying to go for a more efficient
> > > locking scheme.
> > 
> > You can do
> > 
> > stop:
> > cpu_buffer->stop=1;
> > smp_wmb();
> > sched_sync();
> > 
> > start:
> > smp_wmb();
> > cpu_buffer->stop=0;
> > 
> > 
> > write:
> > if (unlikely(smp_rmb(), cpu_buffer->stop))) {
> >   return -EBUSY;
> >     or
> >   while (cpu_buffer->stop)
> >     cpu_relax();
> > }
> > 
> > The read in the fast path is just a read, nothing fancy...
> > 
> 
> That will stop tracing if you want to read a trace at the same time you
> want to write it, which does not permit continuous streaming.

Agree, we can do better than that.

> 
> I agree that such start/stop is needed to control tracing on/off, but it
> should be separate from the actual writer/reader locking.

yep

> 
> > > ring_buffer_peek: Look at a next item in the cpu buffer.
> > > This kind of feature is useless for data extraction to userspace and
> > > poses serious synchronization concerns if you have other writers in the
> > > same buffer.
> > 
> > Sure, its probably possible to rework the merge-iterator to use consume,
> > but that would require it having storage for 1 event, which might be a
> > bit messy.
> > 
> 
> Not sure why you'd need storage for 1 event ?
> 
> > How would your locking deal with this? - it really is a requirement to
> > be able to merge-sort-iterate the output from kernel space..
> > 
> 
> Sure, I plan to support this. This would be a subset of userspace data
> extraction. The in-kernel consumer would be a consumer just like a
> splice operation which would extract buffers one event at a time. Rather
> than doing a splice to write the data to disk or to the network, this
> special in-kernel consumer would merge-sort events from the various
> buffers it is connected to and pretty-print one event record at a time
> to a seq_file.
> 
> Instead of peeking in the "next event", which implies closely coupled
> locking with the writer, it would therefore have to ability to consume
> all the oldest subbuffers which are not being written to.

Remember, I plan on supporting two modes. An iterator mode and a consumer 
producer mode. Do you have an iterator mode? That's what most of the
ftrace tools use.

The iterator mode will disable writing and use the "peek" mode. The
consumer mode will not, and have its own ways to read. As is right
now in the buffer code.

> 
> Being a standard consumer, it would therefore increment the consumer
> offset, which is synchronized by the writer at each subbuffer (page)
> switch only.
> 
> > > Structure for event records :
> > > 
> > > struct ring_buffer_event {
> > >         u32 type:2, len:3, time_delta:27;
> > >         u32 array[];
> > > };
> > > 
> > > The only good thing about reserving 2 bits for event IDs in there is to
> > > put the most frequent events in those IDs
> > 
> > Not so, these types are buffer package types, not actual event types, I
> > think thats a useful distinction.
> > 
> 
> I actually think we only need 1 bit to represent extended timestamp
> headers. The rest of these event types are just unneeded and consume
> precious ID space for nothing. I have not seen any justification telling
> why reserving these events actually does something that cannot be done
> by the extended timestamp header or by reserving a small header at the
> beginning of the subbuffer (page).

Perhaps we can look at this for v2.

> 
> > > , which is clearly not the
> > > case:
> > > RB_TYPE_PADDING: utterly useless. Can be expressed by a sub-buffer
> > > header telling the length of data written into the subbuffer (what you
> > > guys call a "page", but what I still think might be worthy to be of
> > > variable size, especially with a light locking infrastructure and
> > > considering we might want to export this data to userspace).
> > 
> > But then you'd need a sub-buffer header, which in itself takes space,
> > this padding solution seems like a fine middle-ground, it only takes
> > space when you need it and it free otherwise.
> > 
> > The sub-buffer headers would always take space.
> > 
> 
> Note that you already need to save the timestamp in the subbuffer
> header. The question is : should we also save the unused size.

I do that too. Yes I have duplicate info. But I still like the robustness
of the padding event.

> 
> If we don't, keeping a padding event will likely :
> - Rarely save space, given the variable size nature of event records,
>   the likeliness of needing padding is pretty high. A padding event is
>   bigger than just writing the unused size anyway, given the unused size
>   can be written in a few bits for a page, but the padding event
>   requires at least 8 bytes.

False. It is a "special" event that can actually be in 4 bytes. Which is
what would be wasted anyway. Lenght is ignored by the padding event. Its
basically just another way to make sure we are not reading useless data.

>From the rb_event_length function:

+       switch (event->type) {
+       case RINGBUF_TYPE_PADDING:
+               /* undefined */
+               return -1;


> - Add complexity to the buffer structure; we have to make sure there is
>   enough room in the page to write the padding event, have to make sure
>   we support large padding events (> 28 bytes)...

False as described above.

> - It also reserves an event ID, which could be used for other purposes.
>   That means if we have 8 event IDs we which to write in a specific
>   buffer (let's suppose they have a 0 bytes payload), we therefore need
>   to increase the event size from 4-bytes to 8-bytes just because we
>   reserved those IDs for "internal" (useless) purpose. OTOH, if we keep
>   those IDs free to encode tracer-specific IDs, we can pack a lot of
>   data in 4-bytes events rather than 8-bytes, which actually doubles the
>   amount of events we can record.

A 4 byte dataless payload is useless anyway.

> 
> 
> > > RB_TYPE_TIME_EXTENT : I'd reserve a separate bit for this one.
> > > 
> > > Also, if size _really_ matters,
> > 
> > You and Martin have been telling it does ;-)
> > 
> > >  we should just export the event ID and
> > > look up the event payload size through a separate table. If the payload
> > > consists of a binary blob, then the data structure should start by a
> > > payload size and then have a the actual binary blob.
> > > 
> > > struct ring_buffer_event {
> > >         u32 time_ext:1, evid:4, time_lsb:27;
> > >         union {
> > >           u32 array[];
> > >           struct {
> > >             u32 ext_id;
> > >             u32 array[];
> > >           };
> > >           struct {
> > >             u32 ext_time;
> > >             u32 array[];
> > >           };
> > >           struct {
> > >             u32 ext_time;
> > >             u32 ext_id;
> > >             u32 array[];
> > >           };
> > > };
> > > 
> > > Therefore we can encode up to 15 event IDs into this compact
> > > representation (we reserve ID 0xF for extended id). If we assign those
> > > IDs per subbuffer, it leaves plenty of room before we have to write a
> > > supplementary field for more IDs.
> > > 
> > > OTOH, if we really want to have an event size in there (which is more
> > > solid), we could have :
> > > 
> > > struct ring_buffer_event {
> > >         u32 time_ext:1, time_lsb:31;
> > >         u16 evid;
> > >         u16 evsize;
> > >         union {
> > >           u32 array[];
> > >           struct {
> > >             u32 ext_time;
> > >             u32 array[];
> > >           };
> > > };
> > > 
> > > That's a bit bigger, but keeps the event size in the data stream.
> > 
> > I'm really not seeing what any of these proposals have on top of what
> > Steve currently has. We have the ability to encode up to 28 bytes of
> > payload in a 4 bytes header, which should suffice for most entries,
> > right?
> > 
> 
> The common entries would be under 28 bytes, right. But the thing is that
> Steven's proposal uses 8 bytes for the _smallest_ event when we can in
> fact cut that down to 4 bytes.

Only for dataless events.

> 
> 
> > > Also, nobody has explained successfully why we have to encode a time
> > > _delta_ (current tsc - prev tsc) rather than just putting the LSBs. So
> > > either I fail to see the light here (I doubt it), or I am not clear
> > > enough when I say that we can just put the raw LSBs and compute the
> > > delta afterward when reading the buffers, manage to keep the same
> > > overflow detection power, and also keep the absolute value of the tsc
> > > lsb, which makes it much easier to cross-check than "deltas".
> > 
> > Still not quite understanding where you get the MSBs from, how do you
> > tell if two LSBs are from the same period?
> > 
> 
> See my answer to Steven for this.
> 
> > > Now for the buffer pages implementation :
> > > 
> > > 
> > > +struct buffer_page {
> > > +       union {
> > > +               struct {
> > > +                       unsigned long flags;    /* mandatory */
> > > +                       atomic_t _count;        /* mandatory */
> > > +                       u64     time_stamp;     /* page time stamp */
> > > 
> > > Why should we ever associate a time stamp with a page ??
> > 
> > Discarting the whole sub-buffer idea, it could be used to validate
> > whichever time-stamp logic.
> > 
> > Personally I don't particulary like the sub-buffer concept, and I don't
> > think we need it.
> > 
> 
> Depends on the use-cases I guess, and on the locking used. My question
> about this time-stamp is : shouldn't we, instead, put it in a page
> header (exposed to userspace) rather than in this page frame, which I
> believe is in a separate data structure (and not in the page per se) ?

I rather keep the page headers out of the buffer. Let the tracer expose 
it.

> 
> > > I see that we could save the timestamp at which a subbuffer switch
> > > happens (which in this patchset semantics happens to be a page), but why
> > > would we every want to save that in the page frame ? Especially if we
> > > simply write the LSBs instead of a time delta... Also, I would write
> > > this timestamp in a subbuffer _header_ which is exported to userspace,
> > 
> > Why have sub-buffers at all?
> > 
> 
> To support events larger than 4kB. Also, to support different backing
> storage for the buffers without keeping a limitation specific a single
> page-tied implementation of those.

Create something special for >page_size events. Those are not the norm
anyway.

> 
> > > but I clealry don't see why we keep it here. In addition, it's
> > > completely wrong in a layered approach : if we want to switch from pages
> > > to video memory or to a statically allocated buffer at boot time, such
> > > "page-related" timestamp hacks won't do.
> > 
> > I don't think you _ever_ want to insert actual video memory in the
> > trace, what you _might_ possibly want to do, is insert a copy of a
> > frame, but that you can do with paged buffers like we have, just add an
> > entry with 3840*1200*4 bytes (one screen in my case), and use sub-writes
> > to copy everything to page-alinged chunks.
> > 
> 
> No, no :) What I meant is : some people would like to _reserve_ part of
> their video card memory so they can use it as backing storage for the
> _buffers_. The benefit of doing that is that those buffers will survive
> a hot reboot. Rather useful to debug kernel crashes. :)

This sounds like something special, and can be done with something 
different. No, I don't want the ring buffer infrastructure to be complex 
mechanism that can do everything. It should be a generic infrastructure 
that does what 99% of the tracers want to do. And 100% of those tracers 
want to record something smaller than a page size.

> 
> Copying larger event payloads would actually require to reserve multiple
> events in the trace, and would require a cookie or some locking to tell
> that a given event is actually the follow-up of the previous one,
> because with more evolved locking schemes, each event space reservation
> is atomic and we can therefore not assume that a single payload splitted
> in two events will be put contiguously in the trace buffers (can be
> interrupted, preempted...).

Again, make a special tracer for this. This is not the common case, and
I really don't care about it.

> 
> > There is nothing techinically prohibiting this in Steve's scheme, except
> > for Linus telling you you're an idiot for doing it.
> > 
> > The NOP packets you dislike allow us to align regular entries with page
> > boundaries, which in turn allows this 0-copy stuff. If you use the
> > sub-write iterator stuff we taked about a while ago, you can leave out
> > the NOPs.
> > 
> > Having both makes sense (if you want the large packets stuff), use the
> > regular page aligned 0-copy stuff for regular small packets, and use the
> > sub-write iterator stuff for your huge packets.
> > 
> 
> Sorry, I don't understand what the NOP packet gives you that the used
> bytes amount in the subbuffer (page) header doesn't provide (given my
> explanation above).
> 
> 
> > > > As for the page-spanning entries, I think we can do that with Steve's
> > > > system just fine, its just that Linus said its a dumb idea and Steve
> > > > dropped it, but there is nothing fundamental stopping us from recording
> > > > a length that is > PAGE_SIZE and copying data into the pages one at a
> > > > time.
> > > > 
> > > > Nor do I see it impossible to implement splice on top of Steve's
> > > > ring-buffer..
> > > > 
> > > > So again, why?
> > > > 
> > > 
> > > I'd like to separate the layer which deals with data read/write from the
> > > layer which deals with synchronization of data write/read to/from the
> > > buffers so we can eventually switch to a locking mechanism which
> > > provides a sane performance level, and given Steven's linked list
> > > implementation, it will just add unneeded locking requirements.
> > 
> > How does it differ from your linked list implementation? Reaslistically,
> > we want but a single locking scheme for the trace buffer stuff. So
> > factoring it out doesn't really make sense.
> > 
> 
> My linked list implementation does not assume we have to do a "disable
> interrupts" around the whole pointer manipulation, simply because there
> isn't any complicated manipulation done. Actually, the only linked list
> pointer manipulation I need to do a to atomically save one of three
> pointers to cache the current page I am writing to/reading from/getting
> the header pointer from.
> 
> > > Compared to this, I deal with buffer coherency with two 32 bits counters
> > > in LTTng : a write offset and a consumer offset. (plus a per-subbuffer
> > > commit count). 
> > 
> > I think that can work just fine on top of Steve's stuff too, it needs a
> > bit of trimming etc.. but afaict there isn't anything stopping us from
> > implementing his reserve function as light as you want:
> > 
> > int reserve(buffer, size, flags)
> > {
> >  preempt_disable()
> >  cpu = smp_processor_id();
> >  cpu_buf = buffer->buffers[cpu];
> > 
> >  if (cpu_buf->stop) {
> >    ret = -EBUSY;
> >    goto out;
> >  }
> > 
> > again:
> >  pos = cpu_buf->write_pos;
> >  if (flags & CONTIG) {
> >    new_pos = pos + size;
> >  } else {
> >    if (size > PAGE_SIZE) {
> >      ret = -ENOSPACE;
> >      goto out;
> >    }
> >    if ((pos + size) & PAGE_MASK != pos & PAGE_MASK) {
> >      insert_nops();
> >    }
> >    new_pos = pos + size;
> >  }
> >  if (cmpxchg(&cpu_buf->write_pos, pos, new_pos) != pos)
> 
> How do you deal with his call to __rb_reserve_next which deals with page
> change and plays with tail_page pointer ? This is where the whole
> problem lays; you have to atomically update more than a single atomic
> variable.


Only need to handle writes in stack order (interrupts and nmis). I'll move 
the tail index into the page itself, and then we can do atomic operations 
on both uping the tail and moving the tail page pointer. This will work 
nicely.


> 
> >    goto again;
> >  return pos;
> > 
> > out:
> >  preempt_enable();
> >  return ret;
> > }
> > 
> > If you put the offset&PAGE_MASK in each page-frame you can use that to
> > easily detect when you need to flip to the next page.
> > 
> 
> Exactly what I have in this patch.
> 
> > Which I imagine is similar to what you have... although I must admit to
> > not being sure how to deal with over-write here, I guess your buffer
> > must be large enough to ensure no nesting depth allows you to
> > wrap-around while having an even un-commited.
> > 
> 
> I deal with overwrite by looking that the commit count value of the
> subbuffer (page) I am about to start writing into. If it's not a

Heh, I was thinking of doing the same thing in the page frame. Sounds
like this is growing into what you did. See, I told you that it will.
Just wait for v2 ;-)

> multiple of subbuffer (page) size, then I consider the content of that
> specific subbuffer as corrupted. And yes, it implies that nesting that
> wraps around the number of subbuffers would cause (detected) data
> corruption. If it ever happens, the corrupted subbuffers count will be
> incremented and the given subbuffer will be overwritten (the subbuffer
> commit count is reequilibrated at this moment). When the stale writer
> will resume its execution, it will increment the commit count and
> therefore cause a second subbuffer corruption. In the end, 2 subbuffers
> will be dropped if this kind of situation happens. But I guess the rule
> is to make sure nested writes does not overflow the complete buffer
> size, or to simply make the buffers large enough.

Actually, I plan on pushing all the work to the reader. The writer will 
have full speed.  I'm not going to worry about it now, I'm pretty sure we 
have the infrastructure in place to make it happen. Another reason I don't 
want the ring buffer to interface with the user land directly is because 
it lets us change it without worrying.

Remember, there is no internal kernel backward compatibility ;-)

-- Steve


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] LTTng relay buffer allocation, read, write
  2008-09-29 21:24         ` Steven Rostedt
@ 2008-09-30 17:22           ` Martin Bligh
  2008-09-30 17:23             ` Martin Bligh
  0 siblings, 1 reply; 24+ messages in thread
From: Martin Bligh @ 2008-09-30 17:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Peter Zijlstra, linux-kernel, prasad,
	Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder, Tom Zanussi

> A 4 byte dataless payload is useless anyway.

Not at all convinced that's true - we used it for lots of things.
Start and end of irq event is one frequent example.

Also, in a compact mode, we can record start and end of
syscalls like this (without parameters).

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] LTTng relay buffer allocation, read, write
  2008-09-30 17:22           ` Martin Bligh
@ 2008-09-30 17:23             ` Martin Bligh
  2008-09-30 18:14               ` Mathieu Desnoyers
  0 siblings, 1 reply; 24+ messages in thread
From: Martin Bligh @ 2008-09-30 17:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Peter Zijlstra, linux-kernel, prasad,
	Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder, Tom Zanussi

On Tue, Sep 30, 2008 at 10:22 AM, Martin Bligh <mbligh@google.com> wrote:
>> A 4 byte dataless payload is useless anyway.
>
> Not at all convinced that's true - we used it for lots of things.
> Start and end of irq event is one frequent example.
>
> Also, in a compact mode, we can record start and end of
> syscalls like this (without parameters).

Ah, sorry, maybe I misread this. No data with no event ids, etc
is fairly useless. 4 bytes data including space to record event ids is OK.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] LTTng relay buffer allocation, read, write
  2008-09-30 17:23             ` Martin Bligh
@ 2008-09-30 18:14               ` Mathieu Desnoyers
  2008-09-30 18:22                 ` Steven Rostedt
  0 siblings, 1 reply; 24+ messages in thread
From: Mathieu Desnoyers @ 2008-09-30 18:14 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Steven Rostedt, Peter Zijlstra, linux-kernel, prasad,
	Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder, Tom Zanussi

* Martin Bligh (mbligh@google.com) wrote:
> On Tue, Sep 30, 2008 at 10:22 AM, Martin Bligh <mbligh@google.com> wrote:
> >> A 4 byte dataless payload is useless anyway.
> >
> > Not at all convinced that's true - we used it for lots of things.
> > Start and end of irq event is one frequent example.
> >
> > Also, in a compact mode, we can record start and end of
> > syscalls like this (without parameters).
> 
> Ah, sorry, maybe I misread this. No data with no event ids, etc
> is fairly useless. 4 bytes data including space to record event ids is OK.

In Steven's scheme, the event IDs in the 4 bytes are reserved for
(useless) internal use ;) They can therefore not be used for specific
tracer event IDs, which I think is a misuse of the precious bits
otherwise available to store really useful event IDs.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] LTTng relay buffer allocation, read, write
  2008-09-30 18:14               ` Mathieu Desnoyers
@ 2008-09-30 18:22                 ` Steven Rostedt
  2008-09-30 18:35                   ` Mathieu Desnoyers
  0 siblings, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2008-09-30 18:22 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Martin Bligh, Peter Zijlstra, linux-kernel, prasad,
	Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder, Tom Zanussi


On Tue, 30 Sep 2008, Mathieu Desnoyers wrote:
> 
> In Steven's scheme, the event IDs in the 4 bytes are reserved for
> (useless) internal use ;) They can therefore not be used for specific
> tracer event IDs, which I think is a misuse of the precious bits
> otherwise available to store really useful event IDs.

I'm using them, so they must not be totally useless. ;-)

But ftrace has its own event ids and I don't want the ring buffer to ever 
have to know about them.

-- Steve


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] LTTng relay buffer allocation, read, write
  2008-09-30 18:22                 ` Steven Rostedt
@ 2008-09-30 18:35                   ` Mathieu Desnoyers
  2008-09-30 19:43                     ` Steven Rostedt
  2008-09-30 19:44                     ` Martin Bligh
  0 siblings, 2 replies; 24+ messages in thread
From: Mathieu Desnoyers @ 2008-09-30 18:35 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Martin Bligh, Peter Zijlstra, linux-kernel, prasad,
	Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder, Tom Zanussi

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Tue, 30 Sep 2008, Mathieu Desnoyers wrote:
> > 
> > In Steven's scheme, the event IDs in the 4 bytes are reserved for
> > (useless) internal use ;) They can therefore not be used for specific
> > tracer event IDs, which I think is a misuse of the precious bits
> > otherwise available to store really useful event IDs.
> 
> I'm using them, so they must not be totally useless. ;-)
> 
> But ftrace has its own event ids and I don't want the ring buffer to ever 
> have to know about them.
> 
> -- Steve
> 

You are actually using them to put redundant information that could be
encoded differently and thus save 4 bits per event records, more or less
what will be needed by most tracers (15 IDs, 1 reserved for an extended
ID field).

So the fact that you use them does not mean they are really required,
and I don't think such duplicated information actually makes things more
solid. Maybe just more obscure ?

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] LTTng relay buffer allocation, read, write
  2008-09-30 18:35                   ` Mathieu Desnoyers
@ 2008-09-30 19:43                     ` Steven Rostedt
  2008-09-30 20:01                       ` Frank Ch. Eigler
  2008-09-30 19:44                     ` Martin Bligh
  1 sibling, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2008-09-30 19:43 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Martin Bligh, Peter Zijlstra, linux-kernel, prasad,
	Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder, Tom Zanussi


On Tue, 30 Sep 2008, Mathieu Desnoyers wrote:
> 
> You are actually using them to put redundant information that could be
> encoded differently and thus save 4 bits per event records, more or less
> what will be needed by most tracers (15 IDs, 1 reserved for an extended
> ID field).

I really like the idea of keeping the tracer event ids out of the ring 
buffer logic.

> 
> So the fact that you use them does not mean they are really required,
> and I don't think such duplicated information actually makes things more
> solid. Maybe just more obscure ?

Well, at least for version 1 these bits stay. I already found two bugs by 
hitting the event padding when the size said it should not have. This 
redundant information makes me feel a bit more cozy when they match.

-- Steve


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] LTTng relay buffer allocation, read, write
  2008-09-30 18:35                   ` Mathieu Desnoyers
  2008-09-30 19:43                     ` Steven Rostedt
@ 2008-09-30 19:44                     ` Martin Bligh
  2008-09-30 19:54                       ` Mathieu Desnoyers
  1 sibling, 1 reply; 24+ messages in thread
From: Martin Bligh @ 2008-09-30 19:44 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Peter Zijlstra, linux-kernel, prasad,
	Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder, Tom Zanussi

> You are actually using them to put redundant information that could be
> encoded differently and thus save 4 bits per event records, more or less
> what will be needed by most tracers (15 IDs, 1 reserved for an extended
> ID field).

You have 15 event types that are useful with no data payload at all?

> So the fact that you use them does not mean they are really required,
> and I don't think such duplicated information actually makes things more
> solid. Maybe just more obscure ?

This is all over 1 bit of information, right? Since you need at least 1 for
the timestamp stuff.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] LTTng relay buffer allocation, read, write
  2008-09-30 19:44                     ` Martin Bligh
@ 2008-09-30 19:54                       ` Mathieu Desnoyers
  2008-09-30 20:49                         ` Martin Bligh
  0 siblings, 1 reply; 24+ messages in thread
From: Mathieu Desnoyers @ 2008-09-30 19:54 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Steven Rostedt, Peter Zijlstra, linux-kernel, prasad,
	Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder, Tom Zanussi

* Martin Bligh (mbligh@google.com) wrote:
> > You are actually using them to put redundant information that could be
> > encoded differently and thus save 4 bits per event records, more or less
> > what will be needed by most tracers (15 IDs, 1 reserved for an extended
> > ID field).
> 
> You have 15 event types that are useful with no data payload at all?
> 

I am not saying anything about the actual number of events with 0 bytes
payload I actually have in my own instrumentation, if this is what you
mean. I am just saying that it leaves this room available for such
events.

Even if there is a 32 bits payload associated with those events, the
fact that we can encode the event ID in the 32 bits header will bring
those events from 96 bits (due to 32 bits alignment) down to 64 bits.

> > So the fact that you use them does not mean they are really required,
> > and I don't think such duplicated information actually makes things more
> > solid. Maybe just more obscure ?
> 
> This is all over 1 bit of information, right? Since you need at least 1 for
> the timestamp stuff.

4 bits of information could be added to the 32-bits header if we allow
tracers to register their first 15 event IDs in those 4 bits.

But well... let's keep that for v2. ;)

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] LTTng relay buffer allocation, read, write
  2008-09-30 19:43                     ` Steven Rostedt
@ 2008-09-30 20:01                       ` Frank Ch. Eigler
  2008-09-30 20:21                         ` Steven Rostedt
  0 siblings, 1 reply; 24+ messages in thread
From: Frank Ch. Eigler @ 2008-09-30 20:01 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Martin Bligh, Peter Zijlstra, linux-kernel,
	prasad, Linus Torvalds, Thomas Gleixner, od, Andrew Morton, hch,
	David Wilder, Tom Zanussi

Hi -

On Tue, Sep 30, 2008 at 03:43:45PM -0400, Steven Rostedt wrote:
> [...]
> On Tue, 30 Sep 2008, Mathieu Desnoyers wrote:
> > 
> > You are actually using them to put redundant information that could be
> > encoded differently and thus save 4 bits per event records, more or less
> > what will be needed by most tracers (15 IDs, 1 reserved for an extended
> > ID field).
> 
> I really like the idea of keeping the tracer event ids out of the ring 
> buffer logic.

That's fine for a ring buffer that only ever contains data from one
event source.  How do you imagine multiplexing working, where one
wants to grep a single debugfs file that contains data from different
event sources?  Punt to another layer?

- FChE

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] LTTng relay buffer allocation, read, write
  2008-09-30 20:01                       ` Frank Ch. Eigler
@ 2008-09-30 20:21                         ` Steven Rostedt
  0 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2008-09-30 20:21 UTC (permalink / raw)
  To: Frank Ch. Eigler
  Cc: Mathieu Desnoyers, Martin Bligh, Peter Zijlstra, linux-kernel,
	prasad, Linus Torvalds, Thomas Gleixner, od, Andrew Morton, hch,
	David Wilder, Tom Zanussi


On Tue, 30 Sep 2008, Frank Ch. Eigler wrote:
> On Tue, Sep 30, 2008 at 03:43:45PM -0400, Steven Rostedt wrote:
> > [...]
> > On Tue, 30 Sep 2008, Mathieu Desnoyers wrote:
> > > 
> > > You are actually using them to put redundant information that could be
> > > encoded differently and thus save 4 bits per event records, more or less
> > > what will be needed by most tracers (15 IDs, 1 reserved for an extended
> > > ID field).
> > 
> > I really like the idea of keeping the tracer event ids out of the ring 
> > buffer logic.
> 
> That's fine for a ring buffer that only ever contains data from one
> event source.  How do you imagine multiplexing working, where one
> wants to grep a single debugfs file that contains data from different
> event sources?  Punt to another layer?

Field goal!

Yes! That is exactly what I want. If your trace has only one event, why 
would it want to record them?  The event id belongs in the tracer layer 
not the ring buffer layer.

I would like to make a "trace_buffer" layer that includes all of this.

-- Steve


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] LTTng relay buffer allocation, read, write
  2008-09-30 19:54                       ` Mathieu Desnoyers
@ 2008-09-30 20:49                         ` Martin Bligh
  2008-09-30 20:55                           ` Mathieu Desnoyers
  0 siblings, 1 reply; 24+ messages in thread
From: Martin Bligh @ 2008-09-30 20:49 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Peter Zijlstra, linux-kernel, prasad,
	Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder, Tom Zanussi

> I am not saying anything about the actual number of events with 0 bytes
> payload I actually have in my own instrumentation, if this is what you
> mean. I am just saying that it leaves this room available for such
> events.

It would, yes. Are they useful?

> Even if there is a 32 bits payload associated with those events, the
> fact that we can encode the event ID in the 32 bits header will bring
> those events from 96 bits (due to 32 bits alignment) down to 64 bits.

That's true. So do we have a bunch of stuff that we really really need
that'd fit into 32 bits, but not 28 bits?

>> This is all over 1 bit of information, right? Since you need at least 1 for
>> the timestamp stuff.
>
> 4 bits of information could be added to the 32-bits header if we allow
> tracers to register their first 15 event IDs in those 4 bits.
>
> But well... let's keep that for v2. ;)

Sounds like a plan ;-) All this stuff is internal representations anyway.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH] LTTng relay buffer allocation, read, write
  2008-09-30 20:49                         ` Martin Bligh
@ 2008-09-30 20:55                           ` Mathieu Desnoyers
  0 siblings, 0 replies; 24+ messages in thread
From: Mathieu Desnoyers @ 2008-09-30 20:55 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Steven Rostedt, Peter Zijlstra, linux-kernel, prasad,
	Linus Torvalds, Thomas Gleixner, od, Frank Ch. Eigler,
	Andrew Morton, hch, David Wilder, Tom Zanussi

* Martin Bligh (mbligh@google.com) wrote:
> > I am not saying anything about the actual number of events with 0 bytes
> > payload I actually have in my own instrumentation, if this is what you
> > mean. I am just saying that it leaves this room available for such
> > events.
> 
> It would, yes. Are they useful?
> 

Stuff like irq_exit has 0 byte payload, is very high rate, and useful,
yes.

> > Even if there is a 32 bits payload associated with those events, the
> > fact that we can encode the event ID in the 32 bits header will bring
> > those events from 96 bits (due to 32 bits alignment) down to 64 bits.
> 
> That's true. So do we have a bunch of stuff that we really really need
> that'd fit into 32 bits, but not 28 bits?
> 

Probably pointers on 32 bits archs. Any "int" value, on 32 or 64 bits,
will need careful attention if we want only to export the 28 LSBs. It
would probably make error value export a bit trickier and error-prone.


> >> This is all over 1 bit of information, right? Since you need at least 1 for
> >> the timestamp stuff.
> >
> > 4 bits of information could be added to the 32-bits header if we allow
> > tracers to register their first 15 event IDs in those 4 bits.
> >
> > But well... let's keep that for v2. ;)
> 
> Sounds like a plan ;-) All this stuff is internal representations anyway.


Yup.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2008-09-30 20:55 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-09-27 13:40 [RFC PATCH] LTTng relay buffer allocation, read, write Mathieu Desnoyers
2008-09-27 17:10 ` Peter Zijlstra
2008-09-28  8:59   ` Peter Zijlstra
2008-09-29 16:06     ` Mathieu Desnoyers
2008-09-29 15:50   ` Mathieu Desnoyers
2008-09-29 16:38     ` Steven Rostedt
2008-09-29 18:38       ` Mathieu Desnoyers
2008-09-29 19:40         ` Steven Rostedt
2008-09-29 19:54           ` Steven Rostedt
2008-09-29 17:30     ` Peter Zijlstra
2008-09-29 20:31       ` Mathieu Desnoyers
2008-09-29 21:24         ` Steven Rostedt
2008-09-30 17:22           ` Martin Bligh
2008-09-30 17:23             ` Martin Bligh
2008-09-30 18:14               ` Mathieu Desnoyers
2008-09-30 18:22                 ` Steven Rostedt
2008-09-30 18:35                   ` Mathieu Desnoyers
2008-09-30 19:43                     ` Steven Rostedt
2008-09-30 20:01                       ` Frank Ch. Eigler
2008-09-30 20:21                         ` Steven Rostedt
2008-09-30 19:44                     ` Martin Bligh
2008-09-30 19:54                       ` Mathieu Desnoyers
2008-09-30 20:49                         ` Martin Bligh
2008-09-30 20:55                           ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).