linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/2 v3] Unified trace buffer
@ 2008-09-25 18:51 Steven Rostedt
  2008-09-25 18:51 ` [RFC PATCH 1/2 " Steven Rostedt
  2008-09-25 18:51 ` [RFC PATCH 2/2 v3] ftrace: make work with new ring buffer Steven Rostedt
  0 siblings, 2 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-09-25 18:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	prasad, Linus Torvalds, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig


[ NOTE function comments have not been updated. Comments within the
 code has.]


This version I change the event header to what Peter Zijlstra requested.
The buffer alignment is now 4 from 8.  The minimum event is 8 bytes.
The event header is now:

struct event_header {
	u32	type:2, len:3, time_delta:27;
	u32	array[];
};

The length of the record is determined as:

  if (data size > 28 bytes)
	lenght = event->array[0] + sizeof(event_header);
  else
  	length = event->len << 4 + sizeof(event_header);

For data

  if (date size > 28 bytes)
	data = &event->array[0];
  else
  	data = &event->array[1];


There are now only 4 internal data types:

0 - Padding
1 - time extent
2 - time stamp
3 - data

This is for internal buffer management only. Other event types should be
pushed to a higher layer, and stored in the data field.

The timing is basically the same as v2 but I added a reader side
ring_buffer_normalize_time_stamp() operation. As a test, I mult the
timestamp to -1, in both set and normalize operations. Whether this
actually tests anything is another story ;-)


I actually like this header and structure the best. And this may be
what I start working on for real.

So please speak up on this one.

I'm going to take a break from this and start doing my real work.
This will let others soak it up for a bit.

-- Steve



^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH 1/2 v3] Unified trace buffer
  2008-09-25 18:51 [RFC PATCH 0/2 v3] Unified trace buffer Steven Rostedt
@ 2008-09-25 18:51 ` Steven Rostedt
  2008-09-26  1:02   ` [RFC PATCH v4] " Steven Rostedt
  2008-09-25 18:51 ` [RFC PATCH 2/2 v3] ftrace: make work with new ring buffer Steven Rostedt
  1 sibling, 1 reply; 102+ messages in thread
From: Steven Rostedt @ 2008-09-25 18:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	prasad, Linus Torvalds, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Steven Rostedt

[-- Attachment #1: ring-buffer.patch --]
[-- Type: text/plain, Size: 38336 bytes --]

This is probably very buggy. I ran it as a back end for ftrace but only
tested the irqsoff and ftrace tracers. The selftests are busted with it.

But this is an attempt to get a unified buffering system that was
talked about at the LPC meeting.

Now that it boots and runs (albeit, a bit buggy), I decided to post it.
This is some idea that I had to handle this.

I tried to make it as simple as possible.

I'm not going to explain all the stuff I'm doing here, since this code
is under a lot of flux (RFC, POC work), and I don't want to keep updating
this change log. When we finally agree on something, I'll make this
change log worthy.

If you want to know what this patch does, the code below explains it :-p

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 include/linux/ring_buffer.h |  175 ++++++
 kernel/trace/Kconfig        |    3 
 kernel/trace/Makefile       |    1 
 kernel/trace/ring_buffer.c  | 1218 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 1397 insertions(+)

Index: linux-compile.git/include/linux/ring_buffer.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-compile.git/include/linux/ring_buffer.h	2008-09-25 13:59:09.000000000 -0400
@@ -0,0 +1,175 @@
+#ifndef _LINUX_RING_BUFFER_H
+#define _LINUX_RING_BUFFER_H
+
+#include <linux/mm.h>
+#include <linux/seq_file.h>
+
+struct ring_buffer;
+struct ring_buffer_iter;
+
+/*
+ * Don't reference this struct directly, use the inline items below.
+ */
+struct ring_buffer_event {
+	u32		type:2, len:3, time_delta:27;
+	u32		array[];
+} __attribute__((__packed__));
+
+enum {
+	RB_TYPE_PADDING,	/* Left over page padding
+				 * array is ignored
+				 * size is variable depending on
+				 */
+	RB_TYPE_TIME_EXTENT,	/* Extent the time delta
+				 * array[0] = time delta (28 .. 59)
+				 * size = 8 bytes
+				 */
+	/* FIXME: RB_TYPE_TIME_STAMP not implemented */
+	RB_TYPE_TIME_STAMP,	/* Sync time stamp with external clock
+				 * array[0] = tv_nsec
+				 * array[1] = tv_sec
+				 * size = 16 bytes
+				 */
+
+	RB_TYPE_DATA,		/* Data record
+				 * If len is zero:
+				 *  array[0] holds the actual length
+				 *  array[1..(length+3)/4] holds data
+				 * else
+				 *  length = len << 2
+				 *  array[0..(length+3)/4] holds data
+				 */
+};
+
+#define RB_EVNT_HDR_SIZE (sizeof(struct ring_buffer_event))
+#define RB_ALIGNMENT_SHIFT	2
+#define RB_ALIGNMENT		(1 << RB_ALIGNMENT_SHIFT)
+#define RB_MAX_SMALL_DATA	(28)
+
+enum {
+	RB_LEN_TIME_EXTENT = 8,
+	RB_LEN_TIME_STAMP = 16,
+};
+
+/**
+ * ring_buffer_event_length - return the length of the event
+ * @event: the event to get the length of
+ */
+static inline unsigned
+ring_buffer_event_length(struct ring_buffer_event *event)
+{
+	unsigned length;
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		/* undefined */
+		return -1;
+
+	case RB_TYPE_TIME_EXTENT:
+		return RB_LEN_TIME_EXTENT;
+
+	case RB_TYPE_TIME_STAMP:
+		return RB_LEN_TIME_STAMP;
+
+	case RB_TYPE_DATA:
+		if (event->len)
+			length = event->len << RB_ALIGNMENT_SHIFT;
+		else
+			length = event->array[0];
+		return length + RB_EVNT_HDR_SIZE;
+	default:
+		BUG();
+	}
+	/* not hit */
+	return 0;
+}
+
+/**
+ * ring_buffer_event_time_delta - return the delta timestamp of the event
+ * @event: the event to get the delta timestamp of
+ *
+ * The delta timestamp is the 27 bit timestamp since the last event.
+ */
+static inline unsigned
+ring_buffer_event_time_delta(struct ring_buffer_event *event)
+{
+	return event->time_delta;
+}
+
+/**
+ * ring_buffer_event_data - return the data of the event
+ * @event: the event to get the data from
+ *
+ * Note, if the length of the event is more than 256 bytes, the
+ * length field is stored in the body. We need to return
+ * after the length field in that case.
+ */
+static inline void *
+ring_buffer_event_data(struct ring_buffer_event *event)
+{
+	BUG_ON(event->type != RB_TYPE_DATA);
+	/* If length is in len field, then array[0] has the data */
+	if (event->len)
+		return (void *)&event->array[0];
+	/* Otherwise length is in array[0] and array[1] has the data */
+	return (void *)&event->array[1];
+}
+
+void ring_buffer_lock(struct ring_buffer *buffer, unsigned long *flags);
+void ring_buffer_unlock(struct ring_buffer *buffer, unsigned long flags);
+
+/*
+ * size is in bytes for each per CPU buffer.
+ */
+struct ring_buffer *
+ring_buffer_alloc(unsigned long size, unsigned flags);
+void ring_buffer_free(struct ring_buffer *buffer);
+
+int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size);
+
+void *ring_buffer_lock_reserve(struct ring_buffer *buffer,
+			       unsigned long length,
+			       unsigned long *flags);
+int ring_buffer_unlock_commit(struct ring_buffer *buffer,
+			      void *data, unsigned long flags);
+void *ring_buffer_write(struct ring_buffer *buffer,
+			unsigned long length, void *data);
+
+struct ring_buffer_event *
+ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts);
+
+struct ring_buffer_iter *
+ring_buffer_read_start(struct ring_buffer *buffer, int cpu);
+void ring_buffer_read_finish(struct ring_buffer_iter *iter);
+
+struct ring_buffer_event *
+ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts);
+struct ring_buffer_event *
+ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts);
+struct ring_buffer_event *
+ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts);
+void ring_buffer_iter_reset(struct ring_buffer_iter *iter);
+int ring_buffer_iter_empty(struct ring_buffer_iter *iter);
+
+unsigned long ring_buffer_size(struct ring_buffer *buffer);
+
+void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu);
+void ring_buffer_reset(struct ring_buffer *buffer);
+
+int ring_buffer_swap_cpu(struct ring_buffer *buffer_a,
+			 struct ring_buffer *buffer_b, int cpu);
+
+int ring_buffer_empty(struct ring_buffer *buffer);
+int ring_buffer_empty_cpu(struct ring_buffer *buffer, int cpu);
+
+void ring_buffer_disable(struct ring_buffer *buffer);
+void ring_buffer_enable(struct ring_buffer *buffer);
+
+unsigned long ring_buffer_entries(struct ring_buffer *buffer);
+unsigned long ring_buffer_overruns(struct ring_buffer *buffer);
+
+enum ring_buffer_flags {
+	RB_FL_OVERWRITE		= 1 << 0,
+};
+
+#endif /* _LINUX_RING_BUFFER_H */
Index: linux-compile.git/kernel/trace/ring_buffer.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-compile.git/kernel/trace/ring_buffer.c	2008-09-25 14:30:12.000000000 -0400
@@ -0,0 +1,1218 @@
+/*
+ * Generic ring buffer
+ *
+ * Copyright (C) 2008 Steven Rostedt <srostedt@redhat.com>
+ */
+#include <linux/ring_buffer.h>
+#include <linux/spinlock.h>
+#include <linux/debugfs.h>
+#include <linux/uaccess.h>
+#include <linux/module.h>
+#include <linux/percpu.h>
+#include <linux/init.h>
+#include <linux/hash.h>
+#include <linux/list.h>
+#include <linux/fs.h>
+
+#include "trace.h"
+
+#define sdr_print(x, y...) printk("%s:%d " x "\n", __FUNCTION__, __LINE__, y)
+
+/* FIXME!!! */
+unsigned long long
+ring_buffer_time_stamp(int cpu)
+{
+	/* mult -1 to test normalize */
+	return sched_clock() * -1;
+}
+void ring_buffer_normalize_time_stamp(int cpu, u64 *ts)
+{
+	*ts *= -1;
+}
+
+#define TS_SHIFT	27
+#define TS_MASK		((1ULL << TS_SHIFT) - 1)
+#define TS_DELTA_TEST	~TS_MASK
+
+/*
+ * We need to fit the time_stamp delta into 27 bits.
+ * Plue the time stamp delta of (-1) is a special flag.
+ */
+static inline int
+test_time_stamp(unsigned long long delta)
+{
+	if ((delta + 1) & TS_DELTA_TEST)
+		return 1;
+	return 0;
+}
+
+struct buffer_page {
+	u64		time_stamp;
+	unsigned char	body[];
+};
+
+#define BUF_PAGE_SIZE (PAGE_SIZE - sizeof(u64))
+
+/*
+ * head_page == tail_page && head == tail then buffer is empty.
+ */
+struct ring_buffer_per_cpu {
+	int			cpu;
+	struct ring_buffer	*buffer;
+	raw_spinlock_t		lock;
+	struct lock_class_key	lock_key;
+	struct buffer_page	**pages;
+	unsigned long		head;	/* read from head */
+	unsigned long		tail;	/* write to tail */
+	unsigned long		head_page;
+	unsigned long		tail_page;
+	unsigned long		overrun;
+	unsigned long		entries;
+	u64			last_stamp;
+	u64			read_stamp;
+	atomic_t		record_disabled;
+};
+
+struct ring_buffer {
+	unsigned long		size;
+	unsigned		pages;
+	unsigned		flags;
+	int			cpus;
+	atomic_t		record_disabled;
+
+	spinlock_t		lock;
+	struct mutex		mutex;
+
+	/* FIXME: this should be online CPUS */
+	struct ring_buffer_per_cpu *buffers[NR_CPUS];
+};
+
+struct ring_buffer_iter {
+	struct ring_buffer_per_cpu	*cpu_buffer;
+	unsigned long			head;
+	unsigned long			head_page;
+	u64				read_stamp;
+};
+
+static struct ring_buffer_per_cpu *
+ring_buffer_allocate_cpu_buffer(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int pages = buffer->pages;
+	int i;
+
+	cpu_buffer = kzalloc_node(ALIGN(sizeof(*cpu_buffer), cache_line_size()),
+				  GFP_KERNEL, cpu_to_node(cpu));
+	if (!cpu_buffer)
+		return NULL;
+
+	cpu_buffer->cpu = cpu;
+	cpu_buffer->buffer = buffer;
+	cpu_buffer->lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED;
+
+	cpu_buffer->pages = kzalloc_node(ALIGN(sizeof(void *) * pages,
+					       cache_line_size()), GFP_KERNEL,
+					 cpu_to_node(cpu));
+	if (!cpu_buffer->pages)
+		goto fail_free_buffer;
+
+	for (i = 0; i < pages; i++) {
+		cpu_buffer->pages[i] = (void *)get_zeroed_page(GFP_KERNEL);
+		if (!cpu_buffer->pages[i])
+			goto fail_free_pages;
+	}
+
+	return cpu_buffer;
+
+ fail_free_pages:
+	for (i = 0; i < pages; i++) {
+		if (cpu_buffer->pages[i])
+			free_page((unsigned long)cpu_buffer->pages[i]);
+	}
+	kfree(cpu_buffer->pages);
+
+ fail_free_buffer:
+	kfree(cpu_buffer);
+	return NULL;
+}
+
+static void
+ring_buffer_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	int i;
+
+	for (i = 0; i < cpu_buffer->buffer->pages; i++) {
+		if (cpu_buffer->pages[i])
+			free_page((unsigned long)cpu_buffer->pages[i]);
+	}
+	kfree(cpu_buffer->pages);
+	kfree(cpu_buffer);
+}
+
+struct ring_buffer *
+ring_buffer_alloc(unsigned long size, unsigned flags)
+{
+	struct ring_buffer *buffer;
+	int cpu;
+
+	/* keep it in its own cache line */
+	buffer = kzalloc(ALIGN(sizeof(*buffer), cache_line_size()),
+			 GFP_KERNEL);
+	if (!buffer)
+		return NULL;
+
+	buffer->pages = (size + (PAGE_SIZE - 1)) / PAGE_SIZE;
+	buffer->flags = flags;
+
+	/* need at least two pages */
+	if (buffer->pages == 1)
+		buffer->pages++;
+
+	/* FIXME: do for only online CPUS */
+	buffer->cpus = num_possible_cpus();
+	for_each_possible_cpu(cpu) {
+		if (cpu >= buffer->cpus)
+			continue;
+		buffer->buffers[cpu] =
+			ring_buffer_allocate_cpu_buffer(buffer, cpu);
+		if (!buffer->buffers[cpu])
+			goto fail_free_buffers;
+	}
+
+	spin_lock_init(&buffer->lock);
+	mutex_init(&buffer->mutex);
+
+	return buffer;
+
+ fail_free_buffers:
+	for_each_possible_cpu(cpu) {
+		if (cpu >= buffer->cpus)
+			continue;
+		if (buffer->buffers[cpu])
+			ring_buffer_free_cpu_buffer(buffer->buffers[cpu]);
+	}
+
+	kfree(buffer);
+	return NULL;
+}
+
+/**
+ * ring_buffer_free - free a ring buffer.
+ * @buffer: the buffer to free.
+ */
+void
+ring_buffer_free(struct ring_buffer *buffer)
+{
+	int cpu;
+
+	for (cpu = 0; cpu < buffer->cpus; cpu++)
+		ring_buffer_free_cpu_buffer(buffer->buffers[cpu]);
+
+	kfree(buffer);
+}
+
+int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size)
+{
+	/* FIXME: */
+	return -1;
+}
+
+static inline int
+ring_buffer_per_cpu_empty(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	return cpu_buffer->head_page == cpu_buffer->tail_page &&
+		cpu_buffer->head == cpu_buffer->tail;
+}
+
+static inline int
+ring_buffer_null_event(struct ring_buffer_event *event)
+{
+	return event->type == RB_TYPE_PADDING;
+}
+
+static inline void *
+rb_page_body(struct ring_buffer_per_cpu *cpu_buffer,
+		      unsigned long page, unsigned index)
+{
+	return cpu_buffer->pages[page]->body + index;
+}
+
+static inline struct ring_buffer_event *
+ring_buffer_head_event(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	return rb_page_body(cpu_buffer,cpu_buffer->head_page,
+			    cpu_buffer->head);
+}
+
+static inline struct ring_buffer_event *
+ring_buffer_iter_head_event(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
+
+	return rb_page_body(cpu_buffer, iter->head_page,
+			    iter->head);
+}
+
+static void
+ring_buffer_update_overflow(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct ring_buffer_event *event;
+	unsigned long head;
+
+	for (head = 0; head < BUF_PAGE_SIZE;
+	     head += ring_buffer_event_length(event)) {
+		event = rb_page_body(cpu_buffer, cpu_buffer->head_page, head);
+		if (ring_buffer_null_event(event))
+			break;
+		cpu_buffer->overrun++;
+		cpu_buffer->entries--;
+	}
+}
+
+static inline void
+ring_buffer_inc_page(struct ring_buffer *buffer,
+		     unsigned long *page)
+{
+	(*page)++;
+	if (*page >= buffer->pages)
+		*page = 0;
+}
+
+static inline void
+rb_add_stamp(struct ring_buffer_per_cpu *cpu_buffer, u64 *ts)
+{
+	struct buffer_page *bpage;
+
+	bpage = cpu_buffer->pages[cpu_buffer->tail_page];
+	bpage->time_stamp = *ts;
+}
+
+static void
+rb_reset_read_page(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct buffer_page *bpage;
+
+	cpu_buffer->head = 0;
+	bpage = cpu_buffer->pages[cpu_buffer->head_page];
+	cpu_buffer->read_stamp = bpage->time_stamp;
+}
+
+static void
+rb_reset_iter_read_page(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
+	struct buffer_page *bpage;
+
+	iter->head = 0;
+	bpage = cpu_buffer->pages[iter->head_page];
+	iter->read_stamp = bpage->time_stamp;
+}
+
+/**
+ * ring_buffer_update_event - update event type and data
+ * @event: the even to update
+ * @type: the type of event
+ * @length: the size of the event field in the ring buffer
+ *
+ * Update the type and data fields of the event. The length
+ * is the actual size that is written to the ring buffer,
+ * and with this, we can determine what to place into the
+ * data field.
+ */
+static inline void
+ring_buffer_update_event(struct ring_buffer_event *event,
+			 unsigned type, unsigned length)
+{
+	event->type = type;
+
+	switch (type) {
+		/* ignore fixed size types */
+	case RB_TYPE_PADDING:
+		break;
+
+	case RB_TYPE_TIME_EXTENT:
+		event->len =
+			(RB_LEN_TIME_EXTENT + (RB_ALIGNMENT-1))
+			>> RB_ALIGNMENT_SHIFT;
+		break;
+
+	case RB_TYPE_TIME_STAMP:
+		event->len =
+			(RB_LEN_TIME_STAMP + (RB_ALIGNMENT-1))
+			>> RB_ALIGNMENT_SHIFT;
+		break;
+
+	case RB_TYPE_DATA:
+		length -= RB_EVNT_HDR_SIZE;
+		if (length > RB_MAX_SMALL_DATA) {
+			event->len = 0;
+			event->array[0] = length;
+		} else
+			event->len =
+				(length + (RB_ALIGNMENT-1))
+				>> RB_ALIGNMENT_SHIFT;
+		break;
+	default:
+		BUG();
+	}
+}
+
+static inline unsigned rb_calculate_event_length(unsigned length)
+{
+	struct ring_buffer_event event; /* Used only for sizeof array */
+
+	/* zero length can cause confusions */
+	if (!length)
+		length = 1;
+
+	if (length > RB_MAX_SMALL_DATA)
+		length += sizeof(event.array[0]);
+
+	length += RB_EVNT_HDR_SIZE;
+	length = ALIGN(length, RB_ALIGNMENT);
+
+	return length;
+}
+
+static struct ring_buffer_event *
+__ring_buffer_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
+			   unsigned type, unsigned long length, u64 *ts)
+{
+	unsigned long head_page, tail_page, tail;
+	struct ring_buffer *buffer = cpu_buffer->buffer;
+	struct ring_buffer_event *event;
+
+	tail_page = cpu_buffer->tail_page;
+	head_page = cpu_buffer->head_page;
+	tail = cpu_buffer->tail;
+
+	BUG_ON(tail_page >= buffer->pages);
+	BUG_ON(head_page >= buffer->pages);
+
+	if (tail + length > BUF_PAGE_SIZE) {
+		unsigned long next_page = tail_page;
+
+		ring_buffer_inc_page(buffer, &next_page);
+
+		if (next_page == head_page) {
+			if (!(buffer->flags & RB_FL_OVERWRITE))
+				return NULL;
+
+			/* count overflows */
+			ring_buffer_update_overflow(cpu_buffer);
+
+			ring_buffer_inc_page(buffer, &head_page);
+			cpu_buffer->head_page = head_page;
+			rb_reset_read_page(cpu_buffer);
+		}
+
+		if (tail != BUF_PAGE_SIZE) {
+			event = rb_page_body(cpu_buffer, tail_page, tail);
+			/* page padding */
+			event->type = RB_TYPE_PADDING;
+		}
+
+		tail = 0;
+		tail_page = next_page;
+		cpu_buffer->tail_page = tail_page;
+		cpu_buffer->tail = tail;
+		rb_add_stamp(cpu_buffer, ts);
+	}
+
+	BUG_ON(tail_page >= buffer->pages);
+	BUG_ON(tail + length > BUF_PAGE_SIZE);
+
+	event = rb_page_body(cpu_buffer, tail_page, tail);
+	ring_buffer_update_event(event, type, length);
+	cpu_buffer->entries++;
+
+	return event;
+}
+
+static struct ring_buffer_event *
+ring_buffer_reserve_next_event(struct ring_buffer_per_cpu *cpu_buffer,
+			       unsigned type, unsigned long length)
+{
+	unsigned long long ts, delta;
+	struct ring_buffer_event *event;
+
+	ts = ring_buffer_time_stamp(cpu_buffer->cpu);
+
+	if (cpu_buffer->tail) {
+		delta = ts - cpu_buffer->last_stamp;
+
+		if (test_time_stamp(delta)) {
+			/*
+			 * The delta is too big, we to add a
+			 * new timestamp.
+			 */
+			event = __ring_buffer_reserve_next(cpu_buffer,
+							   RB_TYPE_TIME_EXTENT,
+							   RB_LEN_TIME_EXTENT,
+							   &ts);
+			if (!event)
+				return NULL;
+
+			/* check to see if we went to the next page */
+			if (!cpu_buffer->tail) {
+				/*
+				 * new page, dont commit this and add the
+				 * time stamp to the page instead.
+				 */
+				rb_add_stamp(cpu_buffer, &ts);
+			} else {
+				event->time_delta = delta & TS_MASK;
+				event->array[0] = delta >> TS_SHIFT;
+			}
+
+			cpu_buffer->last_stamp = ts;
+			delta = 0;
+		}
+	} else {
+		rb_add_stamp(cpu_buffer, &ts);
+		delta = 0;
+	}
+
+	event = __ring_buffer_reserve_next(cpu_buffer, type, length, &ts);
+	if (!event)
+		return NULL;
+
+	event->time_delta = delta;
+	cpu_buffer->last_stamp = ts;
+
+	return event;
+}
+
+/**
+ * ring_buffer_lock_reserve - reserve a part of the buffer
+ * @buffer: the ring buffer to reserve from
+ * @length: the length of the data to reserve (excluding event header)
+ * @flags: a pointer to save the interrupt flags
+ *
+ * Returns a location on the ring buffer to copy directly to.
+ * The length is the length of the data needed, not the event length
+ * which also includes the event header.
+ *
+ * Must be paired with ring_buffer_unlock_commit, unless NULL is returned.
+ * If NULL is returned, then nothing has been allocated or locked.
+ */
+void *ring_buffer_lock_reserve(struct ring_buffer *buffer,
+			       unsigned long length,
+			       unsigned long *flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	int cpu;
+
+	if (atomic_read(&buffer->record_disabled))
+		return NULL;
+
+	raw_local_irq_save(*flags);
+	cpu = raw_smp_processor_id();
+	cpu_buffer = buffer->buffers[cpu];
+	__raw_spin_lock(&cpu_buffer->lock);
+
+	if (atomic_read(&cpu_buffer->record_disabled))
+		goto no_record;
+
+	length = rb_calculate_event_length(length);
+	if (length > BUF_PAGE_SIZE)
+		return NULL;
+
+	event = ring_buffer_reserve_next_event(cpu_buffer,
+					       RB_TYPE_DATA, length);
+	if (!event)
+		goto no_record;
+
+	return ring_buffer_event_data(event);
+
+ no_record:
+	__raw_spin_unlock(&cpu_buffer->lock);
+	local_irq_restore(*flags);
+	return NULL;
+}
+
+/**
+ * ring_buffer_unlock_commit - commit a reserved
+ * @buffer: The buffer to commit to
+ * @data: The data pointer to commit.
+ * @flags: the interrupt flags received from ring_buffer_lock_reserve.
+ *
+ * This commits the data to the ring buffer, and releases any locks held.
+ *
+ * Must be paired with ring_buffer_lock_reserve.
+ */
+int ring_buffer_unlock_commit(struct ring_buffer *buffer, void *data, unsigned long flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	u32 *array = data;
+	int cpu = raw_smp_processor_id();
+
+	/*
+	 * If the data was larger than max small size, the array[0] will
+	 * hold the length, which must be less than PAGE_SIZE.
+	 * Since the type field is in the MSB, and must not be zero
+	 * we can test that to see if this entry is a large entry
+	 * or not.
+	 */
+	array--;
+	if (*array < PAGE_SIZE)
+		array--;	/* this is large data */
+	event = (struct ring_buffer_event *)array;
+
+	cpu_buffer = buffer->buffers[cpu];
+	cpu_buffer->tail += ring_buffer_event_length(event);
+
+	__raw_spin_unlock(&cpu_buffer->lock);
+	raw_local_irq_restore(flags);
+
+	return 0;
+}
+
+/**
+ * ring_buffer_write - write data to the buffer without reserving
+ * @buffer: The ring buffer to write to.
+ * @event_type: The event type to write to.
+ * @length: The length of the data being written (excluding the event header)
+ * @data: The data to write to the buffer.
+ *
+ * This is like ring_buffer_lock_reserve and ring_buffer_unlock_commit as
+ * one function. If you already have the data to write to the buffer, it
+ * may be easier to simply call this function.
+ *
+ * Note, like ring_buffer_lock_reserve, the length is the length of the data
+ * and not the length of the event which would hold the header.
+ */
+void *ring_buffer_write(struct ring_buffer *buffer,
+			unsigned long length,
+			void *data)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	unsigned long event_length, flags;
+	void *ret = NULL;
+	int cpu;
+
+	if (atomic_read(&buffer->record_disabled))
+		return NULL;
+
+	local_irq_save(flags);
+	cpu = raw_smp_processor_id();
+	cpu_buffer = buffer->buffers[cpu];
+	__raw_spin_lock(&cpu_buffer->lock);
+
+	if (atomic_read(&cpu_buffer->record_disabled))
+		goto out;
+
+	event_length = rb_calculate_event_length(length);
+	event = ring_buffer_reserve_next_event(cpu_buffer,
+					       RB_TYPE_DATA, event_length);
+	if (!event)
+		goto out;
+
+	ret = ring_buffer_event_data(event);
+
+	memcpy(ret, data, length);
+	cpu_buffer->tail += event_length;
+
+ out:
+	__raw_spin_unlock(&cpu_buffer->lock);
+	local_irq_restore(flags);
+
+	return ret;
+}
+
+/**
+ * ring_buffer_lock - lock the ring buffer
+ * @buffer: The ring buffer to lock
+ * @flags: The place to store the interrupt flags
+ *
+ * This locks all the per CPU buffers.
+ *
+ * Must be unlocked by ring_buffer_unlock.
+ */
+void ring_buffer_lock(struct ring_buffer *buffer, unsigned long *flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	local_irq_save(*flags);
+
+	for (cpu = 0; cpu < buffer->cpus; cpu++) {
+
+		cpu_buffer = buffer->buffers[cpu];
+		__raw_spin_lock(&cpu_buffer->lock);
+	}
+}
+
+/**
+ * ring_buffer_unlock - unlock a locked buffer
+ * @buffer: The locked buffer to unlock
+ * @flags: The interrupt flags received by ring_buffer_lock
+ */
+void ring_buffer_unlock(struct ring_buffer *buffer, unsigned long flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	for (cpu = buffer->cpus - 1; cpu >= 0; cpu--) {
+
+		cpu_buffer = buffer->buffers[cpu];
+		__raw_spin_unlock(&cpu_buffer->lock);
+	}
+
+	local_irq_restore(flags);
+}
+
+/**
+ * ring_buffer_record_disable - stop all writes into the buffer
+ * @buffer: The ring buffer to stop writes to.
+ *
+ * This prevents all writes to the buffer. Any attempt to write
+ * to the buffer after this will fail and return NULL.
+ */
+void ring_buffer_record_disable(struct ring_buffer *buffer)
+{
+	atomic_inc(&buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_record_enable - enable writes to the buffer
+ * @buffer: The ring buffer to enable writes
+ *
+ * Note, multiple disables will need the same number of enables
+ * to truely enable the writing (much like preempt_disable).
+ */
+void ring_buffer_record_enable(struct ring_buffer *buffer)
+{
+	atomic_dec(&buffer->record_disabled);
+}
+
+void ring_buffer_record_disable_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	atomic_inc(&cpu_buffer->record_disabled);
+}
+
+void ring_buffer_record_enable_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	atomic_dec(&cpu_buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_entries_cpu - get the number of entries in a cpu buffer
+ * @buffer: The ring buffer
+ * @cpu: The per CPU buffer to get the entries from.
+ */
+unsigned long ring_buffer_entries_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	return cpu_buffer->entries;
+}
+
+/**
+ * ring_buffer_overrun_cpu - get the number of overruns in a cpu_buffer
+ * @buffer: The ring buffer
+ * @cpu: The per CPU buffer to get the number of overruns from
+ */
+unsigned long ring_buffer_overrun_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	return cpu_buffer->overrun;
+}
+
+/**
+ * ring_buffer_entries - get the number of entries in a buffer
+ * @buffer: The ring buffer
+ *
+ * Returns the total number of entries in the ring buffer
+ * (all CPU entries)
+ */
+unsigned long ring_buffer_entries(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long entries = 0;
+	int cpu;
+
+	/* if you care about this being correct, lock the buffer */
+	for (cpu = 0; cpu < buffer->cpus; cpu++) {
+		cpu_buffer = buffer->buffers[cpu];
+		entries += cpu_buffer->entries;
+	}
+
+	return entries;
+}
+
+/**
+ * ring_buffer_overrun_cpu - get the number of overruns in buffer
+ * @buffer: The ring buffer
+ *
+ * Returns the total number of overruns in the ring buffer
+ * (all CPU entries)
+ */
+unsigned long ring_buffer_overruns(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long overruns = 0;
+	int cpu;
+
+	/* if you care about this being correct, lock the buffer */
+	for (cpu = 0; cpu < buffer->cpus; cpu++) {
+		cpu_buffer = buffer->buffers[cpu];
+		overruns += cpu_buffer->overrun;
+	}
+
+	return overruns;
+}
+
+void ring_buffer_iter_reset(struct ring_buffer_iter *iter)
+{
+	iter->head_page = 0;
+	iter->head = 0;
+}
+
+int ring_buffer_iter_empty(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = iter->cpu_buffer;
+
+	return iter->head_page == cpu_buffer->tail_page &&
+		iter->head == cpu_buffer->tail;
+}
+
+static void
+ring_buffer_advance_head(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct ring_buffer *buffer = cpu_buffer->buffer;
+	struct ring_buffer_event *event;
+	unsigned length;
+
+	event = ring_buffer_head_event(cpu_buffer);
+	/*
+	 * Check if we are at the end of the buffer.
+	 * For fixed length, we need to check if we can fit
+	 *  another entry on the page.
+	 * Otherwise we need to see if the end is a null
+	 *  pointer.
+	 */
+	if (ring_buffer_null_event(event)) {
+		BUG_ON(cpu_buffer->head_page == cpu_buffer->tail_page);
+		ring_buffer_inc_page(buffer, &cpu_buffer->head_page);
+		rb_reset_read_page(cpu_buffer);
+		return;
+	}
+
+	length = ring_buffer_event_length(event);
+
+	/*
+	 * This should not be called to advance the header if we are
+	 * at the tail of the buffer.
+	 */
+	BUG_ON((cpu_buffer->head_page == cpu_buffer->tail_page) &&
+	       (cpu_buffer->head + length > cpu_buffer->tail));
+
+	cpu_buffer->head += length;
+
+	/* check for end of page padding */
+	event = ring_buffer_head_event(cpu_buffer);
+	if (ring_buffer_null_event(event) &&
+	    (cpu_buffer->head_page != cpu_buffer->tail_page))
+		ring_buffer_advance_head(cpu_buffer);
+}
+
+static void
+ring_buffer_advance_iter(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer *buffer;
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	unsigned length;
+
+	cpu_buffer = iter->cpu_buffer;
+	buffer = cpu_buffer->buffer;
+
+	event = ring_buffer_iter_head_event(iter);
+
+	/*
+	 * Check if we are at the end of the buffer.
+	 * For fixed length, we need to check if we can fit
+	 *  another entry on the page.
+	 * Otherwise we need to see if the end is a null
+	 *  pointer.
+	 */
+	if (ring_buffer_null_event(event)) {
+		BUG_ON(iter->head_page == cpu_buffer->tail_page);
+		ring_buffer_inc_page(buffer, &iter->head_page);
+		rb_reset_iter_read_page(iter);
+		return;
+	}
+
+	length = ring_buffer_event_length(event);
+
+	/*
+	 * This should not be called to advance the header if we are
+	 * at the tail of the buffer.
+	 */
+	BUG_ON((iter->head_page == cpu_buffer->tail_page) &&
+	       (iter->head + length > cpu_buffer->tail));
+
+	iter->head += length;
+
+	/* check for end of page padding */
+	event = ring_buffer_iter_head_event(iter);
+	if (ring_buffer_null_event(event) &&
+	    (iter->head_page != cpu_buffer->tail_page))
+		ring_buffer_advance_iter(iter);
+}
+
+/**
+ * ring_buffer_peek - peek at the next event to be read
+ * @iter: The ring buffer iterator
+ * @iter_next_cpu: The CPU that the next event belongs on
+ *
+ * This will return the event that will be read next, but does
+ * not increment the iterator.
+ */
+struct ring_buffer_event *
+ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	u64 delta;
+
+	cpu_buffer = buffer->buffers[cpu];
+
+ again:
+	if (ring_buffer_per_cpu_empty(cpu_buffer))
+		return NULL;
+
+	event = ring_buffer_head_event(cpu_buffer);
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		ring_buffer_inc_page(buffer, &cpu_buffer->head_page);
+		rb_reset_read_page(cpu_buffer);
+		goto again;
+
+	case RB_TYPE_TIME_EXTENT:
+		delta = event->array[0];
+		delta <<= TS_SHIFT;
+		delta += event->time_delta;
+		cpu_buffer->read_stamp += delta;
+		goto again;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		goto again;
+
+	case RB_TYPE_DATA:
+		if (ts) {
+			*ts = cpu_buffer->read_stamp + event->time_delta;
+			ring_buffer_normalize_time_stamp(cpu_buffer->cpu, ts);
+		}
+		return event;
+
+	default:
+		BUG();
+	}
+
+	return NULL;
+}
+
+/**
+ * ring_buffer_peek - peek at the next event to be read
+ * @iter: The ring buffer iterator
+ * @iter_next_cpu: The CPU that the next event belongs on
+ *
+ * This will return the event that will be read next, but does
+ * not increment the iterator.
+ */
+struct ring_buffer_event *
+ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
+{
+	struct ring_buffer *buffer;
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	u64 delta;
+
+	if (ring_buffer_iter_empty(iter))
+		return NULL;
+
+	cpu_buffer = iter->cpu_buffer;
+	buffer = cpu_buffer->buffer;
+
+ again:
+	if (ring_buffer_per_cpu_empty(cpu_buffer))
+		return NULL;
+
+	event = ring_buffer_iter_head_event(iter);
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		ring_buffer_inc_page(buffer, &iter->head_page);
+		rb_reset_iter_read_page(iter);
+		goto again;
+
+	case RB_TYPE_TIME_EXTENT:
+		delta = event->array[0];
+		delta <<= TS_SHIFT;
+		delta += event->time_delta;
+		iter->read_stamp += delta;
+		goto again;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		goto again;
+
+	case RB_TYPE_DATA:
+		if (ts) {
+			*ts = iter->read_stamp + event->time_delta;
+			ring_buffer_normalize_time_stamp(cpu_buffer->cpu, ts);
+		}
+		return event;
+
+	default:
+		BUG();
+	}
+
+	return NULL;
+}
+
+/**
+ * ring_buffer_consume - return an event and consume it
+ * @buffer: The ring buffer to get the next event from
+ *
+ * Returns the next event in the ring buffer, and that event is consumed.
+ * Meaning, that sequential reads will keep returning a different event,
+ * and eventually empty the ring buffer if the producer is slower.
+ */
+struct ring_buffer_event *
+ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+
+	event = ring_buffer_peek(buffer, cpu, ts);
+	if (!event)
+		return NULL;
+
+	cpu_buffer = buffer->buffers[cpu];
+	ring_buffer_advance_head(cpu_buffer);
+
+	return event;
+}
+
+/**
+ * ring_buffer_read_start - start a non consuming read of the buffer
+ * @buffer: The ring buffer to read from
+ * @iter_flags: control flags on how to read the buffer.
+ *
+ * This starts up an iteration through the buffer. It also disables
+ * the recording to the buffer until the reading is finished.
+ * This prevents the reading from being corrupted. This is not
+ * a consuming read, so a producer is not expected.
+ *
+ * The iter_flags of RB_ITER_FL_SNAP will read the snapshot image
+ * and not the main buffer.
+ *
+ * Must be paired with ring_buffer_finish.
+ */
+struct ring_buffer_iter *
+ring_buffer_read_start(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_iter *iter;
+
+	iter = kmalloc(sizeof(*iter), GFP_KERNEL);
+	if (!iter)
+		return NULL;
+
+	cpu_buffer = buffer->buffers[cpu];
+
+	iter->cpu_buffer = cpu_buffer;
+
+	atomic_inc(&cpu_buffer->record_disabled);
+
+	__raw_spin_lock(&cpu_buffer->lock);
+	iter->head = cpu_buffer->head;
+	iter->head_page = cpu_buffer->head_page;
+	rb_reset_iter_read_page(iter);
+	__raw_spin_unlock(&cpu_buffer->lock);
+
+	return iter;
+}
+
+/**
+ * ring_buffer_finish - finish reading the iterator of the buffer
+ * @iter: The iterator retrieved by ring_buffer_start
+ *
+ * This re-enables the recording to the buffer, and frees the
+ * iterator.
+ */
+void
+ring_buffer_read_finish(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
+
+	atomic_dec(&cpu_buffer->record_disabled);
+	kfree(iter);
+}
+
+/**
+ * ring_buffer_read - read the next item in the ring buffer by the iterator
+ * @iter: The ring buffer iterator
+ * @cpu: The cpu buffer to read from.
+ *
+ * This reads the next event in the ring buffer and increments the iterator.
+ */
+struct ring_buffer_event *
+ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts)
+{
+	struct ring_buffer_event *event;
+
+	event = ring_buffer_iter_peek(iter, ts);
+	if (!event)
+		return NULL;
+
+	ring_buffer_advance_iter(iter);
+
+	return event;
+}
+
+/**
+ * ring_buffer_size - return the size of the ring buffer (in bytes)
+ * @buffer: The ring buffer.
+ */
+unsigned long ring_buffer_size(struct ring_buffer *buffer)
+{
+	return PAGE_SIZE * buffer->pages;
+}
+
+static void
+__ring_buffer_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	cpu_buffer->head_page = cpu_buffer->tail_page = 0;
+	cpu_buffer->head = cpu_buffer->tail = 0;
+	cpu_buffer->overrun = 0;
+	cpu_buffer->entries = 0;
+}
+
+/**
+ * ring_buffer_reset_cpu - reset a ring buffer per CPU buffer
+ * @buffer: The ring buffer to reset a per cpu buffer of
+ * @cpu: The CPU buffer to be reset
+ */
+void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = buffer->buffers[cpu];
+	unsigned long flags;
+
+	raw_local_irq_save(flags);
+	__raw_spin_lock(&cpu_buffer->lock);
+
+	__ring_buffer_reset_cpu(cpu_buffer);
+
+	__raw_spin_unlock(&cpu_buffer->lock);
+	raw_local_irq_restore(flags);
+}
+
+/**
+ * ring_buffer_reset_cpu - reset a ring buffer per CPU buffer
+ * @buffer: The ring buffer to reset a per cpu buffer of
+ * @cpu: The CPU buffer to be reset
+ */
+void ring_buffer_reset(struct ring_buffer *buffer)
+{
+	unsigned long flags;
+	int cpu;
+
+	ring_buffer_lock(buffer, &flags);
+
+	for (cpu = 0; cpu < buffer->cpus; cpu++)
+		__ring_buffer_reset_cpu(buffer->buffers[cpu]);
+
+	ring_buffer_unlock(buffer, flags);
+}
+
+/**
+ * rind_buffer_empty - is the ring buffer empty?
+ * @buffer: The ring buffer to test
+ */
+int ring_buffer_empty(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	/* yes this is racy, but if you don't like the race, lock the buffer */
+	for (cpu = 0; cpu < buffer->cpus; cpu++) {
+		cpu_buffer = buffer->buffers[cpu];
+		if (!ring_buffer_per_cpu_empty(cpu_buffer))
+			return 0;
+	}
+	return 1;
+}
+
+/**
+ * ring_buffer_empty_cpu - is a cpu buffer of a ring buffer empty?
+ * @buffer: The ring buffer
+ * @cpu: The CPU buffer to test
+ */
+int ring_buffer_empty_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	/* yes this is racy, but if you don't like the race, lock the buffer */
+	cpu_buffer = buffer->buffers[cpu];
+	return ring_buffer_per_cpu_empty(cpu_buffer);
+}
+
+/**
+ * ring_buffer_swap_cpu - swap a CPU buffer between two ring buffers
+ * @buffer_a: One buffer to swap with
+ * @buffer_b: The other buffer to swap with
+ *
+ * This function is useful for tracers that want to take a "snapshot"
+ * of a CPU buffer and has another back up buffer lying around.
+ * it is expected that the tracer handles the cpu buffer not being
+ * used at the moment.
+ */
+int ring_buffer_swap_cpu(struct ring_buffer *buffer_a,
+			 struct ring_buffer *buffer_b, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer_a;
+	struct ring_buffer_per_cpu *cpu_buffer_b;
+
+	/* At least make sure the two buffers are somewhat the same */
+	if (buffer_a->size != buffer_b->size ||
+	    buffer_a->pages != buffer_b->pages)
+		return -EINVAL;
+
+	cpu_buffer_a = buffer_a->buffers[cpu];
+	cpu_buffer_b = buffer_b->buffers[cpu];
+
+	atomic_inc(&cpu_buffer_a->record_disabled);
+	atomic_inc(&cpu_buffer_b->record_disabled);
+
+	buffer_a->buffers[cpu] = cpu_buffer_b;
+	buffer_b->buffers[cpu] = cpu_buffer_a;
+
+	cpu_buffer_b->buffer = buffer_a;
+	cpu_buffer_a->buffer = buffer_b;
+
+	atomic_dec(&cpu_buffer_a->record_disabled);
+	atomic_dec(&cpu_buffer_b->record_disabled);
+
+	return 0;
+}
+
Index: linux-compile.git/kernel/trace/Kconfig
===================================================================
--- linux-compile.git.orig/kernel/trace/Kconfig	2008-09-24 13:21:18.000000000 -0400
+++ linux-compile.git/kernel/trace/Kconfig	2008-09-24 19:31:01.000000000 -0400
@@ -15,6 +15,9 @@ config TRACING
 	select DEBUG_FS
 	select STACKTRACE
 
+config RING_BUFFER
+	bool "ring buffer"
+
 config FTRACE
 	bool "Kernel Function Tracer"
 	depends on HAVE_FTRACE
Index: linux-compile.git/kernel/trace/Makefile
===================================================================
--- linux-compile.git.orig/kernel/trace/Makefile	2008-09-24 13:21:18.000000000 -0400
+++ linux-compile.git/kernel/trace/Makefile	2008-09-24 19:31:01.000000000 -0400
@@ -11,6 +11,7 @@ obj-y += trace_selftest_dynamic.o
 endif
 
 obj-$(CONFIG_FTRACE) += libftrace.o
+obj-$(CONFIG_RING_BUFFER) += ring_buffer.o
 
 obj-$(CONFIG_TRACING) += trace.o
 obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o

-- 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH 2/2 v3] ftrace: make work with new ring buffer
  2008-09-25 18:51 [RFC PATCH 0/2 v3] Unified trace buffer Steven Rostedt
  2008-09-25 18:51 ` [RFC PATCH 1/2 " Steven Rostedt
@ 2008-09-25 18:51 ` Steven Rostedt
  1 sibling, 0 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-09-25 18:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	prasad, Linus Torvalds, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Steven Rostedt

[-- Attachment #1: ftrace-ring-buffer-take-two.patch --]
[-- Type: text/plain, Size: 40851 bytes --]

Note: This patch is a proof of concept, and breaks a lot of
 functionality of ftrace.

This patch simply makes ftrace work with the developmental ring buffer.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 kernel/trace/trace.c              |  776 ++++++++------------------------------
 kernel/trace/trace.h              |   22 -
 kernel/trace/trace_functions.c    |    2 
 kernel/trace/trace_irqsoff.c      |    6 
 kernel/trace/trace_mmiotrace.c    |   10 
 kernel/trace/trace_sched_switch.c |    2 
 kernel/trace/trace_sched_wakeup.c |    2 
 7 files changed, 195 insertions(+), 625 deletions(-)

Index: linux-compile.git/kernel/trace/trace.c
===================================================================
--- linux-compile.git.orig/kernel/trace/trace.c	2008-09-25 12:34:11.000000000 -0400
+++ linux-compile.git/kernel/trace/trace.c	2008-09-25 12:34:23.000000000 -0400
@@ -31,25 +31,24 @@
 #include <linux/writeback.h>
 
 #include <linux/stacktrace.h>
+#include <linux/ring_buffer.h>
 
 #include "trace.h"
 
+#define sdr_print(x, y...) printk("%s:%d " x "\n", __FUNCTION__, __LINE__, y)
+
+#define TRACE_BUFFER_FLAGS	(RB_FL_OVERWRITE)
+
 unsigned long __read_mostly	tracing_max_latency = (cycle_t)ULONG_MAX;
 unsigned long __read_mostly	tracing_thresh;
 
-static unsigned long __read_mostly	tracing_nr_buffers;
 static cpumask_t __read_mostly		tracing_buffer_mask;
 
 #define for_each_tracing_cpu(cpu)	\
 	for_each_cpu_mask(cpu, tracing_buffer_mask)
 
-static int trace_alloc_page(void);
-static int trace_free_page(void);
-
 static int tracing_disabled = 1;
 
-static unsigned long tracing_pages_allocated;
-
 long
 ns2usecs(cycle_t nsec)
 {
@@ -100,11 +99,11 @@ static int			tracer_enabled = 1;
 int				ftrace_function_enabled;
 
 /*
- * trace_nr_entries is the number of entries that is allocated
- * for a buffer. Note, the number of entries is always rounded
- * to ENTRIES_PER_PAGE.
+ * trace_buf_size is the size in bytes that is allocated
+ * for a buffer. Note, the number of bytes is always rounded
+ * to page size.
  */
-static unsigned long		trace_nr_entries = 65536UL;
+static unsigned long		trace_buf_size = 65536UL;
 
 /* trace_types holds a link list of available tracers. */
 static struct tracer		*trace_types __read_mostly;
@@ -139,8 +138,8 @@ static notrace void no_trace_init(struct
 
 	ftrace_function_enabled = 0;
 	if(tr->ctrl)
-		for_each_online_cpu(cpu)
-			tracing_reset(tr->data[cpu]);
+		for_each_tracing_cpu(cpu)
+			tracing_reset(tr, cpu);
 	tracer_enabled = 0;
 }
 
@@ -167,23 +166,21 @@ void trace_wake_up(void)
 		wake_up(&trace_wait);
 }
 
-#define ENTRIES_PER_PAGE (PAGE_SIZE / sizeof(struct trace_entry))
-
-static int __init set_nr_entries(char *str)
+static int __init set_buf_size(char *str)
 {
-	unsigned long nr_entries;
+	unsigned long buf_size;
 	int ret;
 
 	if (!str)
 		return 0;
-	ret = strict_strtoul(str, 0, &nr_entries);
+	ret = strict_strtoul(str, 0, &buf_size);
 	/* nr_entries can not be zero */
-	if (ret < 0 || nr_entries == 0)
+	if (ret < 0 || buf_size == 0)
 		return 0;
-	trace_nr_entries = nr_entries;
+	trace_buf_size = buf_size;
 	return 1;
 }
-__setup("trace_entries=", set_nr_entries);
+__setup("trace_buf_size=", set_buf_size);
 
 unsigned long nsecs_to_usecs(unsigned long nsecs)
 {
@@ -266,54 +263,6 @@ __update_max_tr(struct trace_array *tr, 
 	tracing_record_cmdline(current);
 }
 
-#define CHECK_COND(cond)			\
-	if (unlikely(cond)) {			\
-		tracing_disabled = 1;		\
-		WARN_ON(1);			\
-		return -1;			\
-	}
-
-/**
- * check_pages - integrity check of trace buffers
- *
- * As a safty measure we check to make sure the data pages have not
- * been corrupted.
- */
-int check_pages(struct trace_array_cpu *data)
-{
-	struct page *page, *tmp;
-
-	CHECK_COND(data->trace_pages.next->prev != &data->trace_pages);
-	CHECK_COND(data->trace_pages.prev->next != &data->trace_pages);
-
-	list_for_each_entry_safe(page, tmp, &data->trace_pages, lru) {
-		CHECK_COND(page->lru.next->prev != &page->lru);
-		CHECK_COND(page->lru.prev->next != &page->lru);
-	}
-
-	return 0;
-}
-
-/**
- * head_page - page address of the first page in per_cpu buffer.
- *
- * head_page returns the page address of the first page in
- * a per_cpu buffer. This also preforms various consistency
- * checks to make sure the buffer has not been corrupted.
- */
-void *head_page(struct trace_array_cpu *data)
-{
-	struct page *page;
-
-	if (list_empty(&data->trace_pages))
-		return NULL;
-
-	page = list_entry(data->trace_pages.next, struct page, lru);
-	BUG_ON(&page->lru == &data->trace_pages);
-
-	return page_address(page);
-}
-
 /**
  * trace_seq_printf - sequence printing of trace information
  * @s: trace sequence descriptor
@@ -460,34 +409,6 @@ trace_print_seq(struct seq_file *m, stru
 	trace_seq_reset(s);
 }
 
-/*
- * flip the trace buffers between two trace descriptors.
- * This usually is the buffers between the global_trace and
- * the max_tr to record a snapshot of a current trace.
- *
- * The ftrace_max_lock must be held.
- */
-static void
-flip_trace(struct trace_array_cpu *tr1, struct trace_array_cpu *tr2)
-{
-	struct list_head flip_pages;
-
-	INIT_LIST_HEAD(&flip_pages);
-
-	memcpy(&tr1->trace_head_idx, &tr2->trace_head_idx,
-		sizeof(struct trace_array_cpu) -
-		offsetof(struct trace_array_cpu, trace_head_idx));
-
-	check_pages(tr1);
-	check_pages(tr2);
-	list_splice_init(&tr1->trace_pages, &flip_pages);
-	list_splice_init(&tr2->trace_pages, &tr1->trace_pages);
-	list_splice_init(&flip_pages, &tr2->trace_pages);
-	BUG_ON(!list_empty(&flip_pages));
-	check_pages(tr1);
-	check_pages(tr2);
-}
-
 /**
  * update_max_tr - snapshot all trace buffers from global_trace to max_tr
  * @tr: tracer
@@ -500,17 +421,15 @@ flip_trace(struct trace_array_cpu *tr1, 
 void
 update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu)
 {
-	struct trace_array_cpu *data;
-	int i;
+	struct ring_buffer *buf = tr->buffer;
 
 	WARN_ON_ONCE(!irqs_disabled());
 	__raw_spin_lock(&ftrace_max_lock);
-	/* clear out all the previous traces */
-	for_each_tracing_cpu(i) {
-		data = tr->data[i];
-		flip_trace(max_tr.data[i], data);
-		tracing_reset(data);
-	}
+
+	tr->buffer = max_tr.buffer;
+	max_tr.buffer = buf;
+
+	ring_buffer_reset(tr->buffer);
 
 	__update_max_tr(tr, tsk, cpu);
 	__raw_spin_unlock(&ftrace_max_lock);
@@ -527,16 +446,15 @@ update_max_tr(struct trace_array *tr, st
 void
 update_max_tr_single(struct trace_array *tr, struct task_struct *tsk, int cpu)
 {
-	struct trace_array_cpu *data = tr->data[cpu];
-	int i;
+	int ret;
 
 	WARN_ON_ONCE(!irqs_disabled());
 	__raw_spin_lock(&ftrace_max_lock);
-	for_each_tracing_cpu(i)
-		tracing_reset(max_tr.data[i]);
 
-	flip_trace(max_tr.data[cpu], data);
-	tracing_reset(data);
+	ring_buffer_reset(max_tr.buffer);
+	ret = ring_buffer_swap_cpu(max_tr.buffer, tr->buffer, cpu);
+
+	WARN_ON_ONCE(ret);
 
 	__update_max_tr(tr, tsk, cpu);
 	__raw_spin_unlock(&ftrace_max_lock);
@@ -573,7 +491,6 @@ int register_tracer(struct tracer *type)
 #ifdef CONFIG_FTRACE_STARTUP_TEST
 	if (type->selftest) {
 		struct tracer *saved_tracer = current_trace;
-		struct trace_array_cpu *data;
 		struct trace_array *tr = &global_trace;
 		int saved_ctrl = tr->ctrl;
 		int i;
@@ -585,10 +502,7 @@ int register_tracer(struct tracer *type)
 		 * If we fail, we do not register this tracer.
 		 */
 		for_each_tracing_cpu(i) {
-			data = tr->data[i];
-			if (!head_page(data))
-				continue;
-			tracing_reset(data);
+			tracing_reset(tr, i);
 		}
 		current_trace = type;
 		tr->ctrl = 0;
@@ -604,10 +518,7 @@ int register_tracer(struct tracer *type)
 		}
 		/* Only reset on passing, to avoid touching corrupted buffers */
 		for_each_tracing_cpu(i) {
-			data = tr->data[i];
-			if (!head_page(data))
-				continue;
-			tracing_reset(data);
+			tracing_reset(tr, i);
 		}
 		printk(KERN_CONT "PASSED\n");
 	}
@@ -653,13 +564,9 @@ void unregister_tracer(struct tracer *ty
 	mutex_unlock(&trace_types_lock);
 }
 
-void tracing_reset(struct trace_array_cpu *data)
+void tracing_reset(struct trace_array *tr, int cpu)
 {
-	data->trace_idx = 0;
-	data->overrun = 0;
-	data->trace_head = data->trace_tail = head_page(data);
-	data->trace_head_idx = 0;
-	data->trace_tail_idx = 0;
+	ring_buffer_reset_cpu(tr->buffer, cpu);
 }
 
 #define SAVED_CMDLINES 128
@@ -745,70 +652,6 @@ void tracing_record_cmdline(struct task_
 	trace_save_cmdline(tsk);
 }
 
-static inline struct list_head *
-trace_next_list(struct trace_array_cpu *data, struct list_head *next)
-{
-	/*
-	 * Roundrobin - but skip the head (which is not a real page):
-	 */
-	next = next->next;
-	if (unlikely(next == &data->trace_pages))
-		next = next->next;
-	BUG_ON(next == &data->trace_pages);
-
-	return next;
-}
-
-static inline void *
-trace_next_page(struct trace_array_cpu *data, void *addr)
-{
-	struct list_head *next;
-	struct page *page;
-
-	page = virt_to_page(addr);
-
-	next = trace_next_list(data, &page->lru);
-	page = list_entry(next, struct page, lru);
-
-	return page_address(page);
-}
-
-static inline struct trace_entry *
-tracing_get_trace_entry(struct trace_array *tr, struct trace_array_cpu *data)
-{
-	unsigned long idx, idx_next;
-	struct trace_entry *entry;
-
-	data->trace_idx++;
-	idx = data->trace_head_idx;
-	idx_next = idx + 1;
-
-	BUG_ON(idx * TRACE_ENTRY_SIZE >= PAGE_SIZE);
-
-	entry = data->trace_head + idx * TRACE_ENTRY_SIZE;
-
-	if (unlikely(idx_next >= ENTRIES_PER_PAGE)) {
-		data->trace_head = trace_next_page(data, data->trace_head);
-		idx_next = 0;
-	}
-
-	if (data->trace_head == data->trace_tail &&
-	    idx_next == data->trace_tail_idx) {
-		/* overrun */
-		data->overrun++;
-		data->trace_tail_idx++;
-		if (data->trace_tail_idx >= ENTRIES_PER_PAGE) {
-			data->trace_tail =
-				trace_next_page(data, data->trace_tail);
-			data->trace_tail_idx = 0;
-		}
-	}
-
-	data->trace_head_idx = idx_next;
-
-	return entry;
-}
-
 static inline void
 tracing_generic_entry_update(struct trace_entry *entry, unsigned long flags)
 {
@@ -819,7 +662,6 @@ tracing_generic_entry_update(struct trac
 
 	entry->preempt_count	= pc & 0xff;
 	entry->pid		= (tsk) ? tsk->pid : 0;
-	entry->t		= ftrace_now(raw_smp_processor_id());
 	entry->flags = (irqs_disabled_flags(flags) ? TRACE_FLAG_IRQS_OFF : 0) |
 		((pc & HARDIRQ_MASK) ? TRACE_FLAG_HARDIRQ : 0) |
 		((pc & SOFTIRQ_MASK) ? TRACE_FLAG_SOFTIRQ : 0) |
@@ -833,15 +675,14 @@ trace_function(struct trace_array *tr, s
 	struct trace_entry *entry;
 	unsigned long irq_flags;
 
-	raw_local_irq_save(irq_flags);
-	__raw_spin_lock(&data->lock);
-	entry			= tracing_get_trace_entry(tr, data);
+	entry	= ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), &irq_flags);
+	if (!entry)
+		return;
 	tracing_generic_entry_update(entry, flags);
 	entry->type		= TRACE_FN;
 	entry->fn.ip		= ip;
 	entry->fn.parent_ip	= parent_ip;
-	__raw_spin_unlock(&data->lock);
-	raw_local_irq_restore(irq_flags);
+	ring_buffer_unlock_commit(tr->buffer, entry, irq_flags);
 }
 
 void
@@ -859,16 +700,13 @@ void __trace_mmiotrace_rw(struct trace_a
 	struct trace_entry *entry;
 	unsigned long irq_flags;
 
-	raw_local_irq_save(irq_flags);
-	__raw_spin_lock(&data->lock);
-
-	entry			= tracing_get_trace_entry(tr, data);
+	entry	= ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), &irq_flags);
+	if (!entry)
+		return;
 	tracing_generic_entry_update(entry, 0);
 	entry->type		= TRACE_MMIO_RW;
 	entry->mmiorw		= *rw;
-
-	__raw_spin_unlock(&data->lock);
-	raw_local_irq_restore(irq_flags);
+	ring_buffer_unlock_commit(tr->buffer, entry, irq_flags);
 
 	trace_wake_up();
 }
@@ -879,16 +717,13 @@ void __trace_mmiotrace_map(struct trace_
 	struct trace_entry *entry;
 	unsigned long irq_flags;
 
-	raw_local_irq_save(irq_flags);
-	__raw_spin_lock(&data->lock);
-
-	entry			= tracing_get_trace_entry(tr, data);
+	entry	= ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), &irq_flags);
+	if (!entry)
+		return;
 	tracing_generic_entry_update(entry, 0);
 	entry->type		= TRACE_MMIO_MAP;
 	entry->mmiomap		= *map;
-
-	__raw_spin_unlock(&data->lock);
-	raw_local_irq_restore(irq_flags);
+	ring_buffer_unlock_commit(tr->buffer, entry, irq_flags);
 
 	trace_wake_up();
 }
@@ -901,11 +736,14 @@ void __trace_stack(struct trace_array *t
 {
 	struct trace_entry *entry;
 	struct stack_trace trace;
+	unsigned long irq_flags;
 
 	if (!(trace_flags & TRACE_ITER_STACKTRACE))
 		return;
 
-	entry			= tracing_get_trace_entry(tr, data);
+	entry	= ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), &irq_flags);
+	if (!entry)
+		return;
 	tracing_generic_entry_update(entry, flags);
 	entry->type		= TRACE_STACK;
 
@@ -917,6 +755,7 @@ void __trace_stack(struct trace_array *t
 	trace.entries		= entry->stack.caller;
 
 	save_stack_trace(&trace);
+	ring_buffer_unlock_commit(tr->buffer, entry, irq_flags);
 }
 
 void
@@ -928,17 +767,16 @@ __trace_special(void *__tr, void *__data
 	struct trace_entry *entry;
 	unsigned long irq_flags;
 
-	raw_local_irq_save(irq_flags);
-	__raw_spin_lock(&data->lock);
-	entry			= tracing_get_trace_entry(tr, data);
+	entry	= ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), &irq_flags);
+	if (!entry)
+		return;
 	tracing_generic_entry_update(entry, 0);
 	entry->type		= TRACE_SPECIAL;
 	entry->special.arg1	= arg1;
 	entry->special.arg2	= arg2;
 	entry->special.arg3	= arg3;
+	ring_buffer_unlock_commit(tr->buffer, entry, irq_flags);
 	__trace_stack(tr, data, irq_flags, 4);
-	__raw_spin_unlock(&data->lock);
-	raw_local_irq_restore(irq_flags);
 
 	trace_wake_up();
 }
@@ -953,9 +791,9 @@ tracing_sched_switch_trace(struct trace_
 	struct trace_entry *entry;
 	unsigned long irq_flags;
 
-	raw_local_irq_save(irq_flags);
-	__raw_spin_lock(&data->lock);
-	entry			= tracing_get_trace_entry(tr, data);
+	entry	= ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), &irq_flags);
+	if (!entry)
+		return;
 	tracing_generic_entry_update(entry, flags);
 	entry->type		= TRACE_CTX;
 	entry->ctx.prev_pid	= prev->pid;
@@ -964,9 +802,8 @@ tracing_sched_switch_trace(struct trace_
 	entry->ctx.next_pid	= next->pid;
 	entry->ctx.next_prio	= next->prio;
 	entry->ctx.next_state	= next->state;
+	ring_buffer_unlock_commit(tr->buffer, entry, irq_flags);
 	__trace_stack(tr, data, flags, 5);
-	__raw_spin_unlock(&data->lock);
-	raw_local_irq_restore(irq_flags);
 }
 
 void
@@ -979,9 +816,9 @@ tracing_sched_wakeup_trace(struct trace_
 	struct trace_entry *entry;
 	unsigned long irq_flags;
 
-	raw_local_irq_save(irq_flags);
-	__raw_spin_lock(&data->lock);
-	entry			= tracing_get_trace_entry(tr, data);
+	entry	= ring_buffer_lock_reserve(tr->buffer, sizeof(*entry), &irq_flags);
+	if (!entry)
+		return;
 	tracing_generic_entry_update(entry, flags);
 	entry->type		= TRACE_WAKE;
 	entry->ctx.prev_pid	= curr->pid;
@@ -990,9 +827,8 @@ tracing_sched_wakeup_trace(struct trace_
 	entry->ctx.next_pid	= wakee->pid;
 	entry->ctx.next_prio	= wakee->prio;
 	entry->ctx.next_state	= wakee->state;
+	ring_buffer_unlock_commit(tr->buffer, entry, irq_flags);
 	__trace_stack(tr, data, flags, 6);
-	__raw_spin_unlock(&data->lock);
-	raw_local_irq_restore(irq_flags);
 
 	trace_wake_up();
 }
@@ -1074,105 +910,66 @@ enum trace_file_type {
 };
 
 static struct trace_entry *
-trace_entry_idx(struct trace_array *tr, struct trace_array_cpu *data,
-		struct trace_iterator *iter, int cpu)
-{
-	struct page *page;
-	struct trace_entry *array;
-
-	if (iter->next_idx[cpu] >= tr->entries ||
-	    iter->next_idx[cpu] >= data->trace_idx ||
-	    (data->trace_head == data->trace_tail &&
-	     data->trace_head_idx == data->trace_tail_idx))
-		return NULL;
-
-	if (!iter->next_page[cpu]) {
-		/* Initialize the iterator for this cpu trace buffer */
-		WARN_ON(!data->trace_tail);
-		page = virt_to_page(data->trace_tail);
-		iter->next_page[cpu] = &page->lru;
-		iter->next_page_idx[cpu] = data->trace_tail_idx;
-	}
-
-	page = list_entry(iter->next_page[cpu], struct page, lru);
-	BUG_ON(&data->trace_pages == &page->lru);
-
-	array = page_address(page);
-
-	WARN_ON(iter->next_page_idx[cpu] >= ENTRIES_PER_PAGE);
-	return &array[iter->next_page_idx[cpu]];
-}
-
-static struct trace_entry *
-find_next_entry(struct trace_iterator *iter, int *ent_cpu)
+find_next_entry(struct trace_iterator *iter, int *ent_cpu, u64 *ent_ts)
 {
-	struct trace_array *tr = iter->tr;
+	struct ring_buffer *buffer = iter->tr->buffer;
+	struct ring_buffer_event *event;
 	struct trace_entry *ent, *next = NULL;
+	u64 next_ts = 0, ts;
 	int next_cpu = -1;
 	int cpu;
 
 	for_each_tracing_cpu(cpu) {
-		if (!head_page(tr->data[cpu]))
+		struct ring_buffer_iter *buf_iter;
+
+		if (ring_buffer_empty_cpu(buffer, cpu))
 			continue;
-		ent = trace_entry_idx(tr, tr->data[cpu], iter, cpu);
+
+		buf_iter = iter->buffer_iter[cpu];
+		event = ring_buffer_iter_peek(buf_iter, &ts);
+		ent = event ? ring_buffer_event_data(event) : NULL;
+
 		/*
 		 * Pick the entry with the smallest timestamp:
 		 */
-		if (ent && (!next || ent->t < next->t)) {
+		if (ent && (!next || ts < next_ts)) {
 			next = ent;
 			next_cpu = cpu;
+			next_ts = ts;
 		}
 	}
 
 	if (ent_cpu)
 		*ent_cpu = next_cpu;
 
+	if (ent_ts)
+		*ent_ts = next_ts;
+
 	return next;
 }
 
 static void trace_iterator_increment(struct trace_iterator *iter)
 {
 	iter->idx++;
-	iter->next_idx[iter->cpu]++;
-	iter->next_page_idx[iter->cpu]++;
-
-	if (iter->next_page_idx[iter->cpu] >= ENTRIES_PER_PAGE) {
-		struct trace_array_cpu *data = iter->tr->data[iter->cpu];
-
-		iter->next_page_idx[iter->cpu] = 0;
-		iter->next_page[iter->cpu] =
-			trace_next_list(data, iter->next_page[iter->cpu]);
-	}
+	ring_buffer_read(iter->buffer_iter[iter->cpu], NULL);
 }
 
 static void trace_consume(struct trace_iterator *iter)
 {
-	struct trace_array_cpu *data = iter->tr->data[iter->cpu];
-
-	data->trace_tail_idx++;
-	if (data->trace_tail_idx >= ENTRIES_PER_PAGE) {
-		data->trace_tail = trace_next_page(data, data->trace_tail);
-		data->trace_tail_idx = 0;
-	}
-
-	/* Check if we empty it, then reset the index */
-	if (data->trace_head == data->trace_tail &&
-	    data->trace_head_idx == data->trace_tail_idx)
-		data->trace_idx = 0;
+	ring_buffer_consume(iter->tr->buffer, iter->cpu, &iter->ts);
 }
 
 static void *find_next_entry_inc(struct trace_iterator *iter)
 {
 	struct trace_entry *next;
 	int next_cpu = -1;
+	u64 ts;
 
-	next = find_next_entry(iter, &next_cpu);
-
-	iter->prev_ent = iter->ent;
-	iter->prev_cpu = iter->cpu;
+	next = find_next_entry(iter, &next_cpu, &ts);
 
 	iter->ent = next;
 	iter->cpu = next_cpu;
+	iter->ts = ts;
 
 	if (next)
 		trace_iterator_increment(iter);
@@ -1210,7 +1007,7 @@ static void *s_start(struct seq_file *m,
 	struct trace_iterator *iter = m->private;
 	void *p = NULL;
 	loff_t l = 0;
-	int i;
+	int cpu;
 
 	mutex_lock(&trace_types_lock);
 
@@ -1229,12 +1026,9 @@ static void *s_start(struct seq_file *m,
 		iter->ent = NULL;
 		iter->cpu = 0;
 		iter->idx = -1;
-		iter->prev_ent = NULL;
-		iter->prev_cpu = -1;
 
-		for_each_tracing_cpu(i) {
-			iter->next_idx[i] = 0;
-			iter->next_page[i] = NULL;
+		for_each_tracing_cpu(cpu) {
+			ring_buffer_iter_reset(iter->buffer_iter[cpu]);
 		}
 
 		for (p = iter; p && l < *pos; p = s_next(m, p, &l))
@@ -1357,21 +1151,12 @@ print_trace_header(struct seq_file *m, s
 	struct tracer *type = current_trace;
 	unsigned long total   = 0;
 	unsigned long entries = 0;
-	int cpu;
 	const char *name = "preemption";
 
 	if (type)
 		name = type->name;
 
-	for_each_tracing_cpu(cpu) {
-		if (head_page(tr->data[cpu])) {
-			total += tr->data[cpu]->trace_idx;
-			if (tr->data[cpu]->trace_idx > tr->entries)
-				entries += tr->entries;
-			else
-				entries += tr->data[cpu]->trace_idx;
-		}
-	}
+	entries = ring_buffer_entries(iter->tr->buffer);
 
 	seq_printf(m, "%s latency trace v1.1.5 on %s\n",
 		   name, UTS_RELEASE);
@@ -1457,7 +1242,7 @@ lat_print_generic(struct trace_seq *s, s
 unsigned long preempt_mark_thresh = 100;
 
 static void
-lat_print_timestamp(struct trace_seq *s, unsigned long long abs_usecs,
+lat_print_timestamp(struct trace_seq *s, u64 abs_usecs,
 		    unsigned long rel_usecs)
 {
 	trace_seq_printf(s, " %4lldus", abs_usecs);
@@ -1476,20 +1261,22 @@ print_lat_fmt(struct trace_iterator *ite
 {
 	struct trace_seq *s = &iter->seq;
 	unsigned long sym_flags = (trace_flags & TRACE_ITER_SYM_MASK);
-	struct trace_entry *next_entry = find_next_entry(iter, NULL);
+	struct trace_entry *next_entry;
 	unsigned long verbose = (trace_flags & TRACE_ITER_VERBOSE);
 	struct trace_entry *entry = iter->ent;
 	unsigned long abs_usecs;
 	unsigned long rel_usecs;
+	u64 next_ts;
 	char *comm;
 	int S, T;
 	int i;
 	unsigned state;
 
+	next_entry = find_next_entry(iter, NULL, &next_ts);
 	if (!next_entry)
-		next_entry = entry;
-	rel_usecs = ns2usecs(next_entry->t - entry->t);
-	abs_usecs = ns2usecs(entry->t - iter->tr->time_start);
+		next_ts = iter->ts;
+	rel_usecs = ns2usecs(next_ts - iter->ts);
+	abs_usecs = ns2usecs(iter->ts - iter->tr->time_start);
 
 	if (verbose) {
 		comm = trace_find_cmdline(entry->pid);
@@ -1498,7 +1285,7 @@ print_lat_fmt(struct trace_iterator *ite
 				 comm,
 				 entry->pid, cpu, entry->flags,
 				 entry->preempt_count, trace_idx,
-				 ns2usecs(entry->t),
+				 ns2usecs(iter->ts),
 				 abs_usecs/1000,
 				 abs_usecs % 1000, rel_usecs/1000,
 				 rel_usecs % 1000);
@@ -1569,7 +1356,7 @@ static int print_trace_fmt(struct trace_
 
 	comm = trace_find_cmdline(iter->ent->pid);
 
-	t = ns2usecs(entry->t);
+	t = ns2usecs(iter->ts);
 	usec_rem = do_div(t, 1000000ULL);
 	secs = (unsigned long)t;
 
@@ -1660,7 +1447,7 @@ static int print_raw_fmt(struct trace_it
 	entry = iter->ent;
 
 	ret = trace_seq_printf(s, "%d %d %llu ",
-		entry->pid, iter->cpu, entry->t);
+		entry->pid, iter->cpu, iter->ts);
 	if (!ret)
 		return 0;
 
@@ -1725,7 +1512,7 @@ static int print_hex_fmt(struct trace_it
 
 	SEQ_PUT_HEX_FIELD_RET(s, entry->pid);
 	SEQ_PUT_HEX_FIELD_RET(s, iter->cpu);
-	SEQ_PUT_HEX_FIELD_RET(s, entry->t);
+	SEQ_PUT_HEX_FIELD_RET(s, iter->ts);
 
 	switch (entry->type) {
 	case TRACE_FN:
@@ -1769,7 +1556,7 @@ static int print_bin_fmt(struct trace_it
 
 	SEQ_PUT_FIELD_RET(s, entry->pid);
 	SEQ_PUT_FIELD_RET(s, entry->cpu);
-	SEQ_PUT_FIELD_RET(s, entry->t);
+	SEQ_PUT_FIELD_RET(s, iter->ts);
 
 	switch (entry->type) {
 	case TRACE_FN:
@@ -1796,16 +1583,10 @@ static int print_bin_fmt(struct trace_it
 
 static int trace_empty(struct trace_iterator *iter)
 {
-	struct trace_array_cpu *data;
 	int cpu;
 
 	for_each_tracing_cpu(cpu) {
-		data = iter->tr->data[cpu];
-
-		if (head_page(data) && data->trace_idx &&
-		    (data->trace_tail != data->trace_head ||
-		     data->trace_tail_idx != data->trace_head_idx))
-			return 0;
+		ring_buffer_iter_empty(iter->buffer_iter[cpu]);
 	}
 	return 1;
 }
@@ -1869,6 +1650,8 @@ static struct trace_iterator *
 __tracing_open(struct inode *inode, struct file *file, int *ret)
 {
 	struct trace_iterator *iter;
+	struct seq_file *m;
+	int cpu;
 
 	if (tracing_disabled) {
 		*ret = -ENODEV;
@@ -1889,28 +1672,43 @@ __tracing_open(struct inode *inode, stru
 	iter->trace = current_trace;
 	iter->pos = -1;
 
+	for_each_tracing_cpu(cpu) {
+		iter->buffer_iter[cpu] =
+			ring_buffer_read_start(iter->tr->buffer, cpu);
+		if (!iter->buffer_iter[cpu])
+			goto fail_buffer;
+	}
+
 	/* TODO stop tracer */
 	*ret = seq_open(file, &tracer_seq_ops);
-	if (!*ret) {
-		struct seq_file *m = file->private_data;
-		m->private = iter;
+	if (*ret)
+		goto fail_buffer;
 
-		/* stop the trace while dumping */
-		if (iter->tr->ctrl) {
-			tracer_enabled = 0;
-			ftrace_function_enabled = 0;
-		}
+	m = file->private_data;
+	m->private = iter;
 
-		if (iter->trace && iter->trace->open)
-			iter->trace->open(iter);
-	} else {
-		kfree(iter);
-		iter = NULL;
+	/* stop the trace while dumping */
+	if (iter->tr->ctrl) {
+		tracer_enabled = 0;
+		ftrace_function_enabled = 0;
 	}
+
+	if (iter->trace && iter->trace->open)
+			iter->trace->open(iter);
+
 	mutex_unlock(&trace_types_lock);
 
  out:
 	return iter;
+
+ fail_buffer:
+	for_each_tracing_cpu(cpu) {
+		if (iter->buffer_iter[cpu])
+			ring_buffer_read_finish(iter->buffer_iter[cpu]);
+	}
+	mutex_unlock(&trace_types_lock);
+
+	return ERR_PTR(-ENOMEM);
 }
 
 int tracing_open_generic(struct inode *inode, struct file *filp)
@@ -1926,8 +1724,14 @@ int tracing_release(struct inode *inode,
 {
 	struct seq_file *m = (struct seq_file *)file->private_data;
 	struct trace_iterator *iter = m->private;
+	int cpu;
 
 	mutex_lock(&trace_types_lock);
+	for_each_tracing_cpu(cpu) {
+		if (iter->buffer_iter[cpu])
+			ring_buffer_read_finish(iter->buffer_iter[cpu]);
+	}
+
 	if (iter->trace && iter->trace->close)
 		iter->trace->close(iter);
 
@@ -2500,13 +2304,10 @@ tracing_read_pipe(struct file *filp, cha
 		  size_t cnt, loff_t *ppos)
 {
 	struct trace_iterator *iter = filp->private_data;
-	struct trace_array_cpu *data;
-	static cpumask_t mask;
 	unsigned long flags;
 #ifdef CONFIG_FTRACE
 	int ftrace_save;
 #endif
-	int cpu;
 	ssize_t sret;
 
 	/* return any leftover data */
@@ -2595,32 +2396,13 @@ tracing_read_pipe(struct file *filp, cha
 	 * and then release the locks again.
 	 */
 
-	cpus_clear(mask);
-	local_irq_save(flags);
+	local_irq_disable();
 #ifdef CONFIG_FTRACE
 	ftrace_save = ftrace_enabled;
 	ftrace_enabled = 0;
 #endif
 	smp_wmb();
-	for_each_tracing_cpu(cpu) {
-		data = iter->tr->data[cpu];
-
-		if (!head_page(data) || !data->trace_idx)
-			continue;
-
-		atomic_inc(&data->disabled);
-		cpu_set(cpu, mask);
-	}
-
-	for_each_cpu_mask(cpu, mask) {
-		data = iter->tr->data[cpu];
-		__raw_spin_lock(&data->lock);
-
-		if (data->overrun > iter->last_overrun[cpu])
-			iter->overrun[cpu] +=
-				data->overrun - iter->last_overrun[cpu];
-		iter->last_overrun[cpu] = data->overrun;
-	}
+	ring_buffer_lock(iter->tr->buffer, &flags);
 
 	while (find_next_entry_inc(iter) != NULL) {
 		int ret;
@@ -2639,19 +2421,11 @@ tracing_read_pipe(struct file *filp, cha
 			break;
 	}
 
-	for_each_cpu_mask(cpu, mask) {
-		data = iter->tr->data[cpu];
-		__raw_spin_unlock(&data->lock);
-	}
-
-	for_each_cpu_mask(cpu, mask) {
-		data = iter->tr->data[cpu];
-		atomic_dec(&data->disabled);
-	}
+	ring_buffer_unlock(iter->tr->buffer, flags);
 #ifdef CONFIG_FTRACE
 	ftrace_enabled = ftrace_save;
 #endif
-	local_irq_restore(flags);
+	local_irq_enable();
 
 	/* Now copy what we have to the user */
 	sret = trace_seq_to_user(&iter->seq, ubuf, cnt);
@@ -2684,7 +2458,7 @@ tracing_entries_write(struct file *filp,
 {
 	unsigned long val;
 	char buf[64];
-	int i, ret;
+	int ret;
 
 	if (cnt >= sizeof(buf))
 		return -EINVAL;
@@ -2711,52 +2485,31 @@ tracing_entries_write(struct file *filp,
 		goto out;
 	}
 
-	if (val > global_trace.entries) {
-		long pages_requested;
-		unsigned long freeable_pages;
-
-		/* make sure we have enough memory before mapping */
-		pages_requested =
-			(val + (ENTRIES_PER_PAGE-1)) / ENTRIES_PER_PAGE;
-
-		/* account for each buffer (and max_tr) */
-		pages_requested *= tracing_nr_buffers * 2;
-
-		/* Check for overflow */
-		if (pages_requested < 0) {
-			cnt = -ENOMEM;
+	if (val != global_trace.entries) {
+		ret = ring_buffer_resize(global_trace.buffer, val);
+		if (ret < 0) {
+			cnt = ret;
 			goto out;
 		}
 
-		freeable_pages = determine_dirtyable_memory();
-
-		/* we only allow to request 1/4 of useable memory */
-		if (pages_requested >
-		    ((freeable_pages + tracing_pages_allocated) / 4)) {
-			cnt = -ENOMEM;
-			goto out;
-		}
-
-		while (global_trace.entries < val) {
-			if (trace_alloc_page()) {
-				cnt = -ENOMEM;
-				goto out;
+		ret = ring_buffer_resize(max_tr.buffer, val);
+		if (ret < 0) {
+			int r;
+			cnt = ret;
+			r = ring_buffer_resize(global_trace.buffer,
+					       global_trace.entries);
+			if (r < 0) {
+				/* AARGH! We are left with different
+				 * size max buffer!!!! */
+				WARN_ON(1);
+				tracing_disabled = 1;
 			}
-			/* double check that we don't go over the known pages */
-			if (tracing_pages_allocated > pages_requested)
-				break;
+			goto out;
 		}
 
-	} else {
-		/* include the number of entries in val (inc of page entries) */
-		while (global_trace.entries > val + (ENTRIES_PER_PAGE - 1))
-			trace_free_page();
+		global_trace.entries = val;
 	}
 
-	/* check integrity */
-	for_each_tracing_cpu(i)
-		check_pages(global_trace.data[i]);
-
 	filp->f_pos += cnt;
 
 	/* If check pages failed, return ENOMEM */
@@ -2930,190 +2683,41 @@ static __init void tracer_init_debugfs(v
 #endif
 }
 
-static int trace_alloc_page(void)
+__init static int tracer_alloc_buffers(void)
 {
 	struct trace_array_cpu *data;
-	struct page *page, *tmp;
-	LIST_HEAD(pages);
-	void *array;
-	unsigned pages_allocated = 0;
 	int i;
 
-	/* first allocate a page for each CPU */
-	for_each_tracing_cpu(i) {
-		array = (void *)__get_free_page(GFP_KERNEL);
-		if (array == NULL) {
-			printk(KERN_ERR "tracer: failed to allocate page"
-			       "for trace buffer!\n");
-			goto free_pages;
-		}
-
-		pages_allocated++;
-		page = virt_to_page(array);
-		list_add(&page->lru, &pages);
+	/* TODO: make the number of buffers hot pluggable with CPUS */
+	tracing_buffer_mask = cpu_possible_map;
 
-/* Only allocate if we are actually using the max trace */
-#ifdef CONFIG_TRACER_MAX_TRACE
-		array = (void *)__get_free_page(GFP_KERNEL);
-		if (array == NULL) {
-			printk(KERN_ERR "tracer: failed to allocate page"
-			       "for trace buffer!\n");
-			goto free_pages;
-		}
-		pages_allocated++;
-		page = virt_to_page(array);
-		list_add(&page->lru, &pages);
-#endif
+	global_trace.buffer = ring_buffer_alloc(trace_buf_size,
+						   TRACE_BUFFER_FLAGS);
+	if (!global_trace.buffer) {
+		printk(KERN_ERR "tracer: failed to allocate ring buffer!\n");
+		WARN_ON(1);
+		return 0;
 	}
-
-	/* Now that we successfully allocate a page per CPU, add them */
-	for_each_tracing_cpu(i) {
-		data = global_trace.data[i];
-		page = list_entry(pages.next, struct page, lru);
-		list_del_init(&page->lru);
-		list_add_tail(&page->lru, &data->trace_pages);
-		ClearPageLRU(page);
+	global_trace.entries = ring_buffer_size(global_trace.buffer);
 
 #ifdef CONFIG_TRACER_MAX_TRACE
-		data = max_tr.data[i];
-		page = list_entry(pages.next, struct page, lru);
-		list_del_init(&page->lru);
-		list_add_tail(&page->lru, &data->trace_pages);
-		SetPageLRU(page);
-#endif
-	}
-	tracing_pages_allocated += pages_allocated;
-	global_trace.entries += ENTRIES_PER_PAGE;
-
-	return 0;
-
- free_pages:
-	list_for_each_entry_safe(page, tmp, &pages, lru) {
-		list_del_init(&page->lru);
-		__free_page(page);
+	max_tr.buffer = ring_buffer_alloc(trace_buf_size,
+					     TRACE_BUFFER_FLAGS);
+	if (!max_tr.buffer) {
+		printk(KERN_ERR "tracer: failed to allocate max ring buffer!\n");
+		WARN_ON(1);
+		ring_buffer_free(global_trace.buffer);
+		return 0;
 	}
-	return -ENOMEM;
-}
-
-static int trace_free_page(void)
-{
-	struct trace_array_cpu *data;
-	struct page *page;
-	struct list_head *p;
-	int i;
-	int ret = 0;
-
-	/* free one page from each buffer */
-	for_each_tracing_cpu(i) {
-		data = global_trace.data[i];
-		p = data->trace_pages.next;
-		if (p == &data->trace_pages) {
-			/* should never happen */
-			WARN_ON(1);
-			tracing_disabled = 1;
-			ret = -1;
-			break;
-		}
-		page = list_entry(p, struct page, lru);
-		ClearPageLRU(page);
-		list_del(&page->lru);
-		tracing_pages_allocated--;
-		tracing_pages_allocated--;
-		__free_page(page);
-
-		tracing_reset(data);
-
-#ifdef CONFIG_TRACER_MAX_TRACE
-		data = max_tr.data[i];
-		p = data->trace_pages.next;
-		if (p == &data->trace_pages) {
-			/* should never happen */
-			WARN_ON(1);
-			tracing_disabled = 1;
-			ret = -1;
-			break;
-		}
-		page = list_entry(p, struct page, lru);
-		ClearPageLRU(page);
-		list_del(&page->lru);
-		__free_page(page);
-
-		tracing_reset(data);
+	max_tr.entries = ring_buffer_size(max_tr.buffer);
+	WARN_ON(max_tr.entries != global_trace.entries);
 #endif
-	}
-	global_trace.entries -= ENTRIES_PER_PAGE;
-
-	return ret;
-}
-
-__init static int tracer_alloc_buffers(void)
-{
-	struct trace_array_cpu *data;
-	void *array;
-	struct page *page;
-	int pages = 0;
-	int ret = -ENOMEM;
-	int i;
-
-	/* TODO: make the number of buffers hot pluggable with CPUS */
-	tracing_nr_buffers = num_possible_cpus();
-	tracing_buffer_mask = cpu_possible_map;
 
 	/* Allocate the first page for all buffers */
 	for_each_tracing_cpu(i) {
 		data = global_trace.data[i] = &per_cpu(global_trace_cpu, i);
 		max_tr.data[i] = &per_cpu(max_data, i);
-
-		array = (void *)__get_free_page(GFP_KERNEL);
-		if (array == NULL) {
-			printk(KERN_ERR "tracer: failed to allocate page"
-			       "for trace buffer!\n");
-			goto free_buffers;
-		}
-
-		/* set the array to the list */
-		INIT_LIST_HEAD(&data->trace_pages);
-		page = virt_to_page(array);
-		list_add(&page->lru, &data->trace_pages);
-		/* use the LRU flag to differentiate the two buffers */
-		ClearPageLRU(page);
-
-		data->lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED;
-		max_tr.data[i]->lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED;
-
-/* Only allocate if we are actually using the max trace */
-#ifdef CONFIG_TRACER_MAX_TRACE
-		array = (void *)__get_free_page(GFP_KERNEL);
-		if (array == NULL) {
-			printk(KERN_ERR "tracer: failed to allocate page"
-			       "for trace buffer!\n");
-			goto free_buffers;
-		}
-
-		INIT_LIST_HEAD(&max_tr.data[i]->trace_pages);
-		page = virt_to_page(array);
-		list_add(&page->lru, &max_tr.data[i]->trace_pages);
-		SetPageLRU(page);
-#endif
-	}
-
-	/*
-	 * Since we allocate by orders of pages, we may be able to
-	 * round up a bit.
-	 */
-	global_trace.entries = ENTRIES_PER_PAGE;
-	pages++;
-
-	while (global_trace.entries < trace_nr_entries) {
-		if (trace_alloc_page())
-			break;
-		pages++;
 	}
-	max_tr.entries = global_trace.entries;
-
-	pr_info("tracer: %d pages allocated for %ld entries of %ld bytes\n",
-		pages, trace_nr_entries, (long)TRACE_ENTRY_SIZE);
-	pr_info("   actual entries %ld\n", global_trace.entries);
 
 	tracer_init_debugfs();
 
@@ -3127,31 +2731,5 @@ __init static int tracer_alloc_buffers(v
 	tracing_disabled = 0;
 
 	return 0;
-
- free_buffers:
-	for (i-- ; i >= 0; i--) {
-		struct page *page, *tmp;
-		struct trace_array_cpu *data = global_trace.data[i];
-
-		if (data) {
-			list_for_each_entry_safe(page, tmp,
-						 &data->trace_pages, lru) {
-				list_del_init(&page->lru);
-				__free_page(page);
-			}
-		}
-
-#ifdef CONFIG_TRACER_MAX_TRACE
-		data = max_tr.data[i];
-		if (data) {
-			list_for_each_entry_safe(page, tmp,
-						 &data->trace_pages, lru) {
-				list_del_init(&page->lru);
-				__free_page(page);
-			}
-		}
-#endif
-	}
-	return ret;
 }
 fs_initcall(tracer_alloc_buffers);
Index: linux-compile.git/kernel/trace/trace.h
===================================================================
--- linux-compile.git.orig/kernel/trace/trace.h	2008-09-25 12:34:11.000000000 -0400
+++ linux-compile.git/kernel/trace/trace.h	2008-09-25 12:34:23.000000000 -0400
@@ -6,6 +6,7 @@
 #include <linux/sched.h>
 #include <linux/clocksource.h>
 #include <linux/mmiotrace.h>
+#include <linux/ring_buffer.h>
 
 enum trace_type {
 	__TRACE_FIRST_TYPE = 0,
@@ -72,7 +73,6 @@ struct trace_entry {
 	char			flags;
 	char			preempt_count;
 	int			pid;
-	cycle_t			t;
 	union {
 		struct ftrace_entry		fn;
 		struct ctx_switch_entry		ctx;
@@ -91,16 +91,9 @@ struct trace_entry {
  * the trace, etc.)
  */
 struct trace_array_cpu {
-	struct list_head	trace_pages;
 	atomic_t		disabled;
-	raw_spinlock_t		lock;
-	struct lock_class_key	lock_key;
 
 	/* these fields get copied into max-trace: */
-	unsigned		trace_head_idx;
-	unsigned		trace_tail_idx;
-	void			*trace_head; /* producer */
-	void			*trace_tail; /* consumer */
 	unsigned long		trace_idx;
 	unsigned long		overrun;
 	unsigned long		saved_latency;
@@ -124,6 +117,7 @@ struct trace_iterator;
  * They have on/off state as well:
  */
 struct trace_array {
+	struct ring_buffer	*buffer;
 	unsigned long		entries;
 	long			ctrl;
 	int			cpu;
@@ -171,26 +165,20 @@ struct trace_iterator {
 	struct trace_array	*tr;
 	struct tracer		*trace;
 	void			*private;
-	long			last_overrun[NR_CPUS];
-	long			overrun[NR_CPUS];
+	struct ring_buffer_iter	*buffer_iter[NR_CPUS];
 
 	/* The below is zeroed out in pipe_read */
 	struct trace_seq	seq;
 	struct trace_entry	*ent;
 	int			cpu;
-
-	struct trace_entry	*prev_ent;
-	int			prev_cpu;
+	u64			ts;
 
 	unsigned long		iter_flags;
 	loff_t			pos;
-	unsigned long		next_idx[NR_CPUS];
-	struct list_head	*next_page[NR_CPUS];
-	unsigned		next_page_idx[NR_CPUS];
 	long			idx;
 };
 
-void tracing_reset(struct trace_array_cpu *data);
+void tracing_reset(struct trace_array *tr, int cpu);
 int tracing_open_generic(struct inode *inode, struct file *filp);
 struct dentry *tracing_init_dentry(void);
 void init_tracer_sysprof_debugfs(struct dentry *d_tracer);
Index: linux-compile.git/kernel/trace/trace_functions.c
===================================================================
--- linux-compile.git.orig/kernel/trace/trace_functions.c	2008-09-25 12:34:11.000000000 -0400
+++ linux-compile.git/kernel/trace/trace_functions.c	2008-09-25 12:34:23.000000000 -0400
@@ -23,7 +23,7 @@ static void function_reset(struct trace_
 	tr->time_start = ftrace_now(tr->cpu);
 
 	for_each_online_cpu(cpu)
-		tracing_reset(tr->data[cpu]);
+		tracing_reset(tr, cpu);
 }
 
 static void start_function_trace(struct trace_array *tr)
Index: linux-compile.git/kernel/trace/trace_irqsoff.c
===================================================================
--- linux-compile.git.orig/kernel/trace/trace_irqsoff.c	2008-09-25 12:34:11.000000000 -0400
+++ linux-compile.git/kernel/trace/trace_irqsoff.c	2008-09-25 12:34:23.000000000 -0400
@@ -173,7 +173,7 @@ out_unlock:
 out:
 	data->critical_sequence = max_sequence;
 	data->preempt_timestamp = ftrace_now(cpu);
-	tracing_reset(data);
+	tracing_reset(tr, cpu);
 	trace_function(tr, data, CALLER_ADDR0, parent_ip, flags);
 }
 
@@ -203,7 +203,7 @@ start_critical_timing(unsigned long ip, 
 	data->critical_sequence = max_sequence;
 	data->preempt_timestamp = ftrace_now(cpu);
 	data->critical_start = parent_ip ? : ip;
-	tracing_reset(data);
+	tracing_reset(tr, cpu);
 
 	local_save_flags(flags);
 
@@ -234,7 +234,7 @@ stop_critical_timing(unsigned long ip, u
 
 	data = tr->data[cpu];
 
-	if (unlikely(!data) || unlikely(!head_page(data)) ||
+	if (unlikely(!data) ||
 	    !data->critical_start || atomic_read(&data->disabled))
 		return;
 
Index: linux-compile.git/kernel/trace/trace_mmiotrace.c
===================================================================
--- linux-compile.git.orig/kernel/trace/trace_mmiotrace.c	2008-09-25 12:34:11.000000000 -0400
+++ linux-compile.git/kernel/trace/trace_mmiotrace.c	2008-09-25 12:34:23.000000000 -0400
@@ -27,7 +27,7 @@ static void mmio_reset_data(struct trace
 	tr->time_start = ftrace_now(tr->cpu);
 
 	for_each_online_cpu(cpu)
-		tracing_reset(tr->data[cpu]);
+		tracing_reset(tr, cpu);
 }
 
 static void mmio_trace_init(struct trace_array *tr)
@@ -130,10 +130,14 @@ static unsigned long count_overruns(stru
 {
 	int cpu;
 	unsigned long cnt = 0;
+/* FIXME: */
+#if 0
 	for_each_online_cpu(cpu) {
 		cnt += iter->overrun[cpu];
 		iter->overrun[cpu] = 0;
 	}
+#endif
+	(void)cpu;
 	return cnt;
 }
 
@@ -176,7 +180,7 @@ static int mmio_print_rw(struct trace_it
 	struct trace_entry *entry = iter->ent;
 	struct mmiotrace_rw *rw	= &entry->mmiorw;
 	struct trace_seq *s	= &iter->seq;
-	unsigned long long t	= ns2usecs(entry->t);
+	unsigned long long t	= ns2usecs(iter->ts);
 	unsigned long usec_rem	= do_div(t, 1000000ULL);
 	unsigned secs		= (unsigned long)t;
 	int ret = 1;
@@ -218,7 +222,7 @@ static int mmio_print_map(struct trace_i
 	struct trace_entry *entry = iter->ent;
 	struct mmiotrace_map *m	= &entry->mmiomap;
 	struct trace_seq *s	= &iter->seq;
-	unsigned long long t	= ns2usecs(entry->t);
+	unsigned long long t	= ns2usecs(iter->ts);
 	unsigned long usec_rem	= do_div(t, 1000000ULL);
 	unsigned secs		= (unsigned long)t;
 	int ret = 1;
Index: linux-compile.git/kernel/trace/trace_sched_switch.c
===================================================================
--- linux-compile.git.orig/kernel/trace/trace_sched_switch.c	2008-09-25 12:34:11.000000000 -0400
+++ linux-compile.git/kernel/trace/trace_sched_switch.c	2008-09-25 12:34:23.000000000 -0400
@@ -133,7 +133,7 @@ static void sched_switch_reset(struct tr
 	tr->time_start = ftrace_now(tr->cpu);
 
 	for_each_online_cpu(cpu)
-		tracing_reset(tr->data[cpu]);
+		tracing_reset(tr, cpu);
 }
 
 static int tracing_sched_register(void)
Index: linux-compile.git/kernel/trace/trace_sched_wakeup.c
===================================================================
--- linux-compile.git.orig/kernel/trace/trace_sched_wakeup.c	2008-09-25 12:34:11.000000000 -0400
+++ linux-compile.git/kernel/trace/trace_sched_wakeup.c	2008-09-25 12:34:23.000000000 -0400
@@ -216,7 +216,7 @@ static void __wakeup_reset(struct trace_
 
 	for_each_possible_cpu(cpu) {
 		data = tr->data[cpu];
-		tracing_reset(data);
+		tracing_reset(tr, cpu);
 	}
 
 	wakeup_cpu = -1;

-- 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH v4] Unified trace buffer
  2008-09-25 18:51 ` [RFC PATCH 1/2 " Steven Rostedt
@ 2008-09-26  1:02   ` Steven Rostedt
  2008-09-26  1:52     ` Masami Hiramatsu
  0 siblings, 1 reply; 102+ messages in thread
From: Steven Rostedt @ 2008-09-26  1:02 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	prasad, Linus Torvalds, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Steven Rostedt


This version has been cleaned up a bit. I've been running it as
a back end to ftrace, and it has been handling pretty well.

I did not implement the GTOD sync part and will leave that for later.
But this is the basic design that I like and will be the basis
of my future work.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 include/linux/ring_buffer.h |  178 ++++++
 kernel/trace/Kconfig        |    4 
 kernel/trace/Makefile       |    1 
 kernel/trace/ring_buffer.c  | 1252 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 1435 insertions(+)

Index: linux-trace.git/include/linux/ring_buffer.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-trace.git/include/linux/ring_buffer.h	2008-09-25 20:36:12.000000000 -0400
@@ -0,0 +1,178 @@
+#ifndef _LINUX_RING_BUFFER_H
+#define _LINUX_RING_BUFFER_H
+
+#include <linux/mm.h>
+#include <linux/seq_file.h>
+
+struct ring_buffer;
+struct ring_buffer_iter;
+
+/*
+ * Don't reference this struct directly, use the inline items below.
+ */
+struct ring_buffer_event {
+	u32		type:2, len:3, time_delta:27;
+	u32		array[];
+} __attribute__((__packed__));
+
+enum {
+	RB_TYPE_PADDING,	/* Left over page padding
+				 * array is ignored
+				 * size is variable depending on
+				 */
+	RB_TYPE_TIME_EXTENT,	/* Extent the time delta
+				 * array[0] = time delta (28 .. 59)
+				 * size = 8 bytes
+				 */
+	/* FIXME: RB_TYPE_TIME_STAMP not implemented */
+	RB_TYPE_TIME_STAMP,	/* Sync time stamp with external clock
+				 * array[0] = tv_nsec
+				 * array[1] = tv_sec
+				 * size = 16 bytes
+				 */
+
+	RB_TYPE_DATA,		/* Data record
+				 * If len is zero:
+				 *  array[0] holds the actual length
+				 *  array[1..(length+3)/4] holds data
+				 * else
+				 *  length = len << 2
+				 *  array[0..(length+3)/4] holds data
+				 */
+};
+
+#define RB_EVNT_HDR_SIZE (sizeof(struct ring_buffer_event))
+#define RB_ALIGNMENT_SHIFT	2
+#define RB_ALIGNMENT		(1 << RB_ALIGNMENT_SHIFT)
+#define RB_MAX_SMALL_DATA	(28)
+
+enum {
+	RB_LEN_TIME_EXTENT = 8,
+	RB_LEN_TIME_STAMP = 16,
+};
+
+/**
+ * ring_buffer_event_length - return the length of the event
+ * @event: the event to get the length of
+ */
+static inline unsigned
+ring_buffer_event_length(struct ring_buffer_event *event)
+{
+	unsigned length;
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		/* undefined */
+		return -1;
+
+	case RB_TYPE_TIME_EXTENT:
+		return RB_LEN_TIME_EXTENT;
+
+	case RB_TYPE_TIME_STAMP:
+		return RB_LEN_TIME_STAMP;
+
+	case RB_TYPE_DATA:
+		if (event->len)
+			length = event->len << RB_ALIGNMENT_SHIFT;
+		else
+			length = event->array[0];
+		return length + RB_EVNT_HDR_SIZE;
+	default:
+		BUG();
+	}
+	/* not hit */
+	return 0;
+}
+
+/**
+ * ring_buffer_event_time_delta - return the delta timestamp of the event
+ * @event: the event to get the delta timestamp of
+ *
+ * The delta timestamp is the 27 bit timestamp since the last event.
+ */
+static inline unsigned
+ring_buffer_event_time_delta(struct ring_buffer_event *event)
+{
+	return event->time_delta;
+}
+
+/**
+ * ring_buffer_event_data - return the data of the event
+ * @event: the event to get the data from
+ */
+static inline void *
+ring_buffer_event_data(struct ring_buffer_event *event)
+{
+	BUG_ON(event->type != RB_TYPE_DATA);
+	/* If length is in len field, then array[0] has the data */
+	if (event->len)
+		return (void *)&event->array[0];
+	/* Otherwise length is in array[0] and array[1] has the data */
+	return (void *)&event->array[1];
+}
+
+void ring_buffer_lock(struct ring_buffer *buffer, unsigned long *flags);
+void ring_buffer_unlock(struct ring_buffer *buffer, unsigned long flags);
+
+/*
+ * size is in bytes for each per CPU buffer.
+ */
+struct ring_buffer *
+ring_buffer_alloc(unsigned long size, unsigned flags);
+void ring_buffer_free(struct ring_buffer *buffer);
+
+int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size);
+
+struct ring_buffer_event *
+ring_buffer_lock_reserve(struct ring_buffer *buffer,
+			 unsigned long length,
+			 unsigned long *flags);
+int ring_buffer_unlock_commit(struct ring_buffer *buffer,
+			      struct ring_buffer_event *event,
+			      unsigned long flags);
+int ring_buffer_write(struct ring_buffer *buffer,
+		      unsigned long length, void *data);
+
+struct ring_buffer_event *
+ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts);
+struct ring_buffer_event *
+ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts);
+
+struct ring_buffer_iter *
+ring_buffer_read_start(struct ring_buffer *buffer, int cpu);
+void ring_buffer_read_finish(struct ring_buffer_iter *iter);
+
+struct ring_buffer_event *
+ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts);
+struct ring_buffer_event *
+ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts);
+void ring_buffer_iter_reset(struct ring_buffer_iter *iter);
+int ring_buffer_iter_empty(struct ring_buffer_iter *iter);
+
+unsigned long ring_buffer_size(struct ring_buffer *buffer);
+
+void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu);
+void ring_buffer_reset(struct ring_buffer *buffer);
+
+int ring_buffer_swap_cpu(struct ring_buffer *buffer_a,
+			 struct ring_buffer *buffer_b, int cpu);
+
+int ring_buffer_empty(struct ring_buffer *buffer);
+int ring_buffer_empty_cpu(struct ring_buffer *buffer, int cpu);
+
+void ring_buffer_record_disable(struct ring_buffer *buffer);
+void ring_buffer_record_enable(struct ring_buffer *buffer);
+void ring_buffer_record_disable_cpu(struct ring_buffer *buffer, int cpu);
+void ring_buffer_record_enable_cpu(struct ring_buffer *buffer, int cpu);
+
+unsigned long ring_buffer_entries(struct ring_buffer *buffer);
+unsigned long ring_buffer_overruns(struct ring_buffer *buffer);
+
+u64 ring_buffer_time_stamp(int cpu);
+void ring_buffer_normalize_time_stamp(int cpu, u64 *ts);
+
+enum ring_buffer_flags {
+	RB_FL_OVERWRITE		= 1 << 0,
+};
+
+#endif /* _LINUX_RING_BUFFER_H */
Index: linux-trace.git/kernel/trace/ring_buffer.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-trace.git/kernel/trace/ring_buffer.c	2008-09-25 20:35:44.000000000 -0400
@@ -0,0 +1,1252 @@
+/*
+ * Generic ring buffer
+ *
+ * Copyright (C) 2008 Steven Rostedt <srostedt@redhat.com>
+ */
+#include <linux/ring_buffer.h>
+#include <linux/spinlock.h>
+#include <linux/debugfs.h>
+#include <linux/uaccess.h>
+#include <linux/module.h>
+#include <linux/percpu.h>
+#include <linux/init.h>
+#include <linux/hash.h>
+#include <linux/list.h>
+#include <linux/fs.h>
+
+#include "trace.h"
+
+/* FIXME!!! */
+u64 ring_buffer_time_stamp(int cpu)
+{
+	/* mult -1 to test normalize */
+	return sched_clock() * -1;
+}
+void ring_buffer_normalize_time_stamp(int cpu, u64 *ts)
+{
+	*ts *= -1;
+}
+
+#define TS_SHIFT	27
+#define TS_MASK		((1ULL << TS_SHIFT) - 1)
+#define TS_DELTA_TEST	~TS_MASK
+
+/*
+ * We need to fit the time_stamp delta into 27 bits.
+ */
+static inline int
+test_time_stamp(unsigned long long delta)
+{
+	if (delta & TS_DELTA_TEST)
+		return 1;
+	return 0;
+}
+
+struct buffer_page {
+	u64		time_stamp;
+	unsigned char	body[];
+};
+
+#define BUF_PAGE_SIZE (PAGE_SIZE - sizeof(u64))
+
+/*
+ * head_page == tail_page && head == tail then buffer is empty.
+ */
+struct ring_buffer_per_cpu {
+	int			cpu;
+	struct ring_buffer	*buffer;
+	raw_spinlock_t		lock;
+	struct lock_class_key	lock_key;
+	struct buffer_page	**pages;
+	unsigned long		head;	/* read from head */
+	unsigned long		tail;	/* write to tail */
+	unsigned long		head_page;
+	unsigned long		tail_page;
+	unsigned long		overrun;
+	unsigned long		entries;
+	u64			write_stamp;
+	u64			read_stamp;
+	atomic_t		record_disabled;
+};
+
+struct ring_buffer {
+	unsigned long		size;
+	unsigned		pages;
+	unsigned		flags;
+	int			cpus;
+	atomic_t		record_disabled;
+
+	spinlock_t		lock;
+
+	/* FIXME: this should be online CPUS */
+	struct ring_buffer_per_cpu *buffers[NR_CPUS];
+};
+
+struct ring_buffer_iter {
+	struct ring_buffer_per_cpu	*cpu_buffer;
+	unsigned long			head;
+	unsigned long			head_page;
+	u64				read_stamp;
+};
+
+static struct ring_buffer_per_cpu *
+ring_buffer_allocate_cpu_buffer(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int pages = buffer->pages;
+	int i;
+
+	cpu_buffer = kzalloc_node(ALIGN(sizeof(*cpu_buffer), cache_line_size()),
+				  GFP_KERNEL, cpu_to_node(cpu));
+	if (!cpu_buffer)
+		return NULL;
+
+	cpu_buffer->cpu = cpu;
+	cpu_buffer->buffer = buffer;
+	cpu_buffer->lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED;
+
+	cpu_buffer->pages = kzalloc_node(ALIGN(sizeof(void *) * pages,
+					       cache_line_size()), GFP_KERNEL,
+					 cpu_to_node(cpu));
+	if (!cpu_buffer->pages)
+		goto fail_free_buffer;
+
+	for (i = 0; i < pages; i++) {
+		cpu_buffer->pages[i] = (void *)get_zeroed_page(GFP_KERNEL);
+		if (!cpu_buffer->pages[i])
+			goto fail_free_pages;
+	}
+
+	return cpu_buffer;
+
+ fail_free_pages:
+	for (i = 0; i < pages; i++) {
+		if (cpu_buffer->pages[i])
+			free_page((unsigned long)cpu_buffer->pages[i]);
+	}
+	kfree(cpu_buffer->pages);
+
+ fail_free_buffer:
+	kfree(cpu_buffer);
+	return NULL;
+}
+
+static void
+ring_buffer_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	int i;
+
+	for (i = 0; i < cpu_buffer->buffer->pages; i++) {
+		if (cpu_buffer->pages[i])
+			free_page((unsigned long)cpu_buffer->pages[i]);
+	}
+	kfree(cpu_buffer->pages);
+	kfree(cpu_buffer);
+}
+
+/**
+ * ring_buffer_alloc - allocate a new ring_buffer
+ * @size: the size in bytes that is needed.
+ * @flags: attributes to set for the ring buffer.
+ *
+ * Currently the only flag that is available is the RB_FL_OVERWRITE
+ * flag. This flag means that the buffer will overwrite old data
+ * when the buffer wraps. If this flag is not set, the buffer will
+ * drop data when the tail hits the head.
+ */
+struct ring_buffer *
+ring_buffer_alloc(unsigned long size, unsigned flags)
+{
+	struct ring_buffer *buffer;
+	int cpu;
+
+	/* keep it in its own cache line */
+	buffer = kzalloc(ALIGN(sizeof(*buffer), cache_line_size()),
+			 GFP_KERNEL);
+	if (!buffer)
+		return NULL;
+
+	buffer->pages = (size + (PAGE_SIZE - 1)) / PAGE_SIZE;
+	buffer->flags = flags;
+
+	/* need at least two pages */
+	if (buffer->pages == 1)
+		buffer->pages++;
+
+	/* FIXME: do for only online CPUS */
+	buffer->cpus = num_possible_cpus();
+	for_each_possible_cpu(cpu) {
+		if (cpu >= buffer->cpus)
+			continue;
+		buffer->buffers[cpu] =
+			ring_buffer_allocate_cpu_buffer(buffer, cpu);
+		if (!buffer->buffers[cpu])
+			goto fail_free_buffers;
+	}
+
+	spin_lock_init(&buffer->lock);
+
+	return buffer;
+
+ fail_free_buffers:
+	for_each_possible_cpu(cpu) {
+		if (cpu >= buffer->cpus)
+			continue;
+		if (buffer->buffers[cpu])
+			ring_buffer_free_cpu_buffer(buffer->buffers[cpu]);
+	}
+
+	kfree(buffer);
+	return NULL;
+}
+
+/**
+ * ring_buffer_free - free a ring buffer.
+ * @buffer: the buffer to free.
+ */
+void
+ring_buffer_free(struct ring_buffer *buffer)
+{
+	int cpu;
+
+	for (cpu = 0; cpu < buffer->cpus; cpu++)
+		ring_buffer_free_cpu_buffer(buffer->buffers[cpu]);
+
+	kfree(buffer);
+}
+
+/**
+ * ring_buffer_resize - resize the ring buffer
+ * @buffer: the buffer to resize.
+ * @size: the new size.
+ *
+ * Returns -1 on failure.
+ */
+int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size)
+{
+	/* FIXME: */
+	return -1;
+}
+
+static inline int
+ring_buffer_per_cpu_empty(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	return cpu_buffer->head_page == cpu_buffer->tail_page &&
+		cpu_buffer->head == cpu_buffer->tail;
+}
+
+static inline int
+ring_buffer_null_event(struct ring_buffer_event *event)
+{
+	return event->type == RB_TYPE_PADDING;
+}
+
+static inline void *
+rb_page_body(struct ring_buffer_per_cpu *cpu_buffer,
+		      unsigned long page, unsigned index)
+{
+	return cpu_buffer->pages[page]->body + index;
+}
+
+static inline struct ring_buffer_event *
+ring_buffer_head_event(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	return rb_page_body(cpu_buffer,cpu_buffer->head_page,
+			    cpu_buffer->head);
+}
+
+static inline struct ring_buffer_event *
+ring_buffer_iter_head_event(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
+
+	return rb_page_body(cpu_buffer, iter->head_page,
+			    iter->head);
+}
+
+/*
+ * When the tail hits the head and the buffer is in overwrite mode,
+ * the head jumps to the next page and all content on the previous
+ * page is discarded. But before doing so, we update the overrun
+ * variable of the buffer.
+ */
+static void
+ring_buffer_update_overflow(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct ring_buffer_event *event;
+	unsigned long head;
+
+	for (head = 0; head < BUF_PAGE_SIZE;
+	     head += ring_buffer_event_length(event)) {
+		event = rb_page_body(cpu_buffer, cpu_buffer->head_page, head);
+		if (ring_buffer_null_event(event))
+			break;
+		cpu_buffer->overrun++;
+		cpu_buffer->entries--;
+	}
+}
+
+static inline void
+ring_buffer_inc_page(struct ring_buffer *buffer,
+		     unsigned long *page)
+{
+	(*page)++;
+	if (*page >= buffer->pages)
+		*page = 0;
+}
+
+static inline void
+rb_add_stamp(struct ring_buffer_per_cpu *cpu_buffer, u64 *ts)
+{
+	struct buffer_page *bpage;
+
+	bpage = cpu_buffer->pages[cpu_buffer->tail_page];
+	bpage->time_stamp = *ts;
+}
+
+static void
+rb_reset_read_page(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct buffer_page *bpage;
+
+	cpu_buffer->head = 0;
+	bpage = cpu_buffer->pages[cpu_buffer->head_page];
+	cpu_buffer->read_stamp = bpage->time_stamp;
+}
+
+static void
+rb_reset_iter_read_page(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
+	struct buffer_page *bpage;
+
+	iter->head = 0;
+	bpage = cpu_buffer->pages[iter->head_page];
+	iter->read_stamp = bpage->time_stamp;
+}
+
+/**
+ * ring_buffer_update_event - update event type and data
+ * @event: the even to update
+ * @type: the type of event
+ * @length: the size of the event field in the ring buffer
+ *
+ * Update the type and data fields of the event. The length
+ * is the actual size that is written to the ring buffer,
+ * and with this, we can determine what to place into the
+ * data field.
+ */
+static inline void
+ring_buffer_update_event(struct ring_buffer_event *event,
+			 unsigned type, unsigned length)
+{
+	event->type = type;
+
+	switch (type) {
+
+	case RB_TYPE_PADDING:
+		break;
+
+	case RB_TYPE_TIME_EXTENT:
+		event->len =
+			(RB_LEN_TIME_EXTENT + (RB_ALIGNMENT-1))
+			>> RB_ALIGNMENT_SHIFT;
+		break;
+
+	case RB_TYPE_TIME_STAMP:
+		event->len =
+			(RB_LEN_TIME_STAMP + (RB_ALIGNMENT-1))
+			>> RB_ALIGNMENT_SHIFT;
+		break;
+
+	case RB_TYPE_DATA:
+		length -= RB_EVNT_HDR_SIZE;
+		if (length > RB_MAX_SMALL_DATA) {
+			event->len = 0;
+			event->array[0] = length;
+		} else
+			event->len =
+				(length + (RB_ALIGNMENT-1))
+				>> RB_ALIGNMENT_SHIFT;
+		break;
+	default:
+		BUG();
+	}
+}
+
+static inline unsigned rb_calculate_event_length(unsigned length)
+{
+	struct ring_buffer_event event; /* Used only for sizeof array */
+
+	/* zero length can cause confusions */
+	if (!length)
+		length = 1;
+
+	if (length > RB_MAX_SMALL_DATA)
+		length += sizeof(event.array[0]);
+
+	length += RB_EVNT_HDR_SIZE;
+	length = ALIGN(length, RB_ALIGNMENT);
+
+	return length;
+}
+
+static struct ring_buffer_event *
+__ring_buffer_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
+			   unsigned type, unsigned long length, u64 *ts)
+{
+	unsigned long head_page, tail_page, tail;
+	struct ring_buffer *buffer = cpu_buffer->buffer;
+	struct ring_buffer_event *event;
+
+	tail_page = cpu_buffer->tail_page;
+	head_page = cpu_buffer->head_page;
+	tail = cpu_buffer->tail;
+
+	BUG_ON(tail_page >= buffer->pages);
+	BUG_ON(head_page >= buffer->pages);
+
+	if (tail + length > BUF_PAGE_SIZE) {
+		unsigned long next_page = tail_page;
+
+		ring_buffer_inc_page(buffer, &next_page);
+
+		if (next_page == head_page) {
+			if (!(buffer->flags & RB_FL_OVERWRITE))
+				return NULL;
+
+			/* count overflows */
+			ring_buffer_update_overflow(cpu_buffer);
+
+			ring_buffer_inc_page(buffer, &head_page);
+			cpu_buffer->head_page = head_page;
+			rb_reset_read_page(cpu_buffer);
+		}
+
+		if (tail != BUF_PAGE_SIZE) {
+			event = rb_page_body(cpu_buffer, tail_page, tail);
+			/* page padding */
+			event->type = RB_TYPE_PADDING;
+		}
+
+		tail = 0;
+		tail_page = next_page;
+		cpu_buffer->tail_page = tail_page;
+		cpu_buffer->tail = tail;
+		rb_add_stamp(cpu_buffer, ts);
+	}
+
+	BUG_ON(tail_page >= buffer->pages);
+	BUG_ON(tail + length > BUF_PAGE_SIZE);
+
+	event = rb_page_body(cpu_buffer, tail_page, tail);
+	ring_buffer_update_event(event, type, length);
+	cpu_buffer->entries++;
+
+	return event;
+}
+
+static struct ring_buffer_event *
+ring_buffer_reserve_next_event(struct ring_buffer_per_cpu *cpu_buffer,
+			       unsigned type, unsigned long length)
+{
+	unsigned long long ts, delta;
+	struct ring_buffer_event *event;
+
+	ts = ring_buffer_time_stamp(cpu_buffer->cpu);
+
+	if (cpu_buffer->tail) {
+		delta = ts - cpu_buffer->write_stamp;
+
+		if (test_time_stamp(delta)) {
+			/*
+			 * The delta is too big, we to add a
+			 * new timestamp.
+			 */
+			event = __ring_buffer_reserve_next(cpu_buffer,
+							   RB_TYPE_TIME_EXTENT,
+							   RB_LEN_TIME_EXTENT,
+							   &ts);
+			if (!event)
+				return NULL;
+
+			/* check to see if we went to the next page */
+			if (!cpu_buffer->tail) {
+				/*
+				 * new page, dont commit this and add the
+				 * time stamp to the page instead.
+				 */
+				rb_add_stamp(cpu_buffer, &ts);
+			} else {
+				event->time_delta = delta & TS_MASK;
+				event->array[0] = delta >> TS_SHIFT;
+			}
+
+			cpu_buffer->write_stamp = ts;
+			delta = 0;
+		}
+	} else {
+		rb_add_stamp(cpu_buffer, &ts);
+		delta = 0;
+	}
+
+	event = __ring_buffer_reserve_next(cpu_buffer, type, length, &ts);
+	if (!event)
+		return NULL;
+
+	event->time_delta = delta;
+	cpu_buffer->write_stamp = ts;
+
+	return event;
+}
+
+/**
+ * ring_buffer_lock_reserve - reserve a part of the buffer
+ * @buffer: the ring buffer to reserve from
+ * @length: the length of the data to reserve (excluding event header)
+ * @flags: a pointer to save the interrupt flags
+ *
+ * Returns a reseverd event on the ring buffer to copy directly to.
+ * The user of this interface will need to get the body to write into
+ * and can use the ring_buffer_event_data() interface.
+ *
+ * The length is the length of the data needed, not the event length
+ * which also includes the event header.
+ *
+ * Must be paired with ring_buffer_unlock_commit, unless NULL is returned.
+ * If NULL is returned, then nothing has been allocated or locked.
+ */
+struct ring_buffer_event *
+ring_buffer_lock_reserve(struct ring_buffer *buffer,
+			 unsigned long length,
+			 unsigned long *flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	int cpu;
+
+	if (atomic_read(&buffer->record_disabled))
+		return NULL;
+
+	raw_local_irq_save(*flags);
+	cpu = raw_smp_processor_id();
+	cpu_buffer = buffer->buffers[cpu];
+	__raw_spin_lock(&cpu_buffer->lock);
+
+	if (atomic_read(&cpu_buffer->record_disabled))
+		goto no_record;
+
+	length = rb_calculate_event_length(length);
+	if (length > BUF_PAGE_SIZE)
+		return NULL;
+
+	event = ring_buffer_reserve_next_event(cpu_buffer,
+					       RB_TYPE_DATA, length);
+	if (!event)
+		goto no_record;
+
+	return event;
+
+ no_record:
+	__raw_spin_unlock(&cpu_buffer->lock);
+	local_irq_restore(*flags);
+	return NULL;
+}
+
+/**
+ * ring_buffer_unlock_commit - commit a reserved
+ * @buffer: The buffer to commit to
+ * @event: The event pointer to commit.
+ * @flags: the interrupt flags received from ring_buffer_lock_reserve.
+ *
+ * This commits the data to the ring buffer, and releases any locks held.
+ *
+ * Must be paired with ring_buffer_lock_reserve.
+ */
+int ring_buffer_unlock_commit(struct ring_buffer *buffer,
+			      struct ring_buffer_event *event,
+			      unsigned long flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu = raw_smp_processor_id();
+
+	cpu_buffer = buffer->buffers[cpu];
+	cpu_buffer->tail += ring_buffer_event_length(event);
+
+	__raw_spin_unlock(&cpu_buffer->lock);
+	raw_local_irq_restore(flags);
+
+	return 0;
+}
+
+/**
+ * ring_buffer_write - write data to the buffer without reserving
+ * @buffer: The ring buffer to write to.
+ * @length: The length of the data being written (excluding the event header)
+ * @data: The data to write to the buffer.
+ *
+ * This is like ring_buffer_lock_reserve and ring_buffer_unlock_commit as
+ * one function. If you already have the data to write to the buffer, it
+ * may be easier to simply call this function.
+ *
+ * Note, like ring_buffer_lock_reserve, the length is the length of the data
+ * and not the length of the event which would hold the header.
+ */
+int ring_buffer_write(struct ring_buffer *buffer,
+			unsigned long length,
+			void *data)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	unsigned long event_length, flags;
+	void *body;
+	int ret = 0;
+	int cpu;
+
+	if (atomic_read(&buffer->record_disabled))
+		return -EBUSY;
+
+	local_irq_save(flags);
+	cpu = raw_smp_processor_id();
+	cpu_buffer = buffer->buffers[cpu];
+	__raw_spin_lock(&cpu_buffer->lock);
+
+	if (atomic_read(&cpu_buffer->record_disabled))
+		goto out;
+
+	event_length = rb_calculate_event_length(length);
+	event = ring_buffer_reserve_next_event(cpu_buffer,
+					       RB_TYPE_DATA, event_length);
+	if (!event)
+		goto out;
+
+	body = ring_buffer_event_data(event);
+
+	memcpy(body, data, length);
+	cpu_buffer->tail += event_length;
+
+ out:
+	__raw_spin_unlock(&cpu_buffer->lock);
+	local_irq_restore(flags);
+
+	return ret;
+}
+
+/**
+ * ring_buffer_lock - lock the ring buffer
+ * @buffer: The ring buffer to lock
+ * @flags: The place to store the interrupt flags
+ *
+ * This locks all the per CPU buffers.
+ *
+ * Must be unlocked by ring_buffer_unlock.
+ */
+void ring_buffer_lock(struct ring_buffer *buffer, unsigned long *flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	local_irq_save(*flags);
+
+	for (cpu = 0; cpu < buffer->cpus; cpu++) {
+
+		cpu_buffer = buffer->buffers[cpu];
+		__raw_spin_lock(&cpu_buffer->lock);
+	}
+}
+
+/**
+ * ring_buffer_unlock - unlock a locked buffer
+ * @buffer: The locked buffer to unlock
+ * @flags: The interrupt flags received by ring_buffer_lock
+ */
+void ring_buffer_unlock(struct ring_buffer *buffer, unsigned long flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	for (cpu = buffer->cpus - 1; cpu >= 0; cpu--) {
+
+		cpu_buffer = buffer->buffers[cpu];
+		__raw_spin_unlock(&cpu_buffer->lock);
+	}
+
+	local_irq_restore(flags);
+}
+
+/**
+ * ring_buffer_record_disable - stop all writes into the buffer
+ * @buffer: The ring buffer to stop writes to.
+ *
+ * This prevents all writes to the buffer. Any attempt to write
+ * to the buffer after this will fail and return NULL.
+ */
+void ring_buffer_record_disable(struct ring_buffer *buffer)
+{
+	atomic_inc(&buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_record_enable - enable writes to the buffer
+ * @buffer: The ring buffer to enable writes
+ *
+ * Note, multiple disables will need the same number of enables
+ * to truely enable the writing (much like preempt_disable).
+ */
+void ring_buffer_record_enable(struct ring_buffer *buffer)
+{
+	atomic_dec(&buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_record_disable_cpu - stop all writes into the cpu_buffer
+ * @buffer: The ring buffer to stop writes to.
+ * @cpu: The CPU buffer to stop
+ *
+ * This prevents all writes to the buffer. Any attempt to write
+ * to the buffer after this will fail and return NULL.
+ */
+void ring_buffer_record_disable_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	atomic_inc(&cpu_buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_record_enable_cpu - enable writes to the buffer
+ * @buffer: The ring buffer to enable writes
+ * @cpu: The CPU to enable.
+ *
+ * Note, multiple disables will need the same number of enables
+ * to truely enable the writing (much like preempt_disable).
+ */
+void ring_buffer_record_enable_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	atomic_dec(&cpu_buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_entries_cpu - get the number of entries in a cpu buffer
+ * @buffer: The ring buffer
+ * @cpu: The per CPU buffer to get the entries from.
+ */
+unsigned long ring_buffer_entries_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	return cpu_buffer->entries;
+}
+
+/**
+ * ring_buffer_overrun_cpu - get the number of overruns in a cpu_buffer
+ * @buffer: The ring buffer
+ * @cpu: The per CPU buffer to get the number of overruns from
+ */
+unsigned long ring_buffer_overrun_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	return cpu_buffer->overrun;
+}
+
+/**
+ * ring_buffer_entries - get the number of entries in a buffer
+ * @buffer: The ring buffer
+ *
+ * Returns the total number of entries in the ring buffer
+ * (all CPU entries)
+ */
+unsigned long ring_buffer_entries(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long entries = 0;
+	int cpu;
+
+	/* if you care about this being correct, lock the buffer */
+	for (cpu = 0; cpu < buffer->cpus; cpu++) {
+		cpu_buffer = buffer->buffers[cpu];
+		entries += cpu_buffer->entries;
+	}
+
+	return entries;
+}
+
+/**
+ * ring_buffer_overrun_cpu - get the number of overruns in buffer
+ * @buffer: The ring buffer
+ *
+ * Returns the total number of overruns in the ring buffer
+ * (all CPU entries)
+ */
+unsigned long ring_buffer_overruns(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long overruns = 0;
+	int cpu;
+
+	/* if you care about this being correct, lock the buffer */
+	for (cpu = 0; cpu < buffer->cpus; cpu++) {
+		cpu_buffer = buffer->buffers[cpu];
+		overruns += cpu_buffer->overrun;
+	}
+
+	return overruns;
+}
+
+/**
+ * ring_buffer_iter_reset - reset an iterator
+ * @iter: The iterator to reset
+ *
+ * Resets the iterator, so that it will start from the beginning
+ * again.
+ */
+void ring_buffer_iter_reset(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
+
+	iter->head_page = cpu_buffer->head_page;
+	iter->head = cpu_buffer->head;
+	rb_reset_iter_read_page(iter);
+}
+
+/**
+ * ring_buffer_iter_empty - check if an iterator has no more to read
+ * @iter: The iterator to check
+ */
+int ring_buffer_iter_empty(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = iter->cpu_buffer;
+
+	return iter->head_page == cpu_buffer->tail_page &&
+		iter->head == cpu_buffer->tail;
+}
+
+static void
+ring_buffer_advance_head(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct ring_buffer *buffer = cpu_buffer->buffer;
+	struct ring_buffer_event *event;
+	unsigned length;
+
+	event = ring_buffer_head_event(cpu_buffer);
+	/*
+	 * Check if we are at the end of the buffer.
+	 */
+	if (ring_buffer_null_event(event)) {
+		BUG_ON(cpu_buffer->head_page == cpu_buffer->tail_page);
+		ring_buffer_inc_page(buffer, &cpu_buffer->head_page);
+		rb_reset_read_page(cpu_buffer);
+		return;
+	}
+
+	length = ring_buffer_event_length(event);
+
+	/*
+	 * This should not be called to advance the header if we are
+	 * at the tail of the buffer.
+	 */
+	BUG_ON((cpu_buffer->head_page == cpu_buffer->tail_page) &&
+	       (cpu_buffer->head + length > cpu_buffer->tail));
+
+	cpu_buffer->head += length;
+
+	/* check for end of page padding */
+	event = ring_buffer_head_event(cpu_buffer);
+	if (ring_buffer_null_event(event) &&
+	    (cpu_buffer->head_page != cpu_buffer->tail_page))
+		ring_buffer_advance_head(cpu_buffer);
+}
+
+static void
+ring_buffer_advance_iter(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer *buffer;
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	unsigned length;
+
+	cpu_buffer = iter->cpu_buffer;
+	buffer = cpu_buffer->buffer;
+
+	event = ring_buffer_iter_head_event(iter);
+
+	/*
+	 * Check if we are at the end of the buffer.
+	 */
+	if (ring_buffer_null_event(event)) {
+		BUG_ON(iter->head_page == cpu_buffer->tail_page);
+		ring_buffer_inc_page(buffer, &iter->head_page);
+		rb_reset_iter_read_page(iter);
+		return;
+	}
+
+	length = ring_buffer_event_length(event);
+
+	/*
+	 * This should not be called to advance the header if we are
+	 * at the tail of the buffer.
+	 */
+	BUG_ON((iter->head_page == cpu_buffer->tail_page) &&
+	       (iter->head + length > cpu_buffer->tail));
+
+	iter->head += length;
+
+	/* check for end of page padding */
+	event = ring_buffer_iter_head_event(iter);
+	if (ring_buffer_null_event(event) &&
+	    (iter->head_page != cpu_buffer->tail_page))
+		ring_buffer_advance_iter(iter);
+}
+
+/**
+ * ring_buffer_peek - peek at the next event to be read
+ * @buffer: The ring buffer to read
+ * @cpu: The cpu to peak at
+ * @ts: The timestamp counter of this event.
+ *
+ * This will return the event that will be read next, but does
+ * not consume the data.
+ */
+struct ring_buffer_event *
+ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	u64 delta;
+
+	cpu_buffer = buffer->buffers[cpu];
+
+ again:
+	if (ring_buffer_per_cpu_empty(cpu_buffer))
+		return NULL;
+
+	event = ring_buffer_head_event(cpu_buffer);
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		ring_buffer_inc_page(buffer, &cpu_buffer->head_page);
+		rb_reset_read_page(cpu_buffer);
+		goto again;
+
+	case RB_TYPE_TIME_EXTENT:
+		delta = event->array[0];
+		delta <<= TS_SHIFT;
+		delta += event->time_delta;
+		cpu_buffer->read_stamp += delta;
+		/* Internal data, OK to advance */
+		ring_buffer_advance_head(cpu_buffer);
+		goto again;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		ring_buffer_advance_head(cpu_buffer);
+		goto again;
+
+	case RB_TYPE_DATA:
+		if (ts) {
+			*ts = cpu_buffer->read_stamp + event->time_delta;
+			ring_buffer_normalize_time_stamp(cpu_buffer->cpu, ts);
+		}
+		return event;
+
+	default:
+		BUG();
+	}
+
+	return NULL;
+}
+
+/**
+ * ring_buffer_iter_peek - peek at the next event to be read
+ * @iter: The ring buffer iterator
+ * @ts: The timestamp counter of this event.
+ *
+ * This will return the event that will be read next, but does
+ * not increment the iterator.
+ */
+struct ring_buffer_event *
+ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
+{
+	struct ring_buffer *buffer;
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	u64 delta;
+
+	if (ring_buffer_iter_empty(iter))
+		return NULL;
+
+	cpu_buffer = iter->cpu_buffer;
+	buffer = cpu_buffer->buffer;
+
+ again:
+	if (ring_buffer_per_cpu_empty(cpu_buffer))
+		return NULL;
+
+	event = ring_buffer_iter_head_event(iter);
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		ring_buffer_inc_page(buffer, &iter->head_page);
+		rb_reset_iter_read_page(iter);
+		goto again;
+
+	case RB_TYPE_TIME_EXTENT:
+		delta = event->array[0];
+		delta <<= TS_SHIFT;
+		delta += event->time_delta;
+		iter->read_stamp += delta;
+		/* Internal data, OK to advance */
+		ring_buffer_advance_iter(iter);
+		goto again;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		ring_buffer_advance_iter(iter);
+		goto again;
+
+	case RB_TYPE_DATA:
+		if (ts) {
+			*ts = iter->read_stamp + event->time_delta;
+			ring_buffer_normalize_time_stamp(cpu_buffer->cpu, ts);
+		}
+		return event;
+
+	default:
+		BUG();
+	}
+
+	return NULL;
+}
+
+/**
+ * ring_buffer_consume - return an event and consume it
+ * @buffer: The ring buffer to get the next event from
+ *
+ * Returns the next event in the ring buffer, and that event is consumed.
+ * Meaning, that sequential reads will keep returning a different event,
+ * and eventually empty the ring buffer if the producer is slower.
+ */
+struct ring_buffer_event *
+ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+
+	event = ring_buffer_peek(buffer, cpu, ts);
+	if (!event)
+		return NULL;
+
+	cpu_buffer = buffer->buffers[cpu];
+	ring_buffer_advance_head(cpu_buffer);
+
+	return event;
+}
+
+/**
+ * ring_buffer_read_start - start a non consuming read of the buffer
+ * @buffer: The ring buffer to read from
+ * @cpu: The cpu buffer to iterate over
+ *
+ * This starts up an iteration through the buffer. It also disables
+ * the recording to the buffer until the reading is finished.
+ * This prevents the reading from being corrupted. This is not
+ * a consuming read, so a producer is not expected.
+ *
+ * Must be paired with ring_buffer_finish.
+ */
+struct ring_buffer_iter *
+ring_buffer_read_start(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_iter *iter;
+
+	iter = kmalloc(sizeof(*iter), GFP_KERNEL);
+	if (!iter)
+		return NULL;
+
+	cpu_buffer = buffer->buffers[cpu];
+
+	iter->cpu_buffer = cpu_buffer;
+
+	atomic_inc(&cpu_buffer->record_disabled);
+
+	__raw_spin_lock(&cpu_buffer->lock);
+	iter->head = cpu_buffer->head;
+	iter->head_page = cpu_buffer->head_page;
+	rb_reset_iter_read_page(iter);
+	__raw_spin_unlock(&cpu_buffer->lock);
+
+	return iter;
+}
+
+/**
+ * ring_buffer_finish - finish reading the iterator of the buffer
+ * @iter: The iterator retrieved by ring_buffer_start
+ *
+ * This re-enables the recording to the buffer, and frees the
+ * iterator.
+ */
+void
+ring_buffer_read_finish(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
+
+	atomic_dec(&cpu_buffer->record_disabled);
+	kfree(iter);
+}
+
+/**
+ * ring_buffer_read - read the next item in the ring buffer by the iterator
+ * @iter: The ring buffer iterator
+ * @ts: The time stamp of the event read.
+ *
+ * This reads the next event in the ring buffer and increments the iterator.
+ */
+struct ring_buffer_event *
+ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts)
+{
+	struct ring_buffer_event *event;
+
+	event = ring_buffer_iter_peek(iter, ts);
+	if (!event)
+		return NULL;
+
+	ring_buffer_advance_iter(iter);
+
+	return event;
+}
+
+/**
+ * ring_buffer_size - return the size of the ring buffer (in bytes)
+ * @buffer: The ring buffer.
+ */
+unsigned long ring_buffer_size(struct ring_buffer *buffer)
+{
+	return BUF_PAGE_SIZE * buffer->pages;
+}
+
+static void
+__ring_buffer_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	cpu_buffer->head_page = cpu_buffer->tail_page = 0;
+	cpu_buffer->head = cpu_buffer->tail = 0;
+	cpu_buffer->overrun = 0;
+	cpu_buffer->entries = 0;
+}
+
+/**
+ * ring_buffer_reset_cpu - reset a ring buffer per CPU buffer
+ * @buffer: The ring buffer to reset a per cpu buffer of
+ * @cpu: The CPU buffer to be reset
+ */
+void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = buffer->buffers[cpu];
+	unsigned long flags;
+
+	raw_local_irq_save(flags);
+	__raw_spin_lock(&cpu_buffer->lock);
+
+	__ring_buffer_reset_cpu(cpu_buffer);
+
+	__raw_spin_unlock(&cpu_buffer->lock);
+	raw_local_irq_restore(flags);
+}
+
+/**
+ * ring_buffer_reset_cpu - reset a ring buffer per CPU buffer
+ * @buffer: The ring buffer to reset a per cpu buffer of
+ * @cpu: The CPU buffer to be reset
+ */
+void ring_buffer_reset(struct ring_buffer *buffer)
+{
+	unsigned long flags;
+	int cpu;
+
+	ring_buffer_lock(buffer, &flags);
+
+	for (cpu = 0; cpu < buffer->cpus; cpu++)
+		__ring_buffer_reset_cpu(buffer->buffers[cpu]);
+
+	ring_buffer_unlock(buffer, flags);
+}
+
+/**
+ * rind_buffer_empty - is the ring buffer empty?
+ * @buffer: The ring buffer to test
+ */
+int ring_buffer_empty(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	/* yes this is racy, but if you don't like the race, lock the buffer */
+	for (cpu = 0; cpu < buffer->cpus; cpu++) {
+		cpu_buffer = buffer->buffers[cpu];
+		if (!ring_buffer_per_cpu_empty(cpu_buffer))
+			return 0;
+	}
+	return 1;
+}
+
+/**
+ * ring_buffer_empty_cpu - is a cpu buffer of a ring buffer empty?
+ * @buffer: The ring buffer
+ * @cpu: The CPU buffer to test
+ */
+int ring_buffer_empty_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	return ring_buffer_per_cpu_empty(cpu_buffer);
+}
+
+/**
+ * ring_buffer_swap_cpu - swap a CPU buffer between two ring buffers
+ * @buffer_a: One buffer to swap with
+ * @buffer_b: The other buffer to swap with
+ *
+ * This function is useful for tracers that want to take a "snapshot"
+ * of a CPU buffer and has another back up buffer lying around.
+ * it is expected that the tracer handles the cpu buffer not being
+ * used at the moment.
+ */
+int ring_buffer_swap_cpu(struct ring_buffer *buffer_a,
+			 struct ring_buffer *buffer_b, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer_a;
+	struct ring_buffer_per_cpu *cpu_buffer_b;
+
+	/* At least make sure the two buffers are somewhat the same */
+	if (buffer_a->size != buffer_b->size ||
+	    buffer_a->pages != buffer_b->pages)
+		return -EINVAL;
+
+	cpu_buffer_a = buffer_a->buffers[cpu];
+	cpu_buffer_b = buffer_b->buffers[cpu];
+
+	atomic_inc(&cpu_buffer_a->record_disabled);
+	atomic_inc(&cpu_buffer_b->record_disabled);
+
+	buffer_a->buffers[cpu] = cpu_buffer_b;
+	buffer_b->buffers[cpu] = cpu_buffer_a;
+
+	cpu_buffer_b->buffer = buffer_a;
+	cpu_buffer_a->buffer = buffer_b;
+
+	atomic_dec(&cpu_buffer_a->record_disabled);
+	atomic_dec(&cpu_buffer_b->record_disabled);
+
+	return 0;
+}
+
Index: linux-trace.git/kernel/trace/Kconfig
===================================================================
--- linux-trace.git.orig/kernel/trace/Kconfig	2008-09-25 18:26:10.000000000 -0400
+++ linux-trace.git/kernel/trace/Kconfig	2008-09-25 18:30:51.000000000 -0400
@@ -10,10 +10,14 @@ config HAVE_DYNAMIC_FTRACE
 config TRACER_MAX_TRACE
 	bool
 
+config RING_BUFFER
+	bool
+
 config TRACING
 	bool
 	select DEBUG_FS
 	select STACKTRACE
+	select RING_BUFFER
 
 config FTRACE
 	bool "Kernel Function Tracer"
Index: linux-trace.git/kernel/trace/Makefile
===================================================================
--- linux-trace.git.orig/kernel/trace/Makefile	2008-09-25 18:26:10.000000000 -0400
+++ linux-trace.git/kernel/trace/Makefile	2008-09-25 18:29:07.000000000 -0400
@@ -11,6 +11,7 @@ obj-y += trace_selftest_dynamic.o
 endif
 
 obj-$(CONFIG_FTRACE) += libftrace.o
+obj-$(CONFIG_RING_BUFFER) += ring_buffer.o
 
 obj-$(CONFIG_TRACING) += trace.o
 obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v4] Unified trace buffer
  2008-09-26  1:02   ` [RFC PATCH v4] " Steven Rostedt
@ 2008-09-26  1:52     ` Masami Hiramatsu
  2008-09-26  2:11       ` Steven Rostedt
  2008-09-26 17:11       ` [PATCH v5] " Steven Rostedt
  0 siblings, 2 replies; 102+ messages in thread
From: Masami Hiramatsu @ 2008-09-26  1:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, prasad, Linus Torvalds, Mathieu Desnoyers,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Steven Rostedt

Hi Steven,

Steven Rostedt wrote:
> This version has been cleaned up a bit. I've been running it as
> a back end to ftrace, and it has been handling pretty well.

Thank you for your great work.
It seems good to me(especially, encapsulating events :)).

I have one request of enhancement.

 > +static struct ring_buffer_per_cpu *
 > +ring_buffer_allocate_cpu_buffer(struct ring_buffer *buffer, int cpu)
 > +{
[...]
 > +	cpu_buffer->pages = kzalloc_node(ALIGN(sizeof(void *) * pages,
 > +					       cache_line_size()), GFP_KERNEL,
 > +					 cpu_to_node(cpu));

Here, you are using a slab object for page managing array,
the largest object size is 128KB(x86-64), so it can contain
16K pages = 64MB.

As I had improved relayfs, in some rare case(on 64bit arch),
we'd like to use larger buffer than 64MB.

http://sourceware.org/ml/systemtap/2008-q2/msg00103.html

So, I think similar hack can be applicable.

Would it be acceptable for the next version?

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v4] Unified trace buffer
  2008-09-26  1:52     ` Masami Hiramatsu
@ 2008-09-26  2:11       ` Steven Rostedt
  2008-09-26  2:47         ` Masami Hiramatsu
  2008-09-26  3:20         ` Mathieu Desnoyers
  2008-09-26 17:11       ` [PATCH v5] " Steven Rostedt
  1 sibling, 2 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-09-26  2:11 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, prasad, Linus Torvalds, Mathieu Desnoyers,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Steven Rostedt


On Thu, 25 Sep 2008, Masami Hiramatsu wrote:

> Hi Steven,
> 
> Steven Rostedt wrote:
> > This version has been cleaned up a bit. I've been running it as
> > a back end to ftrace, and it has been handling pretty well.
> 
> Thank you for your great work.
> It seems good to me(especially, encapsulating events :)).

Thanks!

> 
> I have one request of enhancement.
> 
> > +static struct ring_buffer_per_cpu *
> > +ring_buffer_allocate_cpu_buffer(struct ring_buffer *buffer, int cpu)
> > +{
> [...]
> > +	cpu_buffer->pages = kzalloc_node(ALIGN(sizeof(void *) * pages,
> > +					       cache_line_size()), GFP_KERNEL,
> > +					 cpu_to_node(cpu));
> 
> Here, you are using a slab object for page managing array,
> the largest object size is 128KB(x86-64), so it can contain
> 16K pages = 64MB.
> 
> As I had improved relayfs, in some rare case(on 64bit arch),
> we'd like to use larger buffer than 64MB.
> 
> http://sourceware.org/ml/systemtap/2008-q2/msg00103.html
> 
> So, I think similar hack can be applicable.
> 
> Would it be acceptable for the next version?

I would like to avoid using vmalloc as much as possible, but I do see the 
limitation here. Here's my compromise.

Instead of using vmalloc if the page array is greater than one page, 
how about using vmalloc if the page array is greater than 
KMALLOC_MAX_SIZE?

This would let us keep the vmap area free unless we have no choice.

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v4] Unified trace buffer
  2008-09-26  2:11       ` Steven Rostedt
@ 2008-09-26  2:47         ` Masami Hiramatsu
  2008-09-26  3:20         ` Mathieu Desnoyers
  1 sibling, 0 replies; 102+ messages in thread
From: Masami Hiramatsu @ 2008-09-26  2:47 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, prasad, Linus Torvalds, Mathieu Desnoyers,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Steven Rostedt

Steven Rostedt wrote:
> On Thu, 25 Sep 2008, Masami Hiramatsu wrote:
> 
>> Hi Steven,
>>
>> Steven Rostedt wrote:
>>> This version has been cleaned up a bit. I've been running it as
>>> a back end to ftrace, and it has been handling pretty well.
>> Thank you for your great work.
>> It seems good to me(especially, encapsulating events :)).
> 
> Thanks!
> 
>> I have one request of enhancement.
>>
>>> +static struct ring_buffer_per_cpu *
>>> +ring_buffer_allocate_cpu_buffer(struct ring_buffer *buffer, int cpu)
>>> +{
>> [...]
>>> +	cpu_buffer->pages = kzalloc_node(ALIGN(sizeof(void *) * pages,
>>> +					       cache_line_size()), GFP_KERNEL,
>>> +					 cpu_to_node(cpu));
>> Here, you are using a slab object for page managing array,
>> the largest object size is 128KB(x86-64), so it can contain
>> 16K pages = 64MB.
>>
>> As I had improved relayfs, in some rare case(on 64bit arch),
>> we'd like to use larger buffer than 64MB.
>>
>> http://sourceware.org/ml/systemtap/2008-q2/msg00103.html
>>
>> So, I think similar hack can be applicable.
>>
>> Would it be acceptable for the next version?
> 
> I would like to avoid using vmalloc as much as possible, but I do see the 
> limitation here. Here's my compromise.
> 
> Instead of using vmalloc if the page array is greater than one page, 
> how about using vmalloc if the page array is greater than 
> KMALLOC_MAX_SIZE?
> 
> This would let us keep the vmap area free unless we have no choice.

Hmm, that's a good idea.
In most cases, per-cpu buffer may be less than 64MB,
so I think it is reasonable.

Thank you,

> 
> -- Steve
> 

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v4] Unified trace buffer
  2008-09-26  2:11       ` Steven Rostedt
  2008-09-26  2:47         ` Masami Hiramatsu
@ 2008-09-26  3:20         ` Mathieu Desnoyers
  2008-09-26  7:18           ` Peter Zijlstra
  1 sibling, 1 reply; 102+ messages in thread
From: Mathieu Desnoyers @ 2008-09-26  3:20 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, prasad, Linus Torvalds,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Steven Rostedt

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Thu, 25 Sep 2008, Masami Hiramatsu wrote:
> 
> > Hi Steven,
> > 
> > Steven Rostedt wrote:
> > > This version has been cleaned up a bit. I've been running it as
> > > a back end to ftrace, and it has been handling pretty well.
> > 
> > Thank you for your great work.
> > It seems good to me(especially, encapsulating events :)).
> 
> Thanks!
> 
> > 
> > I have one request of enhancement.
> > 
> > > +static struct ring_buffer_per_cpu *
> > > +ring_buffer_allocate_cpu_buffer(struct ring_buffer *buffer, int cpu)
> > > +{
> > [...]
> > > +	cpu_buffer->pages = kzalloc_node(ALIGN(sizeof(void *) * pages,
> > > +					       cache_line_size()), GFP_KERNEL,
> > > +					 cpu_to_node(cpu));
> > 
> > Here, you are using a slab object for page managing array,
> > the largest object size is 128KB(x86-64), so it can contain
> > 16K pages = 64MB.
> > 
> > As I had improved relayfs, in some rare case(on 64bit arch),
> > we'd like to use larger buffer than 64MB.
> > 
> > http://sourceware.org/ml/systemtap/2008-q2/msg00103.html
> > 
> > So, I think similar hack can be applicable.
> > 
> > Would it be acceptable for the next version?
> 
> I would like to avoid using vmalloc as much as possible, but I do see the 
> limitation here. Here's my compromise.
> 
> Instead of using vmalloc if the page array is greater than one page, 
> how about using vmalloc if the page array is greater than 
> KMALLOC_MAX_SIZE?
> 
> This would let us keep the vmap area free unless we have no choice.
> 
> -- Steve
> 

You could also fallback on a 2-level page array when buffer size is >
64MB. The cost is mainly a supplementary pointer dereference, but one
more should not make sure a big difference overall.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v4] Unified trace buffer
  2008-09-26  3:20         ` Mathieu Desnoyers
@ 2008-09-26  7:18           ` Peter Zijlstra
  2008-09-26 10:45             ` Steven Rostedt
                               ` (2 more replies)
  0 siblings, 3 replies; 102+ messages in thread
From: Peter Zijlstra @ 2008-09-26  7:18 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, Masami Hiramatsu, LKML, Ingo Molnar,
	Thomas Gleixner, Andrew Morton, prasad, Linus Torvalds,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Steven Rostedt

On Thu, 2008-09-25 at 23:20 -0400, Mathieu Desnoyers wrote:
> * Steven Rostedt (rostedt@goodmis.org) wrote:
> > 
> > On Thu, 25 Sep 2008, Masami Hiramatsu wrote:
> > 
> > > Hi Steven,
> > > 
> > > Steven Rostedt wrote:
> > > > This version has been cleaned up a bit. I've been running it as
> > > > a back end to ftrace, and it has been handling pretty well.
> > > 
> > > Thank you for your great work.
> > > It seems good to me(especially, encapsulating events :)).
> > 
> > Thanks!
> > 
> > > 
> > > I have one request of enhancement.
> > > 
> > > > +static struct ring_buffer_per_cpu *
> > > > +ring_buffer_allocate_cpu_buffer(struct ring_buffer *buffer, int cpu)
> > > > +{
> > > [...]
> > > > +	cpu_buffer->pages = kzalloc_node(ALIGN(sizeof(void *) * pages,
> > > > +					       cache_line_size()), GFP_KERNEL,
> > > > +					 cpu_to_node(cpu));
> > > 
> > > Here, you are using a slab object for page managing array,
> > > the largest object size is 128KB(x86-64), so it can contain
> > > 16K pages = 64MB.
> > > 
> > > As I had improved relayfs, in some rare case(on 64bit arch),
> > > we'd like to use larger buffer than 64MB.
> > > 
> > > http://sourceware.org/ml/systemtap/2008-q2/msg00103.html
> > > 
> > > So, I think similar hack can be applicable.
> > > 
> > > Would it be acceptable for the next version?
> > 
> > I would like to avoid using vmalloc as much as possible, but I do see the 
> > limitation here. Here's my compromise.
> > 
> > Instead of using vmalloc if the page array is greater than one page, 
> > how about using vmalloc if the page array is greater than 
> > KMALLOC_MAX_SIZE?
> > 
> > This would let us keep the vmap area free unless we have no choice.
> > 
> > -- Steve
> > 
> 
> You could also fallback on a 2-level page array when buffer size is >
> 64MB. The cost is mainly a supplementary pointer dereference, but one
> more should not make sure a big difference overall.

I'm still not sure why we don't just link the pages using the page
frames, we don't need the random access, do we?


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v4] Unified trace buffer
  2008-09-26  7:18           ` Peter Zijlstra
@ 2008-09-26 10:45             ` Steven Rostedt
  2008-09-26 11:00               ` Peter Zijlstra
  2008-09-26 10:47             ` Steven Rostedt
  2008-09-26 16:04             ` Mathieu Desnoyers
  2 siblings, 1 reply; 102+ messages in thread
From: Steven Rostedt @ 2008-09-26 10:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Masami Hiramatsu, LKML, Ingo Molnar,
	Thomas Gleixner, Andrew Morton, prasad, Linus Torvalds,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Steven Rostedt


On Fri, 26 Sep 2008, Peter Zijlstra wrote:
> On Thu, 2008-09-25 at 23:20 -0400, Mathieu Desnoyers wrote:
> > 
> > You could also fallback on a 2-level page array when buffer size is >
> > 64MB. The cost is mainly a supplementary pointer dereference, but one
> > more should not make sure a big difference overall.
> 
> I'm still not sure why we don't just link the pages using the page
> frames, we don't need the random access, do we?

Yeah we can go back to that (as ftrace does).

1) It can be very error prone. I will need to encapsulate the logic more.

2) I'm still not sure if crash can handle it.


I was going to reply to Masami with this answer, but it makes things more 
complex.  For v1 (non RFC v1) I wanted to start simple. v2 can have this 
enhancement.

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v4] Unified trace buffer
  2008-09-26  7:18           ` Peter Zijlstra
  2008-09-26 10:45             ` Steven Rostedt
@ 2008-09-26 10:47             ` Steven Rostedt
  2008-09-26 16:04             ` Mathieu Desnoyers
  2 siblings, 0 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-09-26 10:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, Masami Hiramatsu, LKML, Ingo Molnar,
	Thomas Gleixner, Andrew Morton, prasad, Linus Torvalds,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Steven Rostedt


On Fri, 26 Sep 2008, Peter Zijlstra wrote:
> > 
> > You could also fallback on a 2-level page array when buffer size is >
> > 64MB. The cost is mainly a supplementary pointer dereference, but one
> > more should not make sure a big difference overall.
> 
> I'm still not sure why we don't just link the pages using the page
> frames, we don't need the random access, do we?


Hmm, but this does make changing the buffer size much easier. I'll think 
about it  and perhaps try it out. If I can tidy it up nicer than the 
ftrace code, then I may include it for v1.

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v4] Unified trace buffer
  2008-09-26 10:45             ` Steven Rostedt
@ 2008-09-26 11:00               ` Peter Zijlstra
  2008-09-26 16:57                 ` Masami Hiramatsu
  0 siblings, 1 reply; 102+ messages in thread
From: Peter Zijlstra @ 2008-09-26 11:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Masami Hiramatsu, LKML, Ingo Molnar,
	Thomas Gleixner, Andrew Morton, prasad, Linus Torvalds,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Steven Rostedt

On Fri, 2008-09-26 at 06:45 -0400, Steven Rostedt wrote:
> On Fri, 26 Sep 2008, Peter Zijlstra wrote:
> > On Thu, 2008-09-25 at 23:20 -0400, Mathieu Desnoyers wrote:
> > > 
> > > You could also fallback on a 2-level page array when buffer size is >
> > > 64MB. The cost is mainly a supplementary pointer dereference, but one
> > > more should not make sure a big difference overall.
> > 
> > I'm still not sure why we don't just link the pages using the page
> > frames, we don't need the random access, do we?
> 
> Yeah we can go back to that (as ftrace does).
> 
> 1) It can be very error prone. I will need to encapsulate the logic more.

Sure.

> 2) I'm still not sure if crash can handle it.

It ought to, and if it can't it should be fixed. Having easy access to
the pageframes is vital to debugging VM issues. So I'd not bother about
this issue too much.

> I was going to reply to Masami with this answer, but it makes things more 
> complex.  For v1 (non RFC v1) I wanted to start simple. v2 can have this 
> enhancement.

Right - I just object to having anything vmalloc.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v4] Unified trace buffer
  2008-09-26  7:18           ` Peter Zijlstra
  2008-09-26 10:45             ` Steven Rostedt
  2008-09-26 10:47             ` Steven Rostedt
@ 2008-09-26 16:04             ` Mathieu Desnoyers
  2 siblings, 0 replies; 102+ messages in thread
From: Mathieu Desnoyers @ 2008-09-26 16:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Masami Hiramatsu, LKML, Ingo Molnar,
	Thomas Gleixner, Andrew Morton, prasad, Linus Torvalds,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Steven Rostedt

* Peter Zijlstra (peterz@infradead.org) wrote:
> On Thu, 2008-09-25 at 23:20 -0400, Mathieu Desnoyers wrote:
> > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > 
> > > On Thu, 25 Sep 2008, Masami Hiramatsu wrote:
> > > 
> > > > Hi Steven,
> > > > 
> > > > Steven Rostedt wrote:
> > > > > This version has been cleaned up a bit. I've been running it as
> > > > > a back end to ftrace, and it has been handling pretty well.
> > > > 
> > > > Thank you for your great work.
> > > > It seems good to me(especially, encapsulating events :)).
> > > 
> > > Thanks!
> > > 
> > > > 
> > > > I have one request of enhancement.
> > > > 
> > > > > +static struct ring_buffer_per_cpu *
> > > > > +ring_buffer_allocate_cpu_buffer(struct ring_buffer *buffer, int cpu)
> > > > > +{
> > > > [...]
> > > > > +	cpu_buffer->pages = kzalloc_node(ALIGN(sizeof(void *) * pages,
> > > > > +					       cache_line_size()), GFP_KERNEL,
> > > > > +					 cpu_to_node(cpu));
> > > > 
> > > > Here, you are using a slab object for page managing array,
> > > > the largest object size is 128KB(x86-64), so it can contain
> > > > 16K pages = 64MB.
> > > > 
> > > > As I had improved relayfs, in some rare case(on 64bit arch),
> > > > we'd like to use larger buffer than 64MB.
> > > > 
> > > > http://sourceware.org/ml/systemtap/2008-q2/msg00103.html
> > > > 
> > > > So, I think similar hack can be applicable.
> > > > 
> > > > Would it be acceptable for the next version?
> > > 
> > > I would like to avoid using vmalloc as much as possible, but I do see the 
> > > limitation here. Here's my compromise.
> > > 
> > > Instead of using vmalloc if the page array is greater than one page, 
> > > how about using vmalloc if the page array is greater than 
> > > KMALLOC_MAX_SIZE?
> > > 
> > > This would let us keep the vmap area free unless we have no choice.
> > > 
> > > -- Steve
> > > 
> > 
> > You could also fallback on a 2-level page array when buffer size is >
> > 64MB. The cost is mainly a supplementary pointer dereference, but one
> > more should not make sure a big difference overall.
> 
> I'm still not sure why we don't just link the pages using the page
> frames, we don't need the random access, do we?
> 

Yes, that's a brilliant idea :)

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v4] Unified trace buffer
  2008-09-26 11:00               ` Peter Zijlstra
@ 2008-09-26 16:57                 ` Masami Hiramatsu
  2008-09-26 17:14                   ` Steven Rostedt
  0 siblings, 1 reply; 102+ messages in thread
From: Masami Hiramatsu @ 2008-09-26 16:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Mathieu Desnoyers, LKML, Ingo Molnar,
	Thomas Gleixner, Andrew Morton, prasad, Linus Torvalds,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Steven Rostedt

Peter Zijlstra wrote:
> On Fri, 2008-09-26 at 06:45 -0400, Steven Rostedt wrote:
>> On Fri, 26 Sep 2008, Peter Zijlstra wrote:
>>> On Thu, 2008-09-25 at 23:20 -0400, Mathieu Desnoyers wrote:
>>>> You could also fallback on a 2-level page array when buffer size is >
>>>> 64MB. The cost is mainly a supplementary pointer dereference, but one
>>>> more should not make sure a big difference overall.
>>> I'm still not sure why we don't just link the pages using the page
>>> frames, we don't need the random access, do we?
>> Yeah we can go back to that (as ftrace does).
>>
>> 1) It can be very error prone. I will need to encapsulate the logic more.
> 
> Sure.
> 
>> 2) I'm still not sure if crash can handle it.
> 
> It ought to, and if it can't it should be fixed. Having easy access to
> the pageframes is vital to debugging VM issues. So I'd not bother about
> this issue too much.
> 
>> I was going to reply to Masami with this answer, but it makes things more 
>> complex.  For v1 (non RFC v1) I wanted to start simple. v2 can have this 
>> enhancement.
> 
> Right - I just object to having anything vmalloc.

I just requested that the expansion of buffer size limitation too. :)

I don't stick with vmalloc. If that (page frame chain?) can
achieve better performance, I agree that trace buffer uses it.

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v5] Unified trace buffer
  2008-09-26  1:52     ` Masami Hiramatsu
  2008-09-26  2:11       ` Steven Rostedt
@ 2008-09-26 17:11       ` Steven Rostedt
  2008-09-26 17:31         ` Arnaldo Carvalho de Melo
  2008-09-26 18:05         ` [PATCH v6] " Steven Rostedt
  1 sibling, 2 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-09-26 17:11 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, prasad, Linus Torvalds, Mathieu Desnoyers,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Steven Rostedt


[
  Note the removal of the RFC in the subject.
  I am happy with this version. It handles everything I need
  for ftrace.

  New since last version:

   - Fixed timing bug. I did not add the deltas properly when
     reading the buffer.

   - Removed "-1" time stamp normalize test. This made the
     clock go backwards!

   - Removed page pointer array and replaced it with the ftrace
     page struct link list trick. Since this is my second time
     writing this code (first with ftrace), it is actually much
     cleaner than the ftrace code.

   - Implemented buffer resizing. By using the page link list trick,
     this became much simpler.

   Note, the GOTD part is still not implemented, but can be done
   later without affecting this interface.

]

This is a unified tracing buffer that implements a ring buffer that
hopefully everyone will eventually be able to use.

The events recorded into the buffer have the following structure:

struct ring_buffer_event {
	u32 type:2, len:3, time_delta:27;
	u32 array[];
};

The minimum size of an event is 8 bytes. All events are 4 byte
aligned inside the buffer.

There are 4 types (all internal use for the ring buffer, only
the data type is exported to the interface users).

RB_TYPE_PADDING: this type is used to note extra space at the end
	of a buffer page.

RB_TYPE_TIME_EXTENT: This type is used when the time between events
	is greater than the 27 bit delta can hold. We add another
	32 bits, and record that in its own event (8 byte size).

RB_TYPE_TIME_STAMP: (Not implemented yet). This will hold data to
	help keep the buffer timestamps in sync.

RB_TYPE_DATA: The event actually holds user data.

The "len" field is only three bits. Since the data must be
4 byte aligned, this field is shifted left by 2, giving a
max length of 28 bytes. If the data load is greater than 28
bytes, the first array field holds the full length of the
data load and the len field is set to zero.

Example, data size of 7 bytes:

	type = RB_TYPE_DATA
	len = 2
	time_delta: <time-stamp> - <prev_event-time-stamp>
	array[0..1]: <7 bytes of data> <1 byte empty>

This event is saved in 12 bytes of the buffer.

An event with 82 bytes of data:

	type = RB_TYPE_DATA
	len = 0
	time_delta: <time-stamp> - <prev_event-time-stamp>
	array[0]: 84 (Note the alignment)
	array[1..14]: <82 bytes of data> <2 bytes empty>

The above event is saved in 92 bytes (if my math is correct).
82 bytes of data, 2 bytes empty, 4 byte header, 4 byte length.

Do not reference the above event struct directly. Use the following
functions to gain access to the event table, since the
ring_buffer_event structure may change in the future.

ring_buffer_event_length(event): get the length of the event.
	This is the size of the memory used to record this
	event, and not the size of the data pay load.

ring_buffer_time_delta(event): get the time delta of the event
	This returns the delta time stamp since the last event.
	Note: Even though this is in the header, there should
		be no reason to access this directly, accept
		for debugging.

ring_buffer_event_data(event): get the data from the event
	This is the function to use to get the actual data
	from the event. Note, it is only a pointer to the
	data inside the buffer. This data must be copied to
	another location otherwise you risk it being written
	over in the buffer.

ring_buffer_lock: A way to lock the entire buffer.
ring_buffer_unlock: unlock the buffer.

ring_buffer_alloc: create a new ring buffer. Can choose between
	overwrite or consumer/producer mode. Overwrite will
	overwrite old data, where as consumer producer will
	throw away new data if the consumer catches up with the
	producer.  The consumer/producer is the default.

ring_buffer_free: free the ring buffer.

ring_buffer_resize: resize the buffer. Changes the size of each cpu
	buffer. Note, it is up to the caller to provide that
	the buffer is not being used while this is happening.
	This requirement may go away but do not count on it.

ring_buffer_lock_reserve: locks the ring buffer and allocates an
	entry on the buffer to write to.
ring_buffer_unlock_commit: unlocks the ring buffer and commits it to
	the buffer.

ring_buffer_write: writes some data into the ring buffer.

ring_buffer_peek: Look at a next item in the cpu buffer.
ring_buffer_consume: get the next item in the cpu buffer and
	consume it. That is, this function increments the head
	pointer.

ring_buffer_read_start: Start an iterator of a cpu buffer.
	For now, this disables the cpu buffer, until you issue
	a finish. This is just because we do not want the iterator
	to be overwritten. This restriction may change in the future.
	But note, this is used for static reading of a buffer which
	is usually done "after" a trace. Live readings would want
	to use the ring_buffer_consume above, which will not
	disable the ring buffer.

ring_buffer_read_finish: Finishes the read iterator and reenables
	the ring buffer.

ring_buffer_iter_peek: Look at the next item in the cpu iterator.
ring_buffer_read: Read the iterator and increment it.
ring_buffer_iter_reset: Reset the iterator to point to the beginning
	of the cpu buffer.
ring_buffer_iter_empty: Returns true if the iterator is at the end
	of the cpu buffer.

ring_buffer_size: returns the size in bytes of each cpu buffer.
	Note, the real size is this times the number of CPUs.

ring_buffer_reset_cpu: Sets the cpu buffer to empty
ring_buffer_reset: sets all cpu buffers to empty

ring_buffer_swap_cpu: swaps a cpu buffer from one buffer with a
	cpu buffer of another buffer. This is handy when you
	want to take a snap shot of a running trace on just one
	cpu. Having a backup buffer, to swap with facilitates this.
	Ftrace max latencies use this.

ring_buffer_empty: Returns true if the ring buffer is empty.
ring_buffer_empty_cpu: Returns true if the cpu buffer is empty.

ring_buffer_record_disable: disable all cpu buffers (read only)
ring_buffer_record_disable_cpu: disable a single cpu buffer (read only)
ring_buffer_record_enable: enable all cpu buffers.
ring_buffer_record_enabl_cpu: enable a single cpu buffer.

ring_buffer_entries: The number of entries in a ring buffer.
ring_buffer_overruns: The number of entries removed due to writing wrap.

ring_buffer_time_stamp: Get the time stamp used by the ring buffer
ring_buffer_normalize_time_stamp: normalize the ring buffer time stamp
	into nanosecs.

I still need to implement the GTOD feature. But we need support from
the cpu frequency infrastructure.  But this can be done at a later
time without affecting the ring buffer interface.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 include/linux/ring_buffer.h |  178 +++++
 kernel/trace/Kconfig        |    4 
 kernel/trace/Makefile       |    1 
 kernel/trace/ring_buffer.c  | 1491 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 1674 insertions(+)

Index: linux-trace.git/include/linux/ring_buffer.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-trace.git/include/linux/ring_buffer.h	2008-09-25 21:29:16.000000000 -0400
@@ -0,0 +1,178 @@
+#ifndef _LINUX_RING_BUFFER_H
+#define _LINUX_RING_BUFFER_H
+
+#include <linux/mm.h>
+#include <linux/seq_file.h>
+
+struct ring_buffer;
+struct ring_buffer_iter;
+
+/*
+ * Don't reference this struct directly, use the inline items below.
+ */
+struct ring_buffer_event {
+	u32		type:2, len:3, time_delta:27;
+	u32		array[];
+} __attribute__((__packed__));
+
+enum {
+	RB_TYPE_PADDING,	/* Left over page padding
+				 * array is ignored
+				 * size is variable depending on
+				 */
+	RB_TYPE_TIME_EXTENT,	/* Extent the time delta
+				 * array[0] = time delta (28 .. 59)
+				 * size = 8 bytes
+				 */
+	/* FIXME: RB_TYPE_TIME_STAMP not implemented */
+	RB_TYPE_TIME_STAMP,	/* Sync time stamp with external clock
+				 * array[0] = tv_nsec
+				 * array[1] = tv_sec
+				 * size = 16 bytes
+				 */
+
+	RB_TYPE_DATA,		/* Data record
+				 * If len is zero:
+				 *  array[0] holds the actual length
+				 *  array[1..(length+3)/4] holds data
+				 * else
+				 *  length = len << 2
+				 *  array[0..(length+3)/4] holds data
+				 */
+};
+
+#define RB_EVNT_HDR_SIZE (sizeof(struct ring_buffer_event))
+#define RB_ALIGNMENT_SHIFT	2
+#define RB_ALIGNMENT		(1 << RB_ALIGNMENT_SHIFT)
+#define RB_MAX_SMALL_DATA	(28)
+
+enum {
+	RB_LEN_TIME_EXTENT = 8,
+	RB_LEN_TIME_STAMP = 16,
+};
+
+/**
+ * ring_buffer_event_length - return the length of the event
+ * @event: the event to get the length of
+ */
+static inline unsigned
+ring_buffer_event_length(struct ring_buffer_event *event)
+{
+	unsigned length;
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		/* undefined */
+		return -1;
+
+	case RB_TYPE_TIME_EXTENT:
+		return RB_LEN_TIME_EXTENT;
+
+	case RB_TYPE_TIME_STAMP:
+		return RB_LEN_TIME_STAMP;
+
+	case RB_TYPE_DATA:
+		if (event->len)
+			length = event->len << RB_ALIGNMENT_SHIFT;
+		else
+			length = event->array[0];
+		return length + RB_EVNT_HDR_SIZE;
+	default:
+		BUG();
+	}
+	/* not hit */
+	return 0;
+}
+
+/**
+ * ring_buffer_event_time_delta - return the delta timestamp of the event
+ * @event: the event to get the delta timestamp of
+ *
+ * The delta timestamp is the 27 bit timestamp since the last event.
+ */
+static inline unsigned
+ring_buffer_event_time_delta(struct ring_buffer_event *event)
+{
+	return event->time_delta;
+}
+
+/**
+ * ring_buffer_event_data - return the data of the event
+ * @event: the event to get the data from
+ */
+static inline void *
+ring_buffer_event_data(struct ring_buffer_event *event)
+{
+	BUG_ON(event->type != RB_TYPE_DATA);
+	/* If length is in len field, then array[0] has the data */
+	if (event->len)
+		return (void *)&event->array[0];
+	/* Otherwise length is in array[0] and array[1] has the data */
+	return (void *)&event->array[1];
+}
+
+void ring_buffer_lock(struct ring_buffer *buffer, unsigned long *flags);
+void ring_buffer_unlock(struct ring_buffer *buffer, unsigned long flags);
+
+/*
+ * size is in bytes for each per CPU buffer.
+ */
+struct ring_buffer *
+ring_buffer_alloc(unsigned long size, unsigned flags);
+void ring_buffer_free(struct ring_buffer *buffer);
+
+int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size);
+
+struct ring_buffer_event *
+ring_buffer_lock_reserve(struct ring_buffer *buffer,
+			 unsigned long length,
+			 unsigned long *flags);
+int ring_buffer_unlock_commit(struct ring_buffer *buffer,
+			      struct ring_buffer_event *event,
+			      unsigned long flags);
+int ring_buffer_write(struct ring_buffer *buffer,
+		      unsigned long length, void *data);
+
+struct ring_buffer_event *
+ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts);
+struct ring_buffer_event *
+ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts);
+
+struct ring_buffer_iter *
+ring_buffer_read_start(struct ring_buffer *buffer, int cpu);
+void ring_buffer_read_finish(struct ring_buffer_iter *iter);
+
+struct ring_buffer_event *
+ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts);
+struct ring_buffer_event *
+ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts);
+void ring_buffer_iter_reset(struct ring_buffer_iter *iter);
+int ring_buffer_iter_empty(struct ring_buffer_iter *iter);
+
+unsigned long ring_buffer_size(struct ring_buffer *buffer);
+
+void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu);
+void ring_buffer_reset(struct ring_buffer *buffer);
+
+int ring_buffer_swap_cpu(struct ring_buffer *buffer_a,
+			 struct ring_buffer *buffer_b, int cpu);
+
+int ring_buffer_empty(struct ring_buffer *buffer);
+int ring_buffer_empty_cpu(struct ring_buffer *buffer, int cpu);
+
+void ring_buffer_record_disable(struct ring_buffer *buffer);
+void ring_buffer_record_enable(struct ring_buffer *buffer);
+void ring_buffer_record_disable_cpu(struct ring_buffer *buffer, int cpu);
+void ring_buffer_record_enable_cpu(struct ring_buffer *buffer, int cpu);
+
+unsigned long ring_buffer_entries(struct ring_buffer *buffer);
+unsigned long ring_buffer_overruns(struct ring_buffer *buffer);
+
+u64 ring_buffer_time_stamp(int cpu);
+void ring_buffer_normalize_time_stamp(int cpu, u64 *ts);
+
+enum ring_buffer_flags {
+	RB_FL_OVERWRITE		= 1 << 0,
+};
+
+#endif /* _LINUX_RING_BUFFER_H */
Index: linux-trace.git/kernel/trace/ring_buffer.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-trace.git/kernel/trace/ring_buffer.c	2008-09-26 12:13:02.000000000 -0400
@@ -0,0 +1,1491 @@
+/*
+ * Generic ring buffer
+ *
+ * Copyright (C) 2008 Steven Rostedt <srostedt@redhat.com>
+ */
+#include <linux/ring_buffer.h>
+#include <linux/spinlock.h>
+#include <linux/debugfs.h>
+#include <linux/uaccess.h>
+#include <linux/module.h>
+#include <linux/percpu.h>
+#include <linux/mutex.h>
+#include <linux/init.h>
+#include <linux/hash.h>
+#include <linux/list.h>
+#include <linux/fs.h>
+
+#include "trace.h"
+
+/* FIXME!!! */
+u64 ring_buffer_time_stamp(int cpu)
+{
+	return sched_clock();
+}
+void ring_buffer_normalize_time_stamp(int cpu, u64 *ts)
+{
+}
+
+#define TS_SHIFT	27
+#define TS_MASK		((1ULL << TS_SHIFT) - 1)
+#define TS_DELTA_TEST	~TS_MASK
+
+/*
+ * We need to fit the time_stamp delta into 27 bits.
+ */
+static inline int
+test_time_stamp(unsigned long long delta)
+{
+	if (delta & TS_DELTA_TEST)
+		return 1;
+	return 0;
+}
+
+struct buffer_page {
+	u64		time_stamp;
+	unsigned char	body[];
+};
+
+#define BUF_PAGE_SIZE (PAGE_SIZE - sizeof(u64))
+
+/*
+ * head_page == tail_page && head == tail then buffer is empty.
+ */
+struct ring_buffer_per_cpu {
+	int			cpu;
+	struct ring_buffer	*buffer;
+	raw_spinlock_t		lock;
+	struct lock_class_key	lock_key;
+	struct list_head	pages;
+	unsigned long		head;	/* read from head */
+	unsigned long		tail;	/* write to tail */
+	struct page		*head_page;
+	struct page		*tail_page;
+	unsigned long		overrun;
+	unsigned long		entries;
+	u64			write_stamp;
+	u64			read_stamp;
+	atomic_t		record_disabled;
+};
+
+struct ring_buffer {
+	unsigned long		size;
+	unsigned		pages;
+	unsigned		flags;
+	int			cpus;
+	atomic_t		record_disabled;
+
+	struct mutex		mutex;
+
+	/* FIXME: this should be online CPUS */
+	struct ring_buffer_per_cpu *buffers[NR_CPUS];
+};
+
+struct ring_buffer_iter {
+	struct ring_buffer_per_cpu	*cpu_buffer;
+	unsigned long			head;
+	struct page			*head_page;
+	u64				read_stamp;
+};
+
+#define CHECK_COND(buffer, cond)			\
+	if (unlikely(cond)) {				\
+		atomic_inc(&buffer->record_disabled);	\
+		WARN_ON(1);				\
+		return -1;				\
+	}
+
+/**
+ * check_pages - integrity check of buffer pages
+ * @cpu_buffer: CPU buffer with pages to test
+ *
+ * As a safty measure we check to make sure the data pages have not
+ * been corrupted.
+ */
+static int check_pages(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct list_head *head = &cpu_buffer->pages;
+	struct page *page, *tmp;
+
+	CHECK_COND(cpu_buffer, head->next->prev != head);
+	CHECK_COND(cpu_buffer, head->prev->next != head);
+
+	list_for_each_entry_safe(page, tmp, head, lru) {
+		CHECK_COND(cpu_buffer, page->lru.next->prev != &page->lru);
+		CHECK_COND(cpu_buffer, page->lru.prev->next != &page->lru);
+	}
+
+	return 0;
+}
+
+static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
+			     unsigned nr_pages)
+{
+	struct list_head *head = &cpu_buffer->pages;
+	LIST_HEAD(pages);
+	struct page *page, *tmp;
+	unsigned long addr;
+	unsigned i;
+
+	for (i = 0; i < nr_pages; i++) {
+		addr = __get_free_page(GFP_KERNEL);
+		if (!addr)
+			goto free_pages;
+		page = virt_to_page(addr);
+		list_add(&page->lru, &pages);
+	}
+
+	list_splice(&pages, head);
+
+	check_pages(cpu_buffer);
+
+	return 0;
+
+ free_pages:
+	list_for_each_entry_safe(page, tmp, &pages, lru) {
+		list_del_init(&page->lru);
+		__free_page(page);
+	}
+	return -ENOMEM;
+}
+
+static struct ring_buffer_per_cpu *
+ring_buffer_allocate_cpu_buffer(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int ret;
+
+	cpu_buffer = kzalloc_node(ALIGN(sizeof(*cpu_buffer), cache_line_size()),
+				  GFP_KERNEL, cpu_to_node(cpu));
+	if (!cpu_buffer)
+		return NULL;
+
+	cpu_buffer->cpu = cpu;
+	cpu_buffer->buffer = buffer;
+	cpu_buffer->lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED;
+	INIT_LIST_HEAD(&cpu_buffer->pages);
+
+	ret = rb_allocate_pages(cpu_buffer, buffer->pages);
+	if (ret < 0)
+		goto fail_free_buffer;
+
+	cpu_buffer->head_page
+		= list_entry(cpu_buffer->pages.next, struct page, lru);
+	cpu_buffer->tail_page
+		= list_entry(cpu_buffer->pages.next, struct page, lru);
+
+	return cpu_buffer;
+
+ fail_free_buffer:
+	kfree(cpu_buffer);
+	return NULL;
+}
+
+static void
+ring_buffer_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct list_head *head = &cpu_buffer->pages;
+	struct page *page, *tmp;
+
+	list_for_each_entry_safe(page, tmp, head, lru) {
+		list_del_init(&page->lru);
+		__free_page(page);
+	}
+	kfree(cpu_buffer);
+}
+
+/**
+ * ring_buffer_alloc - allocate a new ring_buffer
+ * @size: the size in bytes that is needed.
+ * @flags: attributes to set for the ring buffer.
+ *
+ * Currently the only flag that is available is the RB_FL_OVERWRITE
+ * flag. This flag means that the buffer will overwrite old data
+ * when the buffer wraps. If this flag is not set, the buffer will
+ * drop data when the tail hits the head.
+ */
+struct ring_buffer *
+ring_buffer_alloc(unsigned long size, unsigned flags)
+{
+	struct ring_buffer *buffer;
+	int cpu;
+
+	/* keep it in its own cache line */
+	buffer = kzalloc(ALIGN(sizeof(*buffer), cache_line_size()),
+			 GFP_KERNEL);
+	if (!buffer)
+		return NULL;
+
+	buffer->pages = (size + (BUF_PAGE_SIZE - 1)) / BUF_PAGE_SIZE;
+	buffer->flags = flags;
+
+	/* need at least two pages */
+	if (buffer->pages == 1)
+		buffer->pages++;
+
+	/* FIXME: do for only online CPUS */
+	buffer->cpus = num_possible_cpus();
+	for_each_possible_cpu(cpu) {
+		if (cpu >= buffer->cpus)
+			continue;
+		buffer->buffers[cpu] =
+			ring_buffer_allocate_cpu_buffer(buffer, cpu);
+		if (!buffer->buffers[cpu])
+			goto fail_free_buffers;
+	}
+
+	mutex_init(&buffer->mutex);
+
+	return buffer;
+
+ fail_free_buffers:
+	for_each_possible_cpu(cpu) {
+		if (cpu >= buffer->cpus)
+			continue;
+		if (buffer->buffers[cpu])
+			ring_buffer_free_cpu_buffer(buffer->buffers[cpu]);
+	}
+
+	kfree(buffer);
+	return NULL;
+}
+
+/**
+ * ring_buffer_free - free a ring buffer.
+ * @buffer: the buffer to free.
+ */
+void
+ring_buffer_free(struct ring_buffer *buffer)
+{
+	int cpu;
+
+	for (cpu = 0; cpu < buffer->cpus; cpu++)
+		ring_buffer_free_cpu_buffer(buffer->buffers[cpu]);
+
+	kfree(buffer);
+}
+
+static void
+__ring_buffer_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer);
+
+static void
+rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned nr_pages)
+{
+	struct page *page;
+	struct list_head *p;
+	unsigned i;
+
+	atomic_inc(&cpu_buffer->record_disabled);
+
+	for (i = 0; i < nr_pages; i++) {
+		BUG_ON(list_empty(&cpu_buffer->pages));
+		p = cpu_buffer->pages.next;
+		page = list_entry(p, struct page, lru);
+		list_del_init(&page->lru);
+		__free_page(page);
+	}
+	BUG_ON(list_empty(&cpu_buffer->pages));
+
+	__ring_buffer_reset_cpu(cpu_buffer);
+
+	atomic_dec(&cpu_buffer->record_disabled);
+
+}
+
+static void
+rb_insert_pages(struct ring_buffer_per_cpu *cpu_buffer,
+		struct list_head *pages, unsigned nr_pages)
+{
+	struct page *page;
+	struct list_head *p;
+	unsigned i;
+
+	atomic_inc(&cpu_buffer->record_disabled);
+
+	for (i = 0; i < nr_pages; i++) {
+		BUG_ON(list_empty(pages));
+		p = pages->next;
+		page = list_entry(p, struct page, lru);
+		list_del_init(&page->lru);
+		list_add_tail(&page->lru, &cpu_buffer->pages);
+	}
+	__ring_buffer_reset_cpu(cpu_buffer);
+
+	atomic_dec(&cpu_buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_resize - resize the ring buffer
+ * @buffer: the buffer to resize.
+ * @size: the new size.
+ *
+ * The tracer is responsible for making sure that the buffer is
+ * not being used while changing the size.
+ * Note: We may be able to change the above requirement by using
+ *  RCU synchronizations.
+ *
+ * Minimum size is 2 * BUF_PAGE_SIZE.
+ *
+ * Returns -1 on failure.
+ */
+int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long buffer_size;
+	LIST_HEAD(pages);
+	unsigned long addr;
+	unsigned nr_pages, rm_pages, new_pages;
+	struct page *page, *tmp;
+	int i, cpu;
+
+	size = (size + (BUF_PAGE_SIZE-1)) / BUF_PAGE_SIZE;
+	size *= BUF_PAGE_SIZE;
+	buffer_size = buffer->pages * BUF_PAGE_SIZE;
+
+	/* we need a minimum of two pages */
+	if (size < BUF_PAGE_SIZE * 2)
+		size = BUF_PAGE_SIZE * 2;
+
+	if (size == buffer_size)
+		return size;
+
+	mutex_lock(&buffer->mutex);
+
+	nr_pages = (size + (BUF_PAGE_SIZE-1)) / BUF_PAGE_SIZE;
+
+	if (size < buffer_size) {
+
+		/* easy case, just free pages */
+		BUG_ON(nr_pages >= buffer->pages);
+
+		rm_pages = buffer->pages - nr_pages;
+
+		for (cpu = 0; cpu < buffer->cpus; cpu++) {
+			cpu_buffer = buffer->buffers[cpu];
+			rb_remove_pages(cpu_buffer, rm_pages);
+		}
+		goto out;
+	}
+
+	/*
+	 * This is a bit more difficult. We only want to add pages
+	 * when we can allocate enough for all CPUs. We do this
+	 * by allocating all the pages and storing them on a local
+	 * link list. If we succeed in our allocation, then we
+	 * add these pages to the cpu_buffers. Otherwise we just free
+	 * them all and return -ENOMEM;
+	 */
+	BUG_ON(nr_pages <= buffer->pages);
+	new_pages = nr_pages - buffer->pages;
+
+	for (cpu = 0; cpu < buffer->cpus; cpu++) {
+		for (i = 0; i < new_pages; i++) {
+			addr = __get_free_page(GFP_KERNEL);
+			if (!addr)
+				goto free_pages;
+			page = virt_to_page(addr);
+			list_add(&page->lru, &pages);
+		}
+	}
+
+	for (cpu = 0; cpu < buffer->cpus; cpu++) {
+		cpu_buffer = buffer->buffers[cpu];
+		rb_insert_pages(cpu_buffer, &pages, new_pages);
+	}
+
+	BUG_ON(!list_empty(&pages));
+
+ out:
+	buffer->pages = nr_pages;
+	mutex_unlock(&buffer->mutex);
+
+	return size;
+
+ free_pages:
+	list_for_each_entry_safe(page, tmp, &pages, lru) {
+		list_del_init(&page->lru);
+		__free_page(page);
+	}
+	return -ENOMEM;
+}
+
+static inline int
+ring_buffer_per_cpu_empty(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	return cpu_buffer->head_page == cpu_buffer->tail_page &&
+		cpu_buffer->head == cpu_buffer->tail;
+}
+
+static inline int
+ring_buffer_null_event(struct ring_buffer_event *event)
+{
+	return event->type == RB_TYPE_PADDING;
+}
+
+static inline void *
+rb_page_index(struct page *page, unsigned index)
+{
+	struct buffer_page *bpage;
+
+	bpage = page_address(page);
+	return bpage->body + index;
+}
+
+static inline struct ring_buffer_event *
+ring_buffer_head_event(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	return rb_page_index(cpu_buffer->head_page,
+			     cpu_buffer->head);
+}
+
+static inline struct ring_buffer_event *
+ring_buffer_iter_head_event(struct ring_buffer_iter *iter)
+{
+	return rb_page_index(iter->head_page,
+			     iter->head);
+}
+
+/*
+ * When the tail hits the head and the buffer is in overwrite mode,
+ * the head jumps to the next page and all content on the previous
+ * page is discarded. But before doing so, we update the overrun
+ * variable of the buffer.
+ */
+static void
+ring_buffer_update_overflow(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct ring_buffer_event *event;
+	unsigned long head;
+
+	for (head = 0; head < BUF_PAGE_SIZE;
+	     head += ring_buffer_event_length(event)) {
+		event = rb_page_index(cpu_buffer->head_page, head);
+		if (ring_buffer_null_event(event))
+			break;
+		cpu_buffer->overrun++;
+		cpu_buffer->entries--;
+	}
+}
+
+static inline void
+ring_buffer_inc_page(struct ring_buffer_per_cpu *cpu_buffer,
+		     struct page **page)
+{
+	struct list_head *p = (*page)->lru.next;
+
+	if (p == &cpu_buffer->pages)
+		p = p->next;
+
+	*page = list_entry(p, struct page, lru);
+}
+
+static inline void
+rb_add_stamp(struct ring_buffer_per_cpu *cpu_buffer, u64 *ts)
+{
+	struct buffer_page *bpage;
+
+	bpage = page_address(cpu_buffer->tail_page);
+	bpage->time_stamp = *ts;
+}
+
+static void
+rb_reset_read_page(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct buffer_page *bpage;
+
+	cpu_buffer->head = 0;
+	bpage = page_address(cpu_buffer->head_page);
+	cpu_buffer->read_stamp = bpage->time_stamp;
+}
+
+static void
+rb_reset_iter_read_page(struct ring_buffer_iter *iter)
+{
+	struct buffer_page *bpage;
+
+	iter->head = 0;
+	bpage = page_address(iter->head_page);
+	iter->read_stamp = bpage->time_stamp;
+}
+
+/**
+ * ring_buffer_update_event - update event type and data
+ * @event: the even to update
+ * @type: the type of event
+ * @length: the size of the event field in the ring buffer
+ *
+ * Update the type and data fields of the event. The length
+ * is the actual size that is written to the ring buffer,
+ * and with this, we can determine what to place into the
+ * data field.
+ */
+static inline void
+ring_buffer_update_event(struct ring_buffer_event *event,
+			 unsigned type, unsigned length)
+{
+	event->type = type;
+
+	switch (type) {
+
+	case RB_TYPE_PADDING:
+		break;
+
+	case RB_TYPE_TIME_EXTENT:
+		event->len =
+			(RB_LEN_TIME_EXTENT + (RB_ALIGNMENT-1))
+			>> RB_ALIGNMENT_SHIFT;
+		break;
+
+	case RB_TYPE_TIME_STAMP:
+		event->len =
+			(RB_LEN_TIME_STAMP + (RB_ALIGNMENT-1))
+			>> RB_ALIGNMENT_SHIFT;
+		break;
+
+	case RB_TYPE_DATA:
+		length -= RB_EVNT_HDR_SIZE;
+		if (length > RB_MAX_SMALL_DATA) {
+			event->len = 0;
+			event->array[0] = length;
+		} else
+			event->len =
+				(length + (RB_ALIGNMENT-1))
+				>> RB_ALIGNMENT_SHIFT;
+		break;
+	default:
+		BUG();
+	}
+}
+
+static inline unsigned rb_calculate_event_length(unsigned length)
+{
+	struct ring_buffer_event event; /* Used only for sizeof array */
+
+	/* zero length can cause confusions */
+	if (!length)
+		length = 1;
+
+	if (length > RB_MAX_SMALL_DATA)
+		length += sizeof(event.array[0]);
+
+	length += RB_EVNT_HDR_SIZE;
+	length = ALIGN(length, RB_ALIGNMENT);
+
+	return length;
+}
+
+static struct ring_buffer_event *
+__ring_buffer_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
+			   unsigned type, unsigned long length, u64 *ts)
+{
+	struct page *head_page, *tail_page;
+	unsigned long tail;
+	struct ring_buffer *buffer = cpu_buffer->buffer;
+	struct ring_buffer_event *event;
+
+	tail_page = cpu_buffer->tail_page;
+	head_page = cpu_buffer->head_page;
+	tail = cpu_buffer->tail;
+
+	if (tail + length > BUF_PAGE_SIZE) {
+		struct page *next_page = tail_page;
+
+		ring_buffer_inc_page(cpu_buffer, &next_page);
+
+		if (next_page == head_page) {
+			if (!(buffer->flags & RB_FL_OVERWRITE))
+				return NULL;
+
+			/* count overflows */
+			ring_buffer_update_overflow(cpu_buffer);
+
+			ring_buffer_inc_page(cpu_buffer, &head_page);
+			cpu_buffer->head_page = head_page;
+			rb_reset_read_page(cpu_buffer);
+		}
+
+		if (tail != BUF_PAGE_SIZE) {
+			event = rb_page_index(tail_page, tail);
+			/* page padding */
+			event->type = RB_TYPE_PADDING;
+		}
+
+		tail = 0;
+		tail_page = next_page;
+		cpu_buffer->tail_page = tail_page;
+		cpu_buffer->tail = tail;
+		rb_add_stamp(cpu_buffer, ts);
+	}
+
+	BUG_ON(tail + length > BUF_PAGE_SIZE);
+
+	event = rb_page_index(tail_page, tail);
+	ring_buffer_update_event(event, type, length);
+	cpu_buffer->entries++;
+
+	return event;
+}
+
+static struct ring_buffer_event *
+ring_buffer_reserve_next_event(struct ring_buffer_per_cpu *cpu_buffer,
+			       unsigned type, unsigned long length)
+{
+	unsigned long long ts, delta;
+	struct ring_buffer_event *event;
+
+	ts = ring_buffer_time_stamp(cpu_buffer->cpu);
+
+	if (cpu_buffer->tail) {
+		delta = ts - cpu_buffer->write_stamp;
+
+		if (test_time_stamp(delta)) {
+			/*
+			 * The delta is too big, we to add a
+			 * new timestamp.
+			 */
+			event = __ring_buffer_reserve_next(cpu_buffer,
+							   RB_TYPE_TIME_EXTENT,
+							   RB_LEN_TIME_EXTENT,
+							   &ts);
+			if (!event)
+				return NULL;
+
+			/* check to see if we went to the next page */
+			if (!cpu_buffer->tail) {
+				/*
+				 * new page, dont commit this and add the
+				 * time stamp to the page instead.
+				 */
+				rb_add_stamp(cpu_buffer, &ts);
+			} else {
+				event->time_delta = delta & TS_MASK;
+				event->array[0] = delta >> TS_SHIFT;
+			}
+
+			cpu_buffer->write_stamp = ts;
+			delta = 0;
+		}
+	} else {
+		rb_add_stamp(cpu_buffer, &ts);
+		delta = 0;
+	}
+
+	event = __ring_buffer_reserve_next(cpu_buffer, type, length, &ts);
+	if (!event)
+		return NULL;
+
+	event->time_delta = delta;
+	cpu_buffer->write_stamp = ts;
+
+	return event;
+}
+
+/**
+ * ring_buffer_lock_reserve - reserve a part of the buffer
+ * @buffer: the ring buffer to reserve from
+ * @length: the length of the data to reserve (excluding event header)
+ * @flags: a pointer to save the interrupt flags
+ *
+ * Returns a reseverd event on the ring buffer to copy directly to.
+ * The user of this interface will need to get the body to write into
+ * and can use the ring_buffer_event_data() interface.
+ *
+ * The length is the length of the data needed, not the event length
+ * which also includes the event header.
+ *
+ * Must be paired with ring_buffer_unlock_commit, unless NULL is returned.
+ * If NULL is returned, then nothing has been allocated or locked.
+ */
+struct ring_buffer_event *
+ring_buffer_lock_reserve(struct ring_buffer *buffer,
+			 unsigned long length,
+			 unsigned long *flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	int cpu;
+
+	if (atomic_read(&buffer->record_disabled))
+		return NULL;
+
+	raw_local_irq_save(*flags);
+	cpu = raw_smp_processor_id();
+	cpu_buffer = buffer->buffers[cpu];
+	__raw_spin_lock(&cpu_buffer->lock);
+
+	if (atomic_read(&cpu_buffer->record_disabled))
+		goto no_record;
+
+	length = rb_calculate_event_length(length);
+	if (length > BUF_PAGE_SIZE)
+		return NULL;
+
+	event = ring_buffer_reserve_next_event(cpu_buffer,
+					       RB_TYPE_DATA, length);
+	if (!event)
+		goto no_record;
+
+	return event;
+
+ no_record:
+	__raw_spin_unlock(&cpu_buffer->lock);
+	local_irq_restore(*flags);
+	return NULL;
+}
+
+/**
+ * ring_buffer_unlock_commit - commit a reserved
+ * @buffer: The buffer to commit to
+ * @event: The event pointer to commit.
+ * @flags: the interrupt flags received from ring_buffer_lock_reserve.
+ *
+ * This commits the data to the ring buffer, and releases any locks held.
+ *
+ * Must be paired with ring_buffer_lock_reserve.
+ */
+int ring_buffer_unlock_commit(struct ring_buffer *buffer,
+			      struct ring_buffer_event *event,
+			      unsigned long flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu = raw_smp_processor_id();
+
+	cpu_buffer = buffer->buffers[cpu];
+	cpu_buffer->tail += ring_buffer_event_length(event);
+
+	__raw_spin_unlock(&cpu_buffer->lock);
+	raw_local_irq_restore(flags);
+
+	return 0;
+}
+
+/**
+ * ring_buffer_write - write data to the buffer without reserving
+ * @buffer: The ring buffer to write to.
+ * @length: The length of the data being written (excluding the event header)
+ * @data: The data to write to the buffer.
+ *
+ * This is like ring_buffer_lock_reserve and ring_buffer_unlock_commit as
+ * one function. If you already have the data to write to the buffer, it
+ * may be easier to simply call this function.
+ *
+ * Note, like ring_buffer_lock_reserve, the length is the length of the data
+ * and not the length of the event which would hold the header.
+ */
+int ring_buffer_write(struct ring_buffer *buffer,
+			unsigned long length,
+			void *data)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	unsigned long event_length, flags;
+	void *body;
+	int ret = 0;
+	int cpu;
+
+	if (atomic_read(&buffer->record_disabled))
+		return -EBUSY;
+
+	local_irq_save(flags);
+	cpu = raw_smp_processor_id();
+	cpu_buffer = buffer->buffers[cpu];
+	__raw_spin_lock(&cpu_buffer->lock);
+
+	if (atomic_read(&cpu_buffer->record_disabled))
+		goto out;
+
+	event_length = rb_calculate_event_length(length);
+	event = ring_buffer_reserve_next_event(cpu_buffer,
+					       RB_TYPE_DATA, event_length);
+	if (!event)
+		goto out;
+
+	body = ring_buffer_event_data(event);
+
+	memcpy(body, data, length);
+	cpu_buffer->tail += event_length;
+
+ out:
+	__raw_spin_unlock(&cpu_buffer->lock);
+	local_irq_restore(flags);
+
+	return ret;
+}
+
+/**
+ * ring_buffer_lock - lock the ring buffer
+ * @buffer: The ring buffer to lock
+ * @flags: The place to store the interrupt flags
+ *
+ * This locks all the per CPU buffers.
+ *
+ * Must be unlocked by ring_buffer_unlock.
+ */
+void ring_buffer_lock(struct ring_buffer *buffer, unsigned long *flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	local_irq_save(*flags);
+
+	for (cpu = 0; cpu < buffer->cpus; cpu++) {
+
+		cpu_buffer = buffer->buffers[cpu];
+		__raw_spin_lock(&cpu_buffer->lock);
+	}
+}
+
+/**
+ * ring_buffer_unlock - unlock a locked buffer
+ * @buffer: The locked buffer to unlock
+ * @flags: The interrupt flags received by ring_buffer_lock
+ */
+void ring_buffer_unlock(struct ring_buffer *buffer, unsigned long flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	for (cpu = buffer->cpus - 1; cpu >= 0; cpu--) {
+
+		cpu_buffer = buffer->buffers[cpu];
+		__raw_spin_unlock(&cpu_buffer->lock);
+	}
+
+	local_irq_restore(flags);
+}
+
+/**
+ * ring_buffer_record_disable - stop all writes into the buffer
+ * @buffer: The ring buffer to stop writes to.
+ *
+ * This prevents all writes to the buffer. Any attempt to write
+ * to the buffer after this will fail and return NULL.
+ */
+void ring_buffer_record_disable(struct ring_buffer *buffer)
+{
+	atomic_inc(&buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_record_enable - enable writes to the buffer
+ * @buffer: The ring buffer to enable writes
+ *
+ * Note, multiple disables will need the same number of enables
+ * to truely enable the writing (much like preempt_disable).
+ */
+void ring_buffer_record_enable(struct ring_buffer *buffer)
+{
+	atomic_dec(&buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_record_disable_cpu - stop all writes into the cpu_buffer
+ * @buffer: The ring buffer to stop writes to.
+ * @cpu: The CPU buffer to stop
+ *
+ * This prevents all writes to the buffer. Any attempt to write
+ * to the buffer after this will fail and return NULL.
+ */
+void ring_buffer_record_disable_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	atomic_inc(&cpu_buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_record_enable_cpu - enable writes to the buffer
+ * @buffer: The ring buffer to enable writes
+ * @cpu: The CPU to enable.
+ *
+ * Note, multiple disables will need the same number of enables
+ * to truely enable the writing (much like preempt_disable).
+ */
+void ring_buffer_record_enable_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	atomic_dec(&cpu_buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_entries_cpu - get the number of entries in a cpu buffer
+ * @buffer: The ring buffer
+ * @cpu: The per CPU buffer to get the entries from.
+ */
+unsigned long ring_buffer_entries_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	return cpu_buffer->entries;
+}
+
+/**
+ * ring_buffer_overrun_cpu - get the number of overruns in a cpu_buffer
+ * @buffer: The ring buffer
+ * @cpu: The per CPU buffer to get the number of overruns from
+ */
+unsigned long ring_buffer_overrun_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	return cpu_buffer->overrun;
+}
+
+/**
+ * ring_buffer_entries - get the number of entries in a buffer
+ * @buffer: The ring buffer
+ *
+ * Returns the total number of entries in the ring buffer
+ * (all CPU entries)
+ */
+unsigned long ring_buffer_entries(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long entries = 0;
+	int cpu;
+
+	/* if you care about this being correct, lock the buffer */
+	for (cpu = 0; cpu < buffer->cpus; cpu++) {
+		cpu_buffer = buffer->buffers[cpu];
+		entries += cpu_buffer->entries;
+	}
+
+	return entries;
+}
+
+/**
+ * ring_buffer_overrun_cpu - get the number of overruns in buffer
+ * @buffer: The ring buffer
+ *
+ * Returns the total number of overruns in the ring buffer
+ * (all CPU entries)
+ */
+unsigned long ring_buffer_overruns(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long overruns = 0;
+	int cpu;
+
+	/* if you care about this being correct, lock the buffer */
+	for (cpu = 0; cpu < buffer->cpus; cpu++) {
+		cpu_buffer = buffer->buffers[cpu];
+		overruns += cpu_buffer->overrun;
+	}
+
+	return overruns;
+}
+
+/**
+ * ring_buffer_iter_reset - reset an iterator
+ * @iter: The iterator to reset
+ *
+ * Resets the iterator, so that it will start from the beginning
+ * again.
+ */
+void ring_buffer_iter_reset(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
+
+	iter->head_page = cpu_buffer->head_page;
+	iter->head = cpu_buffer->head;
+	rb_reset_iter_read_page(iter);
+}
+
+/**
+ * ring_buffer_iter_empty - check if an iterator has no more to read
+ * @iter: The iterator to check
+ */
+int ring_buffer_iter_empty(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = iter->cpu_buffer;
+
+	return iter->head_page == cpu_buffer->tail_page &&
+		iter->head == cpu_buffer->tail;
+}
+
+static void
+rb_update_read_stamp(struct ring_buffer_per_cpu *cpu_buffer,
+		     struct ring_buffer_event *event)
+{
+	u64 delta;
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		return;
+
+	case RB_TYPE_TIME_EXTENT:
+		delta = event->array[0];
+		delta <<= TS_SHIFT;
+		delta += event->time_delta;
+		cpu_buffer->read_stamp += delta;
+		return;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		return;
+
+	case RB_TYPE_DATA:
+		cpu_buffer->read_stamp += event->time_delta;
+		return;
+
+	default:
+		BUG();
+	}
+	return;
+}
+
+static void
+rb_update_iter_read_stamp(struct ring_buffer_iter *iter,
+			  struct ring_buffer_event *event)
+{
+	u64 delta;
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		return;
+
+	case RB_TYPE_TIME_EXTENT:
+		delta = event->array[0];
+		delta <<= TS_SHIFT;
+		delta += event->time_delta;
+		iter->read_stamp += delta;
+		return;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		return;
+
+	case RB_TYPE_DATA:
+		iter->read_stamp += event->time_delta;
+		return;
+
+	default:
+		BUG();
+	}
+	return;
+}
+
+static void
+ring_buffer_advance_head(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct ring_buffer_event *event;
+	unsigned length;
+
+	event = ring_buffer_head_event(cpu_buffer);
+	/*
+	 * Check if we are at the end of the buffer.
+	 */
+	if (ring_buffer_null_event(event)) {
+		BUG_ON(cpu_buffer->head_page == cpu_buffer->tail_page);
+		ring_buffer_inc_page(cpu_buffer, &cpu_buffer->head_page);
+		rb_reset_read_page(cpu_buffer);
+		return;
+	}
+
+	length = ring_buffer_event_length(event);
+
+	/*
+	 * This should not be called to advance the header if we are
+	 * at the tail of the buffer.
+	 */
+	BUG_ON((cpu_buffer->head_page == cpu_buffer->tail_page) &&
+	       (cpu_buffer->head + length > cpu_buffer->tail));
+
+	rb_update_read_stamp(cpu_buffer, event);
+
+	cpu_buffer->head += length;
+
+	/* check for end of page padding */
+	event = ring_buffer_head_event(cpu_buffer);
+	if (ring_buffer_null_event(event) &&
+	    (cpu_buffer->head_page != cpu_buffer->tail_page))
+		ring_buffer_advance_head(cpu_buffer);
+}
+
+static void
+ring_buffer_advance_iter(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer *buffer;
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	unsigned length;
+
+	cpu_buffer = iter->cpu_buffer;
+	buffer = cpu_buffer->buffer;
+
+	event = ring_buffer_iter_head_event(iter);
+
+	/*
+	 * Check if we are at the end of the buffer.
+	 */
+	if (ring_buffer_null_event(event)) {
+		BUG_ON(iter->head_page == cpu_buffer->tail_page);
+		ring_buffer_inc_page(cpu_buffer, &iter->head_page);
+		rb_reset_iter_read_page(iter);
+		return;
+	}
+
+	length = ring_buffer_event_length(event);
+
+	/*
+	 * This should not be called to advance the header if we are
+	 * at the tail of the buffer.
+	 */
+	BUG_ON((iter->head_page == cpu_buffer->tail_page) &&
+	       (iter->head + length > cpu_buffer->tail));
+
+	rb_update_iter_read_stamp(iter, event);
+
+	iter->head += length;
+
+	/* check for end of page padding */
+	event = ring_buffer_iter_head_event(iter);
+	if (ring_buffer_null_event(event) &&
+	    (iter->head_page != cpu_buffer->tail_page))
+		ring_buffer_advance_iter(iter);
+}
+
+/**
+ * ring_buffer_peek - peek at the next event to be read
+ * @buffer: The ring buffer to read
+ * @cpu: The cpu to peak at
+ * @ts: The timestamp counter of this event.
+ *
+ * This will return the event that will be read next, but does
+ * not consume the data.
+ */
+struct ring_buffer_event *
+ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+
+	cpu_buffer = buffer->buffers[cpu];
+
+ again:
+	if (ring_buffer_per_cpu_empty(cpu_buffer))
+		return NULL;
+
+	event = ring_buffer_head_event(cpu_buffer);
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		ring_buffer_inc_page(cpu_buffer, &cpu_buffer->head_page);
+		rb_reset_read_page(cpu_buffer);
+		goto again;
+
+	case RB_TYPE_TIME_EXTENT:
+		/* Internal data, OK to advance */
+		ring_buffer_advance_head(cpu_buffer);
+		goto again;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		ring_buffer_advance_head(cpu_buffer);
+		goto again;
+
+	case RB_TYPE_DATA:
+		if (ts) {
+			*ts = cpu_buffer->read_stamp + event->time_delta;
+			ring_buffer_normalize_time_stamp(cpu_buffer->cpu, ts);
+		}
+		return event;
+
+	default:
+		BUG();
+	}
+
+	return NULL;
+}
+
+/**
+ * ring_buffer_iter_peek - peek at the next event to be read
+ * @iter: The ring buffer iterator
+ * @ts: The timestamp counter of this event.
+ *
+ * This will return the event that will be read next, but does
+ * not increment the iterator.
+ */
+struct ring_buffer_event *
+ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
+{
+	struct ring_buffer *buffer;
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+
+	if (ring_buffer_iter_empty(iter))
+		return NULL;
+
+	cpu_buffer = iter->cpu_buffer;
+	buffer = cpu_buffer->buffer;
+
+ again:
+	if (ring_buffer_per_cpu_empty(cpu_buffer))
+		return NULL;
+
+	event = ring_buffer_iter_head_event(iter);
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		ring_buffer_inc_page(cpu_buffer, &iter->head_page);
+		rb_reset_iter_read_page(iter);
+		goto again;
+
+	case RB_TYPE_TIME_EXTENT:
+		/* Internal data, OK to advance */
+		ring_buffer_advance_iter(iter);
+		goto again;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		ring_buffer_advance_iter(iter);
+		goto again;
+
+	case RB_TYPE_DATA:
+		if (ts) {
+			*ts = iter->read_stamp + event->time_delta;
+			ring_buffer_normalize_time_stamp(cpu_buffer->cpu, ts);
+		}
+		return event;
+
+	default:
+		BUG();
+	}
+
+	return NULL;
+}
+
+/**
+ * ring_buffer_consume - return an event and consume it
+ * @buffer: The ring buffer to get the next event from
+ *
+ * Returns the next event in the ring buffer, and that event is consumed.
+ * Meaning, that sequential reads will keep returning a different event,
+ * and eventually empty the ring buffer if the producer is slower.
+ */
+struct ring_buffer_event *
+ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+
+	event = ring_buffer_peek(buffer, cpu, ts);
+	if (!event)
+		return NULL;
+
+	cpu_buffer = buffer->buffers[cpu];
+	ring_buffer_advance_head(cpu_buffer);
+
+	return event;
+}
+
+/**
+ * ring_buffer_read_start - start a non consuming read of the buffer
+ * @buffer: The ring buffer to read from
+ * @cpu: The cpu buffer to iterate over
+ *
+ * This starts up an iteration through the buffer. It also disables
+ * the recording to the buffer until the reading is finished.
+ * This prevents the reading from being corrupted. This is not
+ * a consuming read, so a producer is not expected.
+ *
+ * Must be paired with ring_buffer_finish.
+ */
+struct ring_buffer_iter *
+ring_buffer_read_start(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_iter *iter;
+
+	iter = kmalloc(sizeof(*iter), GFP_KERNEL);
+	if (!iter)
+		return NULL;
+
+	cpu_buffer = buffer->buffers[cpu];
+
+	iter->cpu_buffer = cpu_buffer;
+
+	atomic_inc(&cpu_buffer->record_disabled);
+
+	__raw_spin_lock(&cpu_buffer->lock);
+	iter->head = cpu_buffer->head;
+	iter->head_page = cpu_buffer->head_page;
+	rb_reset_iter_read_page(iter);
+	__raw_spin_unlock(&cpu_buffer->lock);
+
+	return iter;
+}
+
+/**
+ * ring_buffer_finish - finish reading the iterator of the buffer
+ * @iter: The iterator retrieved by ring_buffer_start
+ *
+ * This re-enables the recording to the buffer, and frees the
+ * iterator.
+ */
+void
+ring_buffer_read_finish(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
+
+	atomic_dec(&cpu_buffer->record_disabled);
+	kfree(iter);
+}
+
+/**
+ * ring_buffer_read - read the next item in the ring buffer by the iterator
+ * @iter: The ring buffer iterator
+ * @ts: The time stamp of the event read.
+ *
+ * This reads the next event in the ring buffer and increments the iterator.
+ */
+struct ring_buffer_event *
+ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts)
+{
+	struct ring_buffer_event *event;
+
+	event = ring_buffer_iter_peek(iter, ts);
+	if (!event)
+		return NULL;
+
+	ring_buffer_advance_iter(iter);
+
+	return event;
+}
+
+/**
+ * ring_buffer_size - return the size of the ring buffer (in bytes)
+ * @buffer: The ring buffer.
+ */
+unsigned long ring_buffer_size(struct ring_buffer *buffer)
+{
+	return BUF_PAGE_SIZE * buffer->pages;
+}
+
+static void
+__ring_buffer_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	cpu_buffer->head_page
+		= list_entry(cpu_buffer->pages.next, struct page, lru);
+	cpu_buffer->tail_page
+		= list_entry(cpu_buffer->pages.next, struct page, lru);
+
+	cpu_buffer->head = cpu_buffer->tail = 0;
+	cpu_buffer->overrun = 0;
+	cpu_buffer->entries = 0;
+}
+
+/**
+ * ring_buffer_reset_cpu - reset a ring buffer per CPU buffer
+ * @buffer: The ring buffer to reset a per cpu buffer of
+ * @cpu: The CPU buffer to be reset
+ */
+void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = buffer->buffers[cpu];
+	unsigned long flags;
+
+	raw_local_irq_save(flags);
+	__raw_spin_lock(&cpu_buffer->lock);
+
+	__ring_buffer_reset_cpu(cpu_buffer);
+
+	__raw_spin_unlock(&cpu_buffer->lock);
+	raw_local_irq_restore(flags);
+}
+
+/**
+ * ring_buffer_reset_cpu - reset a ring buffer per CPU buffer
+ * @buffer: The ring buffer to reset a per cpu buffer of
+ * @cpu: The CPU buffer to be reset
+ */
+void ring_buffer_reset(struct ring_buffer *buffer)
+{
+	unsigned long flags;
+	int cpu;
+
+	ring_buffer_lock(buffer, &flags);
+
+	for (cpu = 0; cpu < buffer->cpus; cpu++)
+		__ring_buffer_reset_cpu(buffer->buffers[cpu]);
+
+	ring_buffer_unlock(buffer, flags);
+}
+
+/**
+ * rind_buffer_empty - is the ring buffer empty?
+ * @buffer: The ring buffer to test
+ */
+int ring_buffer_empty(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	/* yes this is racy, but if you don't like the race, lock the buffer */
+	for (cpu = 0; cpu < buffer->cpus; cpu++) {
+		cpu_buffer = buffer->buffers[cpu];
+		if (!ring_buffer_per_cpu_empty(cpu_buffer))
+			return 0;
+	}
+	return 1;
+}
+
+/**
+ * ring_buffer_empty_cpu - is a cpu buffer of a ring buffer empty?
+ * @buffer: The ring buffer
+ * @cpu: The CPU buffer to test
+ */
+int ring_buffer_empty_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	return ring_buffer_per_cpu_empty(cpu_buffer);
+}
+
+/**
+ * ring_buffer_swap_cpu - swap a CPU buffer between two ring buffers
+ * @buffer_a: One buffer to swap with
+ * @buffer_b: The other buffer to swap with
+ *
+ * This function is useful for tracers that want to take a "snapshot"
+ * of a CPU buffer and has another back up buffer lying around.
+ * it is expected that the tracer handles the cpu buffer not being
+ * used at the moment.
+ */
+int ring_buffer_swap_cpu(struct ring_buffer *buffer_a,
+			 struct ring_buffer *buffer_b, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer_a;
+	struct ring_buffer_per_cpu *cpu_buffer_b;
+
+	/* At least make sure the two buffers are somewhat the same */
+	if (buffer_a->size != buffer_b->size ||
+	    buffer_a->pages != buffer_b->pages)
+		return -EINVAL;
+
+	cpu_buffer_a = buffer_a->buffers[cpu];
+	cpu_buffer_b = buffer_b->buffers[cpu];
+
+	atomic_inc(&cpu_buffer_a->record_disabled);
+	atomic_inc(&cpu_buffer_b->record_disabled);
+
+	buffer_a->buffers[cpu] = cpu_buffer_b;
+	buffer_b->buffers[cpu] = cpu_buffer_a;
+
+	cpu_buffer_b->buffer = buffer_a;
+	cpu_buffer_a->buffer = buffer_b;
+
+	atomic_dec(&cpu_buffer_a->record_disabled);
+	atomic_dec(&cpu_buffer_b->record_disabled);
+
+	return 0;
+}
+
Index: linux-trace.git/kernel/trace/Kconfig
===================================================================
--- linux-trace.git.orig/kernel/trace/Kconfig	2008-09-25 21:28:29.000000000 -0400
+++ linux-trace.git/kernel/trace/Kconfig	2008-09-25 21:29:16.000000000 -0400
@@ -10,10 +10,14 @@ config HAVE_DYNAMIC_FTRACE
 config TRACER_MAX_TRACE
 	bool
 
+config RING_BUFFER
+	bool
+
 config TRACING
 	bool
 	select DEBUG_FS
 	select STACKTRACE
+	select RING_BUFFER
 
 config FTRACE
 	bool "Kernel Function Tracer"
Index: linux-trace.git/kernel/trace/Makefile
===================================================================
--- linux-trace.git.orig/kernel/trace/Makefile	2008-09-25 21:28:29.000000000 -0400
+++ linux-trace.git/kernel/trace/Makefile	2008-09-25 21:29:16.000000000 -0400
@@ -11,6 +11,7 @@ obj-y += trace_selftest_dynamic.o
 endif
 
 obj-$(CONFIG_FTRACE) += libftrace.o
+obj-$(CONFIG_RING_BUFFER) += ring_buffer.o
 
 obj-$(CONFIG_TRACING) += trace.o
 obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH v4] Unified trace buffer
  2008-09-26 16:57                 ` Masami Hiramatsu
@ 2008-09-26 17:14                   ` Steven Rostedt
  0 siblings, 0 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-09-26 17:14 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Peter Zijlstra, Mathieu Desnoyers, LKML, Ingo Molnar,
	Thomas Gleixner, Andrew Morton, prasad, Linus Torvalds,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Steven Rostedt


On Fri, 26 Sep 2008, Masami Hiramatsu wrote:
> Peter Zijlstra wrote:
> >> I was going to reply to Masami with this answer, but it makes things more 
> >> complex.  For v1 (non RFC v1) I wanted to start simple. v2 can have this 
> >> enhancement.
> > 
> > Right - I just object to having anything vmalloc.
> 
> I just requested that the expansion of buffer size limitation too. :)
> 
> I don't stick with vmalloc. If that (page frame chain?) can
> achieve better performance, I agree that trace buffer uses it.
> 

v5 is out with this implementation. It may or may not be better 
performance, but the difference is most likely negligible.

Anyway, I'm happing with this last release, and hopefully it can get into
2.6.28.  This would mean I can start basing ftrace on top of it.

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5] Unified trace buffer
  2008-09-26 17:11       ` [PATCH v5] " Steven Rostedt
@ 2008-09-26 17:31         ` Arnaldo Carvalho de Melo
  2008-09-26 17:37           ` Linus Torvalds
  2008-09-26 18:05         ` [PATCH v6] " Steven Rostedt
  1 sibling, 1 reply; 102+ messages in thread
From: Arnaldo Carvalho de Melo @ 2008-09-26 17:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, prasad, Linus Torvalds,
	Mathieu Desnoyers, Frank Ch. Eigler, David Wilder, hch,
	Martin Bligh, Christoph Hellwig, Steven Rostedt

Em Fri, Sep 26, 2008 at 01:11:57PM -0400, Steven Rostedt escreveu:
> 
> [
>   Note the removal of the RFC in the subject.
>   I am happy with this version. It handles everything I need
>   for ftrace.
> 
>   New since last version:
> 
>    - Fixed timing bug. I did not add the deltas properly when
>      reading the buffer.
> 
>    - Removed "-1" time stamp normalize test. This made the
>      clock go backwards!
> 
>    - Removed page pointer array and replaced it with the ftrace
>      page struct link list trick. Since this is my second time
>      writing this code (first with ftrace), it is actually much
>      cleaner than the ftrace code.
> 
>    - Implemented buffer resizing. By using the page link list trick,
>      this became much simpler.
> 
>    Note, the GOTD part is still not implemented, but can be done
>    later without affecting this interface.
> 
> ]
> 
> This is a unified tracing buffer that implements a ring buffer that
> hopefully everyone will eventually be able to use.
> 
> The events recorded into the buffer have the following structure:
> 
> struct ring_buffer_event {
> 	u32 type:2, len:3, time_delta:27;
> 	u32 array[];
> };
> 
> The minimum size of an event is 8 bytes. All events are 4 byte
> aligned inside the buffer.
> 
> There are 4 types (all internal use for the ring buffer, only
> the data type is exported to the interface users).
> 
> RB_TYPE_PADDING: this type is used to note extra space at the end
> 	of a buffer page.
> 
> RB_TYPE_TIME_EXTENT: This type is used when the time between events
> 	is greater than the 27 bit delta can hold. We add another
> 	32 bits, and record that in its own event (8 byte size).
> 
> RB_TYPE_TIME_STAMP: (Not implemented yet). This will hold data to
> 	help keep the buffer timestamps in sync.
> 
> RB_TYPE_DATA: The event actually holds user data.
> 
> The "len" field is only three bits. Since the data must be
> 4 byte aligned, this field is shifted left by 2, giving a
> max length of 28 bytes. If the data load is greater than 28
> bytes, the first array field holds the full length of the
> data load and the len field is set to zero.
> 
> Example, data size of 7 bytes:
> 
> 	type = RB_TYPE_DATA
> 	len = 2
> 	time_delta: <time-stamp> - <prev_event-time-stamp>
> 	array[0..1]: <7 bytes of data> <1 byte empty>
> 
> This event is saved in 12 bytes of the buffer.
> 
> An event with 82 bytes of data:
> 
> 	type = RB_TYPE_DATA
> 	len = 0
> 	time_delta: <time-stamp> - <prev_event-time-stamp>
> 	array[0]: 84 (Note the alignment)
> 	array[1..14]: <82 bytes of data> <2 bytes empty>
> 
> The above event is saved in 92 bytes (if my math is correct).
> 82 bytes of data, 2 bytes empty, 4 byte header, 4 byte length.
> 
> Do not reference the above event struct directly. Use the following
> functions to gain access to the event table, since the
> ring_buffer_event structure may change in the future.
> 
> ring_buffer_event_length(event): get the length of the event.
> 	This is the size of the memory used to record this
> 	event, and not the size of the data pay load.
> 
> ring_buffer_time_delta(event): get the time delta of the event
> 	This returns the delta time stamp since the last event.
> 	Note: Even though this is in the header, there should
> 		be no reason to access this directly, accept
> 		for debugging.
> 
> ring_buffer_event_data(event): get the data from the event
> 	This is the function to use to get the actual data
> 	from the event. Note, it is only a pointer to the
> 	data inside the buffer. This data must be copied to
> 	another location otherwise you risk it being written
> 	over in the buffer.
> 
> ring_buffer_lock: A way to lock the entire buffer.
> ring_buffer_unlock: unlock the buffer.
> 
> ring_buffer_alloc: create a new ring buffer. Can choose between
> 	overwrite or consumer/producer mode. Overwrite will
> 	overwrite old data, where as consumer producer will
> 	throw away new data if the consumer catches up with the
> 	producer.  The consumer/producer is the default.
> 
> ring_buffer_free: free the ring buffer.
> 
> ring_buffer_resize: resize the buffer. Changes the size of each cpu
> 	buffer. Note, it is up to the caller to provide that
> 	the buffer is not being used while this is happening.
> 	This requirement may go away but do not count on it.
> 
> ring_buffer_lock_reserve: locks the ring buffer and allocates an
> 	entry on the buffer to write to.
> ring_buffer_unlock_commit: unlocks the ring buffer and commits it to
> 	the buffer.
> 
> ring_buffer_write: writes some data into the ring buffer.
> 
> ring_buffer_peek: Look at a next item in the cpu buffer.
> ring_buffer_consume: get the next item in the cpu buffer and
> 	consume it. That is, this function increments the head
> 	pointer.
> 
> ring_buffer_read_start: Start an iterator of a cpu buffer.
> 	For now, this disables the cpu buffer, until you issue
> 	a finish. This is just because we do not want the iterator
> 	to be overwritten. This restriction may change in the future.
> 	But note, this is used for static reading of a buffer which
> 	is usually done "after" a trace. Live readings would want
> 	to use the ring_buffer_consume above, which will not
> 	disable the ring buffer.
> 
> ring_buffer_read_finish: Finishes the read iterator and reenables
> 	the ring buffer.
> 
> ring_buffer_iter_peek: Look at the next item in the cpu iterator.
> ring_buffer_read: Read the iterator and increment it.
> ring_buffer_iter_reset: Reset the iterator to point to the beginning
> 	of the cpu buffer.
> ring_buffer_iter_empty: Returns true if the iterator is at the end
> 	of the cpu buffer.
> 
> ring_buffer_size: returns the size in bytes of each cpu buffer.
> 	Note, the real size is this times the number of CPUs.
> 
> ring_buffer_reset_cpu: Sets the cpu buffer to empty
> ring_buffer_reset: sets all cpu buffers to empty
> 
> ring_buffer_swap_cpu: swaps a cpu buffer from one buffer with a
> 	cpu buffer of another buffer. This is handy when you
> 	want to take a snap shot of a running trace on just one
> 	cpu. Having a backup buffer, to swap with facilitates this.
> 	Ftrace max latencies use this.
> 
> ring_buffer_empty: Returns true if the ring buffer is empty.
> ring_buffer_empty_cpu: Returns true if the cpu buffer is empty.
> 
> ring_buffer_record_disable: disable all cpu buffers (read only)
> ring_buffer_record_disable_cpu: disable a single cpu buffer (read only)
> ring_buffer_record_enable: enable all cpu buffers.
> ring_buffer_record_enabl_cpu: enable a single cpu buffer.
> 
> ring_buffer_entries: The number of entries in a ring buffer.
> ring_buffer_overruns: The number of entries removed due to writing wrap.
> 
> ring_buffer_time_stamp: Get the time stamp used by the ring buffer
> ring_buffer_normalize_time_stamp: normalize the ring buffer time stamp
> 	into nanosecs.
> 
> I still need to implement the GTOD feature. But we need support from
> the cpu frequency infrastructure.  But this can be done at a later
> time without affecting the ring buffer interface.
> 
> Signed-off-by: Steven Rostedt <srostedt@redhat.com>
> ---
>  include/linux/ring_buffer.h |  178 +++++
>  kernel/trace/Kconfig        |    4 
>  kernel/trace/Makefile       |    1 
>  kernel/trace/ring_buffer.c  | 1491 ++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 1674 insertions(+)
> 
> Index: linux-trace.git/include/linux/ring_buffer.h
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-trace.git/include/linux/ring_buffer.h	2008-09-25 21:29:16.000000000 -0400
> @@ -0,0 +1,178 @@
> +#ifndef _LINUX_RING_BUFFER_H
> +#define _LINUX_RING_BUFFER_H
> +
> +#include <linux/mm.h>
> +#include <linux/seq_file.h>
> +
> +struct ring_buffer;
> +struct ring_buffer_iter;
> +
> +/*
> + * Don't reference this struct directly, use the inline items below.
> + */
> +struct ring_buffer_event {
> +	u32		type:2, len:3, time_delta:27;
> +	u32		array[];
> +} __attribute__((__packed__));

Why do you need __packed__ here? With or without it the layout is the
same:

[acme@doppio examples]$ pahole packed
struct ring_buffer_event {
	u32 type:2;               /* 0:30  4 */
	u32 len:3;                /* 0:27  4 */
	u32 time_delta:27;        /* 0: 0  4 */
	u32 array[0];             /* 4     0 */

	/* size: 4, cachelines: 1, members: 4 */
	/* last cacheline: 4 bytes */
};

- Arnaldo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5] Unified trace buffer
  2008-09-26 17:31         ` Arnaldo Carvalho de Melo
@ 2008-09-26 17:37           ` Linus Torvalds
  2008-09-26 17:46             ` Steven Rostedt
  0 siblings, 1 reply; 102+ messages in thread
From: Linus Torvalds @ 2008-09-26 17:37 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Steven Rostedt, Masami Hiramatsu, LKML, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, Andrew Morton, prasad,
	Mathieu Desnoyers, Frank Ch. Eigler, David Wilder, hch,
	Martin Bligh, Christoph Hellwig, Steven Rostedt



On Fri, 26 Sep 2008, Arnaldo Carvalho de Melo wrote:
> 
> Why do you need __packed__ here? With or without it the layout is the
> same:

Indeed. And on some architectures 'packed' will actually mean that the 
compiler may think that it's unaligned, and then generate much worse code 
to access the fields. So if you align things anyway (and you do), then 
'packed' is the wrong thing to do.

		Linus

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5] Unified trace buffer
  2008-09-26 17:37           ` Linus Torvalds
@ 2008-09-26 17:46             ` Steven Rostedt
  2008-09-27 17:02               ` Ingo Molnar
  0 siblings, 1 reply; 102+ messages in thread
From: Steven Rostedt @ 2008-09-26 17:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arnaldo Carvalho de Melo, Masami Hiramatsu, LKML, Ingo Molnar,
	Thomas Gleixner, Peter Zijlstra, Andrew Morton, prasad,
	Mathieu Desnoyers, Frank Ch. Eigler, David Wilder, hch,
	Martin Bligh, Christoph Hellwig, Steven Rostedt


On Fri, 26 Sep 2008, Linus Torvalds wrote:

> 
> 
> On Fri, 26 Sep 2008, Arnaldo Carvalho de Melo wrote:
> > 
> > Why do you need __packed__ here? With or without it the layout is the

>From just being paranoid.

> > same:
> 
> Indeed. And on some architectures 'packed' will actually mean that the 
> compiler may think that it's unaligned, and then generate much worse code 
> to access the fields. So if you align things anyway (and you do), then 
> 'packed' is the wrong thing to do.

OK, I'm making v6 now with various cleanups. I'll nuke it on that one.

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v6] Unified trace buffer
  2008-09-26 17:11       ` [PATCH v5] " Steven Rostedt
  2008-09-26 17:31         ` Arnaldo Carvalho de Melo
@ 2008-09-26 18:05         ` Steven Rostedt
  2008-09-26 18:30           ` Richard Holden
                             ` (6 more replies)
  1 sibling, 7 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-09-26 18:05 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	prasad, Linus Torvalds, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo


[
  Changes since v5:

  - removed packed attribute from event structure.

  - added parenthesis around ~TS_MASK

  - fixed some comments in header

  - fixed ret value on ring_buffer_write on errors.

  - added check_pages when modifying the size of cpu buffers
]

This is a unified tracing buffer that implements a ring buffer that
hopefully everyone will eventually be able to use.

The events recorded into the buffer have the following structure:

struct ring_buffer_event {
	u32 type:2, len:3, time_delta:27;
	u32 array[];
};

The minimum size of an event is 8 bytes. All events are 4 byte
aligned inside the buffer.

There are 4 types (all internal use for the ring buffer, only
the data type is exported to the interface users).

RB_TYPE_PADDING: this type is used to note extra space at the end
	of a buffer page.

RB_TYPE_TIME_EXTENT: This type is used when the time between events
	is greater than the 27 bit delta can hold. We add another
	32 bits, and record that in its own event (8 byte size).

RB_TYPE_TIME_STAMP: (Not implemented yet). This will hold data to
	help keep the buffer timestamps in sync.

RB_TYPE_DATA: The event actually holds user data.

The "len" field is only three bits. Since the data must be
4 byte aligned, this field is shifted left by 2, giving a
max length of 28 bytes. If the data load is greater than 28
bytes, the first array field holds the full length of the
data load and the len field is set to zero.

Example, data size of 7 bytes:

	type = RB_TYPE_DATA
	len = 2
	time_delta: <time-stamp> - <prev_event-time-stamp>
	array[0..1]: <7 bytes of data> <1 byte empty>

This event is saved in 12 bytes of the buffer.

An event with 82 bytes of data:

	type = RB_TYPE_DATA
	len = 0
	time_delta: <time-stamp> - <prev_event-time-stamp>
	array[0]: 84 (Note the alignment)
	array[1..14]: <82 bytes of data> <2 bytes empty>

The above event is saved in 92 bytes (if my math is correct).
82 bytes of data, 2 bytes empty, 4 byte header, 4 byte length.

Do not reference the above event struct directly. Use the following
functions to gain access to the event table, since the
ring_buffer_event structure may change in the future.

ring_buffer_event_length(event): get the length of the event.
	This is the size of the memory used to record this
	event, and not the size of the data pay load.

ring_buffer_time_delta(event): get the time delta of the event
	This returns the delta time stamp since the last event.
	Note: Even though this is in the header, there should
		be no reason to access this directly, accept
		for debugging.

ring_buffer_event_data(event): get the data from the event
	This is the function to use to get the actual data
	from the event. Note, it is only a pointer to the
	data inside the buffer. This data must be copied to
	another location otherwise you risk it being written
	over in the buffer.

ring_buffer_lock: A way to lock the entire buffer.
ring_buffer_unlock: unlock the buffer.

ring_buffer_alloc: create a new ring buffer. Can choose between
	overwrite or consumer/producer mode. Overwrite will
	overwrite old data, where as consumer producer will
	throw away new data if the consumer catches up with the
	producer.  The consumer/producer is the default.

ring_buffer_free: free the ring buffer.

ring_buffer_resize: resize the buffer. Changes the size of each cpu
	buffer. Note, it is up to the caller to provide that
	the buffer is not being used while this is happening.
	This requirement may go away but do not count on it.

ring_buffer_lock_reserve: locks the ring buffer and allocates an
	entry on the buffer to write to.
ring_buffer_unlock_commit: unlocks the ring buffer and commits it to
	the buffer.

ring_buffer_write: writes some data into the ring buffer.

ring_buffer_peek: Look at a next item in the cpu buffer.
ring_buffer_consume: get the next item in the cpu buffer and
	consume it. That is, this function increments the head
	pointer.

ring_buffer_read_start: Start an iterator of a cpu buffer.
	For now, this disables the cpu buffer, until you issue
	a finish. This is just because we do not want the iterator
	to be overwritten. This restriction may change in the future.
	But note, this is used for static reading of a buffer which
	is usually done "after" a trace. Live readings would want
	to use the ring_buffer_consume above, which will not
	disable the ring buffer.

ring_buffer_read_finish: Finishes the read iterator and reenables
	the ring buffer.

ring_buffer_iter_peek: Look at the next item in the cpu iterator.
ring_buffer_read: Read the iterator and increment it.
ring_buffer_iter_reset: Reset the iterator to point to the beginning
	of the cpu buffer.
ring_buffer_iter_empty: Returns true if the iterator is at the end
	of the cpu buffer.

ring_buffer_size: returns the size in bytes of each cpu buffer.
	Note, the real size is this times the number of CPUs.

ring_buffer_reset_cpu: Sets the cpu buffer to empty
ring_buffer_reset: sets all cpu buffers to empty

ring_buffer_swap_cpu: swaps a cpu buffer from one buffer with a
	cpu buffer of another buffer. This is handy when you
	want to take a snap shot of a running trace on just one
	cpu. Having a backup buffer, to swap with facilitates this.
	Ftrace max latencies use this.

ring_buffer_empty: Returns true if the ring buffer is empty.
ring_buffer_empty_cpu: Returns true if the cpu buffer is empty.

ring_buffer_record_disable: disable all cpu buffers (read only)
ring_buffer_record_disable_cpu: disable a single cpu buffer (read only)
ring_buffer_record_enable: enable all cpu buffers.
ring_buffer_record_enabl_cpu: enable a single cpu buffer.

ring_buffer_entries: The number of entries in a ring buffer.
ring_buffer_overruns: The number of entries removed due to writing wrap.

ring_buffer_time_stamp: Get the time stamp used by the ring buffer
ring_buffer_normalize_time_stamp: normalize the ring buffer time stamp
	into nanosecs.

I still need to implement the GTOD feature. But we need support from
the cpu frequency infrastructure.  But this can be done at a later
time without affecting the ring buffer interface.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 include/linux/ring_buffer.h |  179 +++++
 kernel/trace/Kconfig        |    4 
 kernel/trace/Makefile       |    1 
 kernel/trace/ring_buffer.c  | 1496 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 1680 insertions(+)

Index: linux-trace.git/include/linux/ring_buffer.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-trace.git/include/linux/ring_buffer.h	2008-09-26 13:44:33.000000000 -0400
@@ -0,0 +1,179 @@
+#ifndef _LINUX_RING_BUFFER_H
+#define _LINUX_RING_BUFFER_H
+
+#include <linux/mm.h>
+#include <linux/seq_file.h>
+
+struct ring_buffer;
+struct ring_buffer_iter;
+
+/*
+ * Don't reference this struct directly, use the inline items below.
+ */
+struct ring_buffer_event {
+	u32		type:2, len:3, time_delta:27;
+	u32		array[];
+};
+
+enum {
+	RB_TYPE_PADDING,	/* Left over page padding
+				 * array is ignored
+				 * size is variable depending on
+				 * how much padding is needed
+				 */
+	RB_TYPE_TIME_EXTENT,	/* Extent the time delta
+				 * array[0] = time delta (28 .. 59)
+				 * size = 8 bytes
+				 */
+	/* FIXME: RB_TYPE_TIME_STAMP not implemented */
+	RB_TYPE_TIME_STAMP,	/* Sync time stamp with external clock
+				 * array[0] = tv_nsec
+				 * array[1] = tv_sec
+				 * size = 16 bytes
+				 */
+
+	RB_TYPE_DATA,		/* Data record
+				 * If len is zero:
+				 *  array[0] holds the actual length
+				 *  array[1..(length+3)/4-1] holds data
+				 * else
+				 *  length = len << 2
+				 *  array[0..(length+3)/4] holds data
+				 */
+};
+
+#define RB_EVNT_HDR_SIZE (sizeof(struct ring_buffer_event))
+#define RB_ALIGNMENT_SHIFT	2
+#define RB_ALIGNMENT		(1 << RB_ALIGNMENT_SHIFT)
+#define RB_MAX_SMALL_DATA	(28)
+
+enum {
+	RB_LEN_TIME_EXTENT = 8,
+	RB_LEN_TIME_STAMP = 16,
+};
+
+/**
+ * ring_buffer_event_length - return the length of the event
+ * @event: the event to get the length of
+ */
+static inline unsigned
+ring_buffer_event_length(struct ring_buffer_event *event)
+{
+	unsigned length;
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		/* undefined */
+		return -1;
+
+	case RB_TYPE_TIME_EXTENT:
+		return RB_LEN_TIME_EXTENT;
+
+	case RB_TYPE_TIME_STAMP:
+		return RB_LEN_TIME_STAMP;
+
+	case RB_TYPE_DATA:
+		if (event->len)
+			length = event->len << RB_ALIGNMENT_SHIFT;
+		else
+			length = event->array[0];
+		return length + RB_EVNT_HDR_SIZE;
+	default:
+		BUG();
+	}
+	/* not hit */
+	return 0;
+}
+
+/**
+ * ring_buffer_event_time_delta - return the delta timestamp of the event
+ * @event: the event to get the delta timestamp of
+ *
+ * The delta timestamp is the 27 bit timestamp since the last event.
+ */
+static inline unsigned
+ring_buffer_event_time_delta(struct ring_buffer_event *event)
+{
+	return event->time_delta;
+}
+
+/**
+ * ring_buffer_event_data - return the data of the event
+ * @event: the event to get the data from
+ */
+static inline void *
+ring_buffer_event_data(struct ring_buffer_event *event)
+{
+	BUG_ON(event->type != RB_TYPE_DATA);
+	/* If length is in len field, then array[0] has the data */
+	if (event->len)
+		return (void *)&event->array[0];
+	/* Otherwise length is in array[0] and array[1] has the data */
+	return (void *)&event->array[1];
+}
+
+void ring_buffer_lock(struct ring_buffer *buffer, unsigned long *flags);
+void ring_buffer_unlock(struct ring_buffer *buffer, unsigned long flags);
+
+/*
+ * size is in bytes for each per CPU buffer.
+ */
+struct ring_buffer *
+ring_buffer_alloc(unsigned long size, unsigned flags);
+void ring_buffer_free(struct ring_buffer *buffer);
+
+int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size);
+
+struct ring_buffer_event *
+ring_buffer_lock_reserve(struct ring_buffer *buffer,
+			 unsigned long length,
+			 unsigned long *flags);
+int ring_buffer_unlock_commit(struct ring_buffer *buffer,
+			      struct ring_buffer_event *event,
+			      unsigned long flags);
+int ring_buffer_write(struct ring_buffer *buffer,
+		      unsigned long length, void *data);
+
+struct ring_buffer_event *
+ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts);
+struct ring_buffer_event *
+ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts);
+
+struct ring_buffer_iter *
+ring_buffer_read_start(struct ring_buffer *buffer, int cpu);
+void ring_buffer_read_finish(struct ring_buffer_iter *iter);
+
+struct ring_buffer_event *
+ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts);
+struct ring_buffer_event *
+ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts);
+void ring_buffer_iter_reset(struct ring_buffer_iter *iter);
+int ring_buffer_iter_empty(struct ring_buffer_iter *iter);
+
+unsigned long ring_buffer_size(struct ring_buffer *buffer);
+
+void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu);
+void ring_buffer_reset(struct ring_buffer *buffer);
+
+int ring_buffer_swap_cpu(struct ring_buffer *buffer_a,
+			 struct ring_buffer *buffer_b, int cpu);
+
+int ring_buffer_empty(struct ring_buffer *buffer);
+int ring_buffer_empty_cpu(struct ring_buffer *buffer, int cpu);
+
+void ring_buffer_record_disable(struct ring_buffer *buffer);
+void ring_buffer_record_enable(struct ring_buffer *buffer);
+void ring_buffer_record_disable_cpu(struct ring_buffer *buffer, int cpu);
+void ring_buffer_record_enable_cpu(struct ring_buffer *buffer, int cpu);
+
+unsigned long ring_buffer_entries(struct ring_buffer *buffer);
+unsigned long ring_buffer_overruns(struct ring_buffer *buffer);
+
+u64 ring_buffer_time_stamp(int cpu);
+void ring_buffer_normalize_time_stamp(int cpu, u64 *ts);
+
+enum ring_buffer_flags {
+	RB_FL_OVERWRITE		= 1 << 0,
+};
+
+#endif /* _LINUX_RING_BUFFER_H */
Index: linux-trace.git/kernel/trace/ring_buffer.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-trace.git/kernel/trace/ring_buffer.c	2008-09-26 13:53:52.000000000 -0400
@@ -0,0 +1,1496 @@
+/*
+ * Generic ring buffer
+ *
+ * Copyright (C) 2008 Steven Rostedt <srostedt@redhat.com>
+ */
+#include <linux/ring_buffer.h>
+#include <linux/spinlock.h>
+#include <linux/debugfs.h>
+#include <linux/uaccess.h>
+#include <linux/module.h>
+#include <linux/percpu.h>
+#include <linux/mutex.h>
+#include <linux/init.h>
+#include <linux/hash.h>
+#include <linux/list.h>
+#include <linux/fs.h>
+
+#include "trace.h"
+
+/* FIXME!!! */
+u64 ring_buffer_time_stamp(int cpu)
+{
+	return sched_clock();
+}
+void ring_buffer_normalize_time_stamp(int cpu, u64 *ts)
+{
+}
+
+#define TS_SHIFT	27
+#define TS_MASK		((1ULL << TS_SHIFT) - 1)
+#define TS_DELTA_TEST	(~TS_MASK)
+
+/*
+ * We need to fit the time_stamp delta into 27 bits.
+ */
+static inline int
+test_time_stamp(unsigned long long delta)
+{
+	if (delta & TS_DELTA_TEST)
+		return 1;
+	return 0;
+}
+
+struct buffer_page {
+	u64		time_stamp;
+	unsigned char	body[];
+};
+
+#define BUF_PAGE_SIZE (PAGE_SIZE - sizeof(u64))
+
+/*
+ * head_page == tail_page && head == tail then buffer is empty.
+ */
+struct ring_buffer_per_cpu {
+	int			cpu;
+	struct ring_buffer	*buffer;
+	raw_spinlock_t		lock;
+	struct lock_class_key	lock_key;
+	struct list_head	pages;
+	unsigned long		head;	/* read from head */
+	unsigned long		tail;	/* write to tail */
+	struct page		*head_page;
+	struct page		*tail_page;
+	unsigned long		overrun;
+	unsigned long		entries;
+	u64			write_stamp;
+	u64			read_stamp;
+	atomic_t		record_disabled;
+};
+
+struct ring_buffer {
+	unsigned long		size;
+	unsigned		pages;
+	unsigned		flags;
+	int			cpus;
+	atomic_t		record_disabled;
+
+	struct mutex		mutex;
+
+	/* FIXME: this should be online CPUS */
+	struct ring_buffer_per_cpu *buffers[NR_CPUS];
+};
+
+struct ring_buffer_iter {
+	struct ring_buffer_per_cpu	*cpu_buffer;
+	unsigned long			head;
+	struct page			*head_page;
+	u64				read_stamp;
+};
+
+#define CHECK_COND(buffer, cond)			\
+	if (unlikely(cond)) {				\
+		atomic_inc(&buffer->record_disabled);	\
+		WARN_ON(1);				\
+		return -1;				\
+	}
+
+/**
+ * check_pages - integrity check of buffer pages
+ * @cpu_buffer: CPU buffer with pages to test
+ *
+ * As a safty measure we check to make sure the data pages have not
+ * been corrupted.
+ */
+static int check_pages(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct list_head *head = &cpu_buffer->pages;
+	struct page *page, *tmp;
+
+	CHECK_COND(cpu_buffer, head->next->prev != head);
+	CHECK_COND(cpu_buffer, head->prev->next != head);
+
+	list_for_each_entry_safe(page, tmp, head, lru) {
+		CHECK_COND(cpu_buffer, page->lru.next->prev != &page->lru);
+		CHECK_COND(cpu_buffer, page->lru.prev->next != &page->lru);
+	}
+
+	return 0;
+}
+
+static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
+			     unsigned nr_pages)
+{
+	struct list_head *head = &cpu_buffer->pages;
+	LIST_HEAD(pages);
+	struct page *page, *tmp;
+	unsigned long addr;
+	unsigned i;
+
+	for (i = 0; i < nr_pages; i++) {
+		addr = __get_free_page(GFP_KERNEL);
+		if (!addr)
+			goto free_pages;
+		page = virt_to_page(addr);
+		list_add(&page->lru, &pages);
+	}
+
+	list_splice(&pages, head);
+
+	check_pages(cpu_buffer);
+
+	return 0;
+
+ free_pages:
+	list_for_each_entry_safe(page, tmp, &pages, lru) {
+		list_del_init(&page->lru);
+		__free_page(page);
+	}
+	return -ENOMEM;
+}
+
+static struct ring_buffer_per_cpu *
+ring_buffer_allocate_cpu_buffer(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int ret;
+
+	cpu_buffer = kzalloc_node(ALIGN(sizeof(*cpu_buffer), cache_line_size()),
+				  GFP_KERNEL, cpu_to_node(cpu));
+	if (!cpu_buffer)
+		return NULL;
+
+	cpu_buffer->cpu = cpu;
+	cpu_buffer->buffer = buffer;
+	cpu_buffer->lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED;
+	INIT_LIST_HEAD(&cpu_buffer->pages);
+
+	ret = rb_allocate_pages(cpu_buffer, buffer->pages);
+	if (ret < 0)
+		goto fail_free_buffer;
+
+	cpu_buffer->head_page
+		= list_entry(cpu_buffer->pages.next, struct page, lru);
+	cpu_buffer->tail_page
+		= list_entry(cpu_buffer->pages.next, struct page, lru);
+
+	return cpu_buffer;
+
+ fail_free_buffer:
+	kfree(cpu_buffer);
+	return NULL;
+}
+
+static void
+ring_buffer_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct list_head *head = &cpu_buffer->pages;
+	struct page *page, *tmp;
+
+	list_for_each_entry_safe(page, tmp, head, lru) {
+		list_del_init(&page->lru);
+		__free_page(page);
+	}
+	kfree(cpu_buffer);
+}
+
+/**
+ * ring_buffer_alloc - allocate a new ring_buffer
+ * @size: the size in bytes that is needed.
+ * @flags: attributes to set for the ring buffer.
+ *
+ * Currently the only flag that is available is the RB_FL_OVERWRITE
+ * flag. This flag means that the buffer will overwrite old data
+ * when the buffer wraps. If this flag is not set, the buffer will
+ * drop data when the tail hits the head.
+ */
+struct ring_buffer *
+ring_buffer_alloc(unsigned long size, unsigned flags)
+{
+	struct ring_buffer *buffer;
+	int cpu;
+
+	/* keep it in its own cache line */
+	buffer = kzalloc(ALIGN(sizeof(*buffer), cache_line_size()),
+			 GFP_KERNEL);
+	if (!buffer)
+		return NULL;
+
+	buffer->pages = (size + (BUF_PAGE_SIZE - 1)) / BUF_PAGE_SIZE;
+	buffer->flags = flags;
+
+	/* need at least two pages */
+	if (buffer->pages == 1)
+		buffer->pages++;
+
+	/* FIXME: do for only online CPUS */
+	buffer->cpus = num_possible_cpus();
+	for_each_possible_cpu(cpu) {
+		if (cpu >= buffer->cpus)
+			continue;
+		buffer->buffers[cpu] =
+			ring_buffer_allocate_cpu_buffer(buffer, cpu);
+		if (!buffer->buffers[cpu])
+			goto fail_free_buffers;
+	}
+
+	mutex_init(&buffer->mutex);
+
+	return buffer;
+
+ fail_free_buffers:
+	for_each_possible_cpu(cpu) {
+		if (cpu >= buffer->cpus)
+			continue;
+		if (buffer->buffers[cpu])
+			ring_buffer_free_cpu_buffer(buffer->buffers[cpu]);
+	}
+
+	kfree(buffer);
+	return NULL;
+}
+
+/**
+ * ring_buffer_free - free a ring buffer.
+ * @buffer: the buffer to free.
+ */
+void
+ring_buffer_free(struct ring_buffer *buffer)
+{
+	int cpu;
+
+	for (cpu = 0; cpu < buffer->cpus; cpu++)
+		ring_buffer_free_cpu_buffer(buffer->buffers[cpu]);
+
+	kfree(buffer);
+}
+
+static void
+__ring_buffer_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer);
+
+static void
+rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned nr_pages)
+{
+	struct page *page;
+	struct list_head *p;
+	unsigned i;
+
+	atomic_inc(&cpu_buffer->record_disabled);
+
+	for (i = 0; i < nr_pages; i++) {
+		BUG_ON(list_empty(&cpu_buffer->pages));
+		p = cpu_buffer->pages.next;
+		page = list_entry(p, struct page, lru);
+		list_del_init(&page->lru);
+		__free_page(page);
+	}
+	BUG_ON(list_empty(&cpu_buffer->pages));
+
+	__ring_buffer_reset_cpu(cpu_buffer);
+
+	check_pages(cpu_buffer);
+
+	atomic_dec(&cpu_buffer->record_disabled);
+
+}
+
+static void
+rb_insert_pages(struct ring_buffer_per_cpu *cpu_buffer,
+		struct list_head *pages, unsigned nr_pages)
+{
+	struct page *page;
+	struct list_head *p;
+	unsigned i;
+
+	atomic_inc(&cpu_buffer->record_disabled);
+
+	for (i = 0; i < nr_pages; i++) {
+		BUG_ON(list_empty(pages));
+		p = pages->next;
+		page = list_entry(p, struct page, lru);
+		list_del_init(&page->lru);
+		list_add_tail(&page->lru, &cpu_buffer->pages);
+	}
+	__ring_buffer_reset_cpu(cpu_buffer);
+
+	check_pages(cpu_buffer);
+
+	atomic_dec(&cpu_buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_resize - resize the ring buffer
+ * @buffer: the buffer to resize.
+ * @size: the new size.
+ *
+ * The tracer is responsible for making sure that the buffer is
+ * not being used while changing the size.
+ * Note: We may be able to change the above requirement by using
+ *  RCU synchronizations.
+ *
+ * Minimum size is 2 * BUF_PAGE_SIZE.
+ *
+ * Returns -1 on failure.
+ */
+int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long buffer_size;
+	LIST_HEAD(pages);
+	unsigned long addr;
+	unsigned nr_pages, rm_pages, new_pages;
+	struct page *page, *tmp;
+	int i, cpu;
+
+	size = (size + (BUF_PAGE_SIZE-1)) / BUF_PAGE_SIZE;
+	size *= BUF_PAGE_SIZE;
+	buffer_size = buffer->pages * BUF_PAGE_SIZE;
+
+	/* we need a minimum of two pages */
+	if (size < BUF_PAGE_SIZE * 2)
+		size = BUF_PAGE_SIZE * 2;
+
+	if (size == buffer_size)
+		return size;
+
+	mutex_lock(&buffer->mutex);
+
+	nr_pages = (size + (BUF_PAGE_SIZE-1)) / BUF_PAGE_SIZE;
+
+	if (size < buffer_size) {
+
+		/* easy case, just free pages */
+		BUG_ON(nr_pages >= buffer->pages);
+
+		rm_pages = buffer->pages - nr_pages;
+
+		for (cpu = 0; cpu < buffer->cpus; cpu++) {
+			cpu_buffer = buffer->buffers[cpu];
+			rb_remove_pages(cpu_buffer, rm_pages);
+		}
+		goto out;
+	}
+
+	/*
+	 * This is a bit more difficult. We only want to add pages
+	 * when we can allocate enough for all CPUs. We do this
+	 * by allocating all the pages and storing them on a local
+	 * link list. If we succeed in our allocation, then we
+	 * add these pages to the cpu_buffers. Otherwise we just free
+	 * them all and return -ENOMEM;
+	 */
+	BUG_ON(nr_pages <= buffer->pages);
+	new_pages = nr_pages - buffer->pages;
+
+	for (cpu = 0; cpu < buffer->cpus; cpu++) {
+		for (i = 0; i < new_pages; i++) {
+			addr = __get_free_page(GFP_KERNEL);
+			if (!addr)
+				goto free_pages;
+			page = virt_to_page(addr);
+			list_add(&page->lru, &pages);
+		}
+	}
+
+	for (cpu = 0; cpu < buffer->cpus; cpu++) {
+		cpu_buffer = buffer->buffers[cpu];
+		rb_insert_pages(cpu_buffer, &pages, new_pages);
+	}
+
+	BUG_ON(!list_empty(&pages));
+
+ out:
+	buffer->pages = nr_pages;
+	mutex_unlock(&buffer->mutex);
+
+	return size;
+
+ free_pages:
+	list_for_each_entry_safe(page, tmp, &pages, lru) {
+		list_del_init(&page->lru);
+		__free_page(page);
+	}
+	return -ENOMEM;
+}
+
+static inline int
+ring_buffer_per_cpu_empty(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	return cpu_buffer->head_page == cpu_buffer->tail_page &&
+		cpu_buffer->head == cpu_buffer->tail;
+}
+
+static inline int
+ring_buffer_null_event(struct ring_buffer_event *event)
+{
+	return event->type == RB_TYPE_PADDING;
+}
+
+static inline void *
+rb_page_index(struct page *page, unsigned index)
+{
+	struct buffer_page *bpage;
+
+	bpage = page_address(page);
+	return bpage->body + index;
+}
+
+static inline struct ring_buffer_event *
+ring_buffer_head_event(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	return rb_page_index(cpu_buffer->head_page,
+			     cpu_buffer->head);
+}
+
+static inline struct ring_buffer_event *
+ring_buffer_iter_head_event(struct ring_buffer_iter *iter)
+{
+	return rb_page_index(iter->head_page,
+			     iter->head);
+}
+
+/*
+ * When the tail hits the head and the buffer is in overwrite mode,
+ * the head jumps to the next page and all content on the previous
+ * page is discarded. But before doing so, we update the overrun
+ * variable of the buffer.
+ */
+static void
+ring_buffer_update_overflow(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct ring_buffer_event *event;
+	unsigned long head;
+
+	for (head = 0; head < BUF_PAGE_SIZE;
+	     head += ring_buffer_event_length(event)) {
+		event = rb_page_index(cpu_buffer->head_page, head);
+		if (ring_buffer_null_event(event))
+			break;
+		cpu_buffer->overrun++;
+		cpu_buffer->entries--;
+	}
+}
+
+static inline void
+ring_buffer_inc_page(struct ring_buffer_per_cpu *cpu_buffer,
+		     struct page **page)
+{
+	struct list_head *p = (*page)->lru.next;
+
+	if (p == &cpu_buffer->pages)
+		p = p->next;
+
+	*page = list_entry(p, struct page, lru);
+}
+
+static inline void
+rb_add_stamp(struct ring_buffer_per_cpu *cpu_buffer, u64 *ts)
+{
+	struct buffer_page *bpage;
+
+	bpage = page_address(cpu_buffer->tail_page);
+	bpage->time_stamp = *ts;
+}
+
+static void
+rb_reset_read_page(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct buffer_page *bpage;
+
+	cpu_buffer->head = 0;
+	bpage = page_address(cpu_buffer->head_page);
+	cpu_buffer->read_stamp = bpage->time_stamp;
+}
+
+static void
+rb_reset_iter_read_page(struct ring_buffer_iter *iter)
+{
+	struct buffer_page *bpage;
+
+	iter->head = 0;
+	bpage = page_address(iter->head_page);
+	iter->read_stamp = bpage->time_stamp;
+}
+
+/**
+ * ring_buffer_update_event - update event type and data
+ * @event: the even to update
+ * @type: the type of event
+ * @length: the size of the event field in the ring buffer
+ *
+ * Update the type and data fields of the event. The length
+ * is the actual size that is written to the ring buffer,
+ * and with this, we can determine what to place into the
+ * data field.
+ */
+static inline void
+ring_buffer_update_event(struct ring_buffer_event *event,
+			 unsigned type, unsigned length)
+{
+	event->type = type;
+
+	switch (type) {
+
+	case RB_TYPE_PADDING:
+		break;
+
+	case RB_TYPE_TIME_EXTENT:
+		event->len =
+			(RB_LEN_TIME_EXTENT + (RB_ALIGNMENT-1))
+			>> RB_ALIGNMENT_SHIFT;
+		break;
+
+	case RB_TYPE_TIME_STAMP:
+		event->len =
+			(RB_LEN_TIME_STAMP + (RB_ALIGNMENT-1))
+			>> RB_ALIGNMENT_SHIFT;
+		break;
+
+	case RB_TYPE_DATA:
+		length -= RB_EVNT_HDR_SIZE;
+		if (length > RB_MAX_SMALL_DATA) {
+			event->len = 0;
+			event->array[0] = length;
+		} else
+			event->len =
+				(length + (RB_ALIGNMENT-1))
+				>> RB_ALIGNMENT_SHIFT;
+		break;
+	default:
+		BUG();
+	}
+}
+
+static inline unsigned rb_calculate_event_length(unsigned length)
+{
+	struct ring_buffer_event event; /* Used only for sizeof array */
+
+	/* zero length can cause confusions */
+	if (!length)
+		length = 1;
+
+	if (length > RB_MAX_SMALL_DATA)
+		length += sizeof(event.array[0]);
+
+	length += RB_EVNT_HDR_SIZE;
+	length = ALIGN(length, RB_ALIGNMENT);
+
+	return length;
+}
+
+static struct ring_buffer_event *
+__ring_buffer_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
+			   unsigned type, unsigned long length, u64 *ts)
+{
+	struct page *head_page, *tail_page;
+	unsigned long tail;
+	struct ring_buffer *buffer = cpu_buffer->buffer;
+	struct ring_buffer_event *event;
+
+	tail_page = cpu_buffer->tail_page;
+	head_page = cpu_buffer->head_page;
+	tail = cpu_buffer->tail;
+
+	if (tail + length > BUF_PAGE_SIZE) {
+		struct page *next_page = tail_page;
+
+		ring_buffer_inc_page(cpu_buffer, &next_page);
+
+		if (next_page == head_page) {
+			if (!(buffer->flags & RB_FL_OVERWRITE))
+				return NULL;
+
+			/* count overflows */
+			ring_buffer_update_overflow(cpu_buffer);
+
+			ring_buffer_inc_page(cpu_buffer, &head_page);
+			cpu_buffer->head_page = head_page;
+			rb_reset_read_page(cpu_buffer);
+		}
+
+		if (tail != BUF_PAGE_SIZE) {
+			event = rb_page_index(tail_page, tail);
+			/* page padding */
+			event->type = RB_TYPE_PADDING;
+		}
+
+		tail = 0;
+		tail_page = next_page;
+		cpu_buffer->tail_page = tail_page;
+		cpu_buffer->tail = tail;
+		rb_add_stamp(cpu_buffer, ts);
+	}
+
+	BUG_ON(tail + length > BUF_PAGE_SIZE);
+
+	event = rb_page_index(tail_page, tail);
+	ring_buffer_update_event(event, type, length);
+	cpu_buffer->entries++;
+
+	return event;
+}
+
+static struct ring_buffer_event *
+ring_buffer_reserve_next_event(struct ring_buffer_per_cpu *cpu_buffer,
+			       unsigned type, unsigned long length)
+{
+	unsigned long long ts, delta;
+	struct ring_buffer_event *event;
+
+	ts = ring_buffer_time_stamp(cpu_buffer->cpu);
+
+	if (cpu_buffer->tail) {
+		delta = ts - cpu_buffer->write_stamp;
+
+		if (test_time_stamp(delta)) {
+			/*
+			 * The delta is too big, we to add a
+			 * new timestamp.
+			 */
+			event = __ring_buffer_reserve_next(cpu_buffer,
+							   RB_TYPE_TIME_EXTENT,
+							   RB_LEN_TIME_EXTENT,
+							   &ts);
+			if (!event)
+				return NULL;
+
+			/* check to see if we went to the next page */
+			if (!cpu_buffer->tail) {
+				/*
+				 * new page, dont commit this and add the
+				 * time stamp to the page instead.
+				 */
+				rb_add_stamp(cpu_buffer, &ts);
+			} else {
+				event->time_delta = delta & TS_MASK;
+				event->array[0] = delta >> TS_SHIFT;
+			}
+
+			cpu_buffer->write_stamp = ts;
+			delta = 0;
+		}
+	} else {
+		rb_add_stamp(cpu_buffer, &ts);
+		delta = 0;
+	}
+
+	event = __ring_buffer_reserve_next(cpu_buffer, type, length, &ts);
+	if (!event)
+		return NULL;
+
+	event->time_delta = delta;
+	cpu_buffer->write_stamp = ts;
+
+	return event;
+}
+
+/**
+ * ring_buffer_lock_reserve - reserve a part of the buffer
+ * @buffer: the ring buffer to reserve from
+ * @length: the length of the data to reserve (excluding event header)
+ * @flags: a pointer to save the interrupt flags
+ *
+ * Returns a reseverd event on the ring buffer to copy directly to.
+ * The user of this interface will need to get the body to write into
+ * and can use the ring_buffer_event_data() interface.
+ *
+ * The length is the length of the data needed, not the event length
+ * which also includes the event header.
+ *
+ * Must be paired with ring_buffer_unlock_commit, unless NULL is returned.
+ * If NULL is returned, then nothing has been allocated or locked.
+ */
+struct ring_buffer_event *
+ring_buffer_lock_reserve(struct ring_buffer *buffer,
+			 unsigned long length,
+			 unsigned long *flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	int cpu;
+
+	if (atomic_read(&buffer->record_disabled))
+		return NULL;
+
+	raw_local_irq_save(*flags);
+	cpu = raw_smp_processor_id();
+	cpu_buffer = buffer->buffers[cpu];
+	__raw_spin_lock(&cpu_buffer->lock);
+
+	if (atomic_read(&cpu_buffer->record_disabled))
+		goto no_record;
+
+	length = rb_calculate_event_length(length);
+	if (length > BUF_PAGE_SIZE)
+		return NULL;
+
+	event = ring_buffer_reserve_next_event(cpu_buffer,
+					       RB_TYPE_DATA, length);
+	if (!event)
+		goto no_record;
+
+	return event;
+
+ no_record:
+	__raw_spin_unlock(&cpu_buffer->lock);
+	local_irq_restore(*flags);
+	return NULL;
+}
+
+/**
+ * ring_buffer_unlock_commit - commit a reserved
+ * @buffer: The buffer to commit to
+ * @event: The event pointer to commit.
+ * @flags: the interrupt flags received from ring_buffer_lock_reserve.
+ *
+ * This commits the data to the ring buffer, and releases any locks held.
+ *
+ * Must be paired with ring_buffer_lock_reserve.
+ */
+int ring_buffer_unlock_commit(struct ring_buffer *buffer,
+			      struct ring_buffer_event *event,
+			      unsigned long flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu = raw_smp_processor_id();
+
+	cpu_buffer = buffer->buffers[cpu];
+	cpu_buffer->tail += ring_buffer_event_length(event);
+
+	__raw_spin_unlock(&cpu_buffer->lock);
+	raw_local_irq_restore(flags);
+
+	return 0;
+}
+
+/**
+ * ring_buffer_write - write data to the buffer without reserving
+ * @buffer: The ring buffer to write to.
+ * @length: The length of the data being written (excluding the event header)
+ * @data: The data to write to the buffer.
+ *
+ * This is like ring_buffer_lock_reserve and ring_buffer_unlock_commit as
+ * one function. If you already have the data to write to the buffer, it
+ * may be easier to simply call this function.
+ *
+ * Note, like ring_buffer_lock_reserve, the length is the length of the data
+ * and not the length of the event which would hold the header.
+ */
+int ring_buffer_write(struct ring_buffer *buffer,
+			unsigned long length,
+			void *data)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	unsigned long event_length, flags;
+	void *body;
+	int ret = -EBUSY;
+	int cpu;
+
+	if (atomic_read(&buffer->record_disabled))
+		return -EBUSY;
+
+	local_irq_save(flags);
+	cpu = raw_smp_processor_id();
+	cpu_buffer = buffer->buffers[cpu];
+	__raw_spin_lock(&cpu_buffer->lock);
+
+	if (atomic_read(&cpu_buffer->record_disabled))
+		goto out;
+
+	event_length = rb_calculate_event_length(length);
+	event = ring_buffer_reserve_next_event(cpu_buffer,
+					       RB_TYPE_DATA, event_length);
+	if (!event)
+		goto out;
+
+	body = ring_buffer_event_data(event);
+
+	memcpy(body, data, length);
+	cpu_buffer->tail += event_length;
+
+	ret = 0;
+ out:
+	__raw_spin_unlock(&cpu_buffer->lock);
+	local_irq_restore(flags);
+
+	return ret;
+}
+
+/**
+ * ring_buffer_lock - lock the ring buffer
+ * @buffer: The ring buffer to lock
+ * @flags: The place to store the interrupt flags
+ *
+ * This locks all the per CPU buffers.
+ *
+ * Must be unlocked by ring_buffer_unlock.
+ */
+void ring_buffer_lock(struct ring_buffer *buffer, unsigned long *flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	local_irq_save(*flags);
+
+	for (cpu = 0; cpu < buffer->cpus; cpu++) {
+
+		cpu_buffer = buffer->buffers[cpu];
+		__raw_spin_lock(&cpu_buffer->lock);
+	}
+}
+
+/**
+ * ring_buffer_unlock - unlock a locked buffer
+ * @buffer: The locked buffer to unlock
+ * @flags: The interrupt flags received by ring_buffer_lock
+ */
+void ring_buffer_unlock(struct ring_buffer *buffer, unsigned long flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	for (cpu = buffer->cpus - 1; cpu >= 0; cpu--) {
+
+		cpu_buffer = buffer->buffers[cpu];
+		__raw_spin_unlock(&cpu_buffer->lock);
+	}
+
+	local_irq_restore(flags);
+}
+
+/**
+ * ring_buffer_record_disable - stop all writes into the buffer
+ * @buffer: The ring buffer to stop writes to.
+ *
+ * This prevents all writes to the buffer. Any attempt to write
+ * to the buffer after this will fail and return NULL.
+ */
+void ring_buffer_record_disable(struct ring_buffer *buffer)
+{
+	atomic_inc(&buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_record_enable - enable writes to the buffer
+ * @buffer: The ring buffer to enable writes
+ *
+ * Note, multiple disables will need the same number of enables
+ * to truely enable the writing (much like preempt_disable).
+ */
+void ring_buffer_record_enable(struct ring_buffer *buffer)
+{
+	atomic_dec(&buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_record_disable_cpu - stop all writes into the cpu_buffer
+ * @buffer: The ring buffer to stop writes to.
+ * @cpu: The CPU buffer to stop
+ *
+ * This prevents all writes to the buffer. Any attempt to write
+ * to the buffer after this will fail and return NULL.
+ */
+void ring_buffer_record_disable_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	atomic_inc(&cpu_buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_record_enable_cpu - enable writes to the buffer
+ * @buffer: The ring buffer to enable writes
+ * @cpu: The CPU to enable.
+ *
+ * Note, multiple disables will need the same number of enables
+ * to truely enable the writing (much like preempt_disable).
+ */
+void ring_buffer_record_enable_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	atomic_dec(&cpu_buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_entries_cpu - get the number of entries in a cpu buffer
+ * @buffer: The ring buffer
+ * @cpu: The per CPU buffer to get the entries from.
+ */
+unsigned long ring_buffer_entries_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	return cpu_buffer->entries;
+}
+
+/**
+ * ring_buffer_overrun_cpu - get the number of overruns in a cpu_buffer
+ * @buffer: The ring buffer
+ * @cpu: The per CPU buffer to get the number of overruns from
+ */
+unsigned long ring_buffer_overrun_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	return cpu_buffer->overrun;
+}
+
+/**
+ * ring_buffer_entries - get the number of entries in a buffer
+ * @buffer: The ring buffer
+ *
+ * Returns the total number of entries in the ring buffer
+ * (all CPU entries)
+ */
+unsigned long ring_buffer_entries(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long entries = 0;
+	int cpu;
+
+	/* if you care about this being correct, lock the buffer */
+	for (cpu = 0; cpu < buffer->cpus; cpu++) {
+		cpu_buffer = buffer->buffers[cpu];
+		entries += cpu_buffer->entries;
+	}
+
+	return entries;
+}
+
+/**
+ * ring_buffer_overrun_cpu - get the number of overruns in buffer
+ * @buffer: The ring buffer
+ *
+ * Returns the total number of overruns in the ring buffer
+ * (all CPU entries)
+ */
+unsigned long ring_buffer_overruns(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long overruns = 0;
+	int cpu;
+
+	/* if you care about this being correct, lock the buffer */
+	for (cpu = 0; cpu < buffer->cpus; cpu++) {
+		cpu_buffer = buffer->buffers[cpu];
+		overruns += cpu_buffer->overrun;
+	}
+
+	return overruns;
+}
+
+/**
+ * ring_buffer_iter_reset - reset an iterator
+ * @iter: The iterator to reset
+ *
+ * Resets the iterator, so that it will start from the beginning
+ * again.
+ */
+void ring_buffer_iter_reset(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
+
+	iter->head_page = cpu_buffer->head_page;
+	iter->head = cpu_buffer->head;
+	rb_reset_iter_read_page(iter);
+}
+
+/**
+ * ring_buffer_iter_empty - check if an iterator has no more to read
+ * @iter: The iterator to check
+ */
+int ring_buffer_iter_empty(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = iter->cpu_buffer;
+
+	return iter->head_page == cpu_buffer->tail_page &&
+		iter->head == cpu_buffer->tail;
+}
+
+static void
+rb_update_read_stamp(struct ring_buffer_per_cpu *cpu_buffer,
+		     struct ring_buffer_event *event)
+{
+	u64 delta;
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		return;
+
+	case RB_TYPE_TIME_EXTENT:
+		delta = event->array[0];
+		delta <<= TS_SHIFT;
+		delta += event->time_delta;
+		cpu_buffer->read_stamp += delta;
+		return;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		return;
+
+	case RB_TYPE_DATA:
+		cpu_buffer->read_stamp += event->time_delta;
+		return;
+
+	default:
+		BUG();
+	}
+	return;
+}
+
+static void
+rb_update_iter_read_stamp(struct ring_buffer_iter *iter,
+			  struct ring_buffer_event *event)
+{
+	u64 delta;
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		return;
+
+	case RB_TYPE_TIME_EXTENT:
+		delta = event->array[0];
+		delta <<= TS_SHIFT;
+		delta += event->time_delta;
+		iter->read_stamp += delta;
+		return;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		return;
+
+	case RB_TYPE_DATA:
+		iter->read_stamp += event->time_delta;
+		return;
+
+	default:
+		BUG();
+	}
+	return;
+}
+
+static void
+ring_buffer_advance_head(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct ring_buffer_event *event;
+	unsigned length;
+
+	event = ring_buffer_head_event(cpu_buffer);
+	/*
+	 * Check if we are at the end of the buffer.
+	 */
+	if (ring_buffer_null_event(event)) {
+		BUG_ON(cpu_buffer->head_page == cpu_buffer->tail_page);
+		ring_buffer_inc_page(cpu_buffer, &cpu_buffer->head_page);
+		rb_reset_read_page(cpu_buffer);
+		return;
+	}
+
+	length = ring_buffer_event_length(event);
+
+	/*
+	 * This should not be called to advance the header if we are
+	 * at the tail of the buffer.
+	 */
+	BUG_ON((cpu_buffer->head_page == cpu_buffer->tail_page) &&
+	       (cpu_buffer->head + length > cpu_buffer->tail));
+
+	rb_update_read_stamp(cpu_buffer, event);
+
+	cpu_buffer->head += length;
+
+	/* check for end of page padding */
+	event = ring_buffer_head_event(cpu_buffer);
+	if (ring_buffer_null_event(event) &&
+	    (cpu_buffer->head_page != cpu_buffer->tail_page))
+		ring_buffer_advance_head(cpu_buffer);
+}
+
+static void
+ring_buffer_advance_iter(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer *buffer;
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	unsigned length;
+
+	cpu_buffer = iter->cpu_buffer;
+	buffer = cpu_buffer->buffer;
+
+	event = ring_buffer_iter_head_event(iter);
+
+	/*
+	 * Check if we are at the end of the buffer.
+	 */
+	if (ring_buffer_null_event(event)) {
+		BUG_ON(iter->head_page == cpu_buffer->tail_page);
+		ring_buffer_inc_page(cpu_buffer, &iter->head_page);
+		rb_reset_iter_read_page(iter);
+		return;
+	}
+
+	length = ring_buffer_event_length(event);
+
+	/*
+	 * This should not be called to advance the header if we are
+	 * at the tail of the buffer.
+	 */
+	BUG_ON((iter->head_page == cpu_buffer->tail_page) &&
+	       (iter->head + length > cpu_buffer->tail));
+
+	rb_update_iter_read_stamp(iter, event);
+
+	iter->head += length;
+
+	/* check for end of page padding */
+	event = ring_buffer_iter_head_event(iter);
+	if (ring_buffer_null_event(event) &&
+	    (iter->head_page != cpu_buffer->tail_page))
+		ring_buffer_advance_iter(iter);
+}
+
+/**
+ * ring_buffer_peek - peek at the next event to be read
+ * @buffer: The ring buffer to read
+ * @cpu: The cpu to peak at
+ * @ts: The timestamp counter of this event.
+ *
+ * This will return the event that will be read next, but does
+ * not consume the data.
+ */
+struct ring_buffer_event *
+ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+
+	cpu_buffer = buffer->buffers[cpu];
+
+ again:
+	if (ring_buffer_per_cpu_empty(cpu_buffer))
+		return NULL;
+
+	event = ring_buffer_head_event(cpu_buffer);
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		ring_buffer_inc_page(cpu_buffer, &cpu_buffer->head_page);
+		rb_reset_read_page(cpu_buffer);
+		goto again;
+
+	case RB_TYPE_TIME_EXTENT:
+		/* Internal data, OK to advance */
+		ring_buffer_advance_head(cpu_buffer);
+		goto again;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		ring_buffer_advance_head(cpu_buffer);
+		goto again;
+
+	case RB_TYPE_DATA:
+		if (ts) {
+			*ts = cpu_buffer->read_stamp + event->time_delta;
+			ring_buffer_normalize_time_stamp(cpu_buffer->cpu, ts);
+		}
+		return event;
+
+	default:
+		BUG();
+	}
+
+	return NULL;
+}
+
+/**
+ * ring_buffer_iter_peek - peek at the next event to be read
+ * @iter: The ring buffer iterator
+ * @ts: The timestamp counter of this event.
+ *
+ * This will return the event that will be read next, but does
+ * not increment the iterator.
+ */
+struct ring_buffer_event *
+ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
+{
+	struct ring_buffer *buffer;
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+
+	if (ring_buffer_iter_empty(iter))
+		return NULL;
+
+	cpu_buffer = iter->cpu_buffer;
+	buffer = cpu_buffer->buffer;
+
+ again:
+	if (ring_buffer_per_cpu_empty(cpu_buffer))
+		return NULL;
+
+	event = ring_buffer_iter_head_event(iter);
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		ring_buffer_inc_page(cpu_buffer, &iter->head_page);
+		rb_reset_iter_read_page(iter);
+		goto again;
+
+	case RB_TYPE_TIME_EXTENT:
+		/* Internal data, OK to advance */
+		ring_buffer_advance_iter(iter);
+		goto again;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		ring_buffer_advance_iter(iter);
+		goto again;
+
+	case RB_TYPE_DATA:
+		if (ts) {
+			*ts = iter->read_stamp + event->time_delta;
+			ring_buffer_normalize_time_stamp(cpu_buffer->cpu, ts);
+		}
+		return event;
+
+	default:
+		BUG();
+	}
+
+	return NULL;
+}
+
+/**
+ * ring_buffer_consume - return an event and consume it
+ * @buffer: The ring buffer to get the next event from
+ *
+ * Returns the next event in the ring buffer, and that event is consumed.
+ * Meaning, that sequential reads will keep returning a different event,
+ * and eventually empty the ring buffer if the producer is slower.
+ */
+struct ring_buffer_event *
+ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+
+	event = ring_buffer_peek(buffer, cpu, ts);
+	if (!event)
+		return NULL;
+
+	cpu_buffer = buffer->buffers[cpu];
+	ring_buffer_advance_head(cpu_buffer);
+
+	return event;
+}
+
+/**
+ * ring_buffer_read_start - start a non consuming read of the buffer
+ * @buffer: The ring buffer to read from
+ * @cpu: The cpu buffer to iterate over
+ *
+ * This starts up an iteration through the buffer. It also disables
+ * the recording to the buffer until the reading is finished.
+ * This prevents the reading from being corrupted. This is not
+ * a consuming read, so a producer is not expected.
+ *
+ * Must be paired with ring_buffer_finish.
+ */
+struct ring_buffer_iter *
+ring_buffer_read_start(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_iter *iter;
+
+	iter = kmalloc(sizeof(*iter), GFP_KERNEL);
+	if (!iter)
+		return NULL;
+
+	cpu_buffer = buffer->buffers[cpu];
+
+	iter->cpu_buffer = cpu_buffer;
+
+	atomic_inc(&cpu_buffer->record_disabled);
+
+	__raw_spin_lock(&cpu_buffer->lock);
+	iter->head = cpu_buffer->head;
+	iter->head_page = cpu_buffer->head_page;
+	rb_reset_iter_read_page(iter);
+	__raw_spin_unlock(&cpu_buffer->lock);
+
+	return iter;
+}
+
+/**
+ * ring_buffer_finish - finish reading the iterator of the buffer
+ * @iter: The iterator retrieved by ring_buffer_start
+ *
+ * This re-enables the recording to the buffer, and frees the
+ * iterator.
+ */
+void
+ring_buffer_read_finish(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
+
+	atomic_dec(&cpu_buffer->record_disabled);
+	kfree(iter);
+}
+
+/**
+ * ring_buffer_read - read the next item in the ring buffer by the iterator
+ * @iter: The ring buffer iterator
+ * @ts: The time stamp of the event read.
+ *
+ * This reads the next event in the ring buffer and increments the iterator.
+ */
+struct ring_buffer_event *
+ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts)
+{
+	struct ring_buffer_event *event;
+
+	event = ring_buffer_iter_peek(iter, ts);
+	if (!event)
+		return NULL;
+
+	ring_buffer_advance_iter(iter);
+
+	return event;
+}
+
+/**
+ * ring_buffer_size - return the size of the ring buffer (in bytes)
+ * @buffer: The ring buffer.
+ */
+unsigned long ring_buffer_size(struct ring_buffer *buffer)
+{
+	return BUF_PAGE_SIZE * buffer->pages;
+}
+
+static void
+__ring_buffer_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	cpu_buffer->head_page
+		= list_entry(cpu_buffer->pages.next, struct page, lru);
+	cpu_buffer->tail_page
+		= list_entry(cpu_buffer->pages.next, struct page, lru);
+
+	cpu_buffer->head = cpu_buffer->tail = 0;
+	cpu_buffer->overrun = 0;
+	cpu_buffer->entries = 0;
+}
+
+/**
+ * ring_buffer_reset_cpu - reset a ring buffer per CPU buffer
+ * @buffer: The ring buffer to reset a per cpu buffer of
+ * @cpu: The CPU buffer to be reset
+ */
+void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = buffer->buffers[cpu];
+	unsigned long flags;
+
+	raw_local_irq_save(flags);
+	__raw_spin_lock(&cpu_buffer->lock);
+
+	__ring_buffer_reset_cpu(cpu_buffer);
+
+	__raw_spin_unlock(&cpu_buffer->lock);
+	raw_local_irq_restore(flags);
+}
+
+/**
+ * ring_buffer_reset_cpu - reset a ring buffer per CPU buffer
+ * @buffer: The ring buffer to reset a per cpu buffer of
+ * @cpu: The CPU buffer to be reset
+ */
+void ring_buffer_reset(struct ring_buffer *buffer)
+{
+	unsigned long flags;
+	int cpu;
+
+	ring_buffer_lock(buffer, &flags);
+
+	for (cpu = 0; cpu < buffer->cpus; cpu++)
+		__ring_buffer_reset_cpu(buffer->buffers[cpu]);
+
+	ring_buffer_unlock(buffer, flags);
+}
+
+/**
+ * rind_buffer_empty - is the ring buffer empty?
+ * @buffer: The ring buffer to test
+ */
+int ring_buffer_empty(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	/* yes this is racy, but if you don't like the race, lock the buffer */
+	for (cpu = 0; cpu < buffer->cpus; cpu++) {
+		cpu_buffer = buffer->buffers[cpu];
+		if (!ring_buffer_per_cpu_empty(cpu_buffer))
+			return 0;
+	}
+	return 1;
+}
+
+/**
+ * ring_buffer_empty_cpu - is a cpu buffer of a ring buffer empty?
+ * @buffer: The ring buffer
+ * @cpu: The CPU buffer to test
+ */
+int ring_buffer_empty_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	return ring_buffer_per_cpu_empty(cpu_buffer);
+}
+
+/**
+ * ring_buffer_swap_cpu - swap a CPU buffer between two ring buffers
+ * @buffer_a: One buffer to swap with
+ * @buffer_b: The other buffer to swap with
+ *
+ * This function is useful for tracers that want to take a "snapshot"
+ * of a CPU buffer and has another back up buffer lying around.
+ * it is expected that the tracer handles the cpu buffer not being
+ * used at the moment.
+ */
+int ring_buffer_swap_cpu(struct ring_buffer *buffer_a,
+			 struct ring_buffer *buffer_b, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer_a;
+	struct ring_buffer_per_cpu *cpu_buffer_b;
+
+	/* At least make sure the two buffers are somewhat the same */
+	if (buffer_a->size != buffer_b->size ||
+	    buffer_a->pages != buffer_b->pages)
+		return -EINVAL;
+
+	cpu_buffer_a = buffer_a->buffers[cpu];
+	cpu_buffer_b = buffer_b->buffers[cpu];
+
+	atomic_inc(&cpu_buffer_a->record_disabled);
+	atomic_inc(&cpu_buffer_b->record_disabled);
+
+	buffer_a->buffers[cpu] = cpu_buffer_b;
+	buffer_b->buffers[cpu] = cpu_buffer_a;
+
+	cpu_buffer_b->buffer = buffer_a;
+	cpu_buffer_a->buffer = buffer_b;
+
+	atomic_dec(&cpu_buffer_a->record_disabled);
+	atomic_dec(&cpu_buffer_b->record_disabled);
+
+	return 0;
+}
+
Index: linux-trace.git/kernel/trace/Kconfig
===================================================================
--- linux-trace.git.orig/kernel/trace/Kconfig	2008-09-25 21:28:29.000000000 -0400
+++ linux-trace.git/kernel/trace/Kconfig	2008-09-25 21:29:16.000000000 -0400
@@ -10,10 +10,14 @@ config HAVE_DYNAMIC_FTRACE
 config TRACER_MAX_TRACE
 	bool
 
+config RING_BUFFER
+	bool
+
 config TRACING
 	bool
 	select DEBUG_FS
 	select STACKTRACE
+	select RING_BUFFER
 
 config FTRACE
 	bool "Kernel Function Tracer"
Index: linux-trace.git/kernel/trace/Makefile
===================================================================
--- linux-trace.git.orig/kernel/trace/Makefile	2008-09-25 21:28:29.000000000 -0400
+++ linux-trace.git/kernel/trace/Makefile	2008-09-25 21:29:16.000000000 -0400
@@ -11,6 +11,7 @@ obj-y += trace_selftest_dynamic.o
 endif
 
 obj-$(CONFIG_FTRACE) += libftrace.o
+obj-$(CONFIG_RING_BUFFER) += ring_buffer.o
 
 obj-$(CONFIG_TRACING) += trace.o
 obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v6] Unified trace buffer
  2008-09-26 18:05         ` [PATCH v6] " Steven Rostedt
@ 2008-09-26 18:30           ` Richard Holden
  2008-09-26 18:39             ` Steven Rostedt
  2008-09-26 18:59           ` Peter Zijlstra
                             ` (5 subsequent siblings)
  6 siblings, 1 reply; 102+ messages in thread
From: Richard Holden @ 2008-09-26 18:30 UTC (permalink / raw)
  To: Steven Rostedt, LKML
  Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	prasad, Linus Torvalds, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo

On 9/26/08 12:05 PM, "Steven Rostedt" <rostedt@goodmis.org> wrote:

> ring_buffer_alloc: create a new ring buffer. Can choose between
> overwrite or consumer/producer mode. Overwrite will
> overwrite old data, where as consumer producer will
> throw away new data if the consumer catches up with the
> producer.  The consumer/producer is the default.

Forgive me if I've gotten this wrong but the terminology seems backwards
Here, I would think we only throw away new data if the producer catches up
with the consumer, if the consumer catches up with the producer we're
reading data as fast as it's being written.

> 
> ring_buffer_write: writes some data into the ring buffer.
> 
> ring_buffer_peek: Look at a next item in the cpu buffer.
> ring_buffer_consume: get the next item in the cpu buffer and
> consume it. That is, this function increments the head
> pointer.

Here too, I would think that consuming data would modify the tail pointer.
> 
> Signed-off-by: Steven Rostedt <srostedt@redhat.com>

Just trying to understand the terminology before I look at the code so I'm
sorry if I have just completely misunderstood.

-Richard Holden



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v6] Unified trace buffer
  2008-09-26 18:30           ` Richard Holden
@ 2008-09-26 18:39             ` Steven Rostedt
  0 siblings, 0 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-09-26 18:39 UTC (permalink / raw)
  To: Richard Holden
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, prasad, Linus Torvalds, Mathieu Desnoyers,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo


On Fri, 26 Sep 2008, Richard Holden wrote:

> On 9/26/08 12:05 PM, "Steven Rostedt" <rostedt@goodmis.org> wrote:
> 
> > ring_buffer_alloc: create a new ring buffer. Can choose between
> > overwrite or consumer/producer mode. Overwrite will
> > overwrite old data, where as consumer producer will
> > throw away new data if the consumer catches up with the
> > producer.  The consumer/producer is the default.
> 
> Forgive me if I've gotten this wrong but the terminology seems backwards
> Here, I would think we only throw away new data if the producer catches up
> with the consumer, if the consumer catches up with the producer we're
> reading data as fast as it's being written.

Argh! Yes.  I'm the one that is backwards ;-)

Yeah, that is what I meant. Don't you know? You are suppose to understand 
what I mean, not what I say :)

> 
> > 
> > ring_buffer_write: writes some data into the ring buffer.
> > 
> > ring_buffer_peek: Look at a next item in the cpu buffer.
> > ring_buffer_consume: get the next item in the cpu buffer and
> > consume it. That is, this function increments the head
> > pointer.
> 
> Here too, I would think that consuming data would modify the tail pointer.

I always get confused with the translation of what the head/tail to 
producer/consumer.

Here I have the producer adding to the tail, and the consumer reading from
the head. Perhaps this is backwards? I could change it.

s/head/foobar/g
s/tail/head/g
s/foobar/tail/g

That could do it.


> > 
> > Signed-off-by: Steven Rostedt <srostedt@redhat.com>
> 
> Just trying to understand the terminology before I look at the code so I'm
> sorry if I have just completely misunderstood.

Sure, thanks.

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v6] Unified trace buffer
  2008-09-26 18:05         ` [PATCH v6] " Steven Rostedt
  2008-09-26 18:30           ` Richard Holden
@ 2008-09-26 18:59           ` Peter Zijlstra
  2008-09-26 19:46             ` Martin Bligh
  2008-09-26 19:14           ` Peter Zijlstra
                             ` (4 subsequent siblings)
  6 siblings, 1 reply; 102+ messages in thread
From: Peter Zijlstra @ 2008-09-26 18:59 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Andrew Morton, prasad,
	Linus Torvalds, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo

On Fri, 2008-09-26 at 14:05 -0400, Steven Rostedt wrote:

> +struct buffer_page {
> +	u64		time_stamp;
> +	unsigned char	body[];
> +};
> +
> +#define BUF_PAGE_SIZE (PAGE_SIZE - sizeof(u64))

Since you're already using the page frame, you can stick this per page
timestamp in there as well, and get the full page for data.

You can either use a struct page overlay like slob does, or add a u64 in
the union that contains struct {private, mapping}.




^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v6] Unified trace buffer
  2008-09-26 18:05         ` [PATCH v6] " Steven Rostedt
  2008-09-26 18:30           ` Richard Holden
  2008-09-26 18:59           ` Peter Zijlstra
@ 2008-09-26 19:14           ` Peter Zijlstra
  2008-09-26 22:28             ` Mike Travis
  2008-09-26 19:17           ` Peter Zijlstra
                             ` (3 subsequent siblings)
  6 siblings, 1 reply; 102+ messages in thread
From: Peter Zijlstra @ 2008-09-26 19:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Andrew Morton, prasad,
	Linus Torvalds, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo,
	Mike Travis

On Fri, 2008-09-26 at 14:05 -0400, Steven Rostedt wrote:
> +struct ring_buffer {
> +       unsigned long           size;
> +       unsigned                pages;
> +       unsigned                flags;
> +       int                     cpus;
> +       atomic_t                record_disabled;
> +
> +       struct mutex            mutex;
> +
> +       /* FIXME: this should be online CPUS */
> +       struct ring_buffer_per_cpu *buffers[NR_CPUS];

actually nr_possible makes sense, and you might consider always
allocating buffers (and keeping them for offlined cpus) to avoid massive
allocations/frees cpu-hotplug events.

Mike Travis has been going over the kernel removing constructs like
this, and replacing them with dynamically allocated arrays of
nr_possible.

> +};


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v6] Unified trace buffer
  2008-09-26 18:05         ` [PATCH v6] " Steven Rostedt
                             ` (2 preceding siblings ...)
  2008-09-26 19:14           ` Peter Zijlstra
@ 2008-09-26 19:17           ` Peter Zijlstra
  2008-09-26 23:16             ` Arjan van de Ven
  2008-09-26 20:08           ` Peter Zijlstra
                             ` (2 subsequent siblings)
  6 siblings, 1 reply; 102+ messages in thread
From: Peter Zijlstra @ 2008-09-26 19:17 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Andrew Morton, prasad,
	Linus Torvalds, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo,
	Arjan van de Ven

On Fri, 2008-09-26 at 14:05 -0400, Steven Rostedt wrote:
> +#define CHECK_COND(buffer, cond)                       \
> +       if (unlikely(cond)) {                           \
> +               atomic_inc(&buffer->record_disabled);   \
> +               WARN_ON(1);                             \
> +               return -1;                              \
> +       }

Arjan, any preferences wrt kerneloops.org?


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v6] Unified trace buffer
  2008-09-26 18:59           ` Peter Zijlstra
@ 2008-09-26 19:46             ` Martin Bligh
  2008-09-26 19:52               ` Steven Rostedt
  2008-09-26 21:37               ` Steven Rostedt
  0 siblings, 2 replies; 102+ messages in thread
From: Martin Bligh @ 2008-09-26 19:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, LKML, Ingo Molnar, Thomas Gleixner,
	Andrew Morton, prasad, Linus Torvalds, Mathieu Desnoyers,
	Frank Ch. Eigler, David Wilder, hch, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo

On Fri, Sep 26, 2008 at 11:59 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2008-09-26 at 14:05 -0400, Steven Rostedt wrote:
>
>> +struct buffer_page {
>> +     u64             time_stamp;
>> +     unsigned char   body[];
>> +};
>> +
>> +#define BUF_PAGE_SIZE (PAGE_SIZE - sizeof(u64))
>
> Since you're already using the page frame, you can stick this per page
> timestamp in there as well, and get the full page for data.
>
> You can either use a struct page overlay like slob does, or add a u64 in
> the union that contains struct {private, mapping}.

What did you guys think of Mathieu's idea of sticking the buffer length
in the header here, rather than using padding events? Seemed cleaner
to me.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v6] Unified trace buffer
  2008-09-26 19:46             ` Martin Bligh
@ 2008-09-26 19:52               ` Steven Rostedt
  2008-09-26 21:37               ` Steven Rostedt
  1 sibling, 0 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-09-26 19:52 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Thomas Gleixner,
	Andrew Morton, prasad, Linus Torvalds, Mathieu Desnoyers,
	Frank Ch. Eigler, David Wilder, hch, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo


On Fri, 26 Sep 2008, Martin Bligh wrote:

> On Fri, Sep 26, 2008 at 11:59 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, 2008-09-26 at 14:05 -0400, Steven Rostedt wrote:
> >
> >> +struct buffer_page {
> >> +     u64             time_stamp;
> >> +     unsigned char   body[];
> >> +};
> >> +
> >> +#define BUF_PAGE_SIZE (PAGE_SIZE - sizeof(u64))
> >
> > Since you're already using the page frame, you can stick this per page
> > timestamp in there as well, and get the full page for data.
> >
> > You can either use a struct page overlay like slob does, or add a u64 in
> > the union that contains struct {private, mapping}.
> 
> What did you guys think of Mathieu's idea of sticking the buffer length
> in the header here, rather than using padding events? Seemed cleaner
> to me.

Actually I like the padding. This way when I move the event pointer 
forward, I only need to compare it to a constant (PAGE_SIZE), or test to 
see if the event is padding.  Placing this into the buffer page, I will 
have to always compare it to that pointer.

But I guess I could change it to that if needed. That doesn't affect the 
API, as it is only internal.

I'm almost done with v7, perhaps I might try that with v8 to see if I like 
it better.

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v6] Unified trace buffer
  2008-09-26 18:05         ` [PATCH v6] " Steven Rostedt
                             ` (3 preceding siblings ...)
  2008-09-26 19:17           ` Peter Zijlstra
@ 2008-09-26 20:08           ` Peter Zijlstra
  2008-09-26 21:14             ` Masami Hiramatsu
  2008-09-26 21:13           ` [PATCH v7] " Steven Rostedt
  2008-09-26 22:31           ` [PATCH v6] Unified trace buffer Arnaldo Carvalho de Melo
  6 siblings, 1 reply; 102+ messages in thread
From: Peter Zijlstra @ 2008-09-26 20:08 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Andrew Morton, prasad,
	Linus Torvalds, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo

On Fri, 2008-09-26 at 14:05 -0400, Steven Rostedt wrote:
> +static void
> +rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned
> nr_pages)
> +{
> +       struct page *page;
> +       struct list_head *p;
> +       unsigned i;
> +
> +       atomic_inc(&cpu_buffer->record_disabled);

You probably want synchronize_sched() here (and similar other places) to
ensure any active writer on the corresponding cpu is actually stopped.

Which suggests you want to use something like ring_buffer_lock_cpu() and
implement that as above.

> +       for (i = 0; i < nr_pages; i++) {
> +               BUG_ON(list_empty(&cpu_buffer->pages));
> +               p = cpu_buffer->pages.next;
> +               page = list_entry(p, struct page, lru);
> +               list_del_init(&page->lru);
> +               __free_page(page);
> +       }
> +       BUG_ON(list_empty(&cpu_buffer->pages));
> +
> +       __ring_buffer_reset_cpu(cpu_buffer);
> +
> +       check_pages(cpu_buffer);
> +
> +       atomic_dec(&cpu_buffer->record_disabled);
> +
> +}


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v7] Unified trace buffer
  2008-09-26 18:05         ` [PATCH v6] " Steven Rostedt
                             ` (4 preceding siblings ...)
  2008-09-26 20:08           ` Peter Zijlstra
@ 2008-09-26 21:13           ` Steven Rostedt
  2008-09-27  2:02             ` [PATCH v8] " Steven Rostedt
  2008-09-26 22:31           ` [PATCH v6] Unified trace buffer Arnaldo Carvalho de Melo
  6 siblings, 1 reply; 102+ messages in thread
From: Steven Rostedt @ 2008-09-26 21:13 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	prasad, Linus Torvalds, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo


[
  Changes since v6:

  - Added shift debug test to test both normalization of
    timestamp, but also the large time deltas. ftrace records too quickly
    to get large deltas :-/

  - Fixed some minor issues with keeping track of time.

  - used slob hack to put more information in the page struct and now
    have the full buffer page free for data. Thanks to Peter Zijlstra
    for suggesting the idea.

  - have the buffer use a cpu mask (initialized to cpu_possible_map)
    to allocate for cpu usage.

  - fixed entries counting.

  - use DIV_ROUND_UP macro (also suggested by Peter)
]

This is a unified tracing buffer that implements a ring buffer that
hopefully everyone will eventually be able to use.

The events recorded into the buffer have the following structure:

struct ring_buffer_event {
	u32 type:2, len:3, time_delta:27;
	u32 array[];
};

The minimum size of an event is 8 bytes. All events are 4 byte
aligned inside the buffer.

There are 4 types (all internal use for the ring buffer, only
the data type is exported to the interface users).

RB_TYPE_PADDING: this type is used to note extra space at the end
	of a buffer page.

RB_TYPE_TIME_EXTENT: This type is used when the time between events
	is greater than the 27 bit delta can hold. We add another
	32 bits, and record that in its own event (8 byte size).

RB_TYPE_TIME_STAMP: (Not implemented yet). This will hold data to
	help keep the buffer timestamps in sync.

RB_TYPE_DATA: The event actually holds user data.

The "len" field is only three bits. Since the data must be
4 byte aligned, this field is shifted left by 2, giving a
max length of 28 bytes. If the data load is greater than 28
bytes, the first array field holds the full length of the
data load and the len field is set to zero.

Example, data size of 7 bytes:

	type = RB_TYPE_DATA
	len = 2
	time_delta: <time-stamp> - <prev_event-time-stamp>
	array[0..1]: <7 bytes of data> <1 byte empty>

This event is saved in 12 bytes of the buffer.

An event with 82 bytes of data:

	type = RB_TYPE_DATA
	len = 0
	time_delta: <time-stamp> - <prev_event-time-stamp>
	array[0]: 84 (Note the alignment)
	array[1..14]: <82 bytes of data> <2 bytes empty>

The above event is saved in 92 bytes (if my math is correct).
82 bytes of data, 2 bytes empty, 4 byte header, 4 byte length.

Do not reference the above event struct directly. Use the following
functions to gain access to the event table, since the
ring_buffer_event structure may change in the future.

ring_buffer_event_length(event): get the length of the event.
	This is the size of the memory used to record this
	event, and not the size of the data pay load.

ring_buffer_time_delta(event): get the time delta of the event
	This returns the delta time stamp since the last event.
	Note: Even though this is in the header, there should
		be no reason to access this directly, accept
		for debugging.

ring_buffer_event_data(event): get the data from the event
	This is the function to use to get the actual data
	from the event. Note, it is only a pointer to the
	data inside the buffer. This data must be copied to
	another location otherwise you risk it being written
	over in the buffer.

ring_buffer_lock: A way to lock the entire buffer.
ring_buffer_unlock: unlock the buffer.

ring_buffer_alloc: create a new ring buffer. Can choose between
	overwrite or consumer/producer mode. Overwrite will
	overwrite old data, where as consumer producer will
	throw away new data if the consumer catches up with the
	producer.  The consumer/producer is the default.

ring_buffer_free: free the ring buffer.

ring_buffer_resize: resize the buffer. Changes the size of each cpu
	buffer. Note, it is up to the caller to provide that
	the buffer is not being used while this is happening.
	This requirement may go away but do not count on it.

ring_buffer_lock_reserve: locks the ring buffer and allocates an
	entry on the buffer to write to.
ring_buffer_unlock_commit: unlocks the ring buffer and commits it to
	the buffer.

ring_buffer_write: writes some data into the ring buffer.

ring_buffer_peek: Look at a next item in the cpu buffer.
ring_buffer_consume: get the next item in the cpu buffer and
	consume it. That is, this function increments the head
	pointer.

ring_buffer_read_start: Start an iterator of a cpu buffer.
	For now, this disables the cpu buffer, until you issue
	a finish. This is just because we do not want the iterator
	to be overwritten. This restriction may change in the future.
	But note, this is used for static reading of a buffer which
	is usually done "after" a trace. Live readings would want
	to use the ring_buffer_consume above, which will not
	disable the ring buffer.

ring_buffer_read_finish: Finishes the read iterator and reenables
	the ring buffer.

ring_buffer_iter_peek: Look at the next item in the cpu iterator.
ring_buffer_read: Read the iterator and increment it.
ring_buffer_iter_reset: Reset the iterator to point to the beginning
	of the cpu buffer.
ring_buffer_iter_empty: Returns true if the iterator is at the end
	of the cpu buffer.

ring_buffer_size: returns the size in bytes of each cpu buffer.
	Note, the real size is this times the number of CPUs.

ring_buffer_reset_cpu: Sets the cpu buffer to empty
ring_buffer_reset: sets all cpu buffers to empty

ring_buffer_swap_cpu: swaps a cpu buffer from one buffer with a
	cpu buffer of another buffer. This is handy when you
	want to take a snap shot of a running trace on just one
	cpu. Having a backup buffer, to swap with facilitates this.
	Ftrace max latencies use this.

ring_buffer_empty: Returns true if the ring buffer is empty.
ring_buffer_empty_cpu: Returns true if the cpu buffer is empty.

ring_buffer_record_disable: disable all cpu buffers (read only)
ring_buffer_record_disable_cpu: disable a single cpu buffer (read only)
ring_buffer_record_enable: enable all cpu buffers.
ring_buffer_record_enabl_cpu: enable a single cpu buffer.

ring_buffer_entries: The number of entries in a ring buffer.
ring_buffer_overruns: The number of entries removed due to writing wrap.

ring_buffer_time_stamp: Get the time stamp used by the ring buffer
ring_buffer_normalize_time_stamp: normalize the ring buffer time stamp
	into nanosecs.

I still need to implement the GTOD feature. But we need support from
the cpu frequency infrastructure.  But this can be done at a later
time without affecting the ring buffer interface.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 include/linux/ring_buffer.h |  179 +++++
 kernel/trace/Kconfig        |    4 
 kernel/trace/Makefile       |    1 
 kernel/trace/ring_buffer.c  | 1525 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 1709 insertions(+)

Index: linux-trace.git/include/linux/ring_buffer.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-trace.git/include/linux/ring_buffer.h	2008-09-26 14:16:54.000000000 -0400
@@ -0,0 +1,179 @@
+#ifndef _LINUX_RING_BUFFER_H
+#define _LINUX_RING_BUFFER_H
+
+#include <linux/mm.h>
+#include <linux/seq_file.h>
+
+struct ring_buffer;
+struct ring_buffer_iter;
+
+/*
+ * Don't reference this struct directly, use the inline items below.
+ */
+struct ring_buffer_event {
+	u32		type:2, len:3, time_delta:27;
+	u32		array[];
+};
+
+enum {
+	RB_TYPE_PADDING,	/* Left over page padding
+				 * array is ignored
+				 * size is variable depending on
+				 * how much padding is needed
+				 */
+	RB_TYPE_TIME_EXTENT,	/* Extent the time delta
+				 * array[0] = time delta (28 .. 59)
+				 * size = 8 bytes
+				 */
+	/* FIXME: RB_TYPE_TIME_STAMP not implemented */
+	RB_TYPE_TIME_STAMP,	/* Sync time stamp with external clock
+				 * array[0] = tv_nsec
+				 * array[1] = tv_sec
+				 * size = 16 bytes
+				 */
+
+	RB_TYPE_DATA,		/* Data record
+				 * If len is zero:
+				 *  array[0] holds the actual length
+				 *  array[1..(length+3)/4-1] holds data
+				 * else
+				 *  length = len << 2
+				 *  array[0..(length+3)/4] holds data
+				 */
+};
+
+#define RB_EVNT_HDR_SIZE (sizeof(struct ring_buffer_event))
+#define RB_ALIGNMENT_SHIFT	2
+#define RB_ALIGNMENT		(1 << RB_ALIGNMENT_SHIFT)
+#define RB_MAX_SMALL_DATA	(28)
+
+enum {
+	RB_LEN_TIME_EXTENT = 8,
+	RB_LEN_TIME_STAMP = 16,
+};
+
+/**
+ * ring_buffer_event_length - return the length of the event
+ * @event: the event to get the length of
+ */
+static inline unsigned
+ring_buffer_event_length(struct ring_buffer_event *event)
+{
+	unsigned length;
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		/* undefined */
+		return -1;
+
+	case RB_TYPE_TIME_EXTENT:
+		return RB_LEN_TIME_EXTENT;
+
+	case RB_TYPE_TIME_STAMP:
+		return RB_LEN_TIME_STAMP;
+
+	case RB_TYPE_DATA:
+		if (event->len)
+			length = event->len << RB_ALIGNMENT_SHIFT;
+		else
+			length = event->array[0];
+		return length + RB_EVNT_HDR_SIZE;
+	default:
+		BUG();
+	}
+	/* not hit */
+	return 0;
+}
+
+/**
+ * ring_buffer_event_time_delta - return the delta timestamp of the event
+ * @event: the event to get the delta timestamp of
+ *
+ * The delta timestamp is the 27 bit timestamp since the last event.
+ */
+static inline unsigned
+ring_buffer_event_time_delta(struct ring_buffer_event *event)
+{
+	return event->time_delta;
+}
+
+/**
+ * ring_buffer_event_data - return the data of the event
+ * @event: the event to get the data from
+ */
+static inline void *
+ring_buffer_event_data(struct ring_buffer_event *event)
+{
+	BUG_ON(event->type != RB_TYPE_DATA);
+	/* If length is in len field, then array[0] has the data */
+	if (event->len)
+		return (void *)&event->array[0];
+	/* Otherwise length is in array[0] and array[1] has the data */
+	return (void *)&event->array[1];
+}
+
+void ring_buffer_lock(struct ring_buffer *buffer, unsigned long *flags);
+void ring_buffer_unlock(struct ring_buffer *buffer, unsigned long flags);
+
+/*
+ * size is in bytes for each per CPU buffer.
+ */
+struct ring_buffer *
+ring_buffer_alloc(unsigned long size, unsigned flags);
+void ring_buffer_free(struct ring_buffer *buffer);
+
+int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size);
+
+struct ring_buffer_event *
+ring_buffer_lock_reserve(struct ring_buffer *buffer,
+			 unsigned long length,
+			 unsigned long *flags);
+int ring_buffer_unlock_commit(struct ring_buffer *buffer,
+			      struct ring_buffer_event *event,
+			      unsigned long flags);
+int ring_buffer_write(struct ring_buffer *buffer,
+		      unsigned long length, void *data);
+
+struct ring_buffer_event *
+ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts);
+struct ring_buffer_event *
+ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts);
+
+struct ring_buffer_iter *
+ring_buffer_read_start(struct ring_buffer *buffer, int cpu);
+void ring_buffer_read_finish(struct ring_buffer_iter *iter);
+
+struct ring_buffer_event *
+ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts);
+struct ring_buffer_event *
+ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts);
+void ring_buffer_iter_reset(struct ring_buffer_iter *iter);
+int ring_buffer_iter_empty(struct ring_buffer_iter *iter);
+
+unsigned long ring_buffer_size(struct ring_buffer *buffer);
+
+void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu);
+void ring_buffer_reset(struct ring_buffer *buffer);
+
+int ring_buffer_swap_cpu(struct ring_buffer *buffer_a,
+			 struct ring_buffer *buffer_b, int cpu);
+
+int ring_buffer_empty(struct ring_buffer *buffer);
+int ring_buffer_empty_cpu(struct ring_buffer *buffer, int cpu);
+
+void ring_buffer_record_disable(struct ring_buffer *buffer);
+void ring_buffer_record_enable(struct ring_buffer *buffer);
+void ring_buffer_record_disable_cpu(struct ring_buffer *buffer, int cpu);
+void ring_buffer_record_enable_cpu(struct ring_buffer *buffer, int cpu);
+
+unsigned long ring_buffer_entries(struct ring_buffer *buffer);
+unsigned long ring_buffer_overruns(struct ring_buffer *buffer);
+
+u64 ring_buffer_time_stamp(int cpu);
+void ring_buffer_normalize_time_stamp(int cpu, u64 *ts);
+
+enum ring_buffer_flags {
+	RB_FL_OVERWRITE		= 1 << 0,
+};
+
+#endif /* _LINUX_RING_BUFFER_H */
Index: linux-trace.git/kernel/trace/ring_buffer.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-trace.git/kernel/trace/ring_buffer.c	2008-09-26 17:01:53.000000000 -0400
@@ -0,0 +1,1525 @@
+/*
+ * Generic ring buffer
+ *
+ * Copyright (C) 2008 Steven Rostedt <srostedt@redhat.com>
+ */
+#include <linux/ring_buffer.h>
+#include <linux/spinlock.h>
+#include <linux/debugfs.h>
+#include <linux/uaccess.h>
+#include <linux/module.h>
+#include <linux/percpu.h>
+#include <linux/mutex.h>
+#include <linux/init.h>
+#include <linux/hash.h>
+#include <linux/list.h>
+#include <linux/fs.h>
+
+#include "trace.h"
+
+#define DEBUG_SHIFT 15
+
+/* FIXME!!! */
+u64 ring_buffer_time_stamp(int cpu)
+{
+	/* shift to debug/test normalization and TIME_EXTENTS */
+	return sched_clock() << DEBUG_SHIFT;
+}
+void ring_buffer_normalize_time_stamp(int cpu, u64 *ts)
+{
+	/* Just stupid testing the normalize function and deltas */
+	*ts >>= DEBUG_SHIFT;
+}
+
+#define for_each_buffer_cpu(buffer, cpu)		\
+	for_each_cpu_mask(cpu, buffer->cpumask)
+
+#define TS_SHIFT	27
+#define TS_MASK		((1ULL << TS_SHIFT) - 1)
+#define TS_DELTA_TEST	(~TS_MASK)
+
+/*
+ * This hack stolen from mm/slob.c.
+ * We can store per page timing information in the page frame of the page.
+ * Thanks to Peter Zijlstra for suggesting this idea.
+ */
+struct buffer_page {
+	union {
+		struct {
+			unsigned long flags;	/* mandatory */
+			atomic_t _count;	/* mandatory */
+			u64	time_stamp;	/* page time stamp */
+			struct list_head list;	/* linked list of free pages */
+		};
+		struct page page;
+	};
+};
+
+/*
+ * We need to fit the time_stamp delta into 27 bits.
+ */
+static inline int
+test_time_stamp(unsigned long long delta)
+{
+	if (delta & TS_DELTA_TEST)
+		return 1;
+	return 0;
+}
+
+#define BUF_PAGE_SIZE PAGE_SIZE
+
+/*
+ * head_page == tail_page && head == tail then buffer is empty.
+ */
+struct ring_buffer_per_cpu {
+	int			cpu;
+	struct ring_buffer	*buffer;
+	raw_spinlock_t		lock;
+	struct lock_class_key	lock_key;
+	struct list_head	pages;
+	unsigned long		head;	/* read from head */
+	unsigned long		tail;	/* write to tail */
+	struct buffer_page	*head_page;
+	struct buffer_page	*tail_page;
+	unsigned long		overrun;
+	unsigned long		entries;
+	u64			write_stamp;
+	u64			read_stamp;
+	atomic_t		record_disabled;
+};
+
+struct ring_buffer {
+	unsigned long		size;
+	unsigned		pages;
+	unsigned		flags;
+	int			cpus;
+	cpumask_t		cpumask;
+	atomic_t		record_disabled;
+
+	struct mutex		mutex;
+
+	struct ring_buffer_per_cpu **buffers;
+};
+
+struct ring_buffer_iter {
+	struct ring_buffer_per_cpu	*cpu_buffer;
+	unsigned long			head;
+	struct buffer_page		*head_page;
+	u64				read_stamp;
+};
+
+#define CHECK_COND(buffer, cond)			\
+	if (unlikely(cond)) {				\
+		atomic_inc(&buffer->record_disabled);	\
+		WARN_ON(1);				\
+		return -1;				\
+	}
+
+/**
+ * check_pages - integrity check of buffer pages
+ * @cpu_buffer: CPU buffer with pages to test
+ *
+ * As a safty measure we check to make sure the data pages have not
+ * been corrupted.
+ */
+static int check_pages(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct list_head *head = &cpu_buffer->pages;
+	struct page *page, *tmp;
+
+	CHECK_COND(cpu_buffer, head->next->prev != head);
+	CHECK_COND(cpu_buffer, head->prev->next != head);
+
+	list_for_each_entry_safe(page, tmp, head, lru) {
+		CHECK_COND(cpu_buffer, page->lru.next->prev != &page->lru);
+		CHECK_COND(cpu_buffer, page->lru.prev->next != &page->lru);
+	}
+
+	return 0;
+}
+
+static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
+			     unsigned nr_pages)
+{
+	struct list_head *head = &cpu_buffer->pages;
+	LIST_HEAD(pages);
+	struct page *page, *tmp;
+	unsigned long addr;
+	unsigned i;
+
+	for (i = 0; i < nr_pages; i++) {
+		addr = __get_free_page(GFP_KERNEL);
+		if (!addr)
+			goto free_pages;
+		page = virt_to_page(addr);
+		list_add(&page->lru, &pages);
+	}
+
+	list_splice(&pages, head);
+
+	check_pages(cpu_buffer);
+
+	return 0;
+
+ free_pages:
+	list_for_each_entry_safe(page, tmp, &pages, lru) {
+		list_del_init(&page->lru);
+		__free_page(page);
+	}
+	return -ENOMEM;
+}
+
+static struct ring_buffer_per_cpu *
+ring_buffer_allocate_cpu_buffer(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int ret;
+
+	cpu_buffer = kzalloc_node(ALIGN(sizeof(*cpu_buffer), cache_line_size()),
+				  GFP_KERNEL, cpu_to_node(cpu));
+	if (!cpu_buffer)
+		return NULL;
+
+	cpu_buffer->cpu = cpu;
+	cpu_buffer->buffer = buffer;
+	cpu_buffer->lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED;
+	INIT_LIST_HEAD(&cpu_buffer->pages);
+
+	ret = rb_allocate_pages(cpu_buffer, buffer->pages);
+	if (ret < 0)
+		goto fail_free_buffer;
+
+	cpu_buffer->head_page
+		= list_entry(cpu_buffer->pages.next, struct buffer_page, list);
+	cpu_buffer->tail_page
+		= list_entry(cpu_buffer->pages.next, struct buffer_page, list);
+
+	return cpu_buffer;
+
+ fail_free_buffer:
+	kfree(cpu_buffer);
+	return NULL;
+}
+
+static void
+ring_buffer_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct list_head *head = &cpu_buffer->pages;
+	struct page *page, *tmp;
+
+	list_for_each_entry_safe(page, tmp, head, lru) {
+		list_del_init(&page->lru);
+		__free_page(page);
+	}
+	kfree(cpu_buffer);
+}
+
+/**
+ * ring_buffer_alloc - allocate a new ring_buffer
+ * @size: the size in bytes that is needed.
+ * @flags: attributes to set for the ring buffer.
+ *
+ * Currently the only flag that is available is the RB_FL_OVERWRITE
+ * flag. This flag means that the buffer will overwrite old data
+ * when the buffer wraps. If this flag is not set, the buffer will
+ * drop data when the tail hits the head.
+ */
+struct ring_buffer *
+ring_buffer_alloc(unsigned long size, unsigned flags)
+{
+	struct ring_buffer *buffer;
+	int bsize;
+	int cpu;
+
+	/* keep it in its own cache line */
+	buffer = kzalloc(ALIGN(sizeof(*buffer), cache_line_size()),
+			 GFP_KERNEL);
+	if (!buffer)
+		return NULL;
+
+	buffer->pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
+	buffer->flags = flags;
+
+	/* need at least two pages */
+	if (buffer->pages == 1)
+		buffer->pages++;
+
+	buffer->cpumask = cpu_possible_map;
+	buffer->cpus = num_possible_cpus();
+
+	bsize = sizeof(void*) * nr_cpu_ids;
+	buffer->buffers = kzalloc(ALIGN(bsize, cache_line_size()),
+				  GFP_KERNEL);
+	if (!buffer->buffers)
+		goto fail_free_buffer;
+
+	for_each_buffer_cpu(buffer, cpu) {
+		buffer->buffers[cpu] =
+			ring_buffer_allocate_cpu_buffer(buffer, cpu);
+		if (!buffer->buffers[cpu])
+			goto fail_free_buffers;
+	}
+
+	mutex_init(&buffer->mutex);
+
+	return buffer;
+
+ fail_free_buffers:
+	for_each_buffer_cpu(buffer, cpu) {
+		if (buffer->buffers[cpu])
+			ring_buffer_free_cpu_buffer(buffer->buffers[cpu]);
+	}
+	kfree(buffer->buffers);
+
+ fail_free_buffer:
+	kfree(buffer);
+	return NULL;
+}
+
+/**
+ * ring_buffer_free - free a ring buffer.
+ * @buffer: the buffer to free.
+ */
+void
+ring_buffer_free(struct ring_buffer *buffer)
+{
+	int cpu;
+
+	for_each_buffer_cpu(buffer, cpu)
+		ring_buffer_free_cpu_buffer(buffer->buffers[cpu]);
+
+	kfree(buffer);
+}
+
+static void
+__ring_buffer_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer);
+
+static void
+rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned nr_pages)
+{
+	struct page *page;
+	struct list_head *p;
+	unsigned i;
+
+	atomic_inc(&cpu_buffer->record_disabled);
+
+	for (i = 0; i < nr_pages; i++) {
+		BUG_ON(list_empty(&cpu_buffer->pages));
+		p = cpu_buffer->pages.next;
+		page = list_entry(p, struct page, lru);
+		list_del_init(&page->lru);
+		__free_page(page);
+	}
+	BUG_ON(list_empty(&cpu_buffer->pages));
+
+	__ring_buffer_reset_cpu(cpu_buffer);
+
+	check_pages(cpu_buffer);
+
+	atomic_dec(&cpu_buffer->record_disabled);
+
+}
+
+static void
+rb_insert_pages(struct ring_buffer_per_cpu *cpu_buffer,
+		struct list_head *pages, unsigned nr_pages)
+{
+	struct page *page;
+	struct list_head *p;
+	unsigned i;
+
+	atomic_inc(&cpu_buffer->record_disabled);
+
+	for (i = 0; i < nr_pages; i++) {
+		BUG_ON(list_empty(pages));
+		p = pages->next;
+		page = list_entry(p, struct page, lru);
+		list_del_init(&page->lru);
+		list_add_tail(&page->lru, &cpu_buffer->pages);
+	}
+	__ring_buffer_reset_cpu(cpu_buffer);
+
+	check_pages(cpu_buffer);
+
+	atomic_dec(&cpu_buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_resize - resize the ring buffer
+ * @buffer: the buffer to resize.
+ * @size: the new size.
+ *
+ * The tracer is responsible for making sure that the buffer is
+ * not being used while changing the size.
+ * Note: We may be able to change the above requirement by using
+ *  RCU synchronizations.
+ *
+ * Minimum size is 2 * BUF_PAGE_SIZE.
+ *
+ * Returns -1 on failure.
+ */
+int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long buffer_size;
+	LIST_HEAD(pages);
+	unsigned long addr;
+	unsigned nr_pages, rm_pages, new_pages;
+	struct page *page, *tmp;
+	int i, cpu;
+
+	size = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
+	size *= BUF_PAGE_SIZE;
+	buffer_size = buffer->pages * BUF_PAGE_SIZE;
+
+	/* we need a minimum of two pages */
+	if (size < BUF_PAGE_SIZE * 2)
+		size = BUF_PAGE_SIZE * 2;
+
+	if (size == buffer_size)
+		return size;
+
+	mutex_lock(&buffer->mutex);
+
+	nr_pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
+
+	if (size < buffer_size) {
+
+		/* easy case, just free pages */
+		BUG_ON(nr_pages >= buffer->pages);
+
+		rm_pages = buffer->pages - nr_pages;
+
+		for_each_buffer_cpu(buffer, cpu) {
+			cpu_buffer = buffer->buffers[cpu];
+			rb_remove_pages(cpu_buffer, rm_pages);
+		}
+		goto out;
+	}
+
+	/*
+	 * This is a bit more difficult. We only want to add pages
+	 * when we can allocate enough for all CPUs. We do this
+	 * by allocating all the pages and storing them on a local
+	 * link list. If we succeed in our allocation, then we
+	 * add these pages to the cpu_buffers. Otherwise we just free
+	 * them all and return -ENOMEM;
+	 */
+	BUG_ON(nr_pages <= buffer->pages);
+	new_pages = nr_pages - buffer->pages;
+
+	for_each_buffer_cpu(buffer, cpu) {
+		for (i = 0; i < new_pages; i++) {
+			addr = __get_free_page(GFP_KERNEL);
+			if (!addr)
+				goto free_pages;
+			page = virt_to_page(addr);
+			list_add(&page->lru, &pages);
+		}
+	}
+
+	for_each_buffer_cpu(buffer, cpu) {
+		cpu_buffer = buffer->buffers[cpu];
+		rb_insert_pages(cpu_buffer, &pages, new_pages);
+	}
+
+	BUG_ON(!list_empty(&pages));
+
+ out:
+	buffer->pages = nr_pages;
+	mutex_unlock(&buffer->mutex);
+
+	return size;
+
+ free_pages:
+	list_for_each_entry_safe(page, tmp, &pages, lru) {
+		list_del_init(&page->lru);
+		__free_page(page);
+	}
+	return -ENOMEM;
+}
+
+static inline int
+ring_buffer_per_cpu_empty(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	return cpu_buffer->head_page == cpu_buffer->tail_page &&
+		cpu_buffer->head == cpu_buffer->tail;
+}
+
+static inline int
+ring_buffer_null_event(struct ring_buffer_event *event)
+{
+	return event->type == RB_TYPE_PADDING;
+}
+
+static inline void *
+rb_page_index(struct buffer_page *page, unsigned index)
+{
+	void *addr;
+
+	addr = page_address(&page->page);
+	return addr + index;
+}
+
+static inline struct ring_buffer_event *
+ring_buffer_head_event(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	return rb_page_index(cpu_buffer->head_page,
+			     cpu_buffer->head);
+}
+
+static inline struct ring_buffer_event *
+ring_buffer_iter_head_event(struct ring_buffer_iter *iter)
+{
+	return rb_page_index(iter->head_page,
+			     iter->head);
+}
+
+/*
+ * When the tail hits the head and the buffer is in overwrite mode,
+ * the head jumps to the next page and all content on the previous
+ * page is discarded. But before doing so, we update the overrun
+ * variable of the buffer.
+ */
+static void
+ring_buffer_update_overflow(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct ring_buffer_event *event;
+	unsigned long head;
+
+	for (head = 0; head < BUF_PAGE_SIZE;
+	     head += ring_buffer_event_length(event)) {
+		event = rb_page_index(cpu_buffer->head_page, head);
+		if (ring_buffer_null_event(event))
+			break;
+		/* Only count data entries */
+		if (event->type != RB_TYPE_DATA)
+			continue;
+		cpu_buffer->overrun++;
+		cpu_buffer->entries--;
+	}
+}
+
+static inline void
+ring_buffer_inc_page(struct ring_buffer_per_cpu *cpu_buffer,
+		     struct buffer_page **page)
+{
+	struct list_head *p = (*page)->list.next;
+
+	if (p == &cpu_buffer->pages)
+		p = p->next;
+
+	*page = list_entry(p, struct buffer_page, list);
+}
+
+static inline void
+rb_add_stamp(struct ring_buffer_per_cpu *cpu_buffer, u64 *ts)
+{
+	cpu_buffer->tail_page->time_stamp = *ts;
+	cpu_buffer->write_stamp = *ts;
+}
+
+static void
+rb_reset_read_page(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	cpu_buffer->read_stamp = cpu_buffer->head_page->time_stamp;
+	cpu_buffer->head = 0;
+}
+
+static void
+rb_reset_iter_read_page(struct ring_buffer_iter *iter)
+{
+	iter->read_stamp = iter->head_page->time_stamp;
+	iter->head = 0;
+}
+
+/**
+ * ring_buffer_update_event - update event type and data
+ * @event: the even to update
+ * @type: the type of event
+ * @length: the size of the event field in the ring buffer
+ *
+ * Update the type and data fields of the event. The length
+ * is the actual size that is written to the ring buffer,
+ * and with this, we can determine what to place into the
+ * data field.
+ */
+static inline void
+ring_buffer_update_event(struct ring_buffer_event *event,
+			 unsigned type, unsigned length)
+{
+	event->type = type;
+
+	switch (type) {
+
+	case RB_TYPE_PADDING:
+		break;
+
+	case RB_TYPE_TIME_EXTENT:
+		event->len =
+			(RB_LEN_TIME_EXTENT + (RB_ALIGNMENT-1))
+			>> RB_ALIGNMENT_SHIFT;
+		break;
+
+	case RB_TYPE_TIME_STAMP:
+		event->len =
+			(RB_LEN_TIME_STAMP + (RB_ALIGNMENT-1))
+			>> RB_ALIGNMENT_SHIFT;
+		break;
+
+	case RB_TYPE_DATA:
+		length -= RB_EVNT_HDR_SIZE;
+		if (length > RB_MAX_SMALL_DATA) {
+			event->len = 0;
+			event->array[0] = length;
+		} else
+			event->len =
+				(length + (RB_ALIGNMENT-1))
+				>> RB_ALIGNMENT_SHIFT;
+		break;
+	default:
+		BUG();
+	}
+}
+
+static inline unsigned rb_calculate_event_length(unsigned length)
+{
+	struct ring_buffer_event event; /* Used only for sizeof array */
+
+	/* zero length can cause confusions */
+	if (!length)
+		length = 1;
+
+	if (length > RB_MAX_SMALL_DATA)
+		length += sizeof(event.array[0]);
+
+	length += RB_EVNT_HDR_SIZE;
+	length = ALIGN(length, RB_ALIGNMENT);
+
+	return length;
+}
+
+static struct ring_buffer_event *
+__ring_buffer_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
+			   unsigned type, unsigned long length, u64 *ts)
+{
+	struct buffer_page *head_page, *tail_page;
+	unsigned long tail;
+	struct ring_buffer *buffer = cpu_buffer->buffer;
+	struct ring_buffer_event *event;
+
+	tail_page = cpu_buffer->tail_page;
+	head_page = cpu_buffer->head_page;
+	tail = cpu_buffer->tail;
+
+	if (tail + length > BUF_PAGE_SIZE) {
+		struct buffer_page *next_page = tail_page;
+
+		ring_buffer_inc_page(cpu_buffer, &next_page);
+
+		if (next_page == head_page) {
+			if (!(buffer->flags & RB_FL_OVERWRITE))
+				return NULL;
+
+			/* count overflows */
+			ring_buffer_update_overflow(cpu_buffer);
+
+			ring_buffer_inc_page(cpu_buffer, &head_page);
+			cpu_buffer->head_page = head_page;
+			rb_reset_read_page(cpu_buffer);
+		}
+
+		if (tail != BUF_PAGE_SIZE) {
+			event = rb_page_index(tail_page, tail);
+			/* page padding */
+			event->type = RB_TYPE_PADDING;
+		}
+
+		tail = 0;
+		tail_page = next_page;
+		cpu_buffer->tail_page = tail_page;
+		cpu_buffer->tail = tail;
+		rb_add_stamp(cpu_buffer, ts);
+	}
+
+	BUG_ON(tail + length > BUF_PAGE_SIZE);
+
+	event = rb_page_index(tail_page, tail);
+	ring_buffer_update_event(event, type, length);
+
+	return event;
+}
+
+static struct ring_buffer_event *
+ring_buffer_reserve_next_event(struct ring_buffer_per_cpu *cpu_buffer,
+			       unsigned type, unsigned long length)
+{
+	unsigned long long ts, delta;
+	struct ring_buffer_event *event;
+
+	ts = ring_buffer_time_stamp(cpu_buffer->cpu);
+
+	if (cpu_buffer->tail) {
+		delta = ts - cpu_buffer->write_stamp;
+
+		if (test_time_stamp(delta)) {
+			/*
+			 * The delta is too big, we to add a
+			 * new timestamp.
+			 */
+			event = __ring_buffer_reserve_next(cpu_buffer,
+							   RB_TYPE_TIME_EXTENT,
+							   RB_LEN_TIME_EXTENT,
+							   &ts);
+			if (!event)
+				return NULL;
+
+			/* check to see if we went to the next page */
+			if (cpu_buffer->tail) {
+				/* Still on same page, update timestamp */
+				event->time_delta = delta & TS_MASK;
+				event->array[0] = delta >> TS_SHIFT;
+				/* commit the time event */
+				cpu_buffer->tail +=
+					ring_buffer_event_length(event);
+				cpu_buffer->write_stamp = ts;
+			}
+			delta = 0;
+		}
+	} else {
+		rb_add_stamp(cpu_buffer, &ts);
+		delta = 0;
+	}
+
+	event = __ring_buffer_reserve_next(cpu_buffer, type, length, &ts);
+	if (!event)
+		return NULL;
+
+	event->time_delta = delta;
+
+	return event;
+}
+
+/**
+ * ring_buffer_lock_reserve - reserve a part of the buffer
+ * @buffer: the ring buffer to reserve from
+ * @length: the length of the data to reserve (excluding event header)
+ * @flags: a pointer to save the interrupt flags
+ *
+ * Returns a reseverd event on the ring buffer to copy directly to.
+ * The user of this interface will need to get the body to write into
+ * and can use the ring_buffer_event_data() interface.
+ *
+ * The length is the length of the data needed, not the event length
+ * which also includes the event header.
+ *
+ * Must be paired with ring_buffer_unlock_commit, unless NULL is returned.
+ * If NULL is returned, then nothing has been allocated or locked.
+ */
+struct ring_buffer_event *
+ring_buffer_lock_reserve(struct ring_buffer *buffer,
+			 unsigned long length,
+			 unsigned long *flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	int cpu;
+
+	if (atomic_read(&buffer->record_disabled))
+		return NULL;
+
+	raw_local_irq_save(*flags);
+	cpu = raw_smp_processor_id();
+	cpu_buffer = buffer->buffers[cpu];
+	__raw_spin_lock(&cpu_buffer->lock);
+
+	if (atomic_read(&cpu_buffer->record_disabled))
+		goto no_record;
+
+	length = rb_calculate_event_length(length);
+	if (length > BUF_PAGE_SIZE)
+		return NULL;
+
+	event = ring_buffer_reserve_next_event(cpu_buffer,
+					       RB_TYPE_DATA, length);
+	if (!event)
+		goto no_record;
+
+	return event;
+
+ no_record:
+	__raw_spin_unlock(&cpu_buffer->lock);
+	local_irq_restore(*flags);
+	return NULL;
+}
+
+static void
+__ring_buffer_commit(struct ring_buffer_per_cpu *cpu_buffer,
+		     struct ring_buffer_event *event)
+{
+	cpu_buffer->tail += ring_buffer_event_length(event);
+	cpu_buffer->write_stamp += event->time_delta;
+	cpu_buffer->entries++;
+}
+
+/**
+ * ring_buffer_unlock_commit - commit a reserved
+ * @buffer: The buffer to commit to
+ * @event: The event pointer to commit.
+ * @flags: the interrupt flags received from ring_buffer_lock_reserve.
+ *
+ * This commits the data to the ring buffer, and releases any locks held.
+ *
+ * Must be paired with ring_buffer_lock_reserve.
+ */
+int ring_buffer_unlock_commit(struct ring_buffer *buffer,
+			      struct ring_buffer_event *event,
+			      unsigned long flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu = raw_smp_processor_id();
+
+	cpu_buffer = buffer->buffers[cpu];
+	__ring_buffer_commit(cpu_buffer, event);
+
+	__raw_spin_unlock(&cpu_buffer->lock);
+	raw_local_irq_restore(flags);
+
+	return 0;
+}
+
+/**
+ * ring_buffer_write - write data to the buffer without reserving
+ * @buffer: The ring buffer to write to.
+ * @length: The length of the data being written (excluding the event header)
+ * @data: The data to write to the buffer.
+ *
+ * This is like ring_buffer_lock_reserve and ring_buffer_unlock_commit as
+ * one function. If you already have the data to write to the buffer, it
+ * may be easier to simply call this function.
+ *
+ * Note, like ring_buffer_lock_reserve, the length is the length of the data
+ * and not the length of the event which would hold the header.
+ */
+int ring_buffer_write(struct ring_buffer *buffer,
+			unsigned long length,
+			void *data)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	unsigned long event_length, flags;
+	void *body;
+	int ret = -EBUSY;
+	int cpu;
+
+	if (atomic_read(&buffer->record_disabled))
+		return -EBUSY;
+
+	local_irq_save(flags);
+	cpu = raw_smp_processor_id();
+	cpu_buffer = buffer->buffers[cpu];
+	__raw_spin_lock(&cpu_buffer->lock);
+
+	if (atomic_read(&cpu_buffer->record_disabled))
+		goto out;
+
+	event_length = rb_calculate_event_length(length);
+	event = ring_buffer_reserve_next_event(cpu_buffer,
+					       RB_TYPE_DATA, event_length);
+	if (!event)
+		goto out;
+
+	body = ring_buffer_event_data(event);
+
+	memcpy(body, data, length);
+
+	__ring_buffer_commit(cpu_buffer, event);
+
+	ret = 0;
+ out:
+	__raw_spin_unlock(&cpu_buffer->lock);
+	local_irq_restore(flags);
+
+	return ret;
+}
+
+/**
+ * ring_buffer_lock - lock the ring buffer
+ * @buffer: The ring buffer to lock
+ * @flags: The place to store the interrupt flags
+ *
+ * This locks all the per CPU buffers.
+ *
+ * Must be unlocked by ring_buffer_unlock.
+ */
+void ring_buffer_lock(struct ring_buffer *buffer, unsigned long *flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	local_irq_save(*flags);
+
+	for_each_buffer_cpu(buffer, cpu) {
+		cpu_buffer = buffer->buffers[cpu];
+		__raw_spin_lock(&cpu_buffer->lock);
+	}
+}
+
+/**
+ * ring_buffer_unlock - unlock a locked buffer
+ * @buffer: The locked buffer to unlock
+ * @flags: The interrupt flags received by ring_buffer_lock
+ */
+void ring_buffer_unlock(struct ring_buffer *buffer, unsigned long flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	for (cpu = buffer->cpus - 1; cpu >= 0; cpu--) {
+		if (!cpu_isset(cpu, buffer->cpumask))
+			continue;
+		cpu_buffer = buffer->buffers[cpu];
+		__raw_spin_unlock(&cpu_buffer->lock);
+	}
+
+	local_irq_restore(flags);
+}
+
+/**
+ * ring_buffer_record_disable - stop all writes into the buffer
+ * @buffer: The ring buffer to stop writes to.
+ *
+ * This prevents all writes to the buffer. Any attempt to write
+ * to the buffer after this will fail and return NULL.
+ */
+void ring_buffer_record_disable(struct ring_buffer *buffer)
+{
+	atomic_inc(&buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_record_enable - enable writes to the buffer
+ * @buffer: The ring buffer to enable writes
+ *
+ * Note, multiple disables will need the same number of enables
+ * to truely enable the writing (much like preempt_disable).
+ */
+void ring_buffer_record_enable(struct ring_buffer *buffer)
+{
+	atomic_dec(&buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_record_disable_cpu - stop all writes into the cpu_buffer
+ * @buffer: The ring buffer to stop writes to.
+ * @cpu: The CPU buffer to stop
+ *
+ * This prevents all writes to the buffer. Any attempt to write
+ * to the buffer after this will fail and return NULL.
+ */
+void ring_buffer_record_disable_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	atomic_inc(&cpu_buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_record_enable_cpu - enable writes to the buffer
+ * @buffer: The ring buffer to enable writes
+ * @cpu: The CPU to enable.
+ *
+ * Note, multiple disables will need the same number of enables
+ * to truely enable the writing (much like preempt_disable).
+ */
+void ring_buffer_record_enable_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	atomic_dec(&cpu_buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_entries_cpu - get the number of entries in a cpu buffer
+ * @buffer: The ring buffer
+ * @cpu: The per CPU buffer to get the entries from.
+ */
+unsigned long ring_buffer_entries_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	return cpu_buffer->entries;
+}
+
+/**
+ * ring_buffer_overrun_cpu - get the number of overruns in a cpu_buffer
+ * @buffer: The ring buffer
+ * @cpu: The per CPU buffer to get the number of overruns from
+ */
+unsigned long ring_buffer_overrun_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	return cpu_buffer->overrun;
+}
+
+/**
+ * ring_buffer_entries - get the number of entries in a buffer
+ * @buffer: The ring buffer
+ *
+ * Returns the total number of entries in the ring buffer
+ * (all CPU entries)
+ */
+unsigned long ring_buffer_entries(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long entries = 0;
+	int cpu;
+
+	/* if you care about this being correct, lock the buffer */
+	for_each_buffer_cpu(buffer, cpu) {
+		cpu_buffer = buffer->buffers[cpu];
+		entries += cpu_buffer->entries;
+	}
+
+	return entries;
+}
+
+/**
+ * ring_buffer_overrun_cpu - get the number of overruns in buffer
+ * @buffer: The ring buffer
+ *
+ * Returns the total number of overruns in the ring buffer
+ * (all CPU entries)
+ */
+unsigned long ring_buffer_overruns(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long overruns = 0;
+	int cpu;
+
+	/* if you care about this being correct, lock the buffer */
+	for_each_buffer_cpu(buffer, cpu) {
+		cpu_buffer = buffer->buffers[cpu];
+		overruns += cpu_buffer->overrun;
+	}
+
+	return overruns;
+}
+
+/**
+ * ring_buffer_iter_reset - reset an iterator
+ * @iter: The iterator to reset
+ *
+ * Resets the iterator, so that it will start from the beginning
+ * again.
+ */
+void ring_buffer_iter_reset(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
+
+	iter->head_page = cpu_buffer->head_page;
+	iter->head = cpu_buffer->head;
+	rb_reset_iter_read_page(iter);
+}
+
+/**
+ * ring_buffer_iter_empty - check if an iterator has no more to read
+ * @iter: The iterator to check
+ */
+int ring_buffer_iter_empty(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = iter->cpu_buffer;
+
+	return iter->head_page == cpu_buffer->tail_page &&
+		iter->head == cpu_buffer->tail;
+}
+
+static void
+rb_update_read_stamp(struct ring_buffer_per_cpu *cpu_buffer,
+		     struct ring_buffer_event *event)
+{
+	u64 delta;
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		return;
+
+	case RB_TYPE_TIME_EXTENT:
+		delta = event->array[0];
+		delta <<= TS_SHIFT;
+		delta += event->time_delta;
+		cpu_buffer->read_stamp += delta;
+		return;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		return;
+
+	case RB_TYPE_DATA:
+		cpu_buffer->read_stamp += event->time_delta;
+		return;
+
+	default:
+		BUG();
+	}
+	return;
+}
+
+static void
+rb_update_iter_read_stamp(struct ring_buffer_iter *iter,
+			  struct ring_buffer_event *event)
+{
+	u64 delta;
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		return;
+
+	case RB_TYPE_TIME_EXTENT:
+		delta = event->array[0];
+		delta <<= TS_SHIFT;
+		delta += event->time_delta;
+		iter->read_stamp += delta;
+		return;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		return;
+
+	case RB_TYPE_DATA:
+		iter->read_stamp += event->time_delta;
+		return;
+
+	default:
+		BUG();
+	}
+	return;
+}
+
+static void
+ring_buffer_advance_head(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct ring_buffer_event *event;
+	unsigned length;
+
+	event = ring_buffer_head_event(cpu_buffer);
+	/*
+	 * Check if we are at the end of the buffer.
+	 */
+	if (ring_buffer_null_event(event)) {
+		BUG_ON(cpu_buffer->head_page == cpu_buffer->tail_page);
+		ring_buffer_inc_page(cpu_buffer, &cpu_buffer->head_page);
+		rb_reset_read_page(cpu_buffer);
+		return;
+	}
+
+	if (event->type == RB_TYPE_DATA)
+		cpu_buffer->entries--;
+
+	length = ring_buffer_event_length(event);
+
+	/*
+	 * This should not be called to advance the header if we are
+	 * at the tail of the buffer.
+	 */
+	BUG_ON((cpu_buffer->head_page == cpu_buffer->tail_page) &&
+	       (cpu_buffer->head + length > cpu_buffer->tail));
+
+	rb_update_read_stamp(cpu_buffer, event);
+
+	cpu_buffer->head += length;
+
+	/* check for end of page padding */
+	event = ring_buffer_head_event(cpu_buffer);
+	if (ring_buffer_null_event(event) &&
+	    (cpu_buffer->head_page != cpu_buffer->tail_page))
+		ring_buffer_advance_head(cpu_buffer);
+}
+
+static void
+ring_buffer_advance_iter(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer *buffer;
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	unsigned length;
+
+	cpu_buffer = iter->cpu_buffer;
+	buffer = cpu_buffer->buffer;
+
+	event = ring_buffer_iter_head_event(iter);
+
+	/*
+	 * Check if we are at the end of the buffer.
+	 */
+	if (ring_buffer_null_event(event)) {
+		BUG_ON(iter->head_page == cpu_buffer->tail_page);
+		ring_buffer_inc_page(cpu_buffer, &iter->head_page);
+		rb_reset_iter_read_page(iter);
+		return;
+	}
+
+	length = ring_buffer_event_length(event);
+
+	/*
+	 * This should not be called to advance the header if we are
+	 * at the tail of the buffer.
+	 */
+	BUG_ON((iter->head_page == cpu_buffer->tail_page) &&
+	       (iter->head + length > cpu_buffer->tail));
+
+	rb_update_iter_read_stamp(iter, event);
+
+	iter->head += length;
+
+	/* check for end of page padding */
+	event = ring_buffer_iter_head_event(iter);
+	if (ring_buffer_null_event(event) &&
+	    (iter->head_page != cpu_buffer->tail_page))
+		ring_buffer_advance_iter(iter);
+}
+
+/**
+ * ring_buffer_peek - peek at the next event to be read
+ * @buffer: The ring buffer to read
+ * @cpu: The cpu to peak at
+ * @ts: The timestamp counter of this event.
+ *
+ * This will return the event that will be read next, but does
+ * not consume the data.
+ */
+struct ring_buffer_event *
+ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+
+	cpu_buffer = buffer->buffers[cpu];
+
+ again:
+	if (ring_buffer_per_cpu_empty(cpu_buffer))
+		return NULL;
+
+	event = ring_buffer_head_event(cpu_buffer);
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		ring_buffer_inc_page(cpu_buffer, &cpu_buffer->head_page);
+		rb_reset_read_page(cpu_buffer);
+		goto again;
+
+	case RB_TYPE_TIME_EXTENT:
+		/* Internal data, OK to advance */
+		ring_buffer_advance_head(cpu_buffer);
+		goto again;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		ring_buffer_advance_head(cpu_buffer);
+		goto again;
+
+	case RB_TYPE_DATA:
+		if (ts) {
+			*ts = cpu_buffer->read_stamp + event->time_delta;
+			ring_buffer_normalize_time_stamp(cpu_buffer->cpu, ts);
+		}
+		return event;
+
+	default:
+		BUG();
+	}
+
+	return NULL;
+}
+
+/**
+ * ring_buffer_iter_peek - peek at the next event to be read
+ * @iter: The ring buffer iterator
+ * @ts: The timestamp counter of this event.
+ *
+ * This will return the event that will be read next, but does
+ * not increment the iterator.
+ */
+struct ring_buffer_event *
+ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
+{
+	struct ring_buffer *buffer;
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+
+	if (ring_buffer_iter_empty(iter))
+		return NULL;
+
+	cpu_buffer = iter->cpu_buffer;
+	buffer = cpu_buffer->buffer;
+
+ again:
+	if (ring_buffer_per_cpu_empty(cpu_buffer))
+		return NULL;
+
+	event = ring_buffer_iter_head_event(iter);
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		ring_buffer_inc_page(cpu_buffer, &iter->head_page);
+		rb_reset_iter_read_page(iter);
+		goto again;
+
+	case RB_TYPE_TIME_EXTENT:
+		/* Internal data, OK to advance */
+		ring_buffer_advance_iter(iter);
+		goto again;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		ring_buffer_advance_iter(iter);
+		goto again;
+
+	case RB_TYPE_DATA:
+		if (ts) {
+			*ts = iter->read_stamp + event->time_delta;
+			ring_buffer_normalize_time_stamp(cpu_buffer->cpu, ts);
+		}
+		return event;
+
+	default:
+		BUG();
+	}
+
+	return NULL;
+}
+
+/**
+ * ring_buffer_consume - return an event and consume it
+ * @buffer: The ring buffer to get the next event from
+ *
+ * Returns the next event in the ring buffer, and that event is consumed.
+ * Meaning, that sequential reads will keep returning a different event,
+ * and eventually empty the ring buffer if the producer is slower.
+ */
+struct ring_buffer_event *
+ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+
+	event = ring_buffer_peek(buffer, cpu, ts);
+	if (!event)
+		return NULL;
+
+	cpu_buffer = buffer->buffers[cpu];
+	ring_buffer_advance_head(cpu_buffer);
+
+	return event;
+}
+
+/**
+ * ring_buffer_read_start - start a non consuming read of the buffer
+ * @buffer: The ring buffer to read from
+ * @cpu: The cpu buffer to iterate over
+ *
+ * This starts up an iteration through the buffer. It also disables
+ * the recording to the buffer until the reading is finished.
+ * This prevents the reading from being corrupted. This is not
+ * a consuming read, so a producer is not expected.
+ *
+ * Must be paired with ring_buffer_finish.
+ */
+struct ring_buffer_iter *
+ring_buffer_read_start(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_iter *iter;
+
+	iter = kmalloc(sizeof(*iter), GFP_KERNEL);
+	if (!iter)
+		return NULL;
+
+	cpu_buffer = buffer->buffers[cpu];
+
+	iter->cpu_buffer = cpu_buffer;
+
+	atomic_inc(&cpu_buffer->record_disabled);
+
+	__raw_spin_lock(&cpu_buffer->lock);
+	iter->head = cpu_buffer->head;
+	iter->head_page = cpu_buffer->head_page;
+	rb_reset_iter_read_page(iter);
+	__raw_spin_unlock(&cpu_buffer->lock);
+
+	return iter;
+}
+
+/**
+ * ring_buffer_finish - finish reading the iterator of the buffer
+ * @iter: The iterator retrieved by ring_buffer_start
+ *
+ * This re-enables the recording to the buffer, and frees the
+ * iterator.
+ */
+void
+ring_buffer_read_finish(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
+
+	atomic_dec(&cpu_buffer->record_disabled);
+	kfree(iter);
+}
+
+/**
+ * ring_buffer_read - read the next item in the ring buffer by the iterator
+ * @iter: The ring buffer iterator
+ * @ts: The time stamp of the event read.
+ *
+ * This reads the next event in the ring buffer and increments the iterator.
+ */
+struct ring_buffer_event *
+ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts)
+{
+	struct ring_buffer_event *event;
+
+	event = ring_buffer_iter_peek(iter, ts);
+	if (!event)
+		return NULL;
+
+	ring_buffer_advance_iter(iter);
+
+	return event;
+}
+
+/**
+ * ring_buffer_size - return the size of the ring buffer (in bytes)
+ * @buffer: The ring buffer.
+ */
+unsigned long ring_buffer_size(struct ring_buffer *buffer)
+{
+	return BUF_PAGE_SIZE * buffer->pages;
+}
+
+static void
+__ring_buffer_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	cpu_buffer->head_page
+		= list_entry(cpu_buffer->pages.next, struct buffer_page, list);
+	cpu_buffer->tail_page
+		= list_entry(cpu_buffer->pages.next, struct buffer_page, list);
+
+	cpu_buffer->head = cpu_buffer->tail = 0;
+	cpu_buffer->overrun = 0;
+	cpu_buffer->entries = 0;
+}
+
+/**
+ * ring_buffer_reset_cpu - reset a ring buffer per CPU buffer
+ * @buffer: The ring buffer to reset a per cpu buffer of
+ * @cpu: The CPU buffer to be reset
+ */
+void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = buffer->buffers[cpu];
+	unsigned long flags;
+
+	raw_local_irq_save(flags);
+	__raw_spin_lock(&cpu_buffer->lock);
+
+	__ring_buffer_reset_cpu(cpu_buffer);
+
+	__raw_spin_unlock(&cpu_buffer->lock);
+	raw_local_irq_restore(flags);
+}
+
+/**
+ * ring_buffer_reset_cpu - reset a ring buffer per CPU buffer
+ * @buffer: The ring buffer to reset a per cpu buffer of
+ * @cpu: The CPU buffer to be reset
+ */
+void ring_buffer_reset(struct ring_buffer *buffer)
+{
+	unsigned long flags;
+	int cpu;
+
+	ring_buffer_lock(buffer, &flags);
+
+	for_each_buffer_cpu(buffer, cpu)
+		__ring_buffer_reset_cpu(buffer->buffers[cpu]);
+
+	ring_buffer_unlock(buffer, flags);
+}
+
+/**
+ * rind_buffer_empty - is the ring buffer empty?
+ * @buffer: The ring buffer to test
+ */
+int ring_buffer_empty(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	/* yes this is racy, but if you don't like the race, lock the buffer */
+	for_each_buffer_cpu(buffer, cpu) {
+		cpu_buffer = buffer->buffers[cpu];
+		if (!ring_buffer_per_cpu_empty(cpu_buffer))
+			return 0;
+	}
+	return 1;
+}
+
+/**
+ * ring_buffer_empty_cpu - is a cpu buffer of a ring buffer empty?
+ * @buffer: The ring buffer
+ * @cpu: The CPU buffer to test
+ */
+int ring_buffer_empty_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = buffer->buffers[cpu];
+	return ring_buffer_per_cpu_empty(cpu_buffer);
+}
+
+/**
+ * ring_buffer_swap_cpu - swap a CPU buffer between two ring buffers
+ * @buffer_a: One buffer to swap with
+ * @buffer_b: The other buffer to swap with
+ *
+ * This function is useful for tracers that want to take a "snapshot"
+ * of a CPU buffer and has another back up buffer lying around.
+ * it is expected that the tracer handles the cpu buffer not being
+ * used at the moment.
+ */
+int ring_buffer_swap_cpu(struct ring_buffer *buffer_a,
+			 struct ring_buffer *buffer_b, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer_a;
+	struct ring_buffer_per_cpu *cpu_buffer_b;
+
+	/* At least make sure the two buffers are somewhat the same */
+	if (buffer_a->size != buffer_b->size ||
+	    buffer_a->pages != buffer_b->pages)
+		return -EINVAL;
+
+	cpu_buffer_a = buffer_a->buffers[cpu];
+	cpu_buffer_b = buffer_b->buffers[cpu];
+
+	atomic_inc(&cpu_buffer_a->record_disabled);
+	atomic_inc(&cpu_buffer_b->record_disabled);
+
+	buffer_a->buffers[cpu] = cpu_buffer_b;
+	buffer_b->buffers[cpu] = cpu_buffer_a;
+
+	cpu_buffer_b->buffer = buffer_a;
+	cpu_buffer_a->buffer = buffer_b;
+
+	atomic_dec(&cpu_buffer_a->record_disabled);
+	atomic_dec(&cpu_buffer_b->record_disabled);
+
+	return 0;
+}
+
Index: linux-trace.git/kernel/trace/Kconfig
===================================================================
--- linux-trace.git.orig/kernel/trace/Kconfig	2008-09-26 14:16:45.000000000 -0400
+++ linux-trace.git/kernel/trace/Kconfig	2008-09-26 14:16:54.000000000 -0400
@@ -10,10 +10,14 @@ config HAVE_DYNAMIC_FTRACE
 config TRACER_MAX_TRACE
 	bool
 
+config RING_BUFFER
+	bool
+
 config TRACING
 	bool
 	select DEBUG_FS
 	select STACKTRACE
+	select RING_BUFFER
 
 config FTRACE
 	bool "Kernel Function Tracer"
Index: linux-trace.git/kernel/trace/Makefile
===================================================================
--- linux-trace.git.orig/kernel/trace/Makefile	2008-09-26 14:16:45.000000000 -0400
+++ linux-trace.git/kernel/trace/Makefile	2008-09-26 14:16:54.000000000 -0400
@@ -11,6 +11,7 @@ obj-y += trace_selftest_dynamic.o
 endif
 
 obj-$(CONFIG_FTRACE) += libftrace.o
+obj-$(CONFIG_RING_BUFFER) += ring_buffer.o
 
 obj-$(CONFIG_TRACING) += trace.o
 obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v6] Unified trace buffer
  2008-09-26 20:08           ` Peter Zijlstra
@ 2008-09-26 21:14             ` Masami Hiramatsu
  2008-09-26 21:26               ` Steven Rostedt
  0 siblings, 1 reply; 102+ messages in thread
From: Masami Hiramatsu @ 2008-09-26 21:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, LKML, Ingo Molnar, Thomas Gleixner,
	Andrew Morton, prasad, Linus Torvalds, Mathieu Desnoyers,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Steven Rostedt, Arnaldo Carvalho de Melo

Peter Zijlstra wrote:
> On Fri, 2008-09-26 at 14:05 -0400, Steven Rostedt wrote:
>> +static void
>> +rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned
>> nr_pages)
>> +{
>> +       struct page *page;
>> +       struct list_head *p;
>> +       unsigned i;
>> +
>> +       atomic_inc(&cpu_buffer->record_disabled);
> 
> You probably want synchronize_sched() here (and similar other places) to
> ensure any active writer on the corresponding cpu is actually stopped.

Would it really be done in the buffer layer?
I think it should be done by each tracer, because buffer layer
can't ensure truly active writers have stopped.

Thank you,

> 
> Which suggests you want to use something like ring_buffer_lock_cpu() and
> implement that as above.
> 
>> +       for (i = 0; i < nr_pages; i++) {
>> +               BUG_ON(list_empty(&cpu_buffer->pages));
>> +               p = cpu_buffer->pages.next;
>> +               page = list_entry(p, struct page, lru);
>> +               list_del_init(&page->lru);
>> +               __free_page(page);
>> +       }
>> +       BUG_ON(list_empty(&cpu_buffer->pages));
>> +
>> +       __ring_buffer_reset_cpu(cpu_buffer);
>> +
>> +       check_pages(cpu_buffer);
>> +
>> +       atomic_dec(&cpu_buffer->record_disabled);
>> +
>> +}
> 

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v6] Unified trace buffer
  2008-09-26 21:14             ` Masami Hiramatsu
@ 2008-09-26 21:26               ` Steven Rostedt
  0 siblings, 0 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-09-26 21:26 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Thomas Gleixner,
	Andrew Morton, prasad, Linus Torvalds, Mathieu Desnoyers,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Steven Rostedt, Arnaldo Carvalho de Melo


On Fri, 26 Sep 2008, Masami Hiramatsu wrote:

> Peter Zijlstra wrote:
> > On Fri, 2008-09-26 at 14:05 -0400, Steven Rostedt wrote:
> >> +static void
> >> +rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned
> >> nr_pages)
> >> +{
> >> +       struct page *page;
> >> +       struct list_head *p;
> >> +       unsigned i;
> >> +
> >> +       atomic_inc(&cpu_buffer->record_disabled);
> > 
> > You probably want synchronize_sched() here (and similar other places) to
> > ensure any active writer on the corresponding cpu is actually stopped.
> 
> Would it really be done in the buffer layer?
> I think it should be done by each tracer, because buffer layer
> can't ensure truly active writers have stopped.
> 

Actually it can ;-)

Since all writes to the buffer at least disable preemption, by issuing a 
synchronize_sched, we can guarantee that after disabling the record, all 
activity will be done.

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v6] Unified trace buffer
  2008-09-26 19:46             ` Martin Bligh
  2008-09-26 19:52               ` Steven Rostedt
@ 2008-09-26 21:37               ` Steven Rostedt
  1 sibling, 0 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-09-26 21:37 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Thomas Gleixner,
	Andrew Morton, prasad, Linus Torvalds, Mathieu Desnoyers,
	Frank Ch. Eigler, David Wilder, hch, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo


On Fri, 26 Sep 2008, Martin Bligh wrote:
> 
> What did you guys think of Mathieu's idea of sticking the buffer length
> in the header here, rather than using padding events? Seemed cleaner
> to me.
> 

OK, I just implemented the size field in the page struct. Seems to work 
well. I'm still keeping the padded event, so in the future, if we ever 
map these pages to userspace or files, these holes will have a type.

Will post later today, need to actually enter a real life for a bit.

-- Steve

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v6] Unified trace buffer
  2008-09-26 19:14           ` Peter Zijlstra
@ 2008-09-26 22:28             ` Mike Travis
  2008-09-26 23:56               ` Steven Rostedt
  0 siblings, 1 reply; 102+ messages in thread
From: Mike Travis @ 2008-09-26 22:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, LKML, Ingo Molnar, Thomas Gleixner,
	Andrew Morton, prasad, Linus Torvalds, Mathieu Desnoyers,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo

Peter Zijlstra wrote:
> On Fri, 2008-09-26 at 14:05 -0400, Steven Rostedt wrote:
>> +struct ring_buffer {
>> +       unsigned long           size;
>> +       unsigned                pages;
>> +       unsigned                flags;
>> +       int                     cpus;
>> +       atomic_t                record_disabled;
>> +
>> +       struct mutex            mutex;
>> +
>> +       /* FIXME: this should be online CPUS */
>> +       struct ring_buffer_per_cpu *buffers[NR_CPUS];
> 
> actually nr_possible makes sense, and you might consider always
> allocating buffers (and keeping them for offlined cpus) to avoid massive
> allocations/frees cpu-hotplug events.
> 
> Mike Travis has been going over the kernel removing constructs like
> this, and replacing them with dynamically allocated arrays of
> nr_possible.
> 
>> +};

The other thing to consider is using a percpu variable.

Cheers,
Mike

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v6] Unified trace buffer
  2008-09-26 18:05         ` [PATCH v6] " Steven Rostedt
                             ` (5 preceding siblings ...)
  2008-09-26 21:13           ` [PATCH v7] " Steven Rostedt
@ 2008-09-26 22:31           ` Arnaldo Carvalho de Melo
  2008-09-26 23:58             ` Steven Rostedt
  6 siblings, 1 reply; 102+ messages in thread
From: Arnaldo Carvalho de Melo @ 2008-09-26 22:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, prasad, Linus Torvalds, Mathieu Desnoyers,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt

Em Fri, Sep 26, 2008 at 02:05:44PM -0400, Steven Rostedt escreveu:
> + */
> +static inline void *
> +ring_buffer_event_data(struct ring_buffer_event *event)
> +{
> +	BUG_ON(event->type != RB_TYPE_DATA);
> +	/* If length is in len field, then array[0] has the data */
> +	if (event->len)
> +		return (void *)&event->array[0];
> +	/* Otherwise length is in array[0] and array[1] has the data */
> +	return (void *)&event->array[1];

Nitpick: Why cast to void *?

And sometimes you use the rb_ prefix, in other cases you use the longer
for ring_buffer_, is the ring_ namespace already used? Or can we make it
use rb_ consistently to shorten the names?

- Arnaldo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v6] Unified trace buffer
  2008-09-26 19:17           ` Peter Zijlstra
@ 2008-09-26 23:16             ` Arjan van de Ven
  0 siblings, 0 replies; 102+ messages in thread
From: Arjan van de Ven @ 2008-09-26 23:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, LKML, Ingo Molnar, Thomas Gleixner,
	Andrew Morton, prasad, Linus Torvalds, Mathieu Desnoyers,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo

On Fri, 26 Sep 2008 21:17:13 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, 2008-09-26 at 14:05 -0400, Steven Rostedt wrote:
> > +#define CHECK_COND(buffer, cond)                       \
> > +       if (unlikely(cond)) {                           \
> > +               atomic_inc(&buffer->record_disabled);   \
> > +               WARN_ON(1);                             \
> > +               return -1;                              \
> > +       }
> 
> Arjan, any preferences wrt kerneloops.org?

this works; if you also want to print something use WARN() instead

> 


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v6] Unified trace buffer
  2008-09-26 22:28             ` Mike Travis
@ 2008-09-26 23:56               ` Steven Rostedt
  2008-09-27  0:05                 ` Mike Travis
  0 siblings, 1 reply; 102+ messages in thread
From: Steven Rostedt @ 2008-09-26 23:56 UTC (permalink / raw)
  To: Mike Travis
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Thomas Gleixner,
	Andrew Morton, prasad, Linus Torvalds, Mathieu Desnoyers,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo


On Fri, 26 Sep 2008, Mike Travis wrote:

> Peter Zijlstra wrote:
> > On Fri, 2008-09-26 at 14:05 -0400, Steven Rostedt wrote:
> >> +struct ring_buffer {
> >> +       unsigned long           size;
> >> +       unsigned                pages;
> >> +       unsigned                flags;
> >> +       int                     cpus;
> >> +       atomic_t                record_disabled;
> >> +
> >> +       struct mutex            mutex;
> >> +
> >> +       /* FIXME: this should be online CPUS */
> >> +       struct ring_buffer_per_cpu *buffers[NR_CPUS];
> > 
> > actually nr_possible makes sense, and you might consider always
> > allocating buffers (and keeping them for offlined cpus) to avoid massive
> > allocations/frees cpu-hotplug events.
> > 
> > Mike Travis has been going over the kernel removing constructs like
> > this, and replacing them with dynamically allocated arrays of
> > nr_possible.
> > 
> >> +};
> 
> The other thing to consider is using a percpu variable.

This structure is allocated on request.

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v6] Unified trace buffer
  2008-09-26 22:31           ` [PATCH v6] Unified trace buffer Arnaldo Carvalho de Melo
@ 2008-09-26 23:58             ` Steven Rostedt
  2008-09-27  0:13               ` Linus Torvalds
  0 siblings, 1 reply; 102+ messages in thread
From: Steven Rostedt @ 2008-09-26 23:58 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, prasad, Linus Torvalds, Mathieu Desnoyers,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt


On Fri, 26 Sep 2008, Arnaldo Carvalho de Melo wrote:

> Em Fri, Sep 26, 2008 at 02:05:44PM -0400, Steven Rostedt escreveu:
> > + */
> > +static inline void *
> > +ring_buffer_event_data(struct ring_buffer_event *event)
> > +{
> > +	BUG_ON(event->type != RB_TYPE_DATA);
> > +	/* If length is in len field, then array[0] has the data */
> > +	if (event->len)
> > +		return (void *)&event->array[0];
> > +	/* Otherwise length is in array[0] and array[1] has the data */
> > +	return (void *)&event->array[1];
> 
> Nitpick: Why cast to void *?

5 day hacking marathon, I cast everything ;-)

> 
> And sometimes you use the rb_ prefix, in other cases you use the longer
> for ring_buffer_, is the ring_ namespace already used? Or can we make it
> use rb_ consistently to shorten the names?

I started using the rb_ because I was constantly breaking the 80 character 
line limit with ring_buffer ;-)   OK, for v8, I'll rename all static 
internal functions to rb_ and keep the global ones ring_buffer_

Thanks,

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v6] Unified trace buffer
  2008-09-26 23:56               ` Steven Rostedt
@ 2008-09-27  0:05                 ` Mike Travis
  2008-09-27  0:18                   ` Steven Rostedt
  0 siblings, 1 reply; 102+ messages in thread
From: Mike Travis @ 2008-09-27  0:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Thomas Gleixner,
	Andrew Morton, prasad, Linus Torvalds, Mathieu Desnoyers,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo

Steven Rostedt wrote:
> On Fri, 26 Sep 2008, Mike Travis wrote:
> 
>> Peter Zijlstra wrote:
>>> On Fri, 2008-09-26 at 14:05 -0400, Steven Rostedt wrote:
>>>> +struct ring_buffer {
>>>> +       unsigned long           size;
>>>> +       unsigned                pages;
>>>> +       unsigned                flags;
>>>> +       int                     cpus;
>>>> +       atomic_t                record_disabled;
>>>> +
>>>> +       struct mutex            mutex;
>>>> +
>>>> +       /* FIXME: this should be online CPUS */
>>>> +       struct ring_buffer_per_cpu *buffers[NR_CPUS];
>>> actually nr_possible makes sense, and you might consider always
>>> allocating buffers (and keeping them for offlined cpus) to avoid massive
>>> allocations/frees cpu-hotplug events.
>>>
>>> Mike Travis has been going over the kernel removing constructs like
>>> this, and replacing them with dynamically allocated arrays of
>>> nr_possible.
>>>
>>>> +};
>> The other thing to consider is using a percpu variable.
> 
> This structure is allocated on request.
> 
> -- Steve


Ahh, then it would need the yet to be added cpu_alloc() from Christoph.

Your best bet then is to allocate based on nr_cpu_ids.

Cheers,
Mike

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v6] Unified trace buffer
  2008-09-26 23:58             ` Steven Rostedt
@ 2008-09-27  0:13               ` Linus Torvalds
  2008-09-27  0:23                 ` Steven Rostedt
  0 siblings, 1 reply; 102+ messages in thread
From: Linus Torvalds @ 2008-09-27  0:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Arnaldo Carvalho de Melo, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, prasad, Mathieu Desnoyers,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt



On Fri, 26 Sep 2008, Steven Rostedt wrote:
> 
> I started using the rb_ because I was constantly breaking the 80 character 
> line limit with ring_buffer ;-)   OK, for v8, I'll rename all static 
> internal functions to rb_ and keep the global ones ring_buffer_

It would probably be better to use something else than 'rb_', because that 
prefix is already used by the red-black trees, and exported as such (eg 
"rb_next()" etc).

But at least as long as it's static, it's probably not _too_ noticeable if 
the rest of the names don't overlap. We _do_ include <linux/rbtree.h>  
almost everywhere, since we use those things in the VM, in timers etc, so 
it comes in through pretty much all headers.

			Linus

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v6] Unified trace buffer
  2008-09-27  0:05                 ` Mike Travis
@ 2008-09-27  0:18                   ` Steven Rostedt
  2008-09-27  0:46                     ` Mike Travis
  0 siblings, 1 reply; 102+ messages in thread
From: Steven Rostedt @ 2008-09-27  0:18 UTC (permalink / raw)
  To: Mike Travis
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Thomas Gleixner,
	Andrew Morton, prasad, Linus Torvalds, Mathieu Desnoyers,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo


On Fri, 26 Sep 2008, Mike Travis wrote:
> Steven Rostedt wrote:
> >> The other thing to consider is using a percpu variable.
> > 
> > This structure is allocated on request.
> > 
> > -- Steve
> 
> 
> Ahh, then it would need the yet to be added cpu_alloc() from Christoph.

We can always change this later.

> 
> Your best bet then is to allocate based on nr_cpu_ids.

Actually in this case I chose num_possible_cpus(). Reason being is that 
later I may add an interface to allow the user to select which CPUs they 
want to trace, and this will only allocate a subset of CPU buffers. 
(not going to implement that in the first release).

But to lay the ground work, I set a buffers->cpumask to be that of all the 
cpus with buffers allocated. For now that mask is set to cpu_possible_map. 
Since num_possible_cpus() is defined as cpus_weight_nr(cpu_possible_map) 
I figured that was the better choice.

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v6] Unified trace buffer
  2008-09-27  0:13               ` Linus Torvalds
@ 2008-09-27  0:23                 ` Steven Rostedt
  2008-09-27  0:28                   ` Steven Rostedt
  0 siblings, 1 reply; 102+ messages in thread
From: Steven Rostedt @ 2008-09-27  0:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arnaldo Carvalho de Melo, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, prasad, Mathieu Desnoyers,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt


On Fri, 26 Sep 2008, Linus Torvalds wrote:
> > I started using the rb_ because I was constantly breaking the 80 character 
> > line limit with ring_buffer ;-)   OK, for v8, I'll rename all static 
> > internal functions to rb_ and keep the global ones ring_buffer_
> 
> It would probably be better to use something else than 'rb_', because that 
> prefix is already used by the red-black trees, and exported as such (eg 
> "rb_next()" etc).

Good point.

> 
> But at least as long as it's static, it's probably not _too_ noticeable if 
> the rest of the names don't overlap. We _do_ include <linux/rbtree.h>  
> almost everywhere, since we use those things in the VM, in timers etc, so 
> it comes in through pretty much all headers.

Well, I just compiled it and it didn't have any name collisions, but that 
doesn't mean that this wont change in the future.

What would you suggest?  buffer_ ? ringbuf_ ?

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v6] Unified trace buffer
  2008-09-27  0:23                 ` Steven Rostedt
@ 2008-09-27  0:28                   ` Steven Rostedt
  0 siblings, 0 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-09-27  0:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arnaldo Carvalho de Melo, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, prasad, Mathieu Desnoyers,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt

On Fri, 26 Sep 2008, Steven Rostedt wrote:
> > 
> > But at least as long as it's static, it's probably not _too_ noticeable if 
> > the rest of the names don't overlap. We _do_ include <linux/rbtree.h>  
> > almost everywhere, since we use those things in the VM, in timers etc, so 
> > it comes in through pretty much all headers.
> 
> Well, I just compiled it and it didn't have any name collisions, but that 
> doesn't mean that this wont change in the future.

For kicks I just added #include <linux/rbtree.h> and it still passed. I 
don't think we'll be adding new functions to rbtree.h, so it may be 
fine to stay with the rb_ prefix.

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v6] Unified trace buffer
  2008-09-27  0:18                   ` Steven Rostedt
@ 2008-09-27  0:46                     ` Mike Travis
  2008-09-27  0:52                       ` Steven Rostedt
  0 siblings, 1 reply; 102+ messages in thread
From: Mike Travis @ 2008-09-27  0:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Thomas Gleixner,
	Andrew Morton, prasad, Linus Torvalds, Mathieu Desnoyers,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo

Steven Rostedt wrote:
> On Fri, 26 Sep 2008, Mike Travis wrote:
>> Steven Rostedt wrote:
>>>> The other thing to consider is using a percpu variable.
>>> This structure is allocated on request.
>>>
>>> -- Steve
>>
>> Ahh, then it would need the yet to be added cpu_alloc() from Christoph.
> 
> We can always change this later.
> 
>> Your best bet then is to allocate based on nr_cpu_ids.
> 
> Actually in this case I chose num_possible_cpus(). Reason being is that 
> later I may add an interface to allow the user to select which CPUs they 
> want to trace, and this will only allocate a subset of CPU buffers. 
> (not going to implement that in the first release).
> 
> But to lay the ground work, I set a buffers->cpumask to be that of all the 
> cpus with buffers allocated. For now that mask is set to cpu_possible_map. 
> Since num_possible_cpus() is defined as cpus_weight_nr(cpu_possible_map) 
> I figured that was the better choice.
> 
> -- Steve

One problem though, it's *theoretically* possible for num_possible to be
less than nr_cpu_ids and a cpu index may extend past the end of your
allocated array.  This would happen if  the cpu indices are allocated
some other way than as each cpu is discovered.  For example, a system
might want a group of cpus in one section (say by node, or socket) and
then a hole in the cpu_possible_map until the next group.  nr_cpu_ids
is guaranteed to be the highest possible cpu + 1.

Cheers,
Mike

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v6] Unified trace buffer
  2008-09-27  0:46                     ` Mike Travis
@ 2008-09-27  0:52                       ` Steven Rostedt
  0 siblings, 0 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-09-27  0:52 UTC (permalink / raw)
  To: Mike Travis
  Cc: Peter Zijlstra, LKML, Ingo Molnar, Thomas Gleixner,
	Andrew Morton, prasad, Linus Torvalds, Mathieu Desnoyers,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo


On Fri, 26 Sep 2008, Mike Travis wrote:
> Steven Rostedt wrote:
> > 
> > But to lay the ground work, I set a buffers->cpumask to be that of all the 
> > cpus with buffers allocated. For now that mask is set to cpu_possible_map. 
> > Since num_possible_cpus() is defined as cpus_weight_nr(cpu_possible_map) 
> > I figured that was the better choice.
> > 
> > -- Steve
> 
> One problem though, it's *theoretically* possible for num_possible to be
> less than nr_cpu_ids and a cpu index may extend past the end of your
> allocated array.  This would happen if  the cpu indices are allocated
> some other way than as each cpu is discovered.  For example, a system
> might want a group of cpus in one section (say by node, or socket) and
> then a hole in the cpu_possible_map until the next group.  nr_cpu_ids
> is guaranteed to be the highest possible cpu + 1.

Thanks for the explanation. I'll change buffer->cpus to be set to 
nr_cpu_ids.

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v8] Unified trace buffer
  2008-09-26 21:13           ` [PATCH v7] " Steven Rostedt
@ 2008-09-27  2:02             ` Steven Rostedt
  2008-09-27  6:06               ` [PATCH v9] " Steven Rostedt
  0 siblings, 1 reply; 102+ messages in thread
From: Steven Rostedt @ 2008-09-27  2:02 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	prasad, Linus Torvalds, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo


[
  Changes since v7:

  - added the size of data in the page into the page frame.
    Suggested by Martin Bligh and Mathieu Desnoyers

  - Converted all static functions to be named with a rb_ prefix.
    This may conflict with rbtree functions in the future, but if
    this does happen, we will need to rename the functions in this
    file. The rb_ prefixed functions here are all static, so it only
    affects this code. Thanks to Arnaldo Carvalho de Melo.

  - Added some synchronized_sched() where record_disabled is
    incremented. There are other places that expect the caller
    to handle it. Suggested by Peter Zijlstra.

  - Use nr_cpu_ids for max cpu. Thanks to Mike Travis.
]

This is a unified tracing buffer that implements a ring buffer that
hopefully everyone will eventually be able to use.

The events recorded into the buffer have the following structure:

struct ring_buffer_event {
	u32 type:2, len:3, time_delta:27;
	u32 array[];
};

The minimum size of an event is 8 bytes. All events are 4 byte
aligned inside the buffer.

There are 4 types (all internal use for the ring buffer, only
the data type is exported to the interface users).

RB_TYPE_PADDING: this type is used to note extra space at the end
	of a buffer page.

RB_TYPE_TIME_EXTENT: This type is used when the time between events
	is greater than the 27 bit delta can hold. We add another
	32 bits, and record that in its own event (8 byte size).

RB_TYPE_TIME_STAMP: (Not implemented yet). This will hold data to
	help keep the buffer timestamps in sync.

RB_TYPE_DATA: The event actually holds user data.

The "len" field is only three bits. Since the data must be
4 byte aligned, this field is shifted left by 2, giving a
max length of 28 bytes. If the data load is greater than 28
bytes, the first array field holds the full length of the
data load and the len field is set to zero.

Example, data size of 7 bytes:

	type = RB_TYPE_DATA
	len = 2
	time_delta: <time-stamp> - <prev_event-time-stamp>
	array[0..1]: <7 bytes of data> <1 byte empty>

This event is saved in 12 bytes of the buffer.

An event with 82 bytes of data:

	type = RB_TYPE_DATA
	len = 0
	time_delta: <time-stamp> - <prev_event-time-stamp>
	array[0]: 84 (Note the alignment)
	array[1..14]: <82 bytes of data> <2 bytes empty>

The above event is saved in 92 bytes (if my math is correct).
82 bytes of data, 2 bytes empty, 4 byte header, 4 byte length.

Do not reference the above event struct directly. Use the following
functions to gain access to the event table, since the
ring_buffer_event structure may change in the future.

ring_buffer_event_length(event): get the length of the event.
	This is the size of the memory used to record this
	event, and not the size of the data pay load.

ring_buffer_time_delta(event): get the time delta of the event
	This returns the delta time stamp since the last event.
	Note: Even though this is in the header, there should
		be no reason to access this directly, accept
		for debugging.

ring_buffer_event_data(event): get the data from the event
	This is the function to use to get the actual data
	from the event. Note, it is only a pointer to the
	data inside the buffer. This data must be copied to
	another location otherwise you risk it being written
	over in the buffer.

ring_buffer_lock: A way to lock the entire buffer.
ring_buffer_unlock: unlock the buffer.

ring_buffer_alloc: create a new ring buffer. Can choose between
	overwrite or consumer/producer mode. Overwrite will
	overwrite old data, where as consumer producer will
	throw away new data if the consumer catches up with the
	producer.  The consumer/producer is the default.

ring_buffer_free: free the ring buffer.

ring_buffer_resize: resize the buffer. Changes the size of each cpu
	buffer. Note, it is up to the caller to provide that
	the buffer is not being used while this is happening.
	This requirement may go away but do not count on it.

ring_buffer_lock_reserve: locks the ring buffer and allocates an
	entry on the buffer to write to.
ring_buffer_unlock_commit: unlocks the ring buffer and commits it to
	the buffer.

ring_buffer_write: writes some data into the ring buffer.

ring_buffer_peek: Look at a next item in the cpu buffer.
ring_buffer_consume: get the next item in the cpu buffer and
	consume it. That is, this function increments the head
	pointer.

ring_buffer_read_start: Start an iterator of a cpu buffer.
	For now, this disables the cpu buffer, until you issue
	a finish. This is just because we do not want the iterator
	to be overwritten. This restriction may change in the future.
	But note, this is used for static reading of a buffer which
	is usually done "after" a trace. Live readings would want
	to use the ring_buffer_consume above, which will not
	disable the ring buffer.

ring_buffer_read_finish: Finishes the read iterator and reenables
	the ring buffer.

ring_buffer_iter_peek: Look at the next item in the cpu iterator.
ring_buffer_read: Read the iterator and increment it.
ring_buffer_iter_reset: Reset the iterator to point to the beginning
	of the cpu buffer.
ring_buffer_iter_empty: Returns true if the iterator is at the end
	of the cpu buffer.

ring_buffer_size: returns the size in bytes of each cpu buffer.
	Note, the real size is this times the number of CPUs.

ring_buffer_reset_cpu: Sets the cpu buffer to empty
ring_buffer_reset: sets all cpu buffers to empty

ring_buffer_swap_cpu: swaps a cpu buffer from one buffer with a
	cpu buffer of another buffer. This is handy when you
	want to take a snap shot of a running trace on just one
	cpu. Having a backup buffer, to swap with facilitates this.
	Ftrace max latencies use this.

ring_buffer_empty: Returns true if the ring buffer is empty.
ring_buffer_empty_cpu: Returns true if the cpu buffer is empty.

ring_buffer_record_disable: disable all cpu buffers (read only)
ring_buffer_record_disable_cpu: disable a single cpu buffer (read only)
ring_buffer_record_enable: enable all cpu buffers.
ring_buffer_record_enabl_cpu: enable a single cpu buffer.

ring_buffer_entries: The number of entries in a ring buffer.
ring_buffer_overruns: The number of entries removed due to writing wrap.

ring_buffer_time_stamp: Get the time stamp used by the ring buffer
ring_buffer_normalize_time_stamp: normalize the ring buffer time stamp
	into nanosecs.

I still need to implement the GTOD feature. But we need support from
the cpu frequency infrastructure.  But this can be done at a later
time without affecting the ring buffer interface.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 include/linux/ring_buffer.h |  179 ++++
 kernel/trace/Kconfig        |    4 
 kernel/trace/Makefile       |    1 
 kernel/trace/ring_buffer.c  | 1584 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 1768 insertions(+)

Index: linux-trace.git/include/linux/ring_buffer.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-trace.git/include/linux/ring_buffer.h	2008-09-26 14:16:54.000000000 -0400
@@ -0,0 +1,179 @@
+#ifndef _LINUX_RING_BUFFER_H
+#define _LINUX_RING_BUFFER_H
+
+#include <linux/mm.h>
+#include <linux/seq_file.h>
+
+struct ring_buffer;
+struct ring_buffer_iter;
+
+/*
+ * Don't reference this struct directly, use the inline items below.
+ */
+struct ring_buffer_event {
+	u32		type:2, len:3, time_delta:27;
+	u32		array[];
+};
+
+enum {
+	RB_TYPE_PADDING,	/* Left over page padding
+				 * array is ignored
+				 * size is variable depending on
+				 * how much padding is needed
+				 */
+	RB_TYPE_TIME_EXTENT,	/* Extent the time delta
+				 * array[0] = time delta (28 .. 59)
+				 * size = 8 bytes
+				 */
+	/* FIXME: RB_TYPE_TIME_STAMP not implemented */
+	RB_TYPE_TIME_STAMP,	/* Sync time stamp with external clock
+				 * array[0] = tv_nsec
+				 * array[1] = tv_sec
+				 * size = 16 bytes
+				 */
+
+	RB_TYPE_DATA,		/* Data record
+				 * If len is zero:
+				 *  array[0] holds the actual length
+				 *  array[1..(length+3)/4-1] holds data
+				 * else
+				 *  length = len << 2
+				 *  array[0..(length+3)/4] holds data
+				 */
+};
+
+#define RB_EVNT_HDR_SIZE (sizeof(struct ring_buffer_event))
+#define RB_ALIGNMENT_SHIFT	2
+#define RB_ALIGNMENT		(1 << RB_ALIGNMENT_SHIFT)
+#define RB_MAX_SMALL_DATA	(28)
+
+enum {
+	RB_LEN_TIME_EXTENT = 8,
+	RB_LEN_TIME_STAMP = 16,
+};
+
+/**
+ * ring_buffer_event_length - return the length of the event
+ * @event: the event to get the length of
+ */
+static inline unsigned
+ring_buffer_event_length(struct ring_buffer_event *event)
+{
+	unsigned length;
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		/* undefined */
+		return -1;
+
+	case RB_TYPE_TIME_EXTENT:
+		return RB_LEN_TIME_EXTENT;
+
+	case RB_TYPE_TIME_STAMP:
+		return RB_LEN_TIME_STAMP;
+
+	case RB_TYPE_DATA:
+		if (event->len)
+			length = event->len << RB_ALIGNMENT_SHIFT;
+		else
+			length = event->array[0];
+		return length + RB_EVNT_HDR_SIZE;
+	default:
+		BUG();
+	}
+	/* not hit */
+	return 0;
+}
+
+/**
+ * ring_buffer_event_time_delta - return the delta timestamp of the event
+ * @event: the event to get the delta timestamp of
+ *
+ * The delta timestamp is the 27 bit timestamp since the last event.
+ */
+static inline unsigned
+ring_buffer_event_time_delta(struct ring_buffer_event *event)
+{
+	return event->time_delta;
+}
+
+/**
+ * ring_buffer_event_data - return the data of the event
+ * @event: the event to get the data from
+ */
+static inline void *
+ring_buffer_event_data(struct ring_buffer_event *event)
+{
+	BUG_ON(event->type != RB_TYPE_DATA);
+	/* If length is in len field, then array[0] has the data */
+	if (event->len)
+		return (void *)&event->array[0];
+	/* Otherwise length is in array[0] and array[1] has the data */
+	return (void *)&event->array[1];
+}
+
+void ring_buffer_lock(struct ring_buffer *buffer, unsigned long *flags);
+void ring_buffer_unlock(struct ring_buffer *buffer, unsigned long flags);
+
+/*
+ * size is in bytes for each per CPU buffer.
+ */
+struct ring_buffer *
+ring_buffer_alloc(unsigned long size, unsigned flags);
+void ring_buffer_free(struct ring_buffer *buffer);
+
+int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size);
+
+struct ring_buffer_event *
+ring_buffer_lock_reserve(struct ring_buffer *buffer,
+			 unsigned long length,
+			 unsigned long *flags);
+int ring_buffer_unlock_commit(struct ring_buffer *buffer,
+			      struct ring_buffer_event *event,
+			      unsigned long flags);
+int ring_buffer_write(struct ring_buffer *buffer,
+		      unsigned long length, void *data);
+
+struct ring_buffer_event *
+ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts);
+struct ring_buffer_event *
+ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts);
+
+struct ring_buffer_iter *
+ring_buffer_read_start(struct ring_buffer *buffer, int cpu);
+void ring_buffer_read_finish(struct ring_buffer_iter *iter);
+
+struct ring_buffer_event *
+ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts);
+struct ring_buffer_event *
+ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts);
+void ring_buffer_iter_reset(struct ring_buffer_iter *iter);
+int ring_buffer_iter_empty(struct ring_buffer_iter *iter);
+
+unsigned long ring_buffer_size(struct ring_buffer *buffer);
+
+void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu);
+void ring_buffer_reset(struct ring_buffer *buffer);
+
+int ring_buffer_swap_cpu(struct ring_buffer *buffer_a,
+			 struct ring_buffer *buffer_b, int cpu);
+
+int ring_buffer_empty(struct ring_buffer *buffer);
+int ring_buffer_empty_cpu(struct ring_buffer *buffer, int cpu);
+
+void ring_buffer_record_disable(struct ring_buffer *buffer);
+void ring_buffer_record_enable(struct ring_buffer *buffer);
+void ring_buffer_record_disable_cpu(struct ring_buffer *buffer, int cpu);
+void ring_buffer_record_enable_cpu(struct ring_buffer *buffer, int cpu);
+
+unsigned long ring_buffer_entries(struct ring_buffer *buffer);
+unsigned long ring_buffer_overruns(struct ring_buffer *buffer);
+
+u64 ring_buffer_time_stamp(int cpu);
+void ring_buffer_normalize_time_stamp(int cpu, u64 *ts);
+
+enum ring_buffer_flags {
+	RB_FL_OVERWRITE		= 1 << 0,
+};
+
+#endif /* _LINUX_RING_BUFFER_H */
Index: linux-trace.git/kernel/trace/ring_buffer.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-trace.git/kernel/trace/ring_buffer.c	2008-09-26 21:55:29.000000000 -0400
@@ -0,0 +1,1584 @@
+/*
+ * Generic ring buffer
+ *
+ * Copyright (C) 2008 Steven Rostedt <srostedt@redhat.com>
+ */
+#include <linux/ring_buffer.h>
+#include <linux/spinlock.h>
+#include <linux/debugfs.h>
+#include <linux/uaccess.h>
+#include <linux/module.h>
+#include <linux/percpu.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>	/* used for sched_clock() (for now) */
+#include <linux/init.h>
+#include <linux/hash.h>
+#include <linux/list.h>
+#include <linux/fs.h>
+
+/* This needs to be somewhere else */
+#ifdef CONFIG_SMP
+# define __raw_assert_spin_is_locked(lock) \
+	BUG_ON(!__raw_spin_is_locked(lock))
+#else
+# define __raw_assert_spin_is_locked(lock) do { } while (0)
+#endif
+
+/* Up this if you want to test the TIME_EXTENTS and normalization */
+#define DEBUG_SHIFT 0
+
+/* FIXME!!! */
+u64 ring_buffer_time_stamp(int cpu)
+{
+	/* shift to debug/test normalization and TIME_EXTENTS */
+	return sched_clock() << DEBUG_SHIFT;
+}
+void ring_buffer_normalize_time_stamp(int cpu, u64 *ts)
+{
+	/* Just stupid testing the normalize function and deltas */
+	*ts >>= DEBUG_SHIFT;
+}
+
+#define for_each_buffer_cpu(buffer, cpu)		\
+	for_each_cpu_mask(cpu, buffer->cpumask)
+
+#define TS_SHIFT	27
+#define TS_MASK		((1ULL << TS_SHIFT) - 1)
+#define TS_DELTA_TEST	(~TS_MASK)
+
+/*
+ * This hack stolen from mm/slob.c.
+ * We can store per page timing information in the page frame of the page.
+ * Thanks to Peter Zijlstra for suggesting this idea.
+ */
+struct buffer_page {
+	union {
+		struct {
+			unsigned long flags;	/* mandatory */
+			atomic_t _count;	/* mandatory */
+			u64	time_stamp;	/* page time stamp */
+			unsigned size;		/* size of page data */
+			struct list_head list;	/* linked list of free pages */
+		};
+		struct page page;
+	};
+};
+
+/*
+ * We need to fit the time_stamp delta into 27 bits.
+ */
+static inline int
+test_time_stamp(unsigned long long delta)
+{
+	if (delta & TS_DELTA_TEST)
+		return 1;
+	return 0;
+}
+
+#define BUF_PAGE_SIZE PAGE_SIZE
+
+/*
+ * head_page == tail_page && head == tail then buffer is empty.
+ */
+struct ring_buffer_per_cpu {
+	int			cpu;
+	struct ring_buffer	*buffer;
+	raw_spinlock_t		lock;
+	struct lock_class_key	lock_key;
+	struct list_head	pages;
+	unsigned long		head;	/* read from head */
+	unsigned long		tail;	/* write to tail */
+	struct buffer_page	*head_page;
+	struct buffer_page	*tail_page;
+	unsigned long		overrun;
+	unsigned long		entries;
+	u64			write_stamp;
+	u64			read_stamp;
+	atomic_t		record_disabled;
+};
+
+struct ring_buffer {
+	unsigned long		size;
+	unsigned		pages;
+	unsigned		flags;
+	int			cpus;
+	cpumask_t		cpumask;
+	atomic_t		record_disabled;
+
+	struct mutex		mutex;
+
+	struct ring_buffer_per_cpu **buffers;
+};
+
+struct ring_buffer_iter {
+	struct ring_buffer_per_cpu	*cpu_buffer;
+	unsigned long			head;
+	struct buffer_page		*head_page;
+	u64				read_stamp;
+};
+
+#define CHECK_COND(buffer, cond)			\
+	if (unlikely(cond)) {				\
+		atomic_inc(&buffer->record_disabled);	\
+		WARN_ON(1);				\
+		return -1;				\
+	}
+
+/**
+ * check_pages - integrity check of buffer pages
+ * @cpu_buffer: CPU buffer with pages to test
+ *
+ * As a safty measure we check to make sure the data pages have not
+ * been corrupted.
+ */
+static int rb_check_pages(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct list_head *head = &cpu_buffer->pages;
+	struct page *page, *tmp;
+
+	CHECK_COND(cpu_buffer, head->next->prev != head);
+	CHECK_COND(cpu_buffer, head->prev->next != head);
+
+	list_for_each_entry_safe(page, tmp, head, lru) {
+		CHECK_COND(cpu_buffer, page->lru.next->prev != &page->lru);
+		CHECK_COND(cpu_buffer, page->lru.prev->next != &page->lru);
+	}
+
+	return 0;
+}
+
+static unsigned rb_head_size(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	return cpu_buffer->head_page->size;
+}
+
+static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
+			     unsigned nr_pages)
+{
+	struct list_head *head = &cpu_buffer->pages;
+	LIST_HEAD(pages);
+	struct page *page, *tmp;
+	unsigned long addr;
+	unsigned i;
+
+	for (i = 0; i < nr_pages; i++) {
+		addr = __get_free_page(GFP_KERNEL);
+		if (!addr)
+			goto free_pages;
+		page = virt_to_page(addr);
+		list_add(&page->lru, &pages);
+	}
+
+	list_splice(&pages, head);
+
+	rb_check_pages(cpu_buffer);
+
+	return 0;
+
+ free_pages:
+	list_for_each_entry_safe(page, tmp, &pages, lru) {
+		list_del_init(&page->lru);
+		__free_page(page);
+	}
+	return -ENOMEM;
+}
+
+static struct ring_buffer_per_cpu *
+rb_allocate_cpu_buffer(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int ret;
+
+	cpu_buffer = kzalloc_node(ALIGN(sizeof(*cpu_buffer), cache_line_size()),
+				  GFP_KERNEL, cpu_to_node(cpu));
+	if (!cpu_buffer)
+		return NULL;
+
+	cpu_buffer->cpu = cpu;
+	cpu_buffer->buffer = buffer;
+	cpu_buffer->lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED;
+	INIT_LIST_HEAD(&cpu_buffer->pages);
+
+	ret = rb_allocate_pages(cpu_buffer, buffer->pages);
+	if (ret < 0)
+		goto fail_free_buffer;
+
+	cpu_buffer->head_page
+		= list_entry(cpu_buffer->pages.next, struct buffer_page, list);
+	cpu_buffer->tail_page
+		= list_entry(cpu_buffer->pages.next, struct buffer_page, list);
+
+	return cpu_buffer;
+
+ fail_free_buffer:
+	kfree(cpu_buffer);
+	return NULL;
+}
+
+static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct list_head *head = &cpu_buffer->pages;
+	struct page *page, *tmp;
+
+	list_for_each_entry_safe(page, tmp, head, lru) {
+		list_del_init(&page->lru);
+		__free_page(page);
+	}
+	kfree(cpu_buffer);
+}
+
+/**
+ * ring_buffer_alloc - allocate a new ring_buffer
+ * @size: the size in bytes that is needed.
+ * @flags: attributes to set for the ring buffer.
+ *
+ * Currently the only flag that is available is the RB_FL_OVERWRITE
+ * flag. This flag means that the buffer will overwrite old data
+ * when the buffer wraps. If this flag is not set, the buffer will
+ * drop data when the tail hits the head.
+ */
+struct ring_buffer *
+ring_buffer_alloc(unsigned long size, unsigned flags)
+{
+	struct ring_buffer *buffer;
+	int bsize;
+	int cpu;
+
+	/* keep it in its own cache line */
+	buffer = kzalloc(ALIGN(sizeof(*buffer), cache_line_size()),
+			 GFP_KERNEL);
+	if (!buffer)
+		return NULL;
+
+	buffer->pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
+	buffer->flags = flags;
+
+	/* need at least two pages */
+	if (buffer->pages == 1)
+		buffer->pages++;
+
+	buffer->cpumask = cpu_possible_map;
+	buffer->cpus = nr_cpu_ids;
+
+	bsize = sizeof(void *) * nr_cpu_ids;
+	buffer->buffers = kzalloc(ALIGN(bsize, cache_line_size()),
+				  GFP_KERNEL);
+	if (!buffer->buffers)
+		goto fail_free_buffer;
+
+	for_each_buffer_cpu(buffer, cpu) {
+		buffer->buffers[cpu] =
+			rb_allocate_cpu_buffer(buffer, cpu);
+		if (!buffer->buffers[cpu])
+			goto fail_free_buffers;
+	}
+
+	mutex_init(&buffer->mutex);
+
+	return buffer;
+
+ fail_free_buffers:
+	for_each_buffer_cpu(buffer, cpu) {
+		if (buffer->buffers[cpu])
+			rb_free_cpu_buffer(buffer->buffers[cpu]);
+	}
+	kfree(buffer->buffers);
+
+ fail_free_buffer:
+	kfree(buffer);
+	return NULL;
+}
+
+/**
+ * ring_buffer_free - free a ring buffer.
+ * @buffer: the buffer to free.
+ */
+void
+ring_buffer_free(struct ring_buffer *buffer)
+{
+	int cpu;
+
+	for_each_buffer_cpu(buffer, cpu)
+		rb_free_cpu_buffer(buffer->buffers[cpu]);
+
+	kfree(buffer);
+}
+
+static void rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer);
+
+static void
+rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned nr_pages)
+{
+	struct page *page;
+	struct list_head *p;
+	unsigned i;
+
+	atomic_inc(&cpu_buffer->record_disabled);
+	synchronize_sched();
+
+	for (i = 0; i < nr_pages; i++) {
+		BUG_ON(list_empty(&cpu_buffer->pages));
+		p = cpu_buffer->pages.next;
+		page = list_entry(p, struct page, lru);
+		list_del_init(&page->lru);
+		__free_page(page);
+	}
+	BUG_ON(list_empty(&cpu_buffer->pages));
+
+	rb_reset_cpu(cpu_buffer);
+
+	rb_check_pages(cpu_buffer);
+
+	atomic_dec(&cpu_buffer->record_disabled);
+
+}
+
+static void
+rb_insert_pages(struct ring_buffer_per_cpu *cpu_buffer,
+		struct list_head *pages, unsigned nr_pages)
+{
+	struct page *page;
+	struct list_head *p;
+	unsigned i;
+
+	atomic_inc(&cpu_buffer->record_disabled);
+	synchronize_sched();
+
+	for (i = 0; i < nr_pages; i++) {
+		BUG_ON(list_empty(pages));
+		p = pages->next;
+		page = list_entry(p, struct page, lru);
+		list_del_init(&page->lru);
+		list_add_tail(&page->lru, &cpu_buffer->pages);
+	}
+	rb_reset_cpu(cpu_buffer);
+
+	rb_check_pages(cpu_buffer);
+
+	atomic_dec(&cpu_buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_resize - resize the ring buffer
+ * @buffer: the buffer to resize.
+ * @size: the new size.
+ *
+ * The tracer is responsible for making sure that the buffer is
+ * not being used while changing the size.
+ * Note: We may be able to change the above requirement by using
+ *  RCU synchronizations.
+ *
+ * Minimum size is 2 * BUF_PAGE_SIZE.
+ *
+ * Returns -1 on failure.
+ */
+int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long buffer_size;
+	LIST_HEAD(pages);
+	unsigned long addr;
+	unsigned nr_pages, rm_pages, new_pages;
+	struct page *page, *tmp;
+	int i, cpu;
+
+	size = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
+	size *= BUF_PAGE_SIZE;
+	buffer_size = buffer->pages * BUF_PAGE_SIZE;
+
+	/* we need a minimum of two pages */
+	if (size < BUF_PAGE_SIZE * 2)
+		size = BUF_PAGE_SIZE * 2;
+
+	if (size == buffer_size)
+		return size;
+
+	mutex_lock(&buffer->mutex);
+
+	nr_pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
+
+	if (size < buffer_size) {
+
+		/* easy case, just free pages */
+		BUG_ON(nr_pages >= buffer->pages);
+
+		rm_pages = buffer->pages - nr_pages;
+
+		for_each_buffer_cpu(buffer, cpu) {
+			cpu_buffer = buffer->buffers[cpu];
+			rb_remove_pages(cpu_buffer, rm_pages);
+		}
+		goto out;
+	}
+
+	/*
+	 * This is a bit more difficult. We only want to add pages
+	 * when we can allocate enough for all CPUs. We do this
+	 * by allocating all the pages and storing them on a local
+	 * link list. If we succeed in our allocation, then we
+	 * add these pages to the cpu_buffers. Otherwise we just free
+	 * them all and return -ENOMEM;
+	 */
+	BUG_ON(nr_pages <= buffer->pages);
+	new_pages = nr_pages - buffer->pages;
+
+	for_each_buffer_cpu(buffer, cpu) {
+		for (i = 0; i < new_pages; i++) {
+			addr = __get_free_page(GFP_KERNEL);
+			if (!addr)
+				goto free_pages;
+			page = virt_to_page(addr);
+			list_add(&page->lru, &pages);
+		}
+	}
+
+	for_each_buffer_cpu(buffer, cpu) {
+		cpu_buffer = buffer->buffers[cpu];
+		rb_insert_pages(cpu_buffer, &pages, new_pages);
+	}
+
+	BUG_ON(!list_empty(&pages));
+
+ out:
+	buffer->pages = nr_pages;
+	mutex_unlock(&buffer->mutex);
+
+	return size;
+
+ free_pages:
+	list_for_each_entry_safe(page, tmp, &pages, lru) {
+		list_del_init(&page->lru);
+		__free_page(page);
+	}
+	return -ENOMEM;
+}
+
+static inline int rb_per_cpu_empty(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	return cpu_buffer->head_page == cpu_buffer->tail_page &&
+		cpu_buffer->head == cpu_buffer->tail;
+}
+
+static inline int rb_null_event(struct ring_buffer_event *event)
+{
+	return event->type == RB_TYPE_PADDING;
+}
+
+static inline void *rb_page_index(struct buffer_page *page, unsigned index)
+{
+	void *addr;
+
+	addr = page_address(&page->page);
+	return addr + index;
+}
+
+static inline struct ring_buffer_event *
+rb_head_event(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	return rb_page_index(cpu_buffer->head_page,
+			     cpu_buffer->head);
+}
+
+static inline struct ring_buffer_event *
+rb_iter_head_event(struct ring_buffer_iter *iter)
+{
+	return rb_page_index(iter->head_page,
+			     iter->head);
+}
+
+/*
+ * When the tail hits the head and the buffer is in overwrite mode,
+ * the head jumps to the next page and all content on the previous
+ * page is discarded. But before doing so, we update the overrun
+ * variable of the buffer.
+ */
+static void rb_update_overflow(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct ring_buffer_event *event;
+	unsigned long head;
+
+	for (head = 0; head < rb_head_size(cpu_buffer);
+	     head += ring_buffer_event_length(event)) {
+		event = rb_page_index(cpu_buffer->head_page, head);
+		BUG_ON(rb_null_event(event));
+		/* Only count data entries */
+		if (event->type != RB_TYPE_DATA)
+			continue;
+		cpu_buffer->overrun++;
+		cpu_buffer->entries--;
+	}
+}
+
+static inline void rb_inc_page(struct ring_buffer_per_cpu *cpu_buffer,
+			       struct buffer_page **page)
+{
+	struct list_head *p = (*page)->list.next;
+
+	if (p == &cpu_buffer->pages)
+		p = p->next;
+
+	*page = list_entry(p, struct buffer_page, list);
+}
+
+static inline void
+rb_add_stamp(struct ring_buffer_per_cpu *cpu_buffer, u64 *ts)
+{
+	cpu_buffer->tail_page->time_stamp = *ts;
+	cpu_buffer->write_stamp = *ts;
+}
+
+static void rb_reset_read_page(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	cpu_buffer->read_stamp = cpu_buffer->head_page->time_stamp;
+	cpu_buffer->head = 0;
+}
+
+static void
+rb_reset_iter_read_page(struct ring_buffer_iter *iter)
+{
+	iter->read_stamp = iter->head_page->time_stamp;
+	iter->head = 0;
+}
+
+/**
+ * ring_buffer_update_event - update event type and data
+ * @event: the even to update
+ * @type: the type of event
+ * @length: the size of the event field in the ring buffer
+ *
+ * Update the type and data fields of the event. The length
+ * is the actual size that is written to the ring buffer,
+ * and with this, we can determine what to place into the
+ * data field.
+ */
+static inline void
+rb_update_event(struct ring_buffer_event *event,
+			 unsigned type, unsigned length)
+{
+	event->type = type;
+
+	switch (type) {
+
+	case RB_TYPE_PADDING:
+		break;
+
+	case RB_TYPE_TIME_EXTENT:
+		event->len =
+			(RB_LEN_TIME_EXTENT + (RB_ALIGNMENT-1))
+			>> RB_ALIGNMENT_SHIFT;
+		break;
+
+	case RB_TYPE_TIME_STAMP:
+		event->len =
+			(RB_LEN_TIME_STAMP + (RB_ALIGNMENT-1))
+			>> RB_ALIGNMENT_SHIFT;
+		break;
+
+	case RB_TYPE_DATA:
+		length -= RB_EVNT_HDR_SIZE;
+		if (length > RB_MAX_SMALL_DATA) {
+			event->len = 0;
+			event->array[0] = length;
+		} else
+			event->len =
+				(length + (RB_ALIGNMENT-1))
+				>> RB_ALIGNMENT_SHIFT;
+		break;
+	default:
+		BUG();
+	}
+}
+
+static inline unsigned rb_calculate_event_length(unsigned length)
+{
+	struct ring_buffer_event event; /* Used only for sizeof array */
+
+	/* zero length can cause confusions */
+	if (!length)
+		length = 1;
+
+	if (length > RB_MAX_SMALL_DATA)
+		length += sizeof(event.array[0]);
+
+	length += RB_EVNT_HDR_SIZE;
+	length = ALIGN(length, RB_ALIGNMENT);
+
+	return length;
+}
+
+static struct ring_buffer_event *
+__rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
+		  unsigned type, unsigned long length, u64 *ts)
+{
+	struct buffer_page *head_page, *tail_page;
+	unsigned long tail;
+	struct ring_buffer *buffer = cpu_buffer->buffer;
+	struct ring_buffer_event *event;
+
+	tail_page = cpu_buffer->tail_page;
+	head_page = cpu_buffer->head_page;
+	tail = cpu_buffer->tail;
+
+	if (tail + length > BUF_PAGE_SIZE) {
+		struct buffer_page *next_page = tail_page;
+
+		rb_inc_page(cpu_buffer, &next_page);
+
+		if (next_page == head_page) {
+			if (!(buffer->flags & RB_FL_OVERWRITE))
+				return NULL;
+
+			/* count overflows */
+			rb_update_overflow(cpu_buffer);
+
+			rb_inc_page(cpu_buffer, &head_page);
+			cpu_buffer->head_page = head_page;
+			rb_reset_read_page(cpu_buffer);
+		}
+
+		if (tail != BUF_PAGE_SIZE) {
+			event = rb_page_index(tail_page, tail);
+			/* page padding */
+			event->type = RB_TYPE_PADDING;
+		}
+
+		tail_page->size = tail;
+		tail_page = next_page;
+		tail_page->size = 0;
+		tail = 0;
+		cpu_buffer->tail_page = tail_page;
+		cpu_buffer->tail = tail;
+		rb_add_stamp(cpu_buffer, ts);
+	}
+
+	BUG_ON(tail + length > BUF_PAGE_SIZE);
+
+	event = rb_page_index(tail_page, tail);
+	rb_update_event(event, type, length);
+
+	return event;
+}
+
+static struct ring_buffer_event *
+rb_reserve_next_event(struct ring_buffer_per_cpu *cpu_buffer,
+		      unsigned type, unsigned long length)
+{
+	unsigned long long ts, delta;
+	struct ring_buffer_event *event;
+
+	ts = ring_buffer_time_stamp(cpu_buffer->cpu);
+
+	if (cpu_buffer->tail) {
+		delta = ts - cpu_buffer->write_stamp;
+
+		if (test_time_stamp(delta)) {
+			/*
+			 * The delta is too big, we to add a
+			 * new timestamp.
+			 */
+			event = __rb_reserve_next(cpu_buffer,
+						  RB_TYPE_TIME_EXTENT,
+						  RB_LEN_TIME_EXTENT,
+						  &ts);
+			if (!event)
+				return NULL;
+
+			/* check to see if we went to the next page */
+			if (cpu_buffer->tail) {
+				/* Still on same page, update timestamp */
+				event->time_delta = delta & TS_MASK;
+				event->array[0] = delta >> TS_SHIFT;
+				/* commit the time event */
+				cpu_buffer->tail +=
+					ring_buffer_event_length(event);
+				cpu_buffer->write_stamp = ts;
+			}
+			delta = 0;
+		}
+	} else {
+		rb_add_stamp(cpu_buffer, &ts);
+		delta = 0;
+	}
+
+	event = __rb_reserve_next(cpu_buffer, type, length, &ts);
+	if (!event)
+		return NULL;
+
+	event->time_delta = delta;
+
+	return event;
+}
+
+/**
+ * ring_buffer_lock_reserve - reserve a part of the buffer
+ * @buffer: the ring buffer to reserve from
+ * @length: the length of the data to reserve (excluding event header)
+ * @flags: a pointer to save the interrupt flags
+ *
+ * Returns a reseverd event on the ring buffer to copy directly to.
+ * The user of this interface will need to get the body to write into
+ * and can use the ring_buffer_event_data() interface.
+ *
+ * The length is the length of the data needed, not the event length
+ * which also includes the event header.
+ *
+ * Must be paired with ring_buffer_unlock_commit, unless NULL is returned.
+ * If NULL is returned, then nothing has been allocated or locked.
+ */
+struct ring_buffer_event *
+ring_buffer_lock_reserve(struct ring_buffer *buffer,
+			 unsigned long length,
+			 unsigned long *flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	int cpu;
+
+	if (atomic_read(&buffer->record_disabled))
+		return NULL;
+
+	raw_local_irq_save(*flags);
+	cpu = raw_smp_processor_id();
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		goto out_irq;
+
+	cpu_buffer = buffer->buffers[cpu];
+	__raw_spin_lock(&cpu_buffer->lock);
+
+	if (atomic_read(&cpu_buffer->record_disabled))
+		goto no_record;
+
+	length = rb_calculate_event_length(length);
+	if (length > BUF_PAGE_SIZE)
+		return NULL;
+
+	event = rb_reserve_next_event(cpu_buffer, RB_TYPE_DATA, length);
+	if (!event)
+		goto no_record;
+
+	return event;
+
+ no_record:
+	__raw_spin_unlock(&cpu_buffer->lock);
+ out_irq:
+	local_irq_restore(*flags);
+	return NULL;
+}
+
+static void rb_commit(struct ring_buffer_per_cpu *cpu_buffer,
+		      struct ring_buffer_event *event)
+{
+	cpu_buffer->tail += ring_buffer_event_length(event);
+	cpu_buffer->tail_page->size = cpu_buffer->tail;
+	cpu_buffer->write_stamp += event->time_delta;
+	cpu_buffer->entries++;
+}
+
+/**
+ * ring_buffer_unlock_commit - commit a reserved
+ * @buffer: The buffer to commit to
+ * @event: The event pointer to commit.
+ * @flags: the interrupt flags received from ring_buffer_lock_reserve.
+ *
+ * This commits the data to the ring buffer, and releases any locks held.
+ *
+ * Must be paired with ring_buffer_lock_reserve.
+ */
+int ring_buffer_unlock_commit(struct ring_buffer *buffer,
+			      struct ring_buffer_event *event,
+			      unsigned long flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu = raw_smp_processor_id();
+
+	cpu_buffer = buffer->buffers[cpu];
+
+	__raw_assert_spin_is_locked(&cpu_buffer->lock);
+
+	rb_commit(cpu_buffer, event);
+
+	__raw_spin_unlock(&cpu_buffer->lock);
+	raw_local_irq_restore(flags);
+
+	return 0;
+}
+
+/**
+ * ring_buffer_write - write data to the buffer without reserving
+ * @buffer: The ring buffer to write to.
+ * @length: The length of the data being written (excluding the event header)
+ * @data: The data to write to the buffer.
+ *
+ * This is like ring_buffer_lock_reserve and ring_buffer_unlock_commit as
+ * one function. If you already have the data to write to the buffer, it
+ * may be easier to simply call this function.
+ *
+ * Note, like ring_buffer_lock_reserve, the length is the length of the data
+ * and not the length of the event which would hold the header.
+ */
+int ring_buffer_write(struct ring_buffer *buffer,
+			unsigned long length,
+			void *data)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	unsigned long event_length, flags;
+	void *body;
+	int ret = -EBUSY;
+	int cpu;
+
+	if (atomic_read(&buffer->record_disabled))
+		return -EBUSY;
+
+	local_irq_save(flags);
+	cpu = raw_smp_processor_id();
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		goto out_irq;
+
+	cpu_buffer = buffer->buffers[cpu];
+	__raw_spin_lock(&cpu_buffer->lock);
+
+	if (atomic_read(&cpu_buffer->record_disabled))
+		goto out;
+
+	event_length = rb_calculate_event_length(length);
+	event = rb_reserve_next_event(cpu_buffer,
+				      RB_TYPE_DATA, event_length);
+	if (!event)
+		goto out;
+
+	body = ring_buffer_event_data(event);
+
+	memcpy(body, data, length);
+
+	rb_commit(cpu_buffer, event);
+
+	ret = 0;
+ out:
+	__raw_spin_unlock(&cpu_buffer->lock);
+ out_irq:
+	local_irq_restore(flags);
+
+	return ret;
+}
+
+/**
+ * ring_buffer_lock - lock the ring buffer
+ * @buffer: The ring buffer to lock
+ * @flags: The place to store the interrupt flags
+ *
+ * This locks all the per CPU buffers.
+ *
+ * Must be unlocked by ring_buffer_unlock.
+ */
+void ring_buffer_lock(struct ring_buffer *buffer, unsigned long *flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	local_irq_save(*flags);
+
+	for_each_buffer_cpu(buffer, cpu) {
+		cpu_buffer = buffer->buffers[cpu];
+		__raw_spin_lock(&cpu_buffer->lock);
+	}
+}
+
+/**
+ * ring_buffer_unlock - unlock a locked buffer
+ * @buffer: The locked buffer to unlock
+ * @flags: The interrupt flags received by ring_buffer_lock
+ */
+void ring_buffer_unlock(struct ring_buffer *buffer, unsigned long flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	for (cpu = buffer->cpus - 1; cpu >= 0; cpu--) {
+		if (!cpu_isset(cpu, buffer->cpumask))
+			continue;
+		cpu_buffer = buffer->buffers[cpu];
+		__raw_spin_unlock(&cpu_buffer->lock);
+	}
+
+	local_irq_restore(flags);
+}
+
+/**
+ * ring_buffer_record_disable - stop all writes into the buffer
+ * @buffer: The ring buffer to stop writes to.
+ *
+ * This prevents all writes to the buffer. Any attempt to write
+ * to the buffer after this will fail and return NULL.
+ *
+ * The caller should call synchronize_sched() after this.
+ */
+void ring_buffer_record_disable(struct ring_buffer *buffer)
+{
+	atomic_inc(&buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_record_enable - enable writes to the buffer
+ * @buffer: The ring buffer to enable writes
+ *
+ * Note, multiple disables will need the same number of enables
+ * to truely enable the writing (much like preempt_disable).
+ */
+void ring_buffer_record_enable(struct ring_buffer *buffer)
+{
+	atomic_dec(&buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_record_disable_cpu - stop all writes into the cpu_buffer
+ * @buffer: The ring buffer to stop writes to.
+ * @cpu: The CPU buffer to stop
+ *
+ * This prevents all writes to the buffer. Any attempt to write
+ * to the buffer after this will fail and return NULL.
+ *
+ * The caller should call synchronize_sched() after this.
+ */
+void ring_buffer_record_disable_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return;
+
+	cpu_buffer = buffer->buffers[cpu];
+	atomic_inc(&cpu_buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_record_enable_cpu - enable writes to the buffer
+ * @buffer: The ring buffer to enable writes
+ * @cpu: The CPU to enable.
+ *
+ * Note, multiple disables will need the same number of enables
+ * to truely enable the writing (much like preempt_disable).
+ */
+void ring_buffer_record_enable_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return;
+
+	cpu_buffer = buffer->buffers[cpu];
+	atomic_dec(&cpu_buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_entries_cpu - get the number of entries in a cpu buffer
+ * @buffer: The ring buffer
+ * @cpu: The per CPU buffer to get the entries from.
+ */
+unsigned long ring_buffer_entries_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return 0;
+
+	cpu_buffer = buffer->buffers[cpu];
+	return cpu_buffer->entries;
+}
+
+/**
+ * ring_buffer_overrun_cpu - get the number of overruns in a cpu_buffer
+ * @buffer: The ring buffer
+ * @cpu: The per CPU buffer to get the number of overruns from
+ */
+unsigned long ring_buffer_overrun_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return 0;
+
+	cpu_buffer = buffer->buffers[cpu];
+	return cpu_buffer->overrun;
+}
+
+/**
+ * ring_buffer_entries - get the number of entries in a buffer
+ * @buffer: The ring buffer
+ *
+ * Returns the total number of entries in the ring buffer
+ * (all CPU entries)
+ */
+unsigned long ring_buffer_entries(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long entries = 0;
+	int cpu;
+
+	/* if you care about this being correct, lock the buffer */
+	for_each_buffer_cpu(buffer, cpu) {
+		cpu_buffer = buffer->buffers[cpu];
+		entries += cpu_buffer->entries;
+	}
+
+	return entries;
+}
+
+/**
+ * ring_buffer_overrun_cpu - get the number of overruns in buffer
+ * @buffer: The ring buffer
+ *
+ * Returns the total number of overruns in the ring buffer
+ * (all CPU entries)
+ */
+unsigned long ring_buffer_overruns(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long overruns = 0;
+	int cpu;
+
+	/* if you care about this being correct, lock the buffer */
+	for_each_buffer_cpu(buffer, cpu) {
+		cpu_buffer = buffer->buffers[cpu];
+		overruns += cpu_buffer->overrun;
+	}
+
+	return overruns;
+}
+
+/**
+ * ring_buffer_iter_reset - reset an iterator
+ * @iter: The iterator to reset
+ *
+ * Resets the iterator, so that it will start from the beginning
+ * again.
+ */
+void ring_buffer_iter_reset(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
+
+	iter->head_page = cpu_buffer->head_page;
+	iter->head = cpu_buffer->head;
+	rb_reset_iter_read_page(iter);
+}
+
+/**
+ * ring_buffer_iter_empty - check if an iterator has no more to read
+ * @iter: The iterator to check
+ */
+int ring_buffer_iter_empty(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = iter->cpu_buffer;
+
+	return iter->head_page == cpu_buffer->tail_page &&
+		iter->head == cpu_buffer->tail;
+}
+
+static void
+rb_update_read_stamp(struct ring_buffer_per_cpu *cpu_buffer,
+		     struct ring_buffer_event *event)
+{
+	u64 delta;
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		return;
+
+	case RB_TYPE_TIME_EXTENT:
+		delta = event->array[0];
+		delta <<= TS_SHIFT;
+		delta += event->time_delta;
+		cpu_buffer->read_stamp += delta;
+		return;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		return;
+
+	case RB_TYPE_DATA:
+		cpu_buffer->read_stamp += event->time_delta;
+		return;
+
+	default:
+		BUG();
+	}
+	return;
+}
+
+static void
+rb_update_iter_read_stamp(struct ring_buffer_iter *iter,
+			  struct ring_buffer_event *event)
+{
+	u64 delta;
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		return;
+
+	case RB_TYPE_TIME_EXTENT:
+		delta = event->array[0];
+		delta <<= TS_SHIFT;
+		delta += event->time_delta;
+		iter->read_stamp += delta;
+		return;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		return;
+
+	case RB_TYPE_DATA:
+		iter->read_stamp += event->time_delta;
+		return;
+
+	default:
+		BUG();
+	}
+	return;
+}
+
+static void rb_advance_head(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct ring_buffer_event *event;
+	unsigned length;
+
+	/*
+	 * Check if we are at the end of the buffer.
+	 */
+	if (cpu_buffer->head >= cpu_buffer->head_page->size) {
+		BUG_ON(cpu_buffer->head_page == cpu_buffer->tail_page);
+		rb_inc_page(cpu_buffer, &cpu_buffer->head_page);
+		rb_reset_read_page(cpu_buffer);
+		return;
+	}
+
+	event = rb_head_event(cpu_buffer);
+
+	if (event->type == RB_TYPE_DATA)
+		cpu_buffer->entries--;
+
+	length = ring_buffer_event_length(event);
+
+	/*
+	 * This should not be called to advance the header if we are
+	 * at the tail of the buffer.
+	 */
+	BUG_ON((cpu_buffer->head_page == cpu_buffer->tail_page) &&
+	       (cpu_buffer->head + length > cpu_buffer->tail));
+
+	rb_update_read_stamp(cpu_buffer, event);
+
+	cpu_buffer->head += length;
+
+	/* check for end of page */
+	if ((cpu_buffer->head >= cpu_buffer->head_page->size) &&
+	    (cpu_buffer->head_page != cpu_buffer->tail_page))
+		rb_advance_head(cpu_buffer);
+}
+
+static void rb_advance_iter(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer *buffer;
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	unsigned length;
+
+	cpu_buffer = iter->cpu_buffer;
+	buffer = cpu_buffer->buffer;
+
+	/*
+	 * Check if we are at the end of the buffer.
+	 */
+	if (iter->head >= iter->head_page->size) {
+		BUG_ON(iter->head_page == cpu_buffer->tail_page);
+		rb_inc_page(cpu_buffer, &iter->head_page);
+		rb_reset_iter_read_page(iter);
+		return;
+	}
+
+	event = rb_iter_head_event(iter);
+
+	length = ring_buffer_event_length(event);
+
+	/*
+	 * This should not be called to advance the header if we are
+	 * at the tail of the buffer.
+	 */
+	BUG_ON((iter->head_page == cpu_buffer->tail_page) &&
+	       (iter->head + length > cpu_buffer->tail));
+
+	rb_update_iter_read_stamp(iter, event);
+
+	iter->head += length;
+
+	/* check for end of page padding */
+	if ((iter->head >= iter->head_page->size) &&
+	    (iter->head_page != cpu_buffer->tail_page))
+		rb_advance_iter(iter);
+}
+
+/**
+ * ring_buffer_peek - peek at the next event to be read
+ * @buffer: The ring buffer to read
+ * @cpu: The cpu to peak at
+ * @ts: The timestamp counter of this event.
+ *
+ * This will return the event that will be read next, but does
+ * not consume the data.
+ */
+struct ring_buffer_event *
+ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return NULL;
+
+	cpu_buffer = buffer->buffers[cpu];
+
+ again:
+	if (rb_per_cpu_empty(cpu_buffer))
+		return NULL;
+
+	event = rb_head_event(cpu_buffer);
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		rb_inc_page(cpu_buffer, &cpu_buffer->head_page);
+		rb_reset_read_page(cpu_buffer);
+		goto again;
+
+	case RB_TYPE_TIME_EXTENT:
+		/* Internal data, OK to advance */
+		rb_advance_head(cpu_buffer);
+		goto again;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		rb_advance_head(cpu_buffer);
+		goto again;
+
+	case RB_TYPE_DATA:
+		if (ts) {
+			*ts = cpu_buffer->read_stamp + event->time_delta;
+			ring_buffer_normalize_time_stamp(cpu_buffer->cpu, ts);
+		}
+		return event;
+
+	default:
+		BUG();
+	}
+
+	return NULL;
+}
+
+/**
+ * ring_buffer_iter_peek - peek at the next event to be read
+ * @iter: The ring buffer iterator
+ * @ts: The timestamp counter of this event.
+ *
+ * This will return the event that will be read next, but does
+ * not increment the iterator.
+ */
+struct ring_buffer_event *
+ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
+{
+	struct ring_buffer *buffer;
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+
+	if (ring_buffer_iter_empty(iter))
+		return NULL;
+
+	cpu_buffer = iter->cpu_buffer;
+	buffer = cpu_buffer->buffer;
+
+ again:
+	if (rb_per_cpu_empty(cpu_buffer))
+		return NULL;
+
+	event = rb_iter_head_event(iter);
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		rb_inc_page(cpu_buffer, &iter->head_page);
+		rb_reset_iter_read_page(iter);
+		goto again;
+
+	case RB_TYPE_TIME_EXTENT:
+		/* Internal data, OK to advance */
+		rb_advance_iter(iter);
+		goto again;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		rb_advance_iter(iter);
+		goto again;
+
+	case RB_TYPE_DATA:
+		if (ts) {
+			*ts = iter->read_stamp + event->time_delta;
+			ring_buffer_normalize_time_stamp(cpu_buffer->cpu, ts);
+		}
+		return event;
+
+	default:
+		BUG();
+	}
+
+	return NULL;
+}
+
+/**
+ * ring_buffer_consume - return an event and consume it
+ * @buffer: The ring buffer to get the next event from
+ *
+ * Returns the next event in the ring buffer, and that event is consumed.
+ * Meaning, that sequential reads will keep returning a different event,
+ * and eventually empty the ring buffer if the producer is slower.
+ */
+struct ring_buffer_event *
+ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return NULL;
+
+	event = ring_buffer_peek(buffer, cpu, ts);
+	if (!event)
+		return NULL;
+
+	cpu_buffer = buffer->buffers[cpu];
+	rb_advance_head(cpu_buffer);
+
+	return event;
+}
+
+/**
+ * ring_buffer_read_start - start a non consuming read of the buffer
+ * @buffer: The ring buffer to read from
+ * @cpu: The cpu buffer to iterate over
+ *
+ * This starts up an iteration through the buffer. It also disables
+ * the recording to the buffer until the reading is finished.
+ * This prevents the reading from being corrupted. This is not
+ * a consuming read, so a producer is not expected.
+ *
+ * Must be paired with ring_buffer_finish.
+ */
+struct ring_buffer_iter *
+ring_buffer_read_start(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_iter *iter;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return NULL;
+
+	iter = kmalloc(sizeof(*iter), GFP_KERNEL);
+	if (!iter)
+		return NULL;
+
+	cpu_buffer = buffer->buffers[cpu];
+
+	iter->cpu_buffer = cpu_buffer;
+
+	atomic_inc(&cpu_buffer->record_disabled);
+	synchronize_sched();
+
+	__raw_spin_lock(&cpu_buffer->lock);
+	iter->head = cpu_buffer->head;
+	iter->head_page = cpu_buffer->head_page;
+	rb_reset_iter_read_page(iter);
+	__raw_spin_unlock(&cpu_buffer->lock);
+
+	return iter;
+}
+
+/**
+ * ring_buffer_finish - finish reading the iterator of the buffer
+ * @iter: The iterator retrieved by ring_buffer_start
+ *
+ * This re-enables the recording to the buffer, and frees the
+ * iterator.
+ */
+void
+ring_buffer_read_finish(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
+
+	atomic_dec(&cpu_buffer->record_disabled);
+	kfree(iter);
+}
+
+/**
+ * ring_buffer_read - read the next item in the ring buffer by the iterator
+ * @iter: The ring buffer iterator
+ * @ts: The time stamp of the event read.
+ *
+ * This reads the next event in the ring buffer and increments the iterator.
+ */
+struct ring_buffer_event *
+ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts)
+{
+	struct ring_buffer_event *event;
+
+	event = ring_buffer_iter_peek(iter, ts);
+	if (!event)
+		return NULL;
+
+	rb_advance_iter(iter);
+
+	return event;
+}
+
+/**
+ * ring_buffer_size - return the size of the ring buffer (in bytes)
+ * @buffer: The ring buffer.
+ */
+unsigned long ring_buffer_size(struct ring_buffer *buffer)
+{
+	return BUF_PAGE_SIZE * buffer->pages;
+}
+
+static void
+rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	cpu_buffer->head_page
+		= list_entry(cpu_buffer->pages.next, struct buffer_page, list);
+	cpu_buffer->tail_page
+		= list_entry(cpu_buffer->pages.next, struct buffer_page, list);
+
+	cpu_buffer->head = cpu_buffer->tail = 0;
+	cpu_buffer->overrun = 0;
+	cpu_buffer->entries = 0;
+}
+
+/**
+ * ring_buffer_reset_cpu - reset a ring buffer per CPU buffer
+ * @buffer: The ring buffer to reset a per cpu buffer of
+ * @cpu: The CPU buffer to be reset
+ */
+void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = buffer->buffers[cpu];
+	unsigned long flags;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return;
+
+	raw_local_irq_save(flags);
+	__raw_spin_lock(&cpu_buffer->lock);
+
+	rb_reset_cpu(cpu_buffer);
+
+	__raw_spin_unlock(&cpu_buffer->lock);
+	raw_local_irq_restore(flags);
+}
+
+/**
+ * ring_buffer_reset - reset a ring buffer
+ * @buffer: The ring buffer to reset all cpu buffers
+ */
+void ring_buffer_reset(struct ring_buffer *buffer)
+{
+	unsigned long flags;
+	int cpu;
+
+	ring_buffer_lock(buffer, &flags);
+
+	for_each_buffer_cpu(buffer, cpu)
+		rb_reset_cpu(buffer->buffers[cpu]);
+
+	ring_buffer_unlock(buffer, flags);
+}
+
+/**
+ * rind_buffer_empty - is the ring buffer empty?
+ * @buffer: The ring buffer to test
+ */
+int ring_buffer_empty(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	/* yes this is racy, but if you don't like the race, lock the buffer */
+	for_each_buffer_cpu(buffer, cpu) {
+		cpu_buffer = buffer->buffers[cpu];
+		if (!rb_per_cpu_empty(cpu_buffer))
+			return 0;
+	}
+	return 1;
+}
+
+/**
+ * ring_buffer_empty_cpu - is a cpu buffer of a ring buffer empty?
+ * @buffer: The ring buffer
+ * @cpu: The CPU buffer to test
+ */
+int ring_buffer_empty_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return 1;
+
+	cpu_buffer = buffer->buffers[cpu];
+	return rb_per_cpu_empty(cpu_buffer);
+}
+
+/**
+ * ring_buffer_swap_cpu - swap a CPU buffer between two ring buffers
+ * @buffer_a: One buffer to swap with
+ * @buffer_b: The other buffer to swap with
+ *
+ * This function is useful for tracers that want to take a "snapshot"
+ * of a CPU buffer and has another back up buffer lying around.
+ * it is expected that the tracer handles the cpu buffer not being
+ * used at the moment.
+ */
+int ring_buffer_swap_cpu(struct ring_buffer *buffer_a,
+			 struct ring_buffer *buffer_b, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer_a;
+	struct ring_buffer_per_cpu *cpu_buffer_b;
+
+	if (!cpu_isset(cpu, buffer_a->cpumask) ||
+	    !cpu_isset(cpu, buffer_b->cpumask))
+		return -EINVAL;
+
+	/* At least make sure the two buffers are somewhat the same */
+	if (buffer_a->size != buffer_b->size ||
+	    buffer_a->pages != buffer_b->pages)
+		return -EINVAL;
+
+	cpu_buffer_a = buffer_a->buffers[cpu];
+	cpu_buffer_b = buffer_b->buffers[cpu];
+
+	/*
+	 * We can't do a synchronize_sched here because this
+	 * function can be called in atomic context.
+	 * Normally this will be called from the same CPU as cpu.
+	 * If not it's up to the caller to protect this.
+	 */
+	atomic_inc(&cpu_buffer_a->record_disabled);
+	atomic_inc(&cpu_buffer_b->record_disabled);
+
+	buffer_a->buffers[cpu] = cpu_buffer_b;
+	buffer_b->buffers[cpu] = cpu_buffer_a;
+
+	cpu_buffer_b->buffer = buffer_a;
+	cpu_buffer_a->buffer = buffer_b;
+
+	atomic_dec(&cpu_buffer_a->record_disabled);
+	atomic_dec(&cpu_buffer_b->record_disabled);
+
+	return 0;
+}
+
Index: linux-trace.git/kernel/trace/Kconfig
===================================================================
--- linux-trace.git.orig/kernel/trace/Kconfig	2008-09-26 14:16:45.000000000 -0400
+++ linux-trace.git/kernel/trace/Kconfig	2008-09-26 14:16:54.000000000 -0400
@@ -10,10 +10,14 @@ config HAVE_DYNAMIC_FTRACE
 config TRACER_MAX_TRACE
 	bool
 
+config RING_BUFFER
+	bool
+
 config TRACING
 	bool
 	select DEBUG_FS
 	select STACKTRACE
+	select RING_BUFFER
 
 config FTRACE
 	bool "Kernel Function Tracer"
Index: linux-trace.git/kernel/trace/Makefile
===================================================================
--- linux-trace.git.orig/kernel/trace/Makefile	2008-09-26 14:16:45.000000000 -0400
+++ linux-trace.git/kernel/trace/Makefile	2008-09-26 14:16:54.000000000 -0400
@@ -11,6 +11,7 @@ obj-y += trace_selftest_dynamic.o
 endif
 
 obj-$(CONFIG_FTRACE) += libftrace.o
+obj-$(CONFIG_RING_BUFFER) += ring_buffer.o
 
 obj-$(CONFIG_TRACING) += trace.o
 obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v9] Unified trace buffer
  2008-09-27  2:02             ` [PATCH v8] " Steven Rostedt
@ 2008-09-27  6:06               ` Steven Rostedt
  2008-09-27 18:39                 ` Ingo Molnar
  2008-09-29 16:10                 ` [PATCH v10 Golden] " Steven Rostedt
  0 siblings, 2 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-09-27  6:06 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	prasad, Linus Torvalds, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo


[
  Changes since version 8:

  Two major bug fixes!

   - Had mix of referencing the pages link list with both
     page->lru and buffer_page->list.  Perhaps they luckily
     were lined up. But I have no idea why this didn't totally
     crash my box.

   - Missed a write stamp update that would cause funny times
]

From: Steven Rostedt <srostedt@redhat.com>
Subject: Unified trace buffer

This is a unified tracing buffer that implements a ring buffer that
hopefully everyone will eventually be able to use.

The events recorded into the buffer have the following structure:

struct ring_buffer_event {
	u32 type:2, len:3, time_delta:27;
	u32 array[];
};

The minimum size of an event is 8 bytes. All events are 4 byte
aligned inside the buffer.

There are 4 types (all internal use for the ring buffer, only
the data type is exported to the interface users).

RB_TYPE_PADDING: this type is used to note extra space at the end
	of a buffer page.

RB_TYPE_TIME_EXTENT: This type is used when the time between events
	is greater than the 27 bit delta can hold. We add another
	32 bits, and record that in its own event (8 byte size).

RB_TYPE_TIME_STAMP: (Not implemented yet). This will hold data to
	help keep the buffer timestamps in sync.

RB_TYPE_DATA: The event actually holds user data.

The "len" field is only three bits. Since the data must be
4 byte aligned, this field is shifted left by 2, giving a
max length of 28 bytes. If the data load is greater than 28
bytes, the first array field holds the full length of the
data load and the len field is set to zero.

Example, data size of 7 bytes:

	type = RB_TYPE_DATA
	len = 2
	time_delta: <time-stamp> - <prev_event-time-stamp>
	array[0..1]: <7 bytes of data> <1 byte empty>

This event is saved in 12 bytes of the buffer.

An event with 82 bytes of data:

	type = RB_TYPE_DATA
	len = 0
	time_delta: <time-stamp> - <prev_event-time-stamp>
	array[0]: 84 (Note the alignment)
	array[1..14]: <82 bytes of data> <2 bytes empty>

The above event is saved in 92 bytes (if my math is correct).
82 bytes of data, 2 bytes empty, 4 byte header, 4 byte length.

Do not reference the above event struct directly. Use the following
functions to gain access to the event table, since the
ring_buffer_event structure may change in the future.

ring_buffer_event_length(event): get the length of the event.
	This is the size of the memory used to record this
	event, and not the size of the data pay load.

ring_buffer_time_delta(event): get the time delta of the event
	This returns the delta time stamp since the last event.
	Note: Even though this is in the header, there should
		be no reason to access this directly, accept
		for debugging.

ring_buffer_event_data(event): get the data from the event
	This is the function to use to get the actual data
	from the event. Note, it is only a pointer to the
	data inside the buffer. This data must be copied to
	another location otherwise you risk it being written
	over in the buffer.

ring_buffer_lock: A way to lock the entire buffer.
ring_buffer_unlock: unlock the buffer.

ring_buffer_alloc: create a new ring buffer. Can choose between
	overwrite or consumer/producer mode. Overwrite will
	overwrite old data, where as consumer producer will
	throw away new data if the consumer catches up with the
	producer.  The consumer/producer is the default.

ring_buffer_free: free the ring buffer.

ring_buffer_resize: resize the buffer. Changes the size of each cpu
	buffer. Note, it is up to the caller to provide that
	the buffer is not being used while this is happening.
	This requirement may go away but do not count on it.

ring_buffer_lock_reserve: locks the ring buffer and allocates an
	entry on the buffer to write to.
ring_buffer_unlock_commit: unlocks the ring buffer and commits it to
	the buffer.

ring_buffer_write: writes some data into the ring buffer.

ring_buffer_peek: Look at a next item in the cpu buffer.
ring_buffer_consume: get the next item in the cpu buffer and
	consume it. That is, this function increments the head
	pointer.

ring_buffer_read_start: Start an iterator of a cpu buffer.
	For now, this disables the cpu buffer, until you issue
	a finish. This is just because we do not want the iterator
	to be overwritten. This restriction may change in the future.
	But note, this is used for static reading of a buffer which
	is usually done "after" a trace. Live readings would want
	to use the ring_buffer_consume above, which will not
	disable the ring buffer.

ring_buffer_read_finish: Finishes the read iterator and reenables
	the ring buffer.

ring_buffer_iter_peek: Look at the next item in the cpu iterator.
ring_buffer_read: Read the iterator and increment it.
ring_buffer_iter_reset: Reset the iterator to point to the beginning
	of the cpu buffer.
ring_buffer_iter_empty: Returns true if the iterator is at the end
	of the cpu buffer.

ring_buffer_size: returns the size in bytes of each cpu buffer.
	Note, the real size is this times the number of CPUs.

ring_buffer_reset_cpu: Sets the cpu buffer to empty
ring_buffer_reset: sets all cpu buffers to empty

ring_buffer_swap_cpu: swaps a cpu buffer from one buffer with a
	cpu buffer of another buffer. This is handy when you
	want to take a snap shot of a running trace on just one
	cpu. Having a backup buffer, to swap with facilitates this.
	Ftrace max latencies use this.

ring_buffer_empty: Returns true if the ring buffer is empty.
ring_buffer_empty_cpu: Returns true if the cpu buffer is empty.

ring_buffer_record_disable: disable all cpu buffers (read only)
ring_buffer_record_disable_cpu: disable a single cpu buffer (read only)
ring_buffer_record_enable: enable all cpu buffers.
ring_buffer_record_enabl_cpu: enable a single cpu buffer.

ring_buffer_entries: The number of entries in a ring buffer.
ring_buffer_overruns: The number of entries removed due to writing wrap.

ring_buffer_time_stamp: Get the time stamp used by the ring buffer
ring_buffer_normalize_time_stamp: normalize the ring buffer time stamp
	into nanosecs.

I still need to implement the GTOD feature. But we need support from
the cpu frequency infrastructure.  But this can be done at a later
time without affecting the ring buffer interface.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 include/linux/ring_buffer.h |  179 ++++
 kernel/trace/Kconfig        |    4 
 kernel/trace/Makefile       |    1 
 kernel/trace/ring_buffer.c  | 1594 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 1778 insertions(+)

Index: linux-trace.git/include/linux/ring_buffer.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-trace.git/include/linux/ring_buffer.h	2008-09-27 01:59:06.000000000 -0400
@@ -0,0 +1,179 @@
+#ifndef _LINUX_RING_BUFFER_H
+#define _LINUX_RING_BUFFER_H
+
+#include <linux/mm.h>
+#include <linux/seq_file.h>
+
+struct ring_buffer;
+struct ring_buffer_iter;
+
+/*
+ * Don't reference this struct directly, use the inline items below.
+ */
+struct ring_buffer_event {
+	u32		type:2, len:3, time_delta:27;
+	u32		array[];
+};
+
+enum {
+	RB_TYPE_PADDING,	/* Left over page padding
+				 * array is ignored
+				 * size is variable depending on
+				 * how much padding is needed
+				 */
+	RB_TYPE_TIME_EXTENT,	/* Extent the time delta
+				 * array[0] = time delta (28 .. 59)
+				 * size = 8 bytes
+				 */
+	/* FIXME: RB_TYPE_TIME_STAMP not implemented */
+	RB_TYPE_TIME_STAMP,	/* Sync time stamp with external clock
+				 * array[0] = tv_nsec
+				 * array[1] = tv_sec
+				 * size = 16 bytes
+				 */
+
+	RB_TYPE_DATA,		/* Data record
+				 * If len is zero:
+				 *  array[0] holds the actual length
+				 *  array[1..(length+3)/4-1] holds data
+				 * else
+				 *  length = len << 2
+				 *  array[0..(length+3)/4] holds data
+				 */
+};
+
+#define RB_EVNT_HDR_SIZE (sizeof(struct ring_buffer_event))
+#define RB_ALIGNMENT_SHIFT	2
+#define RB_ALIGNMENT		(1 << RB_ALIGNMENT_SHIFT)
+#define RB_MAX_SMALL_DATA	(28)
+
+enum {
+	RB_LEN_TIME_EXTENT = 8,
+	RB_LEN_TIME_STAMP = 16,
+};
+
+/**
+ * ring_buffer_event_length - return the length of the event
+ * @event: the event to get the length of
+ */
+static inline unsigned
+ring_buffer_event_length(struct ring_buffer_event *event)
+{
+	unsigned length;
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		/* undefined */
+		return -1;
+
+	case RB_TYPE_TIME_EXTENT:
+		return RB_LEN_TIME_EXTENT;
+
+	case RB_TYPE_TIME_STAMP:
+		return RB_LEN_TIME_STAMP;
+
+	case RB_TYPE_DATA:
+		if (event->len)
+			length = event->len << RB_ALIGNMENT_SHIFT;
+		else
+			length = event->array[0];
+		return length + RB_EVNT_HDR_SIZE;
+	default:
+		BUG();
+	}
+	/* not hit */
+	return 0;
+}
+
+/**
+ * ring_buffer_event_time_delta - return the delta timestamp of the event
+ * @event: the event to get the delta timestamp of
+ *
+ * The delta timestamp is the 27 bit timestamp since the last event.
+ */
+static inline unsigned
+ring_buffer_event_time_delta(struct ring_buffer_event *event)
+{
+	return event->time_delta;
+}
+
+/**
+ * ring_buffer_event_data - return the data of the event
+ * @event: the event to get the data from
+ */
+static inline void *
+ring_buffer_event_data(struct ring_buffer_event *event)
+{
+	BUG_ON(event->type != RB_TYPE_DATA);
+	/* If length is in len field, then array[0] has the data */
+	if (event->len)
+		return (void *)&event->array[0];
+	/* Otherwise length is in array[0] and array[1] has the data */
+	return (void *)&event->array[1];
+}
+
+void ring_buffer_lock(struct ring_buffer *buffer, unsigned long *flags);
+void ring_buffer_unlock(struct ring_buffer *buffer, unsigned long flags);
+
+/*
+ * size is in bytes for each per CPU buffer.
+ */
+struct ring_buffer *
+ring_buffer_alloc(unsigned long size, unsigned flags);
+void ring_buffer_free(struct ring_buffer *buffer);
+
+int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size);
+
+struct ring_buffer_event *
+ring_buffer_lock_reserve(struct ring_buffer *buffer,
+			 unsigned long length,
+			 unsigned long *flags);
+int ring_buffer_unlock_commit(struct ring_buffer *buffer,
+			      struct ring_buffer_event *event,
+			      unsigned long flags);
+int ring_buffer_write(struct ring_buffer *buffer,
+		      unsigned long length, void *data);
+
+struct ring_buffer_event *
+ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts);
+struct ring_buffer_event *
+ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts);
+
+struct ring_buffer_iter *
+ring_buffer_read_start(struct ring_buffer *buffer, int cpu);
+void ring_buffer_read_finish(struct ring_buffer_iter *iter);
+
+struct ring_buffer_event *
+ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts);
+struct ring_buffer_event *
+ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts);
+void ring_buffer_iter_reset(struct ring_buffer_iter *iter);
+int ring_buffer_iter_empty(struct ring_buffer_iter *iter);
+
+unsigned long ring_buffer_size(struct ring_buffer *buffer);
+
+void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu);
+void ring_buffer_reset(struct ring_buffer *buffer);
+
+int ring_buffer_swap_cpu(struct ring_buffer *buffer_a,
+			 struct ring_buffer *buffer_b, int cpu);
+
+int ring_buffer_empty(struct ring_buffer *buffer);
+int ring_buffer_empty_cpu(struct ring_buffer *buffer, int cpu);
+
+void ring_buffer_record_disable(struct ring_buffer *buffer);
+void ring_buffer_record_enable(struct ring_buffer *buffer);
+void ring_buffer_record_disable_cpu(struct ring_buffer *buffer, int cpu);
+void ring_buffer_record_enable_cpu(struct ring_buffer *buffer, int cpu);
+
+unsigned long ring_buffer_entries(struct ring_buffer *buffer);
+unsigned long ring_buffer_overruns(struct ring_buffer *buffer);
+
+u64 ring_buffer_time_stamp(int cpu);
+void ring_buffer_normalize_time_stamp(int cpu, u64 *ts);
+
+enum ring_buffer_flags {
+	RB_FL_OVERWRITE		= 1 << 0,
+};
+
+#endif /* _LINUX_RING_BUFFER_H */
Index: linux-trace.git/kernel/trace/ring_buffer.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-trace.git/kernel/trace/ring_buffer.c	2008-09-27 02:02:11.000000000 -0400
@@ -0,0 +1,1594 @@
+/*
+ * Generic ring buffer
+ *
+ * Copyright (C) 2008 Steven Rostedt <srostedt@redhat.com>
+ */
+#include <linux/ring_buffer.h>
+#include <linux/spinlock.h>
+#include <linux/debugfs.h>
+#include <linux/uaccess.h>
+#include <linux/module.h>
+#include <linux/percpu.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>	/* used for sched_clock() (for now) */
+#include <linux/init.h>
+#include <linux/hash.h>
+#include <linux/list.h>
+#include <linux/fs.h>
+
+/* This needs to be somewhere else */
+#ifdef CONFIG_SMP
+# define __raw_assert_spin_is_locked(lock) \
+	BUG_ON(!__raw_spin_is_locked(lock))
+#else
+# define __raw_assert_spin_is_locked(lock) do { } while (0)
+#endif
+
+/* Up this if you want to test the TIME_EXTENTS and normalization */
+#define DEBUG_SHIFT 0
+
+/* FIXME!!! */
+u64 ring_buffer_time_stamp(int cpu)
+{
+	/* shift to debug/test normalization and TIME_EXTENTS */
+	return sched_clock() << DEBUG_SHIFT;
+}
+void ring_buffer_normalize_time_stamp(int cpu, u64 *ts)
+{
+	/* Just stupid testing the normalize function and deltas */
+	*ts >>= DEBUG_SHIFT;
+}
+
+#define for_each_buffer_cpu(buffer, cpu)		\
+	for_each_cpu_mask(cpu, buffer->cpumask)
+
+#define TS_SHIFT	27
+#define TS_MASK		((1ULL << TS_SHIFT) - 1)
+#define TS_DELTA_TEST	(~TS_MASK)
+
+/*
+ * This hack stolen from mm/slob.c.
+ * We can store per page timing information in the page frame of the page.
+ * Thanks to Peter Zijlstra for suggesting this idea.
+ */
+struct buffer_page {
+	union {
+		struct {
+			unsigned long flags;	/* mandatory */
+			atomic_t _count;	/* mandatory */
+			u64	time_stamp;	/* page time stamp */
+			unsigned size;		/* size of page data */
+			struct list_head list;	/* linked list of free pages */
+		};
+		struct page page;
+	};
+};
+
+/*
+ * We need to fit the time_stamp delta into 27 bits.
+ */
+static inline int test_time_stamp(u64 delta)
+{
+	if (delta & TS_DELTA_TEST)
+		return 1;
+	return 0;
+}
+
+#define BUF_PAGE_SIZE PAGE_SIZE
+
+/*
+ * head_page == tail_page && head == tail then buffer is empty.
+ */
+struct ring_buffer_per_cpu {
+	int			cpu;
+	struct ring_buffer	*buffer;
+	raw_spinlock_t		lock;
+	struct lock_class_key	lock_key;
+	struct list_head	pages;
+	unsigned long		head;	/* read from head */
+	unsigned long		tail;	/* write to tail */
+	struct buffer_page	*head_page;
+	struct buffer_page	*tail_page;
+	unsigned long		overrun;
+	unsigned long		entries;
+	u64			write_stamp;
+	u64			read_stamp;
+	atomic_t		record_disabled;
+};
+
+struct ring_buffer {
+	unsigned long		size;
+	unsigned		pages;
+	unsigned		flags;
+	int			cpus;
+	cpumask_t		cpumask;
+	atomic_t		record_disabled;
+
+	struct mutex		mutex;
+
+	struct ring_buffer_per_cpu **buffers;
+};
+
+struct ring_buffer_iter {
+	struct ring_buffer_per_cpu	*cpu_buffer;
+	unsigned long			head;
+	struct buffer_page		*head_page;
+	u64				read_stamp;
+};
+
+#define CHECK_COND(buffer, cond)			\
+	if (unlikely(cond)) {				\
+		atomic_inc(&buffer->record_disabled);	\
+		WARN_ON(1);				\
+		return -1;				\
+	}
+
+/**
+ * check_pages - integrity check of buffer pages
+ * @cpu_buffer: CPU buffer with pages to test
+ *
+ * As a safty measure we check to make sure the data pages have not
+ * been corrupted.
+ */
+static int rb_check_pages(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct list_head *head = &cpu_buffer->pages;
+	struct buffer_page *page, *tmp;
+
+	CHECK_COND(cpu_buffer, head->next->prev != head);
+	CHECK_COND(cpu_buffer, head->prev->next != head);
+
+	list_for_each_entry_safe(page, tmp, head, list) {
+		CHECK_COND(cpu_buffer, page->list.next->prev != &page->list);
+		CHECK_COND(cpu_buffer, page->list.prev->next != &page->list);
+	}
+
+	return 0;
+}
+
+static unsigned rb_head_size(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	return cpu_buffer->head_page->size;
+}
+
+static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
+			     unsigned nr_pages)
+{
+	struct list_head *head = &cpu_buffer->pages;
+	LIST_HEAD(pages);
+	struct buffer_page *page, *tmp;
+	unsigned long addr;
+	unsigned i;
+
+	for (i = 0; i < nr_pages; i++) {
+		addr = __get_free_page(GFP_KERNEL);
+		if (!addr)
+			goto free_pages;
+		page = (struct buffer_page *)virt_to_page(addr);
+		list_add(&page->list, &pages);
+	}
+
+	list_splice(&pages, head);
+
+	rb_check_pages(cpu_buffer);
+
+	return 0;
+
+ free_pages:
+	list_for_each_entry_safe(page, tmp, &pages, list) {
+		list_del_init(&page->list);
+		__free_page(&page->page);
+	}
+	return -ENOMEM;
+}
+
+static struct ring_buffer_per_cpu *
+rb_allocate_cpu_buffer(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int ret;
+
+	cpu_buffer = kzalloc_node(ALIGN(sizeof(*cpu_buffer), cache_line_size()),
+				  GFP_KERNEL, cpu_to_node(cpu));
+	if (!cpu_buffer)
+		return NULL;
+
+	cpu_buffer->cpu = cpu;
+	cpu_buffer->buffer = buffer;
+	cpu_buffer->lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED;
+	INIT_LIST_HEAD(&cpu_buffer->pages);
+
+	ret = rb_allocate_pages(cpu_buffer, buffer->pages);
+	if (ret < 0)
+		goto fail_free_buffer;
+
+	cpu_buffer->head_page
+		= list_entry(cpu_buffer->pages.next, struct buffer_page, list);
+	cpu_buffer->tail_page
+		= list_entry(cpu_buffer->pages.next, struct buffer_page, list);
+
+	return cpu_buffer;
+
+ fail_free_buffer:
+	kfree(cpu_buffer);
+	return NULL;
+}
+
+static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct list_head *head = &cpu_buffer->pages;
+	struct buffer_page *page, *tmp;
+
+	list_for_each_entry_safe(page, tmp, head, list) {
+		list_del_init(&page->list);
+		__free_page(&page->page);
+	}
+	kfree(cpu_buffer);
+}
+
+/**
+ * ring_buffer_alloc - allocate a new ring_buffer
+ * @size: the size in bytes that is needed.
+ * @flags: attributes to set for the ring buffer.
+ *
+ * Currently the only flag that is available is the RB_FL_OVERWRITE
+ * flag. This flag means that the buffer will overwrite old data
+ * when the buffer wraps. If this flag is not set, the buffer will
+ * drop data when the tail hits the head.
+ */
+struct ring_buffer *
+ring_buffer_alloc(unsigned long size, unsigned flags)
+{
+	struct ring_buffer *buffer;
+	int bsize;
+	int cpu;
+
+	/* keep it in its own cache line */
+	buffer = kzalloc(ALIGN(sizeof(*buffer), cache_line_size()),
+			 GFP_KERNEL);
+	if (!buffer)
+		return NULL;
+
+	buffer->pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
+	buffer->flags = flags;
+
+	/* need at least two pages */
+	if (buffer->pages == 1)
+		buffer->pages++;
+
+	buffer->cpumask = cpu_possible_map;
+	buffer->cpus = nr_cpu_ids;
+
+	bsize = sizeof(void *) * nr_cpu_ids;
+	buffer->buffers = kzalloc(ALIGN(bsize, cache_line_size()),
+				  GFP_KERNEL);
+	if (!buffer->buffers)
+		goto fail_free_buffer;
+
+	for_each_buffer_cpu(buffer, cpu) {
+		buffer->buffers[cpu] =
+			rb_allocate_cpu_buffer(buffer, cpu);
+		if (!buffer->buffers[cpu])
+			goto fail_free_buffers;
+	}
+
+	mutex_init(&buffer->mutex);
+
+	return buffer;
+
+ fail_free_buffers:
+	for_each_buffer_cpu(buffer, cpu) {
+		if (buffer->buffers[cpu])
+			rb_free_cpu_buffer(buffer->buffers[cpu]);
+	}
+	kfree(buffer->buffers);
+
+ fail_free_buffer:
+	kfree(buffer);
+	return NULL;
+}
+
+/**
+ * ring_buffer_free - free a ring buffer.
+ * @buffer: the buffer to free.
+ */
+void
+ring_buffer_free(struct ring_buffer *buffer)
+{
+	int cpu;
+
+	for_each_buffer_cpu(buffer, cpu)
+		rb_free_cpu_buffer(buffer->buffers[cpu]);
+
+	kfree(buffer);
+}
+
+static void rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer);
+
+static void
+rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned nr_pages)
+{
+	struct buffer_page *page;
+	struct list_head *p;
+	unsigned i;
+
+	atomic_inc(&cpu_buffer->record_disabled);
+	synchronize_sched();
+
+	for (i = 0; i < nr_pages; i++) {
+		BUG_ON(list_empty(&cpu_buffer->pages));
+		p = cpu_buffer->pages.next;
+		page = list_entry(p, struct buffer_page, list);
+		list_del_init(&page->list);
+		__free_page(&page->page);
+	}
+	BUG_ON(list_empty(&cpu_buffer->pages));
+
+	rb_reset_cpu(cpu_buffer);
+
+	rb_check_pages(cpu_buffer);
+
+	atomic_dec(&cpu_buffer->record_disabled);
+
+}
+
+static void
+rb_insert_pages(struct ring_buffer_per_cpu *cpu_buffer,
+		struct list_head *pages, unsigned nr_pages)
+{
+	struct buffer_page *page;
+	struct list_head *p;
+	unsigned i;
+
+	atomic_inc(&cpu_buffer->record_disabled);
+	synchronize_sched();
+
+	for (i = 0; i < nr_pages; i++) {
+		BUG_ON(list_empty(pages));
+		p = pages->next;
+		page = list_entry(p, struct buffer_page, list);
+		list_del_init(&page->list);
+		list_add_tail(&page->list, &cpu_buffer->pages);
+	}
+	rb_reset_cpu(cpu_buffer);
+
+	rb_check_pages(cpu_buffer);
+
+	atomic_dec(&cpu_buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_resize - resize the ring buffer
+ * @buffer: the buffer to resize.
+ * @size: the new size.
+ *
+ * The tracer is responsible for making sure that the buffer is
+ * not being used while changing the size.
+ * Note: We may be able to change the above requirement by using
+ *  RCU synchronizations.
+ *
+ * Minimum size is 2 * BUF_PAGE_SIZE.
+ *
+ * Returns -1 on failure.
+ */
+int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long buffer_size;
+	LIST_HEAD(pages);
+	unsigned long addr;
+	unsigned nr_pages, rm_pages, new_pages;
+	struct buffer_page *page, *tmp;
+	int i, cpu;
+
+	size = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
+	size *= BUF_PAGE_SIZE;
+	buffer_size = buffer->pages * BUF_PAGE_SIZE;
+
+	/* we need a minimum of two pages */
+	if (size < BUF_PAGE_SIZE * 2)
+		size = BUF_PAGE_SIZE * 2;
+
+	if (size == buffer_size)
+		return size;
+
+	mutex_lock(&buffer->mutex);
+
+	nr_pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
+
+	if (size < buffer_size) {
+
+		/* easy case, just free pages */
+		BUG_ON(nr_pages >= buffer->pages);
+
+		rm_pages = buffer->pages - nr_pages;
+
+		for_each_buffer_cpu(buffer, cpu) {
+			cpu_buffer = buffer->buffers[cpu];
+			rb_remove_pages(cpu_buffer, rm_pages);
+		}
+		goto out;
+	}
+
+	/*
+	 * This is a bit more difficult. We only want to add pages
+	 * when we can allocate enough for all CPUs. We do this
+	 * by allocating all the pages and storing them on a local
+	 * link list. If we succeed in our allocation, then we
+	 * add these pages to the cpu_buffers. Otherwise we just free
+	 * them all and return -ENOMEM;
+	 */
+	BUG_ON(nr_pages <= buffer->pages);
+	new_pages = nr_pages - buffer->pages;
+
+	for_each_buffer_cpu(buffer, cpu) {
+		for (i = 0; i < new_pages; i++) {
+			addr = __get_free_page(GFP_KERNEL);
+			if (!addr)
+				goto free_pages;
+			page = (struct buffer_page *)virt_to_page(addr);
+			list_add(&page->list, &pages);
+		}
+	}
+
+	for_each_buffer_cpu(buffer, cpu) {
+		cpu_buffer = buffer->buffers[cpu];
+		rb_insert_pages(cpu_buffer, &pages, new_pages);
+	}
+
+	BUG_ON(!list_empty(&pages));
+
+ out:
+	buffer->pages = nr_pages;
+	mutex_unlock(&buffer->mutex);
+
+	return size;
+
+ free_pages:
+	list_for_each_entry_safe(page, tmp, &pages, list) {
+		list_del_init(&page->list);
+		__free_page(&page->page);
+	}
+	return -ENOMEM;
+}
+
+static inline int rb_per_cpu_empty(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	return cpu_buffer->head_page == cpu_buffer->tail_page &&
+		cpu_buffer->head == cpu_buffer->tail;
+}
+
+static inline int rb_null_event(struct ring_buffer_event *event)
+{
+	return event->type == RB_TYPE_PADDING;
+}
+
+static inline void *rb_page_index(struct buffer_page *page, unsigned index)
+{
+	void *addr;
+
+	addr = page_address(&page->page);
+	return addr + index;
+}
+
+static inline struct ring_buffer_event *
+rb_head_event(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	return rb_page_index(cpu_buffer->head_page,
+			     cpu_buffer->head);
+}
+
+static inline struct ring_buffer_event *
+rb_iter_head_event(struct ring_buffer_iter *iter)
+{
+	return rb_page_index(iter->head_page,
+			     iter->head);
+}
+
+/*
+ * When the tail hits the head and the buffer is in overwrite mode,
+ * the head jumps to the next page and all content on the previous
+ * page is discarded. But before doing so, we update the overrun
+ * variable of the buffer.
+ */
+static void rb_update_overflow(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct ring_buffer_event *event;
+	unsigned long head;
+
+	for (head = 0; head < rb_head_size(cpu_buffer);
+	     head += ring_buffer_event_length(event)) {
+		event = rb_page_index(cpu_buffer->head_page, head);
+		BUG_ON(rb_null_event(event));
+		/* Only count data entries */
+		if (event->type != RB_TYPE_DATA)
+			continue;
+		cpu_buffer->overrun++;
+		cpu_buffer->entries--;
+	}
+}
+
+static inline void rb_inc_page(struct ring_buffer_per_cpu *cpu_buffer,
+			       struct buffer_page **page)
+{
+	struct list_head *p = (*page)->list.next;
+
+	if (p == &cpu_buffer->pages)
+		p = p->next;
+
+	*page = list_entry(p, struct buffer_page, list);
+}
+
+static inline void
+rb_add_stamp(struct ring_buffer_per_cpu *cpu_buffer, u64 *ts)
+{
+	cpu_buffer->tail_page->time_stamp = *ts;
+	cpu_buffer->write_stamp = *ts;
+}
+
+static void rb_reset_read_page(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	cpu_buffer->read_stamp = cpu_buffer->head_page->time_stamp;
+	cpu_buffer->head = 0;
+}
+
+static void
+rb_reset_iter_read_page(struct ring_buffer_iter *iter)
+{
+	iter->read_stamp = iter->head_page->time_stamp;
+	iter->head = 0;
+}
+
+/**
+ * ring_buffer_update_event - update event type and data
+ * @event: the even to update
+ * @type: the type of event
+ * @length: the size of the event field in the ring buffer
+ *
+ * Update the type and data fields of the event. The length
+ * is the actual size that is written to the ring buffer,
+ * and with this, we can determine what to place into the
+ * data field.
+ */
+static inline void
+rb_update_event(struct ring_buffer_event *event,
+			 unsigned type, unsigned length)
+{
+	event->type = type;
+
+	switch (type) {
+
+	case RB_TYPE_PADDING:
+		break;
+
+	case RB_TYPE_TIME_EXTENT:
+		event->len =
+			(RB_LEN_TIME_EXTENT + (RB_ALIGNMENT-1))
+			>> RB_ALIGNMENT_SHIFT;
+		break;
+
+	case RB_TYPE_TIME_STAMP:
+		event->len =
+			(RB_LEN_TIME_STAMP + (RB_ALIGNMENT-1))
+			>> RB_ALIGNMENT_SHIFT;
+		break;
+
+	case RB_TYPE_DATA:
+		length -= RB_EVNT_HDR_SIZE;
+		if (length > RB_MAX_SMALL_DATA) {
+			event->len = 0;
+			event->array[0] = length;
+		} else
+			event->len =
+				(length + (RB_ALIGNMENT-1))
+				>> RB_ALIGNMENT_SHIFT;
+		break;
+	default:
+		BUG();
+	}
+}
+
+static inline unsigned rb_calculate_event_length(unsigned length)
+{
+	struct ring_buffer_event event; /* Used only for sizeof array */
+
+	/* zero length can cause confusions */
+	if (!length)
+		length = 1;
+
+	if (length > RB_MAX_SMALL_DATA)
+		length += sizeof(event.array[0]);
+
+	length += RB_EVNT_HDR_SIZE;
+	length = ALIGN(length, RB_ALIGNMENT);
+
+	return length;
+}
+
+static struct ring_buffer_event *
+__rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
+		  unsigned type, unsigned long length, u64 *ts)
+{
+	struct buffer_page *head_page, *tail_page;
+	unsigned long tail;
+	struct ring_buffer *buffer = cpu_buffer->buffer;
+	struct ring_buffer_event *event;
+
+	tail_page = cpu_buffer->tail_page;
+	head_page = cpu_buffer->head_page;
+	tail = cpu_buffer->tail;
+
+	if (tail + length > BUF_PAGE_SIZE) {
+		struct buffer_page *next_page = tail_page;
+
+		rb_inc_page(cpu_buffer, &next_page);
+
+		if (next_page == head_page) {
+			if (!(buffer->flags & RB_FL_OVERWRITE))
+				return NULL;
+
+			/* count overflows */
+			rb_update_overflow(cpu_buffer);
+
+			rb_inc_page(cpu_buffer, &head_page);
+			cpu_buffer->head_page = head_page;
+			rb_reset_read_page(cpu_buffer);
+		}
+
+		if (tail != BUF_PAGE_SIZE) {
+			event = rb_page_index(tail_page, tail);
+			/* page padding */
+			event->type = RB_TYPE_PADDING;
+		}
+
+		tail_page->size = tail;
+		tail_page = next_page;
+		tail_page->size = 0;
+		tail = 0;
+		cpu_buffer->tail_page = tail_page;
+		cpu_buffer->tail = tail;
+		rb_add_stamp(cpu_buffer, ts);
+	}
+
+	BUG_ON(tail + length > BUF_PAGE_SIZE);
+
+	event = rb_page_index(tail_page, tail);
+	rb_update_event(event, type, length);
+
+	return event;
+}
+
+static struct ring_buffer_event *
+rb_reserve_next_event(struct ring_buffer_per_cpu *cpu_buffer,
+		      unsigned type, unsigned long length)
+{
+	u64 ts, delta;
+	struct ring_buffer_event *event;
+	static int once;
+
+	ts = ring_buffer_time_stamp(cpu_buffer->cpu);
+
+	if (cpu_buffer->tail) {
+		delta = ts - cpu_buffer->write_stamp;
+
+		if (test_time_stamp(delta)) {
+			if (unlikely(delta > (1ULL << 59) && !once++)) {
+				printk(KERN_WARNING "Delta way too big! %llu"
+				       " ts=%llu write stamp = %llu\n",
+				       delta, ts, cpu_buffer->write_stamp);
+				WARN_ON(1);
+			}
+			/*
+			 * The delta is too big, we to add a
+			 * new timestamp.
+			 */
+			event = __rb_reserve_next(cpu_buffer,
+						  RB_TYPE_TIME_EXTENT,
+						  RB_LEN_TIME_EXTENT,
+						  &ts);
+			if (!event)
+				return NULL;
+
+			/* check to see if we went to the next page */
+			if (cpu_buffer->tail) {
+				/* Still on same page, update timestamp */
+				event->time_delta = delta & TS_MASK;
+				event->array[0] = delta >> TS_SHIFT;
+				/* commit the time event */
+				cpu_buffer->tail +=
+					ring_buffer_event_length(event);
+				cpu_buffer->write_stamp = ts;
+				delta = 0;
+			}
+		}
+	} else {
+		rb_add_stamp(cpu_buffer, &ts);
+		delta = 0;
+	}
+
+	event = __rb_reserve_next(cpu_buffer, type, length, &ts);
+	if (!event)
+		return NULL;
+
+	/* If the reserve went to the next page, our delta is zero */
+	if (!cpu_buffer->tail)
+		delta = 0;
+
+	event->time_delta = delta;
+
+	return event;
+}
+
+/**
+ * ring_buffer_lock_reserve - reserve a part of the buffer
+ * @buffer: the ring buffer to reserve from
+ * @length: the length of the data to reserve (excluding event header)
+ * @flags: a pointer to save the interrupt flags
+ *
+ * Returns a reseverd event on the ring buffer to copy directly to.
+ * The user of this interface will need to get the body to write into
+ * and can use the ring_buffer_event_data() interface.
+ *
+ * The length is the length of the data needed, not the event length
+ * which also includes the event header.
+ *
+ * Must be paired with ring_buffer_unlock_commit, unless NULL is returned.
+ * If NULL is returned, then nothing has been allocated or locked.
+ */
+struct ring_buffer_event *
+ring_buffer_lock_reserve(struct ring_buffer *buffer,
+			 unsigned long length,
+			 unsigned long *flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	int cpu;
+
+	if (atomic_read(&buffer->record_disabled))
+		return NULL;
+
+	raw_local_irq_save(*flags);
+	cpu = raw_smp_processor_id();
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		goto out_irq;
+
+	cpu_buffer = buffer->buffers[cpu];
+	__raw_spin_lock(&cpu_buffer->lock);
+
+	if (atomic_read(&cpu_buffer->record_disabled))
+		goto no_record;
+
+	length = rb_calculate_event_length(length);
+	if (length > BUF_PAGE_SIZE)
+		return NULL;
+
+	event = rb_reserve_next_event(cpu_buffer, RB_TYPE_DATA, length);
+	if (!event)
+		goto no_record;
+
+	return event;
+
+ no_record:
+	__raw_spin_unlock(&cpu_buffer->lock);
+ out_irq:
+	local_irq_restore(*flags);
+	return NULL;
+}
+
+static void rb_commit(struct ring_buffer_per_cpu *cpu_buffer,
+		      struct ring_buffer_event *event)
+{
+	cpu_buffer->tail += ring_buffer_event_length(event);
+	cpu_buffer->tail_page->size = cpu_buffer->tail;
+	cpu_buffer->write_stamp += event->time_delta;
+	cpu_buffer->entries++;
+}
+
+/**
+ * ring_buffer_unlock_commit - commit a reserved
+ * @buffer: The buffer to commit to
+ * @event: The event pointer to commit.
+ * @flags: the interrupt flags received from ring_buffer_lock_reserve.
+ *
+ * This commits the data to the ring buffer, and releases any locks held.
+ *
+ * Must be paired with ring_buffer_lock_reserve.
+ */
+int ring_buffer_unlock_commit(struct ring_buffer *buffer,
+			      struct ring_buffer_event *event,
+			      unsigned long flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu = raw_smp_processor_id();
+
+	cpu_buffer = buffer->buffers[cpu];
+
+	__raw_assert_spin_is_locked(&cpu_buffer->lock);
+
+	rb_commit(cpu_buffer, event);
+
+	__raw_spin_unlock(&cpu_buffer->lock);
+	raw_local_irq_restore(flags);
+
+	return 0;
+}
+
+/**
+ * ring_buffer_write - write data to the buffer without reserving
+ * @buffer: The ring buffer to write to.
+ * @length: The length of the data being written (excluding the event header)
+ * @data: The data to write to the buffer.
+ *
+ * This is like ring_buffer_lock_reserve and ring_buffer_unlock_commit as
+ * one function. If you already have the data to write to the buffer, it
+ * may be easier to simply call this function.
+ *
+ * Note, like ring_buffer_lock_reserve, the length is the length of the data
+ * and not the length of the event which would hold the header.
+ */
+int ring_buffer_write(struct ring_buffer *buffer,
+			unsigned long length,
+			void *data)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	unsigned long event_length, flags;
+	void *body;
+	int ret = -EBUSY;
+	int cpu;
+
+	if (atomic_read(&buffer->record_disabled))
+		return -EBUSY;
+
+	local_irq_save(flags);
+	cpu = raw_smp_processor_id();
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		goto out_irq;
+
+	cpu_buffer = buffer->buffers[cpu];
+	__raw_spin_lock(&cpu_buffer->lock);
+
+	if (atomic_read(&cpu_buffer->record_disabled))
+		goto out;
+
+	event_length = rb_calculate_event_length(length);
+	event = rb_reserve_next_event(cpu_buffer,
+				      RB_TYPE_DATA, event_length);
+	if (!event)
+		goto out;
+
+	body = ring_buffer_event_data(event);
+
+	memcpy(body, data, length);
+
+	rb_commit(cpu_buffer, event);
+
+	ret = 0;
+ out:
+	__raw_spin_unlock(&cpu_buffer->lock);
+ out_irq:
+	local_irq_restore(flags);
+
+	return ret;
+}
+
+/**
+ * ring_buffer_lock - lock the ring buffer
+ * @buffer: The ring buffer to lock
+ * @flags: The place to store the interrupt flags
+ *
+ * This locks all the per CPU buffers.
+ *
+ * Must be unlocked by ring_buffer_unlock.
+ */
+void ring_buffer_lock(struct ring_buffer *buffer, unsigned long *flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	local_irq_save(*flags);
+
+	for_each_buffer_cpu(buffer, cpu) {
+		cpu_buffer = buffer->buffers[cpu];
+		__raw_spin_lock(&cpu_buffer->lock);
+	}
+}
+
+/**
+ * ring_buffer_unlock - unlock a locked buffer
+ * @buffer: The locked buffer to unlock
+ * @flags: The interrupt flags received by ring_buffer_lock
+ */
+void ring_buffer_unlock(struct ring_buffer *buffer, unsigned long flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	for (cpu = buffer->cpus - 1; cpu >= 0; cpu--) {
+		if (!cpu_isset(cpu, buffer->cpumask))
+			continue;
+		cpu_buffer = buffer->buffers[cpu];
+		__raw_spin_unlock(&cpu_buffer->lock);
+	}
+
+	local_irq_restore(flags);
+}
+
+/**
+ * ring_buffer_record_disable - stop all writes into the buffer
+ * @buffer: The ring buffer to stop writes to.
+ *
+ * This prevents all writes to the buffer. Any attempt to write
+ * to the buffer after this will fail and return NULL.
+ *
+ * The caller should call synchronize_sched() after this.
+ */
+void ring_buffer_record_disable(struct ring_buffer *buffer)
+{
+	atomic_inc(&buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_record_enable - enable writes to the buffer
+ * @buffer: The ring buffer to enable writes
+ *
+ * Note, multiple disables will need the same number of enables
+ * to truely enable the writing (much like preempt_disable).
+ */
+void ring_buffer_record_enable(struct ring_buffer *buffer)
+{
+	atomic_dec(&buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_record_disable_cpu - stop all writes into the cpu_buffer
+ * @buffer: The ring buffer to stop writes to.
+ * @cpu: The CPU buffer to stop
+ *
+ * This prevents all writes to the buffer. Any attempt to write
+ * to the buffer after this will fail and return NULL.
+ *
+ * The caller should call synchronize_sched() after this.
+ */
+void ring_buffer_record_disable_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return;
+
+	cpu_buffer = buffer->buffers[cpu];
+	atomic_inc(&cpu_buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_record_enable_cpu - enable writes to the buffer
+ * @buffer: The ring buffer to enable writes
+ * @cpu: The CPU to enable.
+ *
+ * Note, multiple disables will need the same number of enables
+ * to truely enable the writing (much like preempt_disable).
+ */
+void ring_buffer_record_enable_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return;
+
+	cpu_buffer = buffer->buffers[cpu];
+	atomic_dec(&cpu_buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_entries_cpu - get the number of entries in a cpu buffer
+ * @buffer: The ring buffer
+ * @cpu: The per CPU buffer to get the entries from.
+ */
+unsigned long ring_buffer_entries_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return 0;
+
+	cpu_buffer = buffer->buffers[cpu];
+	return cpu_buffer->entries;
+}
+
+/**
+ * ring_buffer_overrun_cpu - get the number of overruns in a cpu_buffer
+ * @buffer: The ring buffer
+ * @cpu: The per CPU buffer to get the number of overruns from
+ */
+unsigned long ring_buffer_overrun_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return 0;
+
+	cpu_buffer = buffer->buffers[cpu];
+	return cpu_buffer->overrun;
+}
+
+/**
+ * ring_buffer_entries - get the number of entries in a buffer
+ * @buffer: The ring buffer
+ *
+ * Returns the total number of entries in the ring buffer
+ * (all CPU entries)
+ */
+unsigned long ring_buffer_entries(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long entries = 0;
+	int cpu;
+
+	/* if you care about this being correct, lock the buffer */
+	for_each_buffer_cpu(buffer, cpu) {
+		cpu_buffer = buffer->buffers[cpu];
+		entries += cpu_buffer->entries;
+	}
+
+	return entries;
+}
+
+/**
+ * ring_buffer_overrun_cpu - get the number of overruns in buffer
+ * @buffer: The ring buffer
+ *
+ * Returns the total number of overruns in the ring buffer
+ * (all CPU entries)
+ */
+unsigned long ring_buffer_overruns(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long overruns = 0;
+	int cpu;
+
+	/* if you care about this being correct, lock the buffer */
+	for_each_buffer_cpu(buffer, cpu) {
+		cpu_buffer = buffer->buffers[cpu];
+		overruns += cpu_buffer->overrun;
+	}
+
+	return overruns;
+}
+
+/**
+ * ring_buffer_iter_reset - reset an iterator
+ * @iter: The iterator to reset
+ *
+ * Resets the iterator, so that it will start from the beginning
+ * again.
+ */
+void ring_buffer_iter_reset(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
+
+	iter->head_page = cpu_buffer->head_page;
+	iter->head = cpu_buffer->head;
+	rb_reset_iter_read_page(iter);
+}
+
+/**
+ * ring_buffer_iter_empty - check if an iterator has no more to read
+ * @iter: The iterator to check
+ */
+int ring_buffer_iter_empty(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = iter->cpu_buffer;
+
+	return iter->head_page == cpu_buffer->tail_page &&
+		iter->head == cpu_buffer->tail;
+}
+
+static void
+rb_update_read_stamp(struct ring_buffer_per_cpu *cpu_buffer,
+		     struct ring_buffer_event *event)
+{
+	u64 delta;
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		return;
+
+	case RB_TYPE_TIME_EXTENT:
+		delta = event->array[0];
+		delta <<= TS_SHIFT;
+		delta += event->time_delta;
+		cpu_buffer->read_stamp += delta;
+		return;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		return;
+
+	case RB_TYPE_DATA:
+		cpu_buffer->read_stamp += event->time_delta;
+		return;
+
+	default:
+		BUG();
+	}
+	return;
+}
+
+static void
+rb_update_iter_read_stamp(struct ring_buffer_iter *iter,
+			  struct ring_buffer_event *event)
+{
+	u64 delta;
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		return;
+
+	case RB_TYPE_TIME_EXTENT:
+		delta = event->array[0];
+		delta <<= TS_SHIFT;
+		delta += event->time_delta;
+		iter->read_stamp += delta;
+		return;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		return;
+
+	case RB_TYPE_DATA:
+		iter->read_stamp += event->time_delta;
+		return;
+
+	default:
+		BUG();
+	}
+	return;
+}
+
+static void rb_advance_head(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct ring_buffer_event *event;
+	unsigned length;
+
+	/*
+	 * Check if we are at the end of the buffer.
+	 */
+	if (cpu_buffer->head >= cpu_buffer->head_page->size) {
+		BUG_ON(cpu_buffer->head_page == cpu_buffer->tail_page);
+		rb_inc_page(cpu_buffer, &cpu_buffer->head_page);
+		rb_reset_read_page(cpu_buffer);
+		return;
+	}
+
+	event = rb_head_event(cpu_buffer);
+
+	if (event->type == RB_TYPE_DATA)
+		cpu_buffer->entries--;
+
+	length = ring_buffer_event_length(event);
+
+	/*
+	 * This should not be called to advance the header if we are
+	 * at the tail of the buffer.
+	 */
+	BUG_ON((cpu_buffer->head_page == cpu_buffer->tail_page) &&
+	       (cpu_buffer->head + length > cpu_buffer->tail));
+
+	rb_update_read_stamp(cpu_buffer, event);
+
+	cpu_buffer->head += length;
+
+	/* check for end of page */
+	if ((cpu_buffer->head >= cpu_buffer->head_page->size) &&
+	    (cpu_buffer->head_page != cpu_buffer->tail_page))
+		rb_advance_head(cpu_buffer);
+}
+
+static void rb_advance_iter(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer *buffer;
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	unsigned length;
+
+	cpu_buffer = iter->cpu_buffer;
+	buffer = cpu_buffer->buffer;
+
+	/*
+	 * Check if we are at the end of the buffer.
+	 */
+	if (iter->head >= iter->head_page->size) {
+		BUG_ON(iter->head_page == cpu_buffer->tail_page);
+		rb_inc_page(cpu_buffer, &iter->head_page);
+		rb_reset_iter_read_page(iter);
+		return;
+	}
+
+	event = rb_iter_head_event(iter);
+
+	length = ring_buffer_event_length(event);
+
+	/*
+	 * This should not be called to advance the header if we are
+	 * at the tail of the buffer.
+	 */
+	BUG_ON((iter->head_page == cpu_buffer->tail_page) &&
+	       (iter->head + length > cpu_buffer->tail));
+
+	rb_update_iter_read_stamp(iter, event);
+
+	iter->head += length;
+
+	/* check for end of page padding */
+	if ((iter->head >= iter->head_page->size) &&
+	    (iter->head_page != cpu_buffer->tail_page))
+		rb_advance_iter(iter);
+}
+
+/**
+ * ring_buffer_peek - peek at the next event to be read
+ * @buffer: The ring buffer to read
+ * @cpu: The cpu to peak at
+ * @ts: The timestamp counter of this event.
+ *
+ * This will return the event that will be read next, but does
+ * not consume the data.
+ */
+struct ring_buffer_event *
+ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return NULL;
+
+	cpu_buffer = buffer->buffers[cpu];
+
+ again:
+	if (rb_per_cpu_empty(cpu_buffer))
+		return NULL;
+
+	event = rb_head_event(cpu_buffer);
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		rb_inc_page(cpu_buffer, &cpu_buffer->head_page);
+		rb_reset_read_page(cpu_buffer);
+		goto again;
+
+	case RB_TYPE_TIME_EXTENT:
+		/* Internal data, OK to advance */
+		rb_advance_head(cpu_buffer);
+		goto again;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		rb_advance_head(cpu_buffer);
+		goto again;
+
+	case RB_TYPE_DATA:
+		if (ts) {
+			*ts = cpu_buffer->read_stamp + event->time_delta;
+			ring_buffer_normalize_time_stamp(cpu_buffer->cpu, ts);
+		}
+		return event;
+
+	default:
+		BUG();
+	}
+
+	return NULL;
+}
+
+/**
+ * ring_buffer_iter_peek - peek at the next event to be read
+ * @iter: The ring buffer iterator
+ * @ts: The timestamp counter of this event.
+ *
+ * This will return the event that will be read next, but does
+ * not increment the iterator.
+ */
+struct ring_buffer_event *
+ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
+{
+	struct ring_buffer *buffer;
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+
+	if (ring_buffer_iter_empty(iter))
+		return NULL;
+
+	cpu_buffer = iter->cpu_buffer;
+	buffer = cpu_buffer->buffer;
+
+ again:
+	if (rb_per_cpu_empty(cpu_buffer))
+		return NULL;
+
+	event = rb_iter_head_event(iter);
+
+	switch (event->type) {
+	case RB_TYPE_PADDING:
+		rb_inc_page(cpu_buffer, &iter->head_page);
+		rb_reset_iter_read_page(iter);
+		goto again;
+
+	case RB_TYPE_TIME_EXTENT:
+		/* Internal data, OK to advance */
+		rb_advance_iter(iter);
+		goto again;
+
+	case RB_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		rb_advance_iter(iter);
+		goto again;
+
+	case RB_TYPE_DATA:
+		if (ts) {
+			*ts = iter->read_stamp + event->time_delta;
+			ring_buffer_normalize_time_stamp(cpu_buffer->cpu, ts);
+		}
+		return event;
+
+	default:
+		BUG();
+	}
+
+	return NULL;
+}
+
+/**
+ * ring_buffer_consume - return an event and consume it
+ * @buffer: The ring buffer to get the next event from
+ *
+ * Returns the next event in the ring buffer, and that event is consumed.
+ * Meaning, that sequential reads will keep returning a different event,
+ * and eventually empty the ring buffer if the producer is slower.
+ */
+struct ring_buffer_event *
+ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return NULL;
+
+	event = ring_buffer_peek(buffer, cpu, ts);
+	if (!event)
+		return NULL;
+
+	cpu_buffer = buffer->buffers[cpu];
+	rb_advance_head(cpu_buffer);
+
+	return event;
+}
+
+/**
+ * ring_buffer_read_start - start a non consuming read of the buffer
+ * @buffer: The ring buffer to read from
+ * @cpu: The cpu buffer to iterate over
+ *
+ * This starts up an iteration through the buffer. It also disables
+ * the recording to the buffer until the reading is finished.
+ * This prevents the reading from being corrupted. This is not
+ * a consuming read, so a producer is not expected.
+ *
+ * Must be paired with ring_buffer_finish.
+ */
+struct ring_buffer_iter *
+ring_buffer_read_start(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_iter *iter;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return NULL;
+
+	iter = kmalloc(sizeof(*iter), GFP_KERNEL);
+	if (!iter)
+		return NULL;
+
+	cpu_buffer = buffer->buffers[cpu];
+
+	iter->cpu_buffer = cpu_buffer;
+
+	atomic_inc(&cpu_buffer->record_disabled);
+	synchronize_sched();
+
+	__raw_spin_lock(&cpu_buffer->lock);
+	iter->head = cpu_buffer->head;
+	iter->head_page = cpu_buffer->head_page;
+	rb_reset_iter_read_page(iter);
+	__raw_spin_unlock(&cpu_buffer->lock);
+
+	return iter;
+}
+
+/**
+ * ring_buffer_finish - finish reading the iterator of the buffer
+ * @iter: The iterator retrieved by ring_buffer_start
+ *
+ * This re-enables the recording to the buffer, and frees the
+ * iterator.
+ */
+void
+ring_buffer_read_finish(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
+
+	atomic_dec(&cpu_buffer->record_disabled);
+	kfree(iter);
+}
+
+/**
+ * ring_buffer_read - read the next item in the ring buffer by the iterator
+ * @iter: The ring buffer iterator
+ * @ts: The time stamp of the event read.
+ *
+ * This reads the next event in the ring buffer and increments the iterator.
+ */
+struct ring_buffer_event *
+ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts)
+{
+	struct ring_buffer_event *event;
+
+	event = ring_buffer_iter_peek(iter, ts);
+	if (!event)
+		return NULL;
+
+	rb_advance_iter(iter);
+
+	return event;
+}
+
+/**
+ * ring_buffer_size - return the size of the ring buffer (in bytes)
+ * @buffer: The ring buffer.
+ */
+unsigned long ring_buffer_size(struct ring_buffer *buffer)
+{
+	return BUF_PAGE_SIZE * buffer->pages;
+}
+
+static void
+rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	cpu_buffer->head_page
+		= list_entry(cpu_buffer->pages.next, struct buffer_page, list);
+	cpu_buffer->tail_page
+		= list_entry(cpu_buffer->pages.next, struct buffer_page, list);
+
+	cpu_buffer->head = cpu_buffer->tail = 0;
+	cpu_buffer->overrun = 0;
+	cpu_buffer->entries = 0;
+}
+
+/**
+ * ring_buffer_reset_cpu - reset a ring buffer per CPU buffer
+ * @buffer: The ring buffer to reset a per cpu buffer of
+ * @cpu: The CPU buffer to be reset
+ */
+void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = buffer->buffers[cpu];
+	unsigned long flags;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return;
+
+	raw_local_irq_save(flags);
+	__raw_spin_lock(&cpu_buffer->lock);
+
+	rb_reset_cpu(cpu_buffer);
+
+	__raw_spin_unlock(&cpu_buffer->lock);
+	raw_local_irq_restore(flags);
+}
+
+/**
+ * ring_buffer_reset - reset a ring buffer
+ * @buffer: The ring buffer to reset all cpu buffers
+ */
+void ring_buffer_reset(struct ring_buffer *buffer)
+{
+	unsigned long flags;
+	int cpu;
+
+	ring_buffer_lock(buffer, &flags);
+
+	for_each_buffer_cpu(buffer, cpu)
+		rb_reset_cpu(buffer->buffers[cpu]);
+
+	ring_buffer_unlock(buffer, flags);
+}
+
+/**
+ * rind_buffer_empty - is the ring buffer empty?
+ * @buffer: The ring buffer to test
+ */
+int ring_buffer_empty(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	/* yes this is racy, but if you don't like the race, lock the buffer */
+	for_each_buffer_cpu(buffer, cpu) {
+		cpu_buffer = buffer->buffers[cpu];
+		if (!rb_per_cpu_empty(cpu_buffer))
+			return 0;
+	}
+	return 1;
+}
+
+/**
+ * ring_buffer_empty_cpu - is a cpu buffer of a ring buffer empty?
+ * @buffer: The ring buffer
+ * @cpu: The CPU buffer to test
+ */
+int ring_buffer_empty_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return 1;
+
+	cpu_buffer = buffer->buffers[cpu];
+	return rb_per_cpu_empty(cpu_buffer);
+}
+
+/**
+ * ring_buffer_swap_cpu - swap a CPU buffer between two ring buffers
+ * @buffer_a: One buffer to swap with
+ * @buffer_b: The other buffer to swap with
+ *
+ * This function is useful for tracers that want to take a "snapshot"
+ * of a CPU buffer and has another back up buffer lying around.
+ * it is expected that the tracer handles the cpu buffer not being
+ * used at the moment.
+ */
+int ring_buffer_swap_cpu(struct ring_buffer *buffer_a,
+			 struct ring_buffer *buffer_b, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer_a;
+	struct ring_buffer_per_cpu *cpu_buffer_b;
+
+	if (!cpu_isset(cpu, buffer_a->cpumask) ||
+	    !cpu_isset(cpu, buffer_b->cpumask))
+		return -EINVAL;
+
+	/* At least make sure the two buffers are somewhat the same */
+	if (buffer_a->size != buffer_b->size ||
+	    buffer_a->pages != buffer_b->pages)
+		return -EINVAL;
+
+	cpu_buffer_a = buffer_a->buffers[cpu];
+	cpu_buffer_b = buffer_b->buffers[cpu];
+
+	/*
+	 * We can't do a synchronize_sched here because this
+	 * function can be called in atomic context.
+	 * Normally this will be called from the same CPU as cpu.
+	 * If not it's up to the caller to protect this.
+	 */
+	atomic_inc(&cpu_buffer_a->record_disabled);
+	atomic_inc(&cpu_buffer_b->record_disabled);
+
+	buffer_a->buffers[cpu] = cpu_buffer_b;
+	buffer_b->buffers[cpu] = cpu_buffer_a;
+
+	cpu_buffer_b->buffer = buffer_a;
+	cpu_buffer_a->buffer = buffer_b;
+
+	atomic_dec(&cpu_buffer_a->record_disabled);
+	atomic_dec(&cpu_buffer_b->record_disabled);
+
+	return 0;
+}
+
Index: linux-trace.git/kernel/trace/Kconfig
===================================================================
--- linux-trace.git.orig/kernel/trace/Kconfig	2008-09-27 01:58:49.000000000 -0400
+++ linux-trace.git/kernel/trace/Kconfig	2008-09-27 01:59:06.000000000 -0400
@@ -10,10 +10,14 @@ config HAVE_DYNAMIC_FTRACE
 config TRACER_MAX_TRACE
 	bool
 
+config RING_BUFFER
+	bool
+
 config TRACING
 	bool
 	select DEBUG_FS
 	select STACKTRACE
+	select RING_BUFFER
 
 config FTRACE
 	bool "Kernel Function Tracer"
Index: linux-trace.git/kernel/trace/Makefile
===================================================================
--- linux-trace.git.orig/kernel/trace/Makefile	2008-09-27 01:58:49.000000000 -0400
+++ linux-trace.git/kernel/trace/Makefile	2008-09-27 01:59:06.000000000 -0400
@@ -11,6 +11,7 @@ obj-y += trace_selftest_dynamic.o
 endif
 
 obj-$(CONFIG_FTRACE) += libftrace.o
+obj-$(CONFIG_RING_BUFFER) += ring_buffer.o
 
 obj-$(CONFIG_TRACING) += trace.o
 obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5] Unified trace buffer
  2008-09-26 17:46             ` Steven Rostedt
@ 2008-09-27 17:02               ` Ingo Molnar
  2008-09-27 17:18                 ` Steven Rostedt
  0 siblings, 1 reply; 102+ messages in thread
From: Ingo Molnar @ 2008-09-27 17:02 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Arnaldo Carvalho de Melo, Masami Hiramatsu, LKML,
	Thomas Gleixner, Peter Zijlstra, Andrew Morton, prasad,
	Mathieu Desnoyers, Frank Ch. Eigler, David Wilder, hch,
	Martin Bligh, Christoph Hellwig, Steven Rostedt


* Steven Rostedt <rostedt@goodmis.org> wrote:

> > Indeed. And on some architectures 'packed' will actually mean that 
> > the compiler may think that it's unaligned, and then generate much 
> > worse code to access the fields. So if you align things anyway (and 
> > you do), then 'packed' is the wrong thing to do.
> 
> OK, I'm making v6 now with various cleanups. I'll nuke it on that one.

btw., now that it's getting into shape, could you please fix the ftrace 
portion:

> Subject: [RFC PATCH 2/2 v3] ftrace: make work with new ring buffer
>
> Note: This patch is a proof of concept, and breaks a lot of 
> functionality of ftrace.
>
> This patch simply makes ftrace work with the developmental ring 
> buffer.

... to not have known bugs, so that we could try it in tip/ftrace and 
make sure it works well in practice?

it's a ton of changes already, it would be nice to get to some stable 
known-working state and do delta patches from that point on, and keep 
its 'works well' quality.

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v5] Unified trace buffer
  2008-09-27 17:02               ` Ingo Molnar
@ 2008-09-27 17:18                 ` Steven Rostedt
  0 siblings, 0 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-09-27 17:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Arnaldo Carvalho de Melo, Masami Hiramatsu, LKML,
	Thomas Gleixner, Peter Zijlstra, Andrew Morton, prasad,
	Mathieu Desnoyers, Frank Ch. Eigler, David Wilder, hch,
	Martin Bligh, Christoph Hellwig, Steven Rostedt


On Sat, 27 Sep 2008, Ingo Molnar wrote:
> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > Subject: [RFC PATCH 2/2 v3] ftrace: make work with new ring buffer
> >
> > Note: This patch is a proof of concept, and breaks a lot of 
> > functionality of ftrace.
> >
> > This patch simply makes ftrace work with the developmental ring 
> > buffer.
> 
> ... to not have known bugs, so that we could try it in tip/ftrace and 
> make sure it works well in practice?
> 
> it's a ton of changes already, it would be nice to get to some stable 
> known-working state and do delta patches from that point on, and keep 
> its 'works well' quality.

OK, the patch that I was using was against Linus's tree. I'll port it over 
to linux-tip on Monday and get it past the "proof of concept" stage. 
Actually, the verison I have on my desk works pretty well. The main issues 
to solve is that some other tracers and the self test stick their noses 
into the buffering system, which would need to be fixed.

There's also some bugs in the status numbers printed in the latency_trace 
header. But I have not hit any bugs with the buffering itself.

I'll clean all this up and send out a patch on Monday. My wife is 
mandating that I do not do anymore work over the weekend ;-)

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v9] Unified trace buffer
  2008-09-27  6:06               ` [PATCH v9] " Steven Rostedt
@ 2008-09-27 18:39                 ` Ingo Molnar
  2008-09-27 19:24                   ` Steven Rostedt
  2008-09-29 16:10                 ` [PATCH v10 Golden] " Steven Rostedt
  1 sibling, 1 reply; 102+ messages in thread
From: Ingo Molnar @ 2008-09-27 18:39 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, prasad,
	Linus Torvalds, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo


small nitpicking review, nothing structural yet:

* Steven Rostedt <rostedt@goodmis.org> wrote:

> Index: linux-trace.git/include/linux/ring_buffer.h
> +enum {
> +	RB_TYPE_PADDING,	/* Left over page padding

RB_ clashes with red-black tree namespace. (on the thought level)

> +#define RB_ALIGNMENT_SHIFT	2
> +#define RB_ALIGNMENT		(1 << RB_ALIGNMENT_SHIFT)
> +#define RB_MAX_SMALL_DATA	(28)

no need to put numeric literals into parenthesis.

> +static inline unsigned
> +ring_buffer_event_length(struct ring_buffer_event *event)
> +{
> +	unsigned length;
> +
> +	switch (event->type) {
> +	case RB_TYPE_PADDING:
> +		/* undefined */
> +		return -1;
> +
> +	case RB_TYPE_TIME_EXTENT:
> +		return RB_LEN_TIME_EXTENT;
> +
> +	case RB_TYPE_TIME_STAMP:
> +		return RB_LEN_TIME_STAMP;
> +
> +	case RB_TYPE_DATA:
> +		if (event->len)
> +			length = event->len << RB_ALIGNMENT_SHIFT;
> +		else
> +			length = event->array[0];
> +		return length + RB_EVNT_HDR_SIZE;
> +	default:
> +		BUG();
> +	}
> +	/* not hit */
> +	return 0;

too large, please uninline.

> +static inline void *
> +ring_buffer_event_data(struct ring_buffer_event *event)
> +{
> +	BUG_ON(event->type != RB_TYPE_DATA);
> +	/* If length is in len field, then array[0] has the data */
> +	if (event->len)
> +		return (void *)&event->array[0];
> +	/* Otherwise length is in array[0] and array[1] has the data */
> +	return (void *)&event->array[1];
> +}

ditto.

> +/* FIXME!!! */
> +u64 ring_buffer_time_stamp(int cpu)
> +{
> +	/* shift to debug/test normalization and TIME_EXTENTS */
> +	return sched_clock() << DEBUG_SHIFT;

[ duly noted ;-) ]

> +}
> +void ring_buffer_normalize_time_stamp(int cpu, u64 *ts)

needs extra newline above.

> +/*
> + * head_page == tail_page && head == tail then buffer is empty.
> + */
> +struct ring_buffer_per_cpu {
> +	int			cpu;
> +	struct ring_buffer	*buffer;
> +	raw_spinlock_t		lock;

hm, should not be raw, at least initially. I am 95% sure we'll see 
lockups, we always did when we iterated ftrace's buffer implementation 
;-)

> +struct ring_buffer {
> +	unsigned long		size;
> +	unsigned		pages;
> +	unsigned		flags;
> +	int			cpus;
> +	cpumask_t		cpumask;
> +	atomic_t		record_disabled;
> +
> +	struct mutex		mutex;
> +
> +	struct ring_buffer_per_cpu **buffers;
> +};
> +
> +struct ring_buffer_iter {
> +	struct ring_buffer_per_cpu	*cpu_buffer;
> +	unsigned long			head;
> +	struct buffer_page		*head_page;
> +	u64				read_stamp;

please use consistent vertical whitespaces.  Above, in the struct 
ring_buffer definition, you can add another tab to most of the vars - 
that will also make the '**buffers' line look nice.

same for all structs across this file. In my experience, a 50% vertical 
break works best - the one you used here in 'struct ring_buffer_iter'.

> +};
> +
> +#define CHECK_COND(buffer, cond)			\
> +	if (unlikely(cond)) {				\
> +		atomic_inc(&buffer->record_disabled);	\
> +		WARN_ON(1);				\
> +		return -1;				\
> +	}

please name it RINGBUFFER_BUG_ON() / RINGBUFFER_WARN_ON(), so that we 
dont have to memorize another set of debug names. [ See 
DEBUG_LOCKS_WARN_ON() in include/linux/debug_locks.h ]

you can change it to:

> +static int
> +rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned nr_pages)
> +{
> +	struct list_head *head = &cpu_buffer->pages;
> +	LIST_HEAD(pages);
> +	struct buffer_page *page, *tmp;
> +	unsigned long addr;
> +	unsigned i;

please apply ftrace's standard reverse christmas tree style and move the 
'pages' line down two lines.

> +int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size)
> +{
> +	struct ring_buffer_per_cpu *cpu_buffer;
> +	unsigned long buffer_size;
> +	LIST_HEAD(pages);
> +	unsigned long addr;
> +	unsigned nr_pages, rm_pages, new_pages;
> +	struct buffer_page *page, *tmp;
> +	int i, cpu;

ditto.

> +static inline void *rb_page_index(struct buffer_page *page, unsigned index)
> +{
> +	void *addr;
> +
> +	addr = page_address(&page->page);

'addr' initialization can move to the definition line - you save two 
lines.

> +	return addr + index;
> +}
> +
> +static inline struct ring_buffer_event *
> +rb_head_event(struct ring_buffer_per_cpu *cpu_buffer)
> +{
> +	return rb_page_index(cpu_buffer->head_page,
> +			     cpu_buffer->head);

can all move to the same return line.

> +}
> +
> +static inline struct ring_buffer_event *
> +rb_iter_head_event(struct ring_buffer_iter *iter)
> +{
> +	return rb_page_index(iter->head_page,
> +			     iter->head);

ditto.

> +	for (head = 0; head < rb_head_size(cpu_buffer);
> +	     head += ring_buffer_event_length(event)) {
> +		event = rb_page_index(cpu_buffer->head_page, head);
> +		BUG_ON(rb_null_event(event));

( optional:when there's a multi-line loop then i generally try to insert 
  an extra newline when starting the body - to make sure the iterator 
  and the body stands apart visually. Matter of taste. )

> +static struct ring_buffer_event *
> +rb_reserve_next_event(struct ring_buffer_per_cpu *cpu_buffer,
> +		      unsigned type, unsigned long length)
> +{
> +	u64 ts, delta;
> +	struct ring_buffer_event *event;
> +	static int once;
> +
> +	ts = ring_buffer_time_stamp(cpu_buffer->cpu);
> +
> +	if (cpu_buffer->tail) {
> +		delta = ts - cpu_buffer->write_stamp;
> +
> +		if (test_time_stamp(delta)) {
> +			if (unlikely(delta > (1ULL << 59) && !once++)) {
> +				printk(KERN_WARNING "Delta way too big! %llu"
> +				       " ts=%llu write stamp = %llu\n",
> +				       delta, ts, cpu_buffer->write_stamp);
> +				WARN_ON(1);
> +			}
> +			/*
> +			 * The delta is too big, we to add a
> +			 * new timestamp.
> +			 */
> +			event = __rb_reserve_next(cpu_buffer,
> +						  RB_TYPE_TIME_EXTENT,
> +						  RB_LEN_TIME_EXTENT,
> +						  &ts);
> +			if (!event)
> +				return NULL;
> +
> +			/* check to see if we went to the next page */
> +			if (cpu_buffer->tail) {
> +				/* Still on same page, update timestamp */
> +				event->time_delta = delta & TS_MASK;
> +				event->array[0] = delta >> TS_SHIFT;
> +				/* commit the time event */
> +				cpu_buffer->tail +=
> +					ring_buffer_event_length(event);
> +				cpu_buffer->write_stamp = ts;
> +				delta = 0;
> +			}
> +		}
> +	} else {
> +		rb_add_stamp(cpu_buffer, &ts);
> +		delta = 0;
> +	}
> +
> +	event = __rb_reserve_next(cpu_buffer, type, length, &ts);
> +	if (!event)
> +		return NULL;
> +
> +	/* If the reserve went to the next page, our delta is zero */
> +	if (!cpu_buffer->tail)
> +		delta = 0;
> +
> +	event->time_delta = delta;
> +
> +	return event;
> +}

this function is too long, please split it up. The first condition's 
body could go into a separate function i guess.

> +	RB_TYPE_TIME_EXTENT,	/* Extent the time delta
> +				 * array[0] = time delta (28 .. 59)
> +				 * size = 8 bytes
> +				 */

please use standard comment style:

 /*
  * Comment
  */

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v9] Unified trace buffer
  2008-09-27 18:39                 ` Ingo Molnar
@ 2008-09-27 19:24                   ` Steven Rostedt
  2008-09-27 19:41                     ` Ingo Molnar
  0 siblings, 1 reply; 102+ messages in thread
From: Steven Rostedt @ 2008-09-27 19:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, prasad,
	Linus Torvalds, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo


Hi Ingo,

Thanks for the review!

On Sat, 27 Sep 2008, Ingo Molnar wrote:

> 
> small nitpicking review, nothing structural yet:
> 
> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > Index: linux-trace.git/include/linux/ring_buffer.h
> > +enum {
> > +	RB_TYPE_PADDING,	/* Left over page padding
> 
> RB_ clashes with red-black tree namespace. (on the thought level)

Yeah, Linus pointed this out with the rb_ static function names. But since 
the functions are static I kept them as is. But here we have global names.

Would RNGBF_ be OK, or do you have any other ideas?

> 
> > +#define RB_ALIGNMENT_SHIFT	2
> > +#define RB_ALIGNMENT		(1 << RB_ALIGNMENT_SHIFT)
> > +#define RB_MAX_SMALL_DATA	(28)
> 
> no need to put numeric literals into parenthesis.

Ah, I think I had it more complex and changed it to a literal without 
removing the parenthesis.

> 
> > +static inline unsigned
> > +ring_buffer_event_length(struct ring_buffer_event *event)
> > +{
> > +	unsigned length;
> > +
> > +	switch (event->type) {
> > +	case RB_TYPE_PADDING:
> > +		/* undefined */
> > +		return -1;
> > +
> > +	case RB_TYPE_TIME_EXTENT:
> > +		return RB_LEN_TIME_EXTENT;
> > +
> > +	case RB_TYPE_TIME_STAMP:
> > +		return RB_LEN_TIME_STAMP;
> > +
> > +	case RB_TYPE_DATA:
> > +		if (event->len)
> > +			length = event->len << RB_ALIGNMENT_SHIFT;
> > +		else
> > +			length = event->array[0];
> > +		return length + RB_EVNT_HDR_SIZE;
> > +	default:
> > +		BUG();
> > +	}
> > +	/* not hit */
> > +	return 0;
> 
> too large, please uninline.

I calculated this on x86_64 to add 78 bytes. Is that still too big?

> 
> > +static inline void *
> > +ring_buffer_event_data(struct ring_buffer_event *event)
> > +{
> > +	BUG_ON(event->type != RB_TYPE_DATA);
> > +	/* If length is in len field, then array[0] has the data */
> > +	if (event->len)
> > +		return (void *)&event->array[0];
> > +	/* Otherwise length is in array[0] and array[1] has the data */
> > +	return (void *)&event->array[1];
> > +}
> 
> ditto.

No biggy. I thought this would be nicer as inline. But I have no problem
changing this.

> 
> > +/* FIXME!!! */
> > +u64 ring_buffer_time_stamp(int cpu)
> > +{
> > +	/* shift to debug/test normalization and TIME_EXTENTS */
> > +	return sched_clock() << DEBUG_SHIFT;
> 
> [ duly noted ;-) ]
> 
> > +}
> > +void ring_buffer_normalize_time_stamp(int cpu, u64 *ts)
> 
> needs extra newline above.

Yeah, I kept them bounded just to stress the "FIXME" part ;-)

> 
> > +/*
> > + * head_page == tail_page && head == tail then buffer is empty.
> > + */
> > +struct ring_buffer_per_cpu {
> > +	int			cpu;
> > +	struct ring_buffer	*buffer;
> > +	raw_spinlock_t		lock;
> 
> hm, should not be raw, at least initially. I am 95% sure we'll see 
> lockups, we always did when we iterated ftrace's buffer implementation 
> ;-)

It was to prevent lockdep from checking the locks from inside. We had 
issues with ftroce and lockdep in the past, because ftrace would trace the
internals of lockdep, and lockdep would then recurse back into itself to 
trace.  If lockdep itself can get away with not using raw_spinlocks, then
this will be OK to make back to spinlock.

> 
> > +struct ring_buffer {
> > +	unsigned long		size;
> > +	unsigned		pages;
> > +	unsigned		flags;
> > +	int			cpus;
> > +	cpumask_t		cpumask;
> > +	atomic_t		record_disabled;
> > +
> > +	struct mutex		mutex;
> > +
> > +	struct ring_buffer_per_cpu **buffers;
> > +};
> > +
> > +struct ring_buffer_iter {
> > +	struct ring_buffer_per_cpu	*cpu_buffer;
> > +	unsigned long			head;
> > +	struct buffer_page		*head_page;
> > +	u64				read_stamp;
> 
> please use consistent vertical whitespaces.  Above, in the struct 
> ring_buffer definition, you can add another tab to most of the vars - 
> that will also make the '**buffers' line look nice.

OK, will fix.

> 
> same for all structs across this file. In my experience, a 50% vertical 
> break works best - the one you used here in 'struct ring_buffer_iter'.
> 
> > +};
> > +
> > +#define CHECK_COND(buffer, cond)			\
> > +	if (unlikely(cond)) {				\
> > +		atomic_inc(&buffer->record_disabled);	\
> > +		WARN_ON(1);				\
> > +		return -1;				\
> > +	}
> 
> please name it RINGBUFFER_BUG_ON() / RINGBUFFER_WARN_ON(), so that we 
> dont have to memorize another set of debug names. [ See 
> DEBUG_LOCKS_WARN_ON() in include/linux/debug_locks.h ]

OK, this was a direct copy from what was used in ftrace.

> 
> you can change it to:
> 
> > +static int
> > +rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned nr_pages)
> > +{
> > +	struct list_head *head = &cpu_buffer->pages;
> > +	LIST_HEAD(pages);
> > +	struct buffer_page *page, *tmp;
> > +	unsigned long addr;
> > +	unsigned i;
> 
> please apply ftrace's standard reverse christmas tree style and move the 
> 'pages' line down two lines.

Heh, this was directly from a bug I had and laziness ;-)
I originally just had struct list_head pages (and no *tmp), which kept the 
christmas tree format. But later found that you need to initialize list 
heads (duh!), and never moved it.


> 
> > +int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size)
> > +{
> > +	struct ring_buffer_per_cpu *cpu_buffer;
> > +	unsigned long buffer_size;
> > +	LIST_HEAD(pages);
> > +	unsigned long addr;
> > +	unsigned nr_pages, rm_pages, new_pages;
> > +	struct buffer_page *page, *tmp;
> > +	int i, cpu;
> 
> ditto.

Same reason.

> 
> > +static inline void *rb_page_index(struct buffer_page *page, unsigned index)
> > +{
> > +	void *addr;
> > +
> > +	addr = page_address(&page->page);
> 
> 'addr' initialization can move to the definition line - you save two 
> lines.

Will fix.

> 
> > +	return addr + index;
> > +}
> > +
> > +static inline struct ring_buffer_event *
> > +rb_head_event(struct ring_buffer_per_cpu *cpu_buffer)
> > +{
> > +	return rb_page_index(cpu_buffer->head_page,
> > +			     cpu_buffer->head);
> 
> can all move to the same return line.

Ah, this was caused by my s/ring_buffer_page_index/rb_page_index/ run.

> 
> > +}
> > +
> > +static inline struct ring_buffer_event *
> > +rb_iter_head_event(struct ring_buffer_iter *iter)
> > +{
> > +	return rb_page_index(iter->head_page,
> > +			     iter->head);
> 
> ditto.

Will fix.

> 
> > +	for (head = 0; head < rb_head_size(cpu_buffer);
> > +	     head += ring_buffer_event_length(event)) {
> > +		event = rb_page_index(cpu_buffer->head_page, head);
> > +		BUG_ON(rb_null_event(event));
> 
> ( optional:when there's a multi-line loop then i generally try to insert 
>   an extra newline when starting the body - to make sure the iterator 
>   and the body stands apart visually. Matter of taste. )

Will fix, I have no preference.

> 
> > +static struct ring_buffer_event *
> > +rb_reserve_next_event(struct ring_buffer_per_cpu *cpu_buffer,
> > +		      unsigned type, unsigned long length)
> > +{
> > +	u64 ts, delta;
> > +	struct ring_buffer_event *event;
> > +	static int once;
> > +
> > +	ts = ring_buffer_time_stamp(cpu_buffer->cpu);
> > +
> > +	if (cpu_buffer->tail) {
> > +		delta = ts - cpu_buffer->write_stamp;
> > +
> > +		if (test_time_stamp(delta)) {
> > +			if (unlikely(delta > (1ULL << 59) && !once++)) {
> > +				printk(KERN_WARNING "Delta way too big! %llu"
> > +				       " ts=%llu write stamp = %llu\n",
> > +				       delta, ts, cpu_buffer->write_stamp);
> > +				WARN_ON(1);
> > +			}
> > +			/*
> > +			 * The delta is too big, we to add a
> > +			 * new timestamp.
> > +			 */
> > +			event = __rb_reserve_next(cpu_buffer,
> > +						  RB_TYPE_TIME_EXTENT,
> > +						  RB_LEN_TIME_EXTENT,
> > +						  &ts);
> > +			if (!event)
> > +				return NULL;
> > +
> > +			/* check to see if we went to the next page */
> > +			if (cpu_buffer->tail) {
> > +				/* Still on same page, update timestamp */
> > +				event->time_delta = delta & TS_MASK;
> > +				event->array[0] = delta >> TS_SHIFT;
> > +				/* commit the time event */
> > +				cpu_buffer->tail +=
> > +					ring_buffer_event_length(event);
> > +				cpu_buffer->write_stamp = ts;
> > +				delta = 0;
> > +			}
> > +		}
> > +	} else {
> > +		rb_add_stamp(cpu_buffer, &ts);
> > +		delta = 0;
> > +	}
> > +
> > +	event = __rb_reserve_next(cpu_buffer, type, length, &ts);
> > +	if (!event)
> > +		return NULL;
> > +
> > +	/* If the reserve went to the next page, our delta is zero */
> > +	if (!cpu_buffer->tail)
> > +		delta = 0;
> > +
> > +	event->time_delta = delta;
> > +
> > +	return event;
> > +}
> 
> this function is too long, please split it up. The first condition's 
> body could go into a separate function i guess.

Will fix.

> 
> > +	RB_TYPE_TIME_EXTENT,	/* Extent the time delta
> > +				 * array[0] = time delta (28 .. 59)
> > +				 * size = 8 bytes
> > +				 */
> 
> please use standard comment style:
> 
>  /*
>   * Comment
>   */

Hmm, this is interesting. I kind of like this because it is not really a 
standard comment. It is a comment about the definitions of the enum. I 
believe if they are above:

  /*
   * Comment
   */
   RB_ENUM_TYPE,

It is not as readable. But if we do:

   RB_ENUM_TYPE,	/*
			 * Comment
			 */

The comment is not at the same line as the enum, which also looks 
unpleasing.

We can't could do:

			/*
   RB_ENUM_TYPE,	 * Comment
			 */
			/*
   RB_ENUM_TYPE2,	 * Comment
			 */

Because the ENUM is also in the comment :-p


I chose this way because we have:

  RB_ENUM_TYPE,		/* Comment
			 * More comment
			 */
  RB_ENUM_TYPE2,	/* Comment
			 */

Since I find this the nices way to describe enums. That last */ is 
good to space the comments apart, otherwise we have:

  RB_ENUM_TYPE,		/* Comment
			 * More comment */
  RB_ENUM_TYPE2,	/* Comment */

That is not as easy to see the separation of one description of enums with 
the other.

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v9] Unified trace buffer
  2008-09-27 19:24                   ` Steven Rostedt
@ 2008-09-27 19:41                     ` Ingo Molnar
  2008-09-27 19:54                       ` Steven Rostedt
  0 siblings, 1 reply; 102+ messages in thread
From: Ingo Molnar @ 2008-09-27 19:41 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, prasad,
	Linus Torvalds, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo


* Steven Rostedt <rostedt@goodmis.org> wrote:

> > > Index: linux-trace.git/include/linux/ring_buffer.h
> > > +enum {
> > > +	RB_TYPE_PADDING,	/* Left over page padding
> > 
> > RB_ clashes with red-black tree namespace. (on the thought level)
> 
> Yeah, Linus pointed this out with the rb_ static function names. But since 
> the functions are static I kept them as is. But here we have global names.
> 
> Would RNGBF_ be OK, or do you have any other ideas?

that's even worse i think :-/ And this isnt bikeshed-painting really, 
the RNGBF_ name hurts my eyes and RB_ is definitely confusing to read. 
(as the rbtree constants are in capitals as well and similarly named)

 RING_TYPE_PADDING

or:

 RINGBUF_TYPE_PADDING

yes, it's longer, but still, saner.

> > too large, please uninline.
> 
> I calculated this on x86_64 to add 78 bytes. Is that still too big?

yes, way too big. Sometimes we make savings from a 10 bytes function 
already. (but it's always case dependent - if a function has a lot of 
parameters then uninlining can hurt)

the only exception would be if there's normally only a single 
instantiation per tracer, and if it's in the absolute tracing hotpath. 

> > hm, should not be raw, at least initially. I am 95% sure we'll see 
> > lockups, we always did when we iterated ftrace's buffer 
> > implementation ;-)
> 
> It was to prevent lockdep from checking the locks from inside. We had 
> issues with ftroce and lockdep in the past, because ftrace would trace 
> the internals of lockdep, and lockdep would then recurse back into 
> itself to trace.  If lockdep itself can get away with not using 
> raw_spinlocks, then this will be OK to make back to spinlock.

would be nice to make sure that ftrace's recursion checks work as 
intended - and the same goes for lockdep's recursion checks. Yes, we had 
problems in this area, and it would be nice to make sure it all works 
fine. (or fix it if it doesnt)


> > > +	for (head = 0; head < rb_head_size(cpu_buffer);
> > > +	     head += ring_buffer_event_length(event)) {
> > > +		event = rb_page_index(cpu_buffer->head_page, head);
> > > +		BUG_ON(rb_null_event(event));
> > 
> > ( optional:when there's a multi-line loop then i generally try to insert 
> >   an extra newline when starting the body - to make sure the iterator 
> >   and the body stands apart visually. Matter of taste. )
> 
> Will fix, I have no preference.

clarification: multi-line loop _condition_. It's pretty rare (this is 
such a case) but sometimes unavoidable - and then the newline helps 
visually.

> > > +	RB_TYPE_TIME_EXTENT,	/* Extent the time delta
> > > +				 * array[0] = time delta (28 .. 59)
> > > +				 * size = 8 bytes
> > > +				 */
> > 
> > please use standard comment style:
> > 
> >  /*
> >   * Comment
> >   */
> 
> Hmm, this is interesting. I kind of like this because it is not really a 
> standard comment. It is a comment about the definitions of the enum. I 
> believe if they are above:
> 
>   /*
>    * Comment
>    */
>    RB_ENUM_TYPE,
> 
> It is not as readable. But if we do:

no, it is not readable. My point was that you should do:
> 
>    RB_ENUM_TYPE,	/*
> 			 * Comment
> 			 */
> 
> The comment is not at the same line as the enum, which also looks 
> unpleasing.

but you did:

>    RB_ENUM_TYPE,	/* Comment
> 			 */

So i suggested to fix it to:

 +	RB_TYPE_TIME_EXTENT,	/*
 +				 * Extent the time delta
 +				 * array[0] = time delta (28 .. 59)
 +				 * size = 8 bytes
 +				 */

ok? I.e. "comment" should have the same visual properties as other 
comments.

I fully agree with moving it next to the enum, i sometimes use that 
style too, it's a nice touch and more readable in this case than 
comment-ahead. (which we use for statements)

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v9] Unified trace buffer
  2008-09-27 19:41                     ` Ingo Molnar
@ 2008-09-27 19:54                       ` Steven Rostedt
  2008-09-27 20:00                         ` Ingo Molnar
  2008-09-27 20:07                         ` Martin Bligh
  0 siblings, 2 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-09-27 19:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, prasad,
	Linus Torvalds, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo


On Sat, 27 Sep 2008, Ingo Molnar wrote:
> 
> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> that's even worse i think :-/ And this isnt bikeshed-painting really, 
> the RNGBF_ name hurts my eyes and RB_ is definitely confusing to read. 
> (as the rbtree constants are in capitals as well and similarly named)
> 
>  RING_TYPE_PADDING
> 
> or:
> 
>  RINGBUF_TYPE_PADDING
> 
> yes, it's longer, but still, saner.

I don't mind the extra typing, it is just a bit more difficult to keep in
the 80 character line limit.

> 
> > > too large, please uninline.
> > 
> > I calculated this on x86_64 to add 78 bytes. Is that still too big?
> 
> yes, way too big. Sometimes we make savings from a 10 bytes function 
> already. (but it's always case dependent - if a function has a lot of 
> parameters then uninlining can hurt)
> 
> the only exception would be if there's normally only a single 
> instantiation per tracer, and if it's in the absolute tracing hotpath. 

It is a hot path in the internals. Perhaps I'll make an inline function
in the interal code "rb_event_length" and have the other users call.

unsigned ring_buffer_event(struct ring_buffer_event *event)
{
	return rb_event_length(event);
}

> no, it is not readable. My point was that you should do:
> > 
> >    RB_ENUM_TYPE,	/*
> > 			 * Comment
> > 			 */
> > 
> > The comment is not at the same line as the enum, which also looks 
> > unpleasing.
> 
> but you did:
> 
> >    RB_ENUM_TYPE,	/* Comment
> > 			 */
> 
> So i suggested to fix it to:
> 
>  +	RB_TYPE_TIME_EXTENT,	/*
>  +				 * Extent the time delta
>  +				 * array[0] = time delta (28 .. 59)
>  +				 * size = 8 bytes
>  +				 */
> 
> ok? I.e. "comment" should have the same visual properties as other 
> comments.
> 
> I fully agree with moving it next to the enum, i sometimes use that 
> style too, it's a nice touch and more readable in this case than 
> comment-ahead. (which we use for statements)

But then we have:

        RB_TYPE_PADDING,        /*
				 * Left over page padding
                                 * array is ignored
                                 * size is variable depending on
                                 * how much padding is needed
                                 */
        RB_TYPE_TIME_EXTENT,    /*
				 * Extent the time delta
                                 * array[0] = time delta (28 .. 59)
                                 * size = 8 bytes
                                 */

Where it is not as easy to see which comment is with which enum. 
Especially when you have many enums. That's why I like the method I used 
with:

        RB_TYPE_PADDING,        /* Left over page padding
                                 * array is ignored
                                 * size is variable depending on
                                 * how much padding is needed
                                 */
        RB_TYPE_TIME_EXTENT,    /* Extent the time delta
                                 * array[0] = time delta (28 .. 59)
                                 * size = 8 bytes
                                 */

Where it is very easy to notice which comment goes with which enum.

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v9] Unified trace buffer
  2008-09-27 19:54                       ` Steven Rostedt
@ 2008-09-27 20:00                         ` Ingo Molnar
  2008-09-29 15:05                           ` Steven Rostedt
  2008-09-27 20:07                         ` Martin Bligh
  1 sibling, 1 reply; 102+ messages in thread
From: Ingo Molnar @ 2008-09-27 20:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, prasad,
	Linus Torvalds, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo


* Steven Rostedt <rostedt@goodmis.org> wrote:

> >  RINGBUF_TYPE_PADDING
> > 
> > yes, it's longer, but still, saner.
> 
> I don't mind the extra typing, it is just a bit more difficult to keep 
> in the 80 character line limit.

that's really not a hard limit, but yeah.

generally, with clean and simple functions it's easy to keep it.

> > yes, way too big. Sometimes we make savings from a 10 bytes function 
> > already. (but it's always case dependent - if a function has a lot 
> > of parameters then uninlining can hurt)
> > 
> > the only exception would be if there's normally only a single 
> > instantiation per tracer, and if it's in the absolute tracing 
> > hotpath.
> 
> It is a hot path in the internals. Perhaps I'll make an inline 
> function in the interal code "rb_event_length" and have the other 
> users call.
> 
> unsigned ring_buffer_event(struct ring_buffer_event *event)
> {
> 	return rb_event_length(event);
> }

yeah, sounds sane.

> > no, it is not readable. My point was that you should do:
> > > 
> > >    RB_ENUM_TYPE,	/*
> > > 			 * Comment
> > > 			 */
> > > 
> > > The comment is not at the same line as the enum, which also looks 
> > > unpleasing.
> > 
> > but you did:
> > 
> > >    RB_ENUM_TYPE,	/* Comment
> > > 			 */
> > 
> > So i suggested to fix it to:
> > 
> >  +	RB_TYPE_TIME_EXTENT,	/*
> >  +				 * Extent the time delta
> >  +				 * array[0] = time delta (28 .. 59)
> >  +				 * size = 8 bytes
> >  +				 */
> > 
> > ok? I.e. "comment" should have the same visual properties as other 
> > comments.
> > 
> > I fully agree with moving it next to the enum, i sometimes use that 
> > style too, it's a nice touch and more readable in this case than 
> > comment-ahead. (which we use for statements)
> 
> But then we have:
> 
>         RB_TYPE_PADDING,        /*
> 				   * Left over page padding
>                                  * array is ignored
>                                  * size is variable depending on
>                                  * how much padding is needed
>                                  */
>         RB_TYPE_TIME_EXTENT,    /*
> 				   * Extent the time delta
>                                  * array[0] = time delta (28 .. 59)
>                                  * size = 8 bytes
>                                  */
> 
> Where it is not as easy to see which comment is with which enum. 
> Especially when you have many enums. That's why I like the method I 
> used with:

> 
>         RB_TYPE_PADDING,        /* Left over page padding
>                                  * array is ignored
>                                  * size is variable depending on
>                                  * how much padding is needed
>                                  */
>         RB_TYPE_TIME_EXTENT,    /* Extent the time delta
>                                  * array[0] = time delta (28 .. 59)
>                                  * size = 8 bytes
>                                  */
> 
> Where it is very easy to notice which comment goes with which enum.

this:

>         RB_TYPE_PADDING,        /*
> 				   * Left over page padding
>                                  * array is ignored
>                                  * size is variable depending on
>                                  * how much padding is needed
>                                  */
>
>         RB_TYPE_TIME_EXTENT,    /*
> 				   * Extent the time delta
>                                  * array[0] = time delta (28 .. 59)
>                                  * size = 8 bytes
>                                  */

or:

>	/*
>	 * Left over page padding. 'array' is ignored,
>	 * 'size' is variable depending on how much padding is needed.
>	 */
>	RB_TYPE_PADDING,
>
>	/*
>	 * Extent the time delta,
>	 * array[0] = time delta (28 .. 59), size = 8 bytes
>	 */
>	RB_TYPE_TIME_EXTENT,

oh, btw., that's a spelling mistake: s/extend/extend ?

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v9] Unified trace buffer
  2008-09-27 19:54                       ` Steven Rostedt
  2008-09-27 20:00                         ` Ingo Molnar
@ 2008-09-27 20:07                         ` Martin Bligh
  2008-09-27 20:34                           ` Ingo Molnar
  1 sibling, 1 reply; 102+ messages in thread
From: Martin Bligh @ 2008-09-27 20:07 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, LKML, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, prasad, Linus Torvalds, Mathieu Desnoyers,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo

>> that's even worse i think :-/ And this isnt bikeshed-painting really,
>> the RNGBF_ name hurts my eyes and RB_ is definitely confusing to read.
>> (as the rbtree constants are in capitals as well and similarly named)
>>
>>  RING_TYPE_PADDING
>>
>> or:
>>
>>  RINGBUF_TYPE_PADDING
>>
>> yes, it's longer, but still, saner.
>
> I don't mind the extra typing, it is just a bit more difficult to keep in
> the 80 character line limit.

Would using tb_ (trace buffer) rather than rb_ help ?

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v9] Unified trace buffer
  2008-09-27 20:07                         ` Martin Bligh
@ 2008-09-27 20:34                           ` Ingo Molnar
  0 siblings, 0 replies; 102+ messages in thread
From: Ingo Molnar @ 2008-09-27 20:34 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Steven Rostedt, LKML, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, prasad, Linus Torvalds, Mathieu Desnoyers,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo


* Martin Bligh <mbligh@mbligh.org> wrote:

> >> that's even worse i think :-/ And this isnt bikeshed-painting really,
> >> the RNGBF_ name hurts my eyes and RB_ is definitely confusing to read.
> >> (as the rbtree constants are in capitals as well and similarly named)
> >>
> >>  RING_TYPE_PADDING
> >>
> >> or:
> >>
> >>  RINGBUF_TYPE_PADDING
> >>
> >> yes, it's longer, but still, saner.
> >
> > I don't mind the extra typing, it is just a bit more difficult to keep in
> > the 80 character line limit.
> 
> Would using tb_ (trace buffer) rather than rb_ help ?

excellent idea ...

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v9] Unified trace buffer
  2008-09-27 20:00                         ` Ingo Molnar
@ 2008-09-29 15:05                           ` Steven Rostedt
  0 siblings, 0 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-09-29 15:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Thomas Gleixner, Peter Zijlstra, Andrew Morton, prasad,
	Linus Torvalds, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo


On Sat, 27 Sep 2008, Ingo Molnar wrote:
> 
> > > no, it is not readable. My point was that you should do:
> > > > 
> > > >    RB_ENUM_TYPE,	/*
> > > > 			 * Comment
> > > > 			 */
> > > > 
> > > > The comment is not at the same line as the enum, which also looks 
> > > > unpleasing.
> > > 
> > > but you did:
> > > 
> > > >    RB_ENUM_TYPE,	/* Comment
> > > > 			 */
> > > 

OK, I did a quick survey of what others did in include/linux to handle 
multi line comments for enums. I ignored the single line comments since 
that is pretty standard. Here's what I found:

Those that do:

enum myenum {
	ENUM_PING_PONG,		/* Bounce a ball back and forth
				   till you have a winner. */
	ENUM_HONEY_CONE,	/* Soft and sweet a yummy for
				   the tummy. */
};

include/linux/atmdev.h
include/linux/fd.h
include/linux/hil.h
include/linux/if_pppol2tp.h
include/linux/ivtv.h
include/linux/libata.h
include/linux/mmzone.h
include/linux/reiserfs_fs.h
include/linux/reiserfs_fs_sb.h
include/linux/rtnetlink.h
include/linux/scc.h
include/linux/videodev2.h

Those that do:

enum myenum {
        ENUM_PING_PONG,         /* Bounce a ball back and forth */
                                /* till you have a winner.      */
        ENUM_HONEY_CONE,        /* Soft and sweet a yummy for   */
                                /* the tummy.                   */
};

include/linux/atmsvc.h
include/linux/pktcdvd.h


Those that do (what I did):

enum myenum {
        ENUM_PING_PONG,         /* Bounce a ball back and forth
                                 * till you have a winner.
				 */
        ENUM_HONEY_CONE,        /* Soft and sweet a yummy for
                                 * the tummy.
				 */
};

include/linux/buffer_head.h (with space between the two enums)
include/linux/personality.h


Those that do:

enum myenum {
	/*
	 * Bounce a ball back and forth
	 * till you have a winner.
	 */
        ENUM_PING_PONG,
	/*
	 * Soft and sweet a yummy for
	 * the tummy.
	 */
        ENUM_HONEY_CONE,
};

include/linux/cgroup.h
include/linux/cn_proc.h
include/linux/exportfs.h
include/linux/fb.h
include/linux/hil_mlc.h
include/linux/pci.h
include/linux/reiserfs_fs_i.h


And finally Doc book style:

/**
 * enum myenum
 * @ENUM_PING_PONG: Bounce a ball back and forth
 *                  till you have a winner.
 * @ENUM_HONEY_CONE: Soft and sweet a yummy for
 *                   the tummy.
 */
enum myenum {
	ENUM_PING_PONG,
	ENUM_HONEY_CONE,
};

Note I did not see any enum users that did what you asked:

enum myenum {
        ENUM_PING_PONG,		/*
				 * Bounce a ball back and forth
				 * till you have a winner.
				 */
        ENUM_HONEY_CONE,	/*
				 * Soft and sweet a yummy for
				 * the tummy.
				 */
};

So by adding that, I will be adding yet another format.

Actually I think the docbook style is the most appropriate for me. I'll go 
with that one.

Thanks,

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH v10 Golden] Unified trace buffer
  2008-09-27  6:06               ` [PATCH v9] " Steven Rostedt
  2008-09-27 18:39                 ` Ingo Molnar
@ 2008-09-29 16:10                 ` Steven Rostedt
  2008-09-29 16:11                   ` Steven Rostedt
  2008-09-29 23:35                   ` Mathieu Desnoyers
  1 sibling, 2 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-09-29 16:10 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	prasad, Linus Torvalds, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo


[
  This is the final version of this patch. From now on, I will be sending
  changes on top of this patch.

  Changes since v9:

  All suggestions from Ingo Molnar.

  - Changed comment of enum to DocBook style.

  - Replaced the RB_TYPE_ enums with RINGBUF_TYPE_ prefixes to avoid
    name collision with rbtree. Note, I did not use the TB_ extension
    because I envision a "trace_buffer" layer on top of this layer
    in the future.

  - Moved ring_buffer_event_{length,data} into the .c file and added
    internal inlines. External uses will need to call the function.

  - Broke out rb_add_time_stamp function from rb_reserve_next_event.

  - made the cpu_buffer->lock back to a normal spin lock.

  - The rest are style changes.

]

This is a unified tracing buffer that implements a ring buffer that
hopefully everyone will eventually be able to use.

The events recorded into the buffer have the following structure:

struct ring_buffer_event {
	u32 type:2, len:3, time_delta:27;
	u32 array[];
};

The minimum size of an event is 8 bytes. All events are 4 byte
aligned inside the buffer.

There are 4 types (all internal use for the ring buffer, only
the data type is exported to the interface users).

RINGBUF_TYPE_PADDING: this type is used to note extra space at the end
	of a buffer page.

RINGBUF_TYPE_TIME_EXTENT: This type is used when the time between events
	is greater than the 27 bit delta can hold. We add another
	32 bits, and record that in its own event (8 byte size).

RINGBUF_TYPE_TIME_STAMP: (Not implemented yet). This will hold data to
	help keep the buffer timestamps in sync.

RINGBUF_TYPE_DATA: The event actually holds user data.

The "len" field is only three bits. Since the data must be
4 byte aligned, this field is shifted left by 2, giving a
max length of 28 bytes. If the data load is greater than 28
bytes, the first array field holds the full length of the
data load and the len field is set to zero.

Example, data size of 7 bytes:

	type = RINGBUF_TYPE_DATA
	len = 2
	time_delta: <time-stamp> - <prev_event-time-stamp>
	array[0..1]: <7 bytes of data> <1 byte empty>

This event is saved in 12 bytes of the buffer.

An event with 82 bytes of data:

	type = RINGBUF_TYPE_DATA
	len = 0
	time_delta: <time-stamp> - <prev_event-time-stamp>
	array[0]: 84 (Note the alignment)
	array[1..14]: <82 bytes of data> <2 bytes empty>

The above event is saved in 92 bytes (if my math is correct).
82 bytes of data, 2 bytes empty, 4 byte header, 4 byte length.

Do not reference the above event struct directly. Use the following
functions to gain access to the event table, since the
ring_buffer_event structure may change in the future.

ring_buffer_event_length(event): get the length of the event.
	This is the size of the memory used to record this
	event, and not the size of the data pay load.

ring_buffer_time_delta(event): get the time delta of the event
	This returns the delta time stamp since the last event.
	Note: Even though this is in the header, there should
		be no reason to access this directly, accept
		for debugging.

ring_buffer_event_data(event): get the data from the event
	This is the function to use to get the actual data
	from the event. Note, it is only a pointer to the
	data inside the buffer. This data must be copied to
	another location otherwise you risk it being written
	over in the buffer.

ring_buffer_lock: A way to lock the entire buffer.
ring_buffer_unlock: unlock the buffer.

ring_buffer_alloc: create a new ring buffer. Can choose between
	overwrite or consumer/producer mode. Overwrite will
	overwrite old data, where as consumer producer will
	throw away new data if the consumer catches up with the
	producer.  The consumer/producer is the default.

ring_buffer_free: free the ring buffer.

ring_buffer_resize: resize the buffer. Changes the size of each cpu
	buffer. Note, it is up to the caller to provide that
	the buffer is not being used while this is happening.
	This requirement may go away but do not count on it.

ring_buffer_lock_reserve: locks the ring buffer and allocates an
	entry on the buffer to write to.
ring_buffer_unlock_commit: unlocks the ring buffer and commits it to
	the buffer.

ring_buffer_write: writes some data into the ring buffer.

ring_buffer_peek: Look at a next item in the cpu buffer.
ring_buffer_consume: get the next item in the cpu buffer and
	consume it. That is, this function increments the head
	pointer.

ring_buffer_read_start: Start an iterator of a cpu buffer.
	For now, this disables the cpu buffer, until you issue
	a finish. This is just because we do not want the iterator
	to be overwritten. This restriction may change in the future.
	But note, this is used for static reading of a buffer which
	is usually done "after" a trace. Live readings would want
	to use the ring_buffer_consume above, which will not
	disable the ring buffer.

ring_buffer_read_finish: Finishes the read iterator and reenables
	the ring buffer.

ring_buffer_iter_peek: Look at the next item in the cpu iterator.
ring_buffer_read: Read the iterator and increment it.
ring_buffer_iter_reset: Reset the iterator to point to the beginning
	of the cpu buffer.
ring_buffer_iter_empty: Returns true if the iterator is at the end
	of the cpu buffer.

ring_buffer_size: returns the size in bytes of each cpu buffer.
	Note, the real size is this times the number of CPUs.

ring_buffer_reset_cpu: Sets the cpu buffer to empty
ring_buffer_reset: sets all cpu buffers to empty

ring_buffer_swap_cpu: swaps a cpu buffer from one buffer with a
	cpu buffer of another buffer. This is handy when you
	want to take a snap shot of a running trace on just one
	cpu. Having a backup buffer, to swap with facilitates this.
	Ftrace max latencies use this.

ring_buffer_empty: Returns true if the ring buffer is empty.
ring_buffer_empty_cpu: Returns true if the cpu buffer is empty.

ring_buffer_record_disable: disable all cpu buffers (read only)
ring_buffer_record_disable_cpu: disable a single cpu buffer (read only)
ring_buffer_record_enable: enable all cpu buffers.
ring_buffer_record_enabl_cpu: enable a single cpu buffer.

ring_buffer_entries: The number of entries in a ring buffer.
ring_buffer_overruns: The number of entries removed due to writing wrap.

ring_buffer_time_stamp: Get the time stamp used by the ring buffer
ring_buffer_normalize_time_stamp: normalize the ring buffer time stamp
	into nanosecs.

I still need to implement the GTOD feature. But we need support from
the cpu frequency infrastructure.  But this can be done at a later
time without affecting the ring buffer interface.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 include/linux/ring_buffer.h |  130 +++
 kernel/trace/Kconfig        |    4 
 kernel/trace/Makefile       |    1 
 kernel/trace/ring_buffer.c  | 1672 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 1807 insertions(+)

Index: linux-trace.git/include/linux/ring_buffer.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-trace.git/include/linux/ring_buffer.h	2008-09-29 11:12:32.000000000 -0400
@@ -0,0 +1,130 @@
+#ifndef _LINUX_RING_BUFFER_H
+#define _LINUX_RING_BUFFER_H
+
+#include <linux/mm.h>
+#include <linux/seq_file.h>
+
+struct ring_buffer;
+struct ring_buffer_iter;
+
+/*
+ * Don't reference this struct directly, use functions below.
+ */
+struct ring_buffer_event {
+	u32		type:2, len:3, time_delta:27;
+	u32		array[];
+};
+
+/**
+ * enum ring_buffer_type - internal ring buffer types
+ *
+ * @RINGBUF_TYPE_PADDING:	Left over page padding
+ *				 array is ignored
+ *				 size is variable depending on how much
+ *				  padding is needed
+ *
+ * @RINGBUF_TYPE_TIME_EXTEND:	Extend the time delta
+ *				 array[0] = time delta (28 .. 59)
+ *				 size = 8 bytes
+ *
+ * @RINGBUF_TYPE_TIME_STAMP:	Sync time stamp with external clock
+ *				 array[0] = tv_nsec
+ *				 array[1] = tv_sec
+ *				 size = 16 bytes
+ *
+ * @RINGBUF_TYPE_DATA:		Data record
+ *				 If len is zero:
+ *				  array[0] holds the actual length
+ *				  array[1..(length+3)/4-1] holds data
+ *				 else
+ *				  length = len << 2
+ *				  array[0..(length+3)/4] holds data
+ */
+enum ring_buffer_type {
+	RINGBUF_TYPE_PADDING,
+	RINGBUF_TYPE_TIME_EXTEND,
+	/* FIXME: RINGBUF_TYPE_TIME_STAMP not implemented */
+	RINGBUF_TYPE_TIME_STAMP,
+	RINGBUF_TYPE_DATA,
+};
+
+unsigned ring_buffer_event_length(struct ring_buffer_event *event);
+void *ring_buffer_event_data(struct ring_buffer_event *event);
+
+/**
+ * ring_buffer_event_time_delta - return the delta timestamp of the event
+ * @event: the event to get the delta timestamp of
+ *
+ * The delta timestamp is the 27 bit timestamp since the last event.
+ */
+static inline unsigned
+ring_buffer_event_time_delta(struct ring_buffer_event *event)
+{
+	return event->time_delta;
+}
+
+void ring_buffer_lock(struct ring_buffer *buffer, unsigned long *flags);
+void ring_buffer_unlock(struct ring_buffer *buffer, unsigned long flags);
+
+/*
+ * size is in bytes for each per CPU buffer.
+ */
+struct ring_buffer *
+ring_buffer_alloc(unsigned long size, unsigned flags);
+void ring_buffer_free(struct ring_buffer *buffer);
+
+int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size);
+
+struct ring_buffer_event *
+ring_buffer_lock_reserve(struct ring_buffer *buffer,
+			 unsigned long length,
+			 unsigned long *flags);
+int ring_buffer_unlock_commit(struct ring_buffer *buffer,
+			      struct ring_buffer_event *event,
+			      unsigned long flags);
+int ring_buffer_write(struct ring_buffer *buffer,
+		      unsigned long length, void *data);
+
+struct ring_buffer_event *
+ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts);
+struct ring_buffer_event *
+ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts);
+
+struct ring_buffer_iter *
+ring_buffer_read_start(struct ring_buffer *buffer, int cpu);
+void ring_buffer_read_finish(struct ring_buffer_iter *iter);
+
+struct ring_buffer_event *
+ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts);
+struct ring_buffer_event *
+ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts);
+void ring_buffer_iter_reset(struct ring_buffer_iter *iter);
+int ring_buffer_iter_empty(struct ring_buffer_iter *iter);
+
+unsigned long ring_buffer_size(struct ring_buffer *buffer);
+
+void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu);
+void ring_buffer_reset(struct ring_buffer *buffer);
+
+int ring_buffer_swap_cpu(struct ring_buffer *buffer_a,
+			 struct ring_buffer *buffer_b, int cpu);
+
+int ring_buffer_empty(struct ring_buffer *buffer);
+int ring_buffer_empty_cpu(struct ring_buffer *buffer, int cpu);
+
+void ring_buffer_record_disable(struct ring_buffer *buffer);
+void ring_buffer_record_enable(struct ring_buffer *buffer);
+void ring_buffer_record_disable_cpu(struct ring_buffer *buffer, int cpu);
+void ring_buffer_record_enable_cpu(struct ring_buffer *buffer, int cpu);
+
+unsigned long ring_buffer_entries(struct ring_buffer *buffer);
+unsigned long ring_buffer_overruns(struct ring_buffer *buffer);
+
+u64 ring_buffer_time_stamp(int cpu);
+void ring_buffer_normalize_time_stamp(int cpu, u64 *ts);
+
+enum ring_buffer_flags {
+	RB_FL_OVERWRITE		= 1 << 0,
+};
+
+#endif /* _LINUX_RING_BUFFER_H */
Index: linux-trace.git/kernel/trace/ring_buffer.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-trace.git/kernel/trace/ring_buffer.c	2008-09-29 11:37:43.000000000 -0400
@@ -0,0 +1,1672 @@
+/*
+ * Generic ring buffer
+ *
+ * Copyright (C) 2008 Steven Rostedt <srostedt@redhat.com>
+ */
+#include <linux/ring_buffer.h>
+#include <linux/spinlock.h>
+#include <linux/debugfs.h>
+#include <linux/uaccess.h>
+#include <linux/module.h>
+#include <linux/percpu.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>	/* used for sched_clock() (for now) */
+#include <linux/init.h>
+#include <linux/hash.h>
+#include <linux/list.h>
+#include <linux/fs.h>
+
+/* Up this if you want to test the TIME_EXTENTS and normalization */
+#define DEBUG_SHIFT 0
+
+/* FIXME!!! */
+u64 ring_buffer_time_stamp(int cpu)
+{
+	/* shift to debug/test normalization and TIME_EXTENTS */
+	return sched_clock() << DEBUG_SHIFT;
+}
+
+void ring_buffer_normalize_time_stamp(int cpu, u64 *ts)
+{
+	/* Just stupid testing the normalize function and deltas */
+	*ts >>= DEBUG_SHIFT;
+}
+
+#define RB_EVNT_HDR_SIZE (sizeof(struct ring_buffer_event))
+#define RB_ALIGNMENT_SHIFT	2
+#define RB_ALIGNMENT		(1 << RB_ALIGNMENT_SHIFT)
+#define RB_MAX_SMALL_DATA	28
+
+enum {
+	RB_LEN_TIME_EXTEND = 8,
+	RB_LEN_TIME_STAMP = 16,
+};
+
+/* inline for ring buffer fast paths */
+static inline unsigned
+rb_event_length(struct ring_buffer_event *event)
+{
+	unsigned length;
+
+	switch (event->type) {
+	case RINGBUF_TYPE_PADDING:
+		/* undefined */
+		return -1;
+
+	case RINGBUF_TYPE_TIME_EXTEND:
+		return RB_LEN_TIME_EXTEND;
+
+	case RINGBUF_TYPE_TIME_STAMP:
+		return RB_LEN_TIME_STAMP;
+
+	case RINGBUF_TYPE_DATA:
+		if (event->len)
+			length = event->len << RB_ALIGNMENT_SHIFT;
+		else
+			length = event->array[0];
+		return length + RB_EVNT_HDR_SIZE;
+	default:
+		BUG();
+	}
+	/* not hit */
+	return 0;
+}
+
+/**
+ * ring_buffer_event_length - return the length of the event
+ * @event: the event to get the length of
+ */
+unsigned ring_buffer_event_length(struct ring_buffer_event *event)
+{
+	return rb_event_length(event);
+}
+
+/* inline for ring buffer fast paths */
+static inline void *
+rb_event_data(struct ring_buffer_event *event)
+{
+	BUG_ON(event->type != RINGBUF_TYPE_DATA);
+	/* If length is in len field, then array[0] has the data */
+	if (event->len)
+		return (void *)&event->array[0];
+	/* Otherwise length is in array[0] and array[1] has the data */
+	return (void *)&event->array[1];
+}
+
+/**
+ * ring_buffer_event_data - return the data of the event
+ * @event: the event to get the data from
+ */
+void *ring_buffer_event_data(struct ring_buffer_event *event)
+{
+	return rb_event_data(event);
+}
+
+#define for_each_buffer_cpu(buffer, cpu)		\
+	for_each_cpu_mask(cpu, buffer->cpumask)
+
+#define TS_SHIFT	27
+#define TS_MASK		((1ULL << TS_SHIFT) - 1)
+#define TS_DELTA_TEST	(~TS_MASK)
+
+/*
+ * This hack stolen from mm/slob.c.
+ * We can store per page timing information in the page frame of the page.
+ * Thanks to Peter Zijlstra for suggesting this idea.
+ */
+struct buffer_page {
+	union {
+		struct {
+			unsigned long	 flags;		/* mandatory */
+			atomic_t	 _count;	/* mandatory */
+			u64		 time_stamp;	/* page time stamp */
+			unsigned	 size;		/* size of page data */
+			struct list_head list;		/* list of free pages */
+		};
+		struct page page;
+	};
+};
+
+/*
+ * We need to fit the time_stamp delta into 27 bits.
+ */
+static inline int test_time_stamp(u64 delta)
+{
+	if (delta & TS_DELTA_TEST)
+		return 1;
+	return 0;
+}
+
+#define BUF_PAGE_SIZE PAGE_SIZE
+
+/*
+ * head_page == tail_page && head == tail then buffer is empty.
+ */
+struct ring_buffer_per_cpu {
+	int				cpu;
+	struct ring_buffer		*buffer;
+	spinlock_t			lock;
+	struct lock_class_key		lock_key;
+	struct list_head		pages;
+	unsigned long			head;	/* read from head */
+	unsigned long			tail;	/* write to tail */
+	struct buffer_page		*head_page;
+	struct buffer_page		*tail_page;
+	unsigned long			overrun;
+	unsigned long			entries;
+	u64				write_stamp;
+	u64				read_stamp;
+	atomic_t			record_disabled;
+};
+
+struct ring_buffer {
+	unsigned long			size;
+	unsigned			pages;
+	unsigned			flags;
+	int				cpus;
+	cpumask_t			cpumask;
+	atomic_t			record_disabled;
+
+	struct mutex			mutex;
+
+	struct ring_buffer_per_cpu	**buffers;
+};
+
+struct ring_buffer_iter {
+	struct ring_buffer_per_cpu	*cpu_buffer;
+	unsigned long			head;
+	struct buffer_page		*head_page;
+	u64				read_stamp;
+};
+
+#define RB_WARN_ON(buffer, cond)			\
+	if (unlikely(cond)) {				\
+		atomic_inc(&buffer->record_disabled);	\
+		WARN_ON(1);				\
+		return -1;				\
+	}
+
+/**
+ * check_pages - integrity check of buffer pages
+ * @cpu_buffer: CPU buffer with pages to test
+ *
+ * As a safty measure we check to make sure the data pages have not
+ * been corrupted.
+ */
+static int rb_check_pages(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct list_head *head = &cpu_buffer->pages;
+	struct buffer_page *page, *tmp;
+
+	RB_WARN_ON(cpu_buffer, head->next->prev != head);
+	RB_WARN_ON(cpu_buffer, head->prev->next != head);
+
+	list_for_each_entry_safe(page, tmp, head, list) {
+		RB_WARN_ON(cpu_buffer, page->list.next->prev != &page->list);
+		RB_WARN_ON(cpu_buffer, page->list.prev->next != &page->list);
+	}
+
+	return 0;
+}
+
+static unsigned rb_head_size(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	return cpu_buffer->head_page->size;
+}
+
+static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
+			     unsigned nr_pages)
+{
+	struct list_head *head = &cpu_buffer->pages;
+	struct buffer_page *page, *tmp;
+	unsigned long addr;
+	LIST_HEAD(pages);
+	unsigned i;
+
+	for (i = 0; i < nr_pages; i++) {
+		addr = __get_free_page(GFP_KERNEL);
+		if (!addr)
+			goto free_pages;
+		page = (struct buffer_page *)virt_to_page(addr);
+		list_add(&page->list, &pages);
+	}
+
+	list_splice(&pages, head);
+
+	rb_check_pages(cpu_buffer);
+
+	return 0;
+
+ free_pages:
+	list_for_each_entry_safe(page, tmp, &pages, list) {
+		list_del_init(&page->list);
+		__free_page(&page->page);
+	}
+	return -ENOMEM;
+}
+
+static struct ring_buffer_per_cpu *
+rb_allocate_cpu_buffer(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int ret;
+
+	cpu_buffer = kzalloc_node(ALIGN(sizeof(*cpu_buffer), cache_line_size()),
+				  GFP_KERNEL, cpu_to_node(cpu));
+	if (!cpu_buffer)
+		return NULL;
+
+	cpu_buffer->cpu = cpu;
+	cpu_buffer->buffer = buffer;
+	spin_lock_init(&cpu_buffer->lock);
+	INIT_LIST_HEAD(&cpu_buffer->pages);
+
+	ret = rb_allocate_pages(cpu_buffer, buffer->pages);
+	if (ret < 0)
+		goto fail_free_buffer;
+
+	cpu_buffer->head_page
+		= list_entry(cpu_buffer->pages.next, struct buffer_page, list);
+	cpu_buffer->tail_page
+		= list_entry(cpu_buffer->pages.next, struct buffer_page, list);
+
+	return cpu_buffer;
+
+ fail_free_buffer:
+	kfree(cpu_buffer);
+	return NULL;
+}
+
+static void rb_free_cpu_buffer(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct list_head *head = &cpu_buffer->pages;
+	struct buffer_page *page, *tmp;
+
+	list_for_each_entry_safe(page, tmp, head, list) {
+		list_del_init(&page->list);
+		__free_page(&page->page);
+	}
+	kfree(cpu_buffer);
+}
+
+/**
+ * ring_buffer_alloc - allocate a new ring_buffer
+ * @size: the size in bytes that is needed.
+ * @flags: attributes to set for the ring buffer.
+ *
+ * Currently the only flag that is available is the RB_FL_OVERWRITE
+ * flag. This flag means that the buffer will overwrite old data
+ * when the buffer wraps. If this flag is not set, the buffer will
+ * drop data when the tail hits the head.
+ */
+struct ring_buffer *ring_buffer_alloc(unsigned long size, unsigned flags)
+{
+	struct ring_buffer *buffer;
+	int bsize;
+	int cpu;
+
+	/* keep it in its own cache line */
+	buffer = kzalloc(ALIGN(sizeof(*buffer), cache_line_size()),
+			 GFP_KERNEL);
+	if (!buffer)
+		return NULL;
+
+	buffer->pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
+	buffer->flags = flags;
+
+	/* need at least two pages */
+	if (buffer->pages == 1)
+		buffer->pages++;
+
+	buffer->cpumask = cpu_possible_map;
+	buffer->cpus = nr_cpu_ids;
+
+	bsize = sizeof(void *) * nr_cpu_ids;
+	buffer->buffers = kzalloc(ALIGN(bsize, cache_line_size()),
+				  GFP_KERNEL);
+	if (!buffer->buffers)
+		goto fail_free_buffer;
+
+	for_each_buffer_cpu(buffer, cpu) {
+		buffer->buffers[cpu] =
+			rb_allocate_cpu_buffer(buffer, cpu);
+		if (!buffer->buffers[cpu])
+			goto fail_free_buffers;
+	}
+
+	mutex_init(&buffer->mutex);
+
+	return buffer;
+
+ fail_free_buffers:
+	for_each_buffer_cpu(buffer, cpu) {
+		if (buffer->buffers[cpu])
+			rb_free_cpu_buffer(buffer->buffers[cpu]);
+	}
+	kfree(buffer->buffers);
+
+ fail_free_buffer:
+	kfree(buffer);
+	return NULL;
+}
+
+/**
+ * ring_buffer_free - free a ring buffer.
+ * @buffer: the buffer to free.
+ */
+void
+ring_buffer_free(struct ring_buffer *buffer)
+{
+	int cpu;
+
+	for_each_buffer_cpu(buffer, cpu)
+		rb_free_cpu_buffer(buffer->buffers[cpu]);
+
+	kfree(buffer);
+}
+
+static void rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer);
+
+static void
+rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, unsigned nr_pages)
+{
+	struct buffer_page *page;
+	struct list_head *p;
+	unsigned i;
+
+	atomic_inc(&cpu_buffer->record_disabled);
+	synchronize_sched();
+
+	for (i = 0; i < nr_pages; i++) {
+		BUG_ON(list_empty(&cpu_buffer->pages));
+		p = cpu_buffer->pages.next;
+		page = list_entry(p, struct buffer_page, list);
+		list_del_init(&page->list);
+		__free_page(&page->page);
+	}
+	BUG_ON(list_empty(&cpu_buffer->pages));
+
+	rb_reset_cpu(cpu_buffer);
+
+	rb_check_pages(cpu_buffer);
+
+	atomic_dec(&cpu_buffer->record_disabled);
+
+}
+
+static void
+rb_insert_pages(struct ring_buffer_per_cpu *cpu_buffer,
+		struct list_head *pages, unsigned nr_pages)
+{
+	struct buffer_page *page;
+	struct list_head *p;
+	unsigned i;
+
+	atomic_inc(&cpu_buffer->record_disabled);
+	synchronize_sched();
+
+	for (i = 0; i < nr_pages; i++) {
+		BUG_ON(list_empty(pages));
+		p = pages->next;
+		page = list_entry(p, struct buffer_page, list);
+		list_del_init(&page->list);
+		list_add_tail(&page->list, &cpu_buffer->pages);
+	}
+	rb_reset_cpu(cpu_buffer);
+
+	rb_check_pages(cpu_buffer);
+
+	atomic_dec(&cpu_buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_resize - resize the ring buffer
+ * @buffer: the buffer to resize.
+ * @size: the new size.
+ *
+ * The tracer is responsible for making sure that the buffer is
+ * not being used while changing the size.
+ * Note: We may be able to change the above requirement by using
+ *  RCU synchronizations.
+ *
+ * Minimum size is 2 * BUF_PAGE_SIZE.
+ *
+ * Returns -1 on failure.
+ */
+int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned nr_pages, rm_pages, new_pages;
+	struct buffer_page *page, *tmp;
+	unsigned long buffer_size;
+	unsigned long addr;
+	LIST_HEAD(pages);
+	int i, cpu;
+
+	size = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
+	size *= BUF_PAGE_SIZE;
+	buffer_size = buffer->pages * BUF_PAGE_SIZE;
+
+	/* we need a minimum of two pages */
+	if (size < BUF_PAGE_SIZE * 2)
+		size = BUF_PAGE_SIZE * 2;
+
+	if (size == buffer_size)
+		return size;
+
+	mutex_lock(&buffer->mutex);
+
+	nr_pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
+
+	if (size < buffer_size) {
+
+		/* easy case, just free pages */
+		BUG_ON(nr_pages >= buffer->pages);
+
+		rm_pages = buffer->pages - nr_pages;
+
+		for_each_buffer_cpu(buffer, cpu) {
+			cpu_buffer = buffer->buffers[cpu];
+			rb_remove_pages(cpu_buffer, rm_pages);
+		}
+		goto out;
+	}
+
+	/*
+	 * This is a bit more difficult. We only want to add pages
+	 * when we can allocate enough for all CPUs. We do this
+	 * by allocating all the pages and storing them on a local
+	 * link list. If we succeed in our allocation, then we
+	 * add these pages to the cpu_buffers. Otherwise we just free
+	 * them all and return -ENOMEM;
+	 */
+	BUG_ON(nr_pages <= buffer->pages);
+	new_pages = nr_pages - buffer->pages;
+
+	for_each_buffer_cpu(buffer, cpu) {
+		for (i = 0; i < new_pages; i++) {
+			addr = __get_free_page(GFP_KERNEL);
+			if (!addr)
+				goto free_pages;
+			page = (struct buffer_page *)virt_to_page(addr);
+			list_add(&page->list, &pages);
+		}
+	}
+
+	for_each_buffer_cpu(buffer, cpu) {
+		cpu_buffer = buffer->buffers[cpu];
+		rb_insert_pages(cpu_buffer, &pages, new_pages);
+	}
+
+	BUG_ON(!list_empty(&pages));
+
+ out:
+	buffer->pages = nr_pages;
+	mutex_unlock(&buffer->mutex);
+
+	return size;
+
+ free_pages:
+	list_for_each_entry_safe(page, tmp, &pages, list) {
+		list_del_init(&page->list);
+		__free_page(&page->page);
+	}
+	return -ENOMEM;
+}
+
+static inline int rb_per_cpu_empty(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	return cpu_buffer->head_page == cpu_buffer->tail_page &&
+		cpu_buffer->head == cpu_buffer->tail;
+}
+
+static inline int rb_null_event(struct ring_buffer_event *event)
+{
+	return event->type == RINGBUF_TYPE_PADDING;
+}
+
+static inline void *rb_page_index(struct buffer_page *page, unsigned index)
+{
+	void *addr = page_address(&page->page);
+
+	return addr + index;
+}
+
+static inline struct ring_buffer_event *
+rb_head_event(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	return rb_page_index(cpu_buffer->head_page,
+			     cpu_buffer->head);
+}
+
+static inline struct ring_buffer_event *
+rb_iter_head_event(struct ring_buffer_iter *iter)
+{
+	return rb_page_index(iter->head_page,
+			     iter->head);
+}
+
+/*
+ * When the tail hits the head and the buffer is in overwrite mode,
+ * the head jumps to the next page and all content on the previous
+ * page is discarded. But before doing so, we update the overrun
+ * variable of the buffer.
+ */
+static void rb_update_overflow(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct ring_buffer_event *event;
+	unsigned long head;
+
+	for (head = 0; head < rb_head_size(cpu_buffer);
+	     head += rb_event_length(event)) {
+
+		event = rb_page_index(cpu_buffer->head_page, head);
+		BUG_ON(rb_null_event(event));
+		/* Only count data entries */
+		if (event->type != RINGBUF_TYPE_DATA)
+			continue;
+		cpu_buffer->overrun++;
+		cpu_buffer->entries--;
+	}
+}
+
+static inline void rb_inc_page(struct ring_buffer_per_cpu *cpu_buffer,
+			       struct buffer_page **page)
+{
+	struct list_head *p = (*page)->list.next;
+
+	if (p == &cpu_buffer->pages)
+		p = p->next;
+
+	*page = list_entry(p, struct buffer_page, list);
+}
+
+static inline void
+rb_add_stamp(struct ring_buffer_per_cpu *cpu_buffer, u64 *ts)
+{
+	cpu_buffer->tail_page->time_stamp = *ts;
+	cpu_buffer->write_stamp = *ts;
+}
+
+static void rb_reset_read_page(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	cpu_buffer->read_stamp = cpu_buffer->head_page->time_stamp;
+	cpu_buffer->head = 0;
+}
+
+static void
+rb_reset_iter_read_page(struct ring_buffer_iter *iter)
+{
+	iter->read_stamp = iter->head_page->time_stamp;
+	iter->head = 0;
+}
+
+/**
+ * ring_buffer_update_event - update event type and data
+ * @event: the even to update
+ * @type: the type of event
+ * @length: the size of the event field in the ring buffer
+ *
+ * Update the type and data fields of the event. The length
+ * is the actual size that is written to the ring buffer,
+ * and with this, we can determine what to place into the
+ * data field.
+ */
+static inline void
+rb_update_event(struct ring_buffer_event *event,
+			 unsigned type, unsigned length)
+{
+	event->type = type;
+
+	switch (type) {
+
+	case RINGBUF_TYPE_PADDING:
+		break;
+
+	case RINGBUF_TYPE_TIME_EXTEND:
+		event->len =
+			(RB_LEN_TIME_EXTEND + (RB_ALIGNMENT-1))
+			>> RB_ALIGNMENT_SHIFT;
+		break;
+
+	case RINGBUF_TYPE_TIME_STAMP:
+		event->len =
+			(RB_LEN_TIME_STAMP + (RB_ALIGNMENT-1))
+			>> RB_ALIGNMENT_SHIFT;
+		break;
+
+	case RINGBUF_TYPE_DATA:
+		length -= RB_EVNT_HDR_SIZE;
+		if (length > RB_MAX_SMALL_DATA) {
+			event->len = 0;
+			event->array[0] = length;
+		} else
+			event->len =
+				(length + (RB_ALIGNMENT-1))
+				>> RB_ALIGNMENT_SHIFT;
+		break;
+	default:
+		BUG();
+	}
+}
+
+static inline unsigned rb_calculate_event_length(unsigned length)
+{
+	struct ring_buffer_event event; /* Used only for sizeof array */
+
+	/* zero length can cause confusions */
+	if (!length)
+		length = 1;
+
+	if (length > RB_MAX_SMALL_DATA)
+		length += sizeof(event.array[0]);
+
+	length += RB_EVNT_HDR_SIZE;
+	length = ALIGN(length, RB_ALIGNMENT);
+
+	return length;
+}
+
+static struct ring_buffer_event *
+__rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
+		  unsigned type, unsigned long length, u64 *ts)
+{
+	struct buffer_page *head_page, *tail_page;
+	unsigned long tail;
+	struct ring_buffer *buffer = cpu_buffer->buffer;
+	struct ring_buffer_event *event;
+
+	tail_page = cpu_buffer->tail_page;
+	head_page = cpu_buffer->head_page;
+	tail = cpu_buffer->tail;
+
+	if (tail + length > BUF_PAGE_SIZE) {
+		struct buffer_page *next_page = tail_page;
+
+		rb_inc_page(cpu_buffer, &next_page);
+
+		if (next_page == head_page) {
+			if (!(buffer->flags & RB_FL_OVERWRITE))
+				return NULL;
+
+			/* count overflows */
+			rb_update_overflow(cpu_buffer);
+
+			rb_inc_page(cpu_buffer, &head_page);
+			cpu_buffer->head_page = head_page;
+			rb_reset_read_page(cpu_buffer);
+		}
+
+		if (tail != BUF_PAGE_SIZE) {
+			event = rb_page_index(tail_page, tail);
+			/* page padding */
+			event->type = RINGBUF_TYPE_PADDING;
+		}
+
+		tail_page->size = tail;
+		tail_page = next_page;
+		tail_page->size = 0;
+		tail = 0;
+		cpu_buffer->tail_page = tail_page;
+		cpu_buffer->tail = tail;
+		rb_add_stamp(cpu_buffer, ts);
+	}
+
+	BUG_ON(tail + length > BUF_PAGE_SIZE);
+
+	event = rb_page_index(tail_page, tail);
+	rb_update_event(event, type, length);
+
+	return event;
+}
+
+static int
+rb_add_time_stamp(struct ring_buffer_per_cpu *cpu_buffer,
+		  u64 *ts, u64 *delta)
+{
+	struct ring_buffer_event *event;
+	static int once;
+
+	if (unlikely(*delta > (1ULL << 59) && !once++)) {
+		printk(KERN_WARNING "Delta way too big! %llu"
+		       " ts=%llu write stamp = %llu\n",
+		       *delta, *ts, cpu_buffer->write_stamp);
+		WARN_ON(1);
+	}
+
+	/*
+	 * The delta is too big, we to add a
+	 * new timestamp.
+	 */
+	event = __rb_reserve_next(cpu_buffer,
+				  RINGBUF_TYPE_TIME_EXTEND,
+				  RB_LEN_TIME_EXTEND,
+				  ts);
+	if (!event)
+		return -1;
+
+	/* check to see if we went to the next page */
+	if (cpu_buffer->tail) {
+		/* Still on same page, update timestamp */
+		event->time_delta = *delta & TS_MASK;
+		event->array[0] = *delta >> TS_SHIFT;
+		/* commit the time event */
+		cpu_buffer->tail +=
+			rb_event_length(event);
+		cpu_buffer->write_stamp = *ts;
+		*delta = 0;
+	}
+
+	return 0;
+}
+
+static struct ring_buffer_event *
+rb_reserve_next_event(struct ring_buffer_per_cpu *cpu_buffer,
+		      unsigned type, unsigned long length)
+{
+	struct ring_buffer_event *event;
+	u64 ts, delta;
+
+	ts = ring_buffer_time_stamp(cpu_buffer->cpu);
+
+	if (cpu_buffer->tail) {
+		delta = ts - cpu_buffer->write_stamp;
+
+		if (test_time_stamp(delta)) {
+			int ret;
+
+			ret = rb_add_time_stamp(cpu_buffer, &ts, &delta);
+			if (ret < 0)
+				return NULL;
+		}
+	} else {
+		rb_add_stamp(cpu_buffer, &ts);
+		delta = 0;
+	}
+
+	event = __rb_reserve_next(cpu_buffer, type, length, &ts);
+	if (!event)
+		return NULL;
+
+	/* If the reserve went to the next page, our delta is zero */
+	if (!cpu_buffer->tail)
+		delta = 0;
+
+	event->time_delta = delta;
+
+	return event;
+}
+
+/**
+ * ring_buffer_lock_reserve - reserve a part of the buffer
+ * @buffer: the ring buffer to reserve from
+ * @length: the length of the data to reserve (excluding event header)
+ * @flags: a pointer to save the interrupt flags
+ *
+ * Returns a reseverd event on the ring buffer to copy directly to.
+ * The user of this interface will need to get the body to write into
+ * and can use the ring_buffer_event_data() interface.
+ *
+ * The length is the length of the data needed, not the event length
+ * which also includes the event header.
+ *
+ * Must be paired with ring_buffer_unlock_commit, unless NULL is returned.
+ * If NULL is returned, then nothing has been allocated or locked.
+ */
+struct ring_buffer_event *
+ring_buffer_lock_reserve(struct ring_buffer *buffer,
+			 unsigned long length,
+			 unsigned long *flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	int cpu;
+
+	if (atomic_read(&buffer->record_disabled))
+		return NULL;
+
+	raw_local_irq_save(*flags);
+	cpu = raw_smp_processor_id();
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		goto out_irq;
+
+	cpu_buffer = buffer->buffers[cpu];
+	spin_lock(&cpu_buffer->lock);
+
+	if (atomic_read(&cpu_buffer->record_disabled))
+		goto no_record;
+
+	length = rb_calculate_event_length(length);
+	if (length > BUF_PAGE_SIZE)
+		return NULL;
+
+	event = rb_reserve_next_event(cpu_buffer, RINGBUF_TYPE_DATA, length);
+	if (!event)
+		goto no_record;
+
+	return event;
+
+ no_record:
+	spin_unlock(&cpu_buffer->lock);
+ out_irq:
+	local_irq_restore(*flags);
+	return NULL;
+}
+
+static void rb_commit(struct ring_buffer_per_cpu *cpu_buffer,
+		      struct ring_buffer_event *event)
+{
+	cpu_buffer->tail += rb_event_length(event);
+	cpu_buffer->tail_page->size = cpu_buffer->tail;
+	cpu_buffer->write_stamp += event->time_delta;
+	cpu_buffer->entries++;
+}
+
+/**
+ * ring_buffer_unlock_commit - commit a reserved
+ * @buffer: The buffer to commit to
+ * @event: The event pointer to commit.
+ * @flags: the interrupt flags received from ring_buffer_lock_reserve.
+ *
+ * This commits the data to the ring buffer, and releases any locks held.
+ *
+ * Must be paired with ring_buffer_lock_reserve.
+ */
+int ring_buffer_unlock_commit(struct ring_buffer *buffer,
+			      struct ring_buffer_event *event,
+			      unsigned long flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu = raw_smp_processor_id();
+
+	cpu_buffer = buffer->buffers[cpu];
+
+	assert_spin_locked(&cpu_buffer->lock);
+
+	rb_commit(cpu_buffer, event);
+
+	spin_unlock(&cpu_buffer->lock);
+	raw_local_irq_restore(flags);
+
+	return 0;
+}
+
+/**
+ * ring_buffer_write - write data to the buffer without reserving
+ * @buffer: The ring buffer to write to.
+ * @length: The length of the data being written (excluding the event header)
+ * @data: The data to write to the buffer.
+ *
+ * This is like ring_buffer_lock_reserve and ring_buffer_unlock_commit as
+ * one function. If you already have the data to write to the buffer, it
+ * may be easier to simply call this function.
+ *
+ * Note, like ring_buffer_lock_reserve, the length is the length of the data
+ * and not the length of the event which would hold the header.
+ */
+int ring_buffer_write(struct ring_buffer *buffer,
+			unsigned long length,
+			void *data)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	unsigned long event_length, flags;
+	void *body;
+	int ret = -EBUSY;
+	int cpu;
+
+	if (atomic_read(&buffer->record_disabled))
+		return -EBUSY;
+
+	local_irq_save(flags);
+	cpu = raw_smp_processor_id();
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		goto out_irq;
+
+	cpu_buffer = buffer->buffers[cpu];
+	spin_lock(&cpu_buffer->lock);
+
+	if (atomic_read(&cpu_buffer->record_disabled))
+		goto out;
+
+	event_length = rb_calculate_event_length(length);
+	event = rb_reserve_next_event(cpu_buffer,
+				      RINGBUF_TYPE_DATA, event_length);
+	if (!event)
+		goto out;
+
+	body = rb_event_data(event);
+
+	memcpy(body, data, length);
+
+	rb_commit(cpu_buffer, event);
+
+	ret = 0;
+ out:
+	spin_unlock(&cpu_buffer->lock);
+ out_irq:
+	local_irq_restore(flags);
+
+	return ret;
+}
+
+/**
+ * ring_buffer_lock - lock the ring buffer
+ * @buffer: The ring buffer to lock
+ * @flags: The place to store the interrupt flags
+ *
+ * This locks all the per CPU buffers.
+ *
+ * Must be unlocked by ring_buffer_unlock.
+ */
+void ring_buffer_lock(struct ring_buffer *buffer, unsigned long *flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	local_irq_save(*flags);
+
+	for_each_buffer_cpu(buffer, cpu) {
+		cpu_buffer = buffer->buffers[cpu];
+		spin_lock(&cpu_buffer->lock);
+	}
+}
+
+/**
+ * ring_buffer_unlock - unlock a locked buffer
+ * @buffer: The locked buffer to unlock
+ * @flags: The interrupt flags received by ring_buffer_lock
+ */
+void ring_buffer_unlock(struct ring_buffer *buffer, unsigned long flags)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	for (cpu = buffer->cpus - 1; cpu >= 0; cpu--) {
+		if (!cpu_isset(cpu, buffer->cpumask))
+			continue;
+		cpu_buffer = buffer->buffers[cpu];
+		spin_unlock(&cpu_buffer->lock);
+	}
+
+	local_irq_restore(flags);
+}
+
+/**
+ * ring_buffer_record_disable - stop all writes into the buffer
+ * @buffer: The ring buffer to stop writes to.
+ *
+ * This prevents all writes to the buffer. Any attempt to write
+ * to the buffer after this will fail and return NULL.
+ *
+ * The caller should call synchronize_sched() after this.
+ */
+void ring_buffer_record_disable(struct ring_buffer *buffer)
+{
+	atomic_inc(&buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_record_enable - enable writes to the buffer
+ * @buffer: The ring buffer to enable writes
+ *
+ * Note, multiple disables will need the same number of enables
+ * to truely enable the writing (much like preempt_disable).
+ */
+void ring_buffer_record_enable(struct ring_buffer *buffer)
+{
+	atomic_dec(&buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_record_disable_cpu - stop all writes into the cpu_buffer
+ * @buffer: The ring buffer to stop writes to.
+ * @cpu: The CPU buffer to stop
+ *
+ * This prevents all writes to the buffer. Any attempt to write
+ * to the buffer after this will fail and return NULL.
+ *
+ * The caller should call synchronize_sched() after this.
+ */
+void ring_buffer_record_disable_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return;
+
+	cpu_buffer = buffer->buffers[cpu];
+	atomic_inc(&cpu_buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_record_enable_cpu - enable writes to the buffer
+ * @buffer: The ring buffer to enable writes
+ * @cpu: The CPU to enable.
+ *
+ * Note, multiple disables will need the same number of enables
+ * to truely enable the writing (much like preempt_disable).
+ */
+void ring_buffer_record_enable_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return;
+
+	cpu_buffer = buffer->buffers[cpu];
+	atomic_dec(&cpu_buffer->record_disabled);
+}
+
+/**
+ * ring_buffer_entries_cpu - get the number of entries in a cpu buffer
+ * @buffer: The ring buffer
+ * @cpu: The per CPU buffer to get the entries from.
+ */
+unsigned long ring_buffer_entries_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return 0;
+
+	cpu_buffer = buffer->buffers[cpu];
+	return cpu_buffer->entries;
+}
+
+/**
+ * ring_buffer_overrun_cpu - get the number of overruns in a cpu_buffer
+ * @buffer: The ring buffer
+ * @cpu: The per CPU buffer to get the number of overruns from
+ */
+unsigned long ring_buffer_overrun_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return 0;
+
+	cpu_buffer = buffer->buffers[cpu];
+	return cpu_buffer->overrun;
+}
+
+/**
+ * ring_buffer_entries - get the number of entries in a buffer
+ * @buffer: The ring buffer
+ *
+ * Returns the total number of entries in the ring buffer
+ * (all CPU entries)
+ */
+unsigned long ring_buffer_entries(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long entries = 0;
+	int cpu;
+
+	/* if you care about this being correct, lock the buffer */
+	for_each_buffer_cpu(buffer, cpu) {
+		cpu_buffer = buffer->buffers[cpu];
+		entries += cpu_buffer->entries;
+	}
+
+	return entries;
+}
+
+/**
+ * ring_buffer_overrun_cpu - get the number of overruns in buffer
+ * @buffer: The ring buffer
+ *
+ * Returns the total number of overruns in the ring buffer
+ * (all CPU entries)
+ */
+unsigned long ring_buffer_overruns(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long overruns = 0;
+	int cpu;
+
+	/* if you care about this being correct, lock the buffer */
+	for_each_buffer_cpu(buffer, cpu) {
+		cpu_buffer = buffer->buffers[cpu];
+		overruns += cpu_buffer->overrun;
+	}
+
+	return overruns;
+}
+
+/**
+ * ring_buffer_iter_reset - reset an iterator
+ * @iter: The iterator to reset
+ *
+ * Resets the iterator, so that it will start from the beginning
+ * again.
+ */
+void ring_buffer_iter_reset(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
+
+	iter->head_page = cpu_buffer->head_page;
+	iter->head = cpu_buffer->head;
+	rb_reset_iter_read_page(iter);
+}
+
+/**
+ * ring_buffer_iter_empty - check if an iterator has no more to read
+ * @iter: The iterator to check
+ */
+int ring_buffer_iter_empty(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	cpu_buffer = iter->cpu_buffer;
+
+	return iter->head_page == cpu_buffer->tail_page &&
+		iter->head == cpu_buffer->tail;
+}
+
+static void
+rb_update_read_stamp(struct ring_buffer_per_cpu *cpu_buffer,
+		     struct ring_buffer_event *event)
+{
+	u64 delta;
+
+	switch (event->type) {
+	case RINGBUF_TYPE_PADDING:
+		return;
+
+	case RINGBUF_TYPE_TIME_EXTEND:
+		delta = event->array[0];
+		delta <<= TS_SHIFT;
+		delta += event->time_delta;
+		cpu_buffer->read_stamp += delta;
+		return;
+
+	case RINGBUF_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		return;
+
+	case RINGBUF_TYPE_DATA:
+		cpu_buffer->read_stamp += event->time_delta;
+		return;
+
+	default:
+		BUG();
+	}
+	return;
+}
+
+static void
+rb_update_iter_read_stamp(struct ring_buffer_iter *iter,
+			  struct ring_buffer_event *event)
+{
+	u64 delta;
+
+	switch (event->type) {
+	case RINGBUF_TYPE_PADDING:
+		return;
+
+	case RINGBUF_TYPE_TIME_EXTEND:
+		delta = event->array[0];
+		delta <<= TS_SHIFT;
+		delta += event->time_delta;
+		iter->read_stamp += delta;
+		return;
+
+	case RINGBUF_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		return;
+
+	case RINGBUF_TYPE_DATA:
+		iter->read_stamp += event->time_delta;
+		return;
+
+	default:
+		BUG();
+	}
+	return;
+}
+
+static void rb_advance_head(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct ring_buffer_event *event;
+	unsigned length;
+
+	/*
+	 * Check if we are at the end of the buffer.
+	 */
+	if (cpu_buffer->head >= cpu_buffer->head_page->size) {
+		BUG_ON(cpu_buffer->head_page == cpu_buffer->tail_page);
+		rb_inc_page(cpu_buffer, &cpu_buffer->head_page);
+		rb_reset_read_page(cpu_buffer);
+		return;
+	}
+
+	event = rb_head_event(cpu_buffer);
+
+	if (event->type == RINGBUF_TYPE_DATA)
+		cpu_buffer->entries--;
+
+	length = rb_event_length(event);
+
+	/*
+	 * This should not be called to advance the header if we are
+	 * at the tail of the buffer.
+	 */
+	BUG_ON((cpu_buffer->head_page == cpu_buffer->tail_page) &&
+	       (cpu_buffer->head + length > cpu_buffer->tail));
+
+	rb_update_read_stamp(cpu_buffer, event);
+
+	cpu_buffer->head += length;
+
+	/* check for end of page */
+	if ((cpu_buffer->head >= cpu_buffer->head_page->size) &&
+	    (cpu_buffer->head_page != cpu_buffer->tail_page))
+		rb_advance_head(cpu_buffer);
+}
+
+static void rb_advance_iter(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer *buffer;
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+	unsigned length;
+
+	cpu_buffer = iter->cpu_buffer;
+	buffer = cpu_buffer->buffer;
+
+	/*
+	 * Check if we are at the end of the buffer.
+	 */
+	if (iter->head >= iter->head_page->size) {
+		BUG_ON(iter->head_page == cpu_buffer->tail_page);
+		rb_inc_page(cpu_buffer, &iter->head_page);
+		rb_reset_iter_read_page(iter);
+		return;
+	}
+
+	event = rb_iter_head_event(iter);
+
+	length = rb_event_length(event);
+
+	/*
+	 * This should not be called to advance the header if we are
+	 * at the tail of the buffer.
+	 */
+	BUG_ON((iter->head_page == cpu_buffer->tail_page) &&
+	       (iter->head + length > cpu_buffer->tail));
+
+	rb_update_iter_read_stamp(iter, event);
+
+	iter->head += length;
+
+	/* check for end of page padding */
+	if ((iter->head >= iter->head_page->size) &&
+	    (iter->head_page != cpu_buffer->tail_page))
+		rb_advance_iter(iter);
+}
+
+/**
+ * ring_buffer_peek - peek at the next event to be read
+ * @buffer: The ring buffer to read
+ * @cpu: The cpu to peak at
+ * @ts: The timestamp counter of this event.
+ *
+ * This will return the event that will be read next, but does
+ * not consume the data.
+ */
+struct ring_buffer_event *
+ring_buffer_peek(struct ring_buffer *buffer, int cpu, u64 *ts)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return NULL;
+
+	cpu_buffer = buffer->buffers[cpu];
+
+ again:
+	if (rb_per_cpu_empty(cpu_buffer))
+		return NULL;
+
+	event = rb_head_event(cpu_buffer);
+
+	switch (event->type) {
+	case RINGBUF_TYPE_PADDING:
+		rb_inc_page(cpu_buffer, &cpu_buffer->head_page);
+		rb_reset_read_page(cpu_buffer);
+		goto again;
+
+	case RINGBUF_TYPE_TIME_EXTEND:
+		/* Internal data, OK to advance */
+		rb_advance_head(cpu_buffer);
+		goto again;
+
+	case RINGBUF_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		rb_advance_head(cpu_buffer);
+		goto again;
+
+	case RINGBUF_TYPE_DATA:
+		if (ts) {
+			*ts = cpu_buffer->read_stamp + event->time_delta;
+			ring_buffer_normalize_time_stamp(cpu_buffer->cpu, ts);
+		}
+		return event;
+
+	default:
+		BUG();
+	}
+
+	return NULL;
+}
+
+/**
+ * ring_buffer_iter_peek - peek at the next event to be read
+ * @iter: The ring buffer iterator
+ * @ts: The timestamp counter of this event.
+ *
+ * This will return the event that will be read next, but does
+ * not increment the iterator.
+ */
+struct ring_buffer_event *
+ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
+{
+	struct ring_buffer *buffer;
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+
+	if (ring_buffer_iter_empty(iter))
+		return NULL;
+
+	cpu_buffer = iter->cpu_buffer;
+	buffer = cpu_buffer->buffer;
+
+ again:
+	if (rb_per_cpu_empty(cpu_buffer))
+		return NULL;
+
+	event = rb_iter_head_event(iter);
+
+	switch (event->type) {
+	case RINGBUF_TYPE_PADDING:
+		rb_inc_page(cpu_buffer, &iter->head_page);
+		rb_reset_iter_read_page(iter);
+		goto again;
+
+	case RINGBUF_TYPE_TIME_EXTEND:
+		/* Internal data, OK to advance */
+		rb_advance_iter(iter);
+		goto again;
+
+	case RINGBUF_TYPE_TIME_STAMP:
+		/* FIXME: not implemented */
+		rb_advance_iter(iter);
+		goto again;
+
+	case RINGBUF_TYPE_DATA:
+		if (ts) {
+			*ts = iter->read_stamp + event->time_delta;
+			ring_buffer_normalize_time_stamp(cpu_buffer->cpu, ts);
+		}
+		return event;
+
+	default:
+		BUG();
+	}
+
+	return NULL;
+}
+
+/**
+ * ring_buffer_consume - return an event and consume it
+ * @buffer: The ring buffer to get the next event from
+ *
+ * Returns the next event in the ring buffer, and that event is consumed.
+ * Meaning, that sequential reads will keep returning a different event,
+ * and eventually empty the ring buffer if the producer is slower.
+ */
+struct ring_buffer_event *
+ring_buffer_consume(struct ring_buffer *buffer, int cpu, u64 *ts)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_event *event;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return NULL;
+
+	event = ring_buffer_peek(buffer, cpu, ts);
+	if (!event)
+		return NULL;
+
+	cpu_buffer = buffer->buffers[cpu];
+	rb_advance_head(cpu_buffer);
+
+	return event;
+}
+
+/**
+ * ring_buffer_read_start - start a non consuming read of the buffer
+ * @buffer: The ring buffer to read from
+ * @cpu: The cpu buffer to iterate over
+ *
+ * This starts up an iteration through the buffer. It also disables
+ * the recording to the buffer until the reading is finished.
+ * This prevents the reading from being corrupted. This is not
+ * a consuming read, so a producer is not expected.
+ *
+ * Must be paired with ring_buffer_finish.
+ */
+struct ring_buffer_iter *
+ring_buffer_read_start(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct ring_buffer_iter *iter;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return NULL;
+
+	iter = kmalloc(sizeof(*iter), GFP_KERNEL);
+	if (!iter)
+		return NULL;
+
+	cpu_buffer = buffer->buffers[cpu];
+
+	iter->cpu_buffer = cpu_buffer;
+
+	atomic_inc(&cpu_buffer->record_disabled);
+	synchronize_sched();
+
+	spin_lock(&cpu_buffer->lock);
+	iter->head = cpu_buffer->head;
+	iter->head_page = cpu_buffer->head_page;
+	rb_reset_iter_read_page(iter);
+	spin_unlock(&cpu_buffer->lock);
+
+	return iter;
+}
+
+/**
+ * ring_buffer_finish - finish reading the iterator of the buffer
+ * @iter: The iterator retrieved by ring_buffer_start
+ *
+ * This re-enables the recording to the buffer, and frees the
+ * iterator.
+ */
+void
+ring_buffer_read_finish(struct ring_buffer_iter *iter)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
+
+	atomic_dec(&cpu_buffer->record_disabled);
+	kfree(iter);
+}
+
+/**
+ * ring_buffer_read - read the next item in the ring buffer by the iterator
+ * @iter: The ring buffer iterator
+ * @ts: The time stamp of the event read.
+ *
+ * This reads the next event in the ring buffer and increments the iterator.
+ */
+struct ring_buffer_event *
+ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts)
+{
+	struct ring_buffer_event *event;
+
+	event = ring_buffer_iter_peek(iter, ts);
+	if (!event)
+		return NULL;
+
+	rb_advance_iter(iter);
+
+	return event;
+}
+
+/**
+ * ring_buffer_size - return the size of the ring buffer (in bytes)
+ * @buffer: The ring buffer.
+ */
+unsigned long ring_buffer_size(struct ring_buffer *buffer)
+{
+	return BUF_PAGE_SIZE * buffer->pages;
+}
+
+static void
+rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	cpu_buffer->head_page
+		= list_entry(cpu_buffer->pages.next, struct buffer_page, list);
+	cpu_buffer->tail_page
+		= list_entry(cpu_buffer->pages.next, struct buffer_page, list);
+
+	cpu_buffer->head = cpu_buffer->tail = 0;
+	cpu_buffer->overrun = 0;
+	cpu_buffer->entries = 0;
+}
+
+/**
+ * ring_buffer_reset_cpu - reset a ring buffer per CPU buffer
+ * @buffer: The ring buffer to reset a per cpu buffer of
+ * @cpu: The CPU buffer to be reset
+ */
+void ring_buffer_reset_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer = buffer->buffers[cpu];
+	unsigned long flags;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return;
+
+	raw_local_irq_save(flags);
+	spin_lock(&cpu_buffer->lock);
+
+	rb_reset_cpu(cpu_buffer);
+
+	spin_unlock(&cpu_buffer->lock);
+	raw_local_irq_restore(flags);
+}
+
+/**
+ * ring_buffer_reset - reset a ring buffer
+ * @buffer: The ring buffer to reset all cpu buffers
+ */
+void ring_buffer_reset(struct ring_buffer *buffer)
+{
+	unsigned long flags;
+	int cpu;
+
+	ring_buffer_lock(buffer, &flags);
+
+	for_each_buffer_cpu(buffer, cpu)
+		rb_reset_cpu(buffer->buffers[cpu]);
+
+	ring_buffer_unlock(buffer, flags);
+}
+
+/**
+ * rind_buffer_empty - is the ring buffer empty?
+ * @buffer: The ring buffer to test
+ */
+int ring_buffer_empty(struct ring_buffer *buffer)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	int cpu;
+
+	/* yes this is racy, but if you don't like the race, lock the buffer */
+	for_each_buffer_cpu(buffer, cpu) {
+		cpu_buffer = buffer->buffers[cpu];
+		if (!rb_per_cpu_empty(cpu_buffer))
+			return 0;
+	}
+	return 1;
+}
+
+/**
+ * ring_buffer_empty_cpu - is a cpu buffer of a ring buffer empty?
+ * @buffer: The ring buffer
+ * @cpu: The CPU buffer to test
+ */
+int ring_buffer_empty_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	if (!cpu_isset(cpu, buffer->cpumask))
+		return 1;
+
+	cpu_buffer = buffer->buffers[cpu];
+	return rb_per_cpu_empty(cpu_buffer);
+}
+
+/**
+ * ring_buffer_swap_cpu - swap a CPU buffer between two ring buffers
+ * @buffer_a: One buffer to swap with
+ * @buffer_b: The other buffer to swap with
+ *
+ * This function is useful for tracers that want to take a "snapshot"
+ * of a CPU buffer and has another back up buffer lying around.
+ * it is expected that the tracer handles the cpu buffer not being
+ * used at the moment.
+ */
+int ring_buffer_swap_cpu(struct ring_buffer *buffer_a,
+			 struct ring_buffer *buffer_b, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer_a;
+	struct ring_buffer_per_cpu *cpu_buffer_b;
+
+	if (!cpu_isset(cpu, buffer_a->cpumask) ||
+	    !cpu_isset(cpu, buffer_b->cpumask))
+		return -EINVAL;
+
+	/* At least make sure the two buffers are somewhat the same */
+	if (buffer_a->size != buffer_b->size ||
+	    buffer_a->pages != buffer_b->pages)
+		return -EINVAL;
+
+	cpu_buffer_a = buffer_a->buffers[cpu];
+	cpu_buffer_b = buffer_b->buffers[cpu];
+
+	/*
+	 * We can't do a synchronize_sched here because this
+	 * function can be called in atomic context.
+	 * Normally this will be called from the same CPU as cpu.
+	 * If not it's up to the caller to protect this.
+	 */
+	atomic_inc(&cpu_buffer_a->record_disabled);
+	atomic_inc(&cpu_buffer_b->record_disabled);
+
+	buffer_a->buffers[cpu] = cpu_buffer_b;
+	buffer_b->buffers[cpu] = cpu_buffer_a;
+
+	cpu_buffer_b->buffer = buffer_a;
+	cpu_buffer_a->buffer = buffer_b;
+
+	atomic_dec(&cpu_buffer_a->record_disabled);
+	atomic_dec(&cpu_buffer_b->record_disabled);
+
+	return 0;
+}
+
Index: linux-trace.git/kernel/trace/Kconfig
===================================================================
--- linux-trace.git.orig/kernel/trace/Kconfig	2008-09-27 01:58:49.000000000 -0400
+++ linux-trace.git/kernel/trace/Kconfig	2008-09-27 01:59:06.000000000 -0400
@@ -10,10 +10,14 @@ config HAVE_DYNAMIC_FTRACE
 config TRACER_MAX_TRACE
 	bool
 
+config RING_BUFFER
+	bool
+
 config TRACING
 	bool
 	select DEBUG_FS
 	select STACKTRACE
+	select RING_BUFFER
 
 config FTRACE
 	bool "Kernel Function Tracer"
Index: linux-trace.git/kernel/trace/Makefile
===================================================================
--- linux-trace.git.orig/kernel/trace/Makefile	2008-09-27 01:58:49.000000000 -0400
+++ linux-trace.git/kernel/trace/Makefile	2008-09-27 01:59:06.000000000 -0400
@@ -11,6 +11,7 @@ obj-y += trace_selftest_dynamic.o
 endif
 
 obj-$(CONFIG_FTRACE) += libftrace.o
+obj-$(CONFIG_RING_BUFFER) += ring_buffer.o
 
 obj-$(CONFIG_TRACING) += trace.o
 obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v10 Golden] Unified trace buffer
  2008-09-29 16:10                 ` [PATCH v10 Golden] " Steven Rostedt
@ 2008-09-29 16:11                   ` Steven Rostedt
  2008-09-29 23:35                   ` Mathieu Desnoyers
  1 sibling, 0 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-09-29 16:11 UTC (permalink / raw)
  To: LKML
  Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, Andrew Morton,
	prasad, Linus Torvalds, Mathieu Desnoyers, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo


Ingo,

I will add this patch to my linux-tip and then I will start porting ftrace 
over to it in an incremental fashion.

-- Steve




^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v10 Golden] Unified trace buffer
  2008-09-29 16:10                 ` [PATCH v10 Golden] " Steven Rostedt
  2008-09-29 16:11                   ` Steven Rostedt
@ 2008-09-29 23:35                   ` Mathieu Desnoyers
  2008-09-30  0:01                     ` Steven Rostedt
  1 sibling, 1 reply; 102+ messages in thread
From: Mathieu Desnoyers @ 2008-09-29 23:35 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, prasad, Linus Torvalds, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo

* Steven Rostedt (rostedt@goodmis.org) wrote:
[...]
> +/*
> + * This hack stolen from mm/slob.c.
> + * We can store per page timing information in the page frame of the page.
> + * Thanks to Peter Zijlstra for suggesting this idea.
> + */
> +struct buffer_page {
> +	union {
> +		struct {
> +			unsigned long	 flags;		/* mandatory */
> +			atomic_t	 _count;	/* mandatory */
> +			u64		 time_stamp;	/* page time stamp */
> +			unsigned	 size;		/* size of page data */
> +			struct list_head list;		/* list of free pages */
> +		};
> +		struct page page;
> +	};
> +};
> +

Hi Steven,

You should have a look at mm/slob.c free_slob_page(). I think your page
free will generate a "bad_page" call due to mapping != NULL and mapcount
!= 0. I just ran into this in my own code. :)

Regards,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v10 Golden] Unified trace buffer
  2008-09-29 23:35                   ` Mathieu Desnoyers
@ 2008-09-30  0:01                     ` Steven Rostedt
  2008-09-30  0:03                       ` Mathieu Desnoyers
  0 siblings, 1 reply; 102+ messages in thread
From: Steven Rostedt @ 2008-09-30  0:01 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, prasad, Linus Torvalds, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo


On Mon, 29 Sep 2008, Mathieu Desnoyers wrote:

> * Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> You should have a look at mm/slob.c free_slob_page(). I think your page
> free will generate a "bad_page" call due to mapping != NULL and mapcount
> != 0. I just ran into this in my own code. :)


Hi Mathieu!

Thanks! I must have been lucky some how not to trigger this :-/

I'll add an update patch for this.

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v10 Golden] Unified trace buffer
  2008-09-30  0:01                     ` Steven Rostedt
@ 2008-09-30  0:03                       ` Mathieu Desnoyers
  2008-09-30  0:12                         ` Steven Rostedt
  0 siblings, 1 reply; 102+ messages in thread
From: Mathieu Desnoyers @ 2008-09-30  0:03 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, prasad, Linus Torvalds, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Mon, 29 Sep 2008, Mathieu Desnoyers wrote:
> 
> > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > 
> > You should have a look at mm/slob.c free_slob_page(). I think your page
> > free will generate a "bad_page" call due to mapping != NULL and mapcount
> > != 0. I just ran into this in my own code. :)
> 
> 
> Hi Mathieu!
> 
> Thanks! I must have been lucky some how not to trigger this :-/
> 

My guess is that you never free your buffers in your test cases. I don't
know if it was expected; probably not if your code is built into the
kernel.

Mathieu

> I'll add an update patch for this.
> 
> -- Steve
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v10 Golden] Unified trace buffer
  2008-09-30  0:03                       ` Mathieu Desnoyers
@ 2008-09-30  0:12                         ` Steven Rostedt
  2008-09-30  3:46                           ` Mathieu Desnoyers
  0 siblings, 1 reply; 102+ messages in thread
From: Steven Rostedt @ 2008-09-30  0:12 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, prasad, Linus Torvalds, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo


On Mon, 29 Sep 2008, Mathieu Desnoyers wrote:
> > 
> > Thanks! I must have been lucky some how not to trigger this :-/
> > 
> 
> My guess is that you never free your buffers in your test cases. I don't
> know if it was expected; probably not if your code is built into the
> kernel.

Actually my resize does free the buffers and I did test this. I probably 
never ran the trace when testing the freeing which means those pointers 
could have luckily not have been changed.

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v10 Golden] Unified trace buffer
  2008-09-30  0:12                         ` Steven Rostedt
@ 2008-09-30  3:46                           ` Mathieu Desnoyers
  2008-09-30  4:00                             ` Steven Rostedt
  0 siblings, 1 reply; 102+ messages in thread
From: Mathieu Desnoyers @ 2008-09-30  3:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, prasad, Linus Torvalds, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Mon, 29 Sep 2008, Mathieu Desnoyers wrote:
> > > 
> > > Thanks! I must have been lucky some how not to trigger this :-/
> > > 
> > 
> > My guess is that you never free your buffers in your test cases. I don't
> > know if it was expected; probably not if your code is built into the
> > kernel.
> 
> Actually my resize does free the buffers and I did test this. I probably 
> never ran the trace when testing the freeing which means those pointers 
> could have luckily not have been changed.
> 
> -- Steve
> 

I also got some corruption of the offset field in the struct page I use.
I think it might be related to the fact that I don't set the PG_private
bit (slob does set it when the page is in its free pages list). However,
given I'd like to pass the buffer pages to disk I/O and for network
socket and still keep the ability to re-use it when the I/O has been
performed, I wonder where I should put my 

                       struct list_head list;  /* linked list of buf pages */
                       size_t offset;          /* page offset in the buffer */

fields ? Any ideas ?

They are currently in :

struct buf_page {
        union {
                struct {
                        unsigned long flags;    /* mandatory */
                        atomic_t _count;        /* mandatory */
                        union {                 /* mandatory */
                                atomic_t _mapcount;
                                struct {
                                        u16 inuse;
                                        u16 objects;
                                };
                        };
                        struct list_head list;  /* linked list of buf pages */
                        size_t offset;          /* page offset in the buffer */
                };
                struct page page;
        };
};

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v10 Golden] Unified trace buffer
  2008-09-30  3:46                           ` Mathieu Desnoyers
@ 2008-09-30  4:00                             ` Steven Rostedt
  2008-09-30 15:20                               ` Jonathan Corbet
  0 siblings, 1 reply; 102+ messages in thread
From: Steven Rostedt @ 2008-09-30  4:00 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	Andrew Morton, prasad, Linus Torvalds, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo


On Mon, 29 Sep 2008, Mathieu Desnoyers wrote:
> 
> I also got some corruption of the offset field in the struct page I use.
> I think it might be related to the fact that I don't set the PG_private
> bit (slob does set it when the page is in its free pages list). However,
> given I'd like to pass the buffer pages to disk I/O and for network

Ah, I believe the disk IO uses the page frame. That might be a bit more 
difficult to pass the data to disk and still keep information on the
page frame.

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v10 Golden] Unified trace buffer
  2008-09-30  4:00                             ` Steven Rostedt
@ 2008-09-30 15:20                               ` Jonathan Corbet
  2008-09-30 15:54                                 ` Peter Zijlstra
  0 siblings, 1 reply; 102+ messages in thread
From: Jonathan Corbet @ 2008-09-30 15:20 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, LKML, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, Andrew Morton, prasad, Linus Torvalds,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo

On Tue, 30 Sep 2008 00:00:11 -0400 (EDT)
Steven Rostedt <rostedt@goodmis.org> wrote:

> Ah, I believe the disk IO uses the page frame. That might be a bit more 
> difficult to pass the data to disk and still keep information on the
> page frame.

Perhaps I'm speaking out of turn, but I have to wonder: am I the only one
who gets uncomfortable looking at these hacks to overload struct page?  It
seems fragile as all hell; woe to he who tries to make a change to struct
page someday and has to track all of this stuff down.

Are the savings gained by using struct page this way really worth the
added complexity?

jon

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v10 Golden] Unified trace buffer
  2008-09-30 15:20                               ` Jonathan Corbet
@ 2008-09-30 15:54                                 ` Peter Zijlstra
  2008-09-30 16:38                                   ` Linus Torvalds
  0 siblings, 1 reply; 102+ messages in thread
From: Peter Zijlstra @ 2008-09-30 15:54 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Steven Rostedt, Mathieu Desnoyers, LKML, Ingo Molnar,
	Thomas Gleixner, Andrew Morton, prasad, Linus Torvalds,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo

On Tue, 2008-09-30 at 09:20 -0600, Jonathan Corbet wrote:
> On Tue, 30 Sep 2008 00:00:11 -0400 (EDT)
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > Ah, I believe the disk IO uses the page frame. That might be a bit more 
> > difficult to pass the data to disk and still keep information on the
> > page frame.
> 
> Perhaps I'm speaking out of turn, but I have to wonder: am I the only one
> who gets uncomfortable looking at these hacks to overload struct page?  It
> seems fragile as all hell; woe to he who tries to make a change to struct
> page someday and has to track all of this stuff down.
> 
> Are the savings gained by using struct page this way really worth the
> added complexity?

Its not that complex IMHO, the thing that is ugly are those struct page
overloads, what we could do is try and sanitize the regular struct page
and pull all these things in.

Because the only reason people are doing these overloads is because
struct page in mm_types.h is becomming an unreadable mess.

Trouble is, looking at it I see no easy way out,


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v10 Golden] Unified trace buffer
  2008-09-30 15:54                                 ` Peter Zijlstra
@ 2008-09-30 16:38                                   ` Linus Torvalds
  2008-09-30 16:48                                     ` Steven Rostedt
  0 siblings, 1 reply; 102+ messages in thread
From: Linus Torvalds @ 2008-09-30 16:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jonathan Corbet, Steven Rostedt, Mathieu Desnoyers, LKML,
	Ingo Molnar, Thomas Gleixner, Andrew Morton, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo



On Tue, 30 Sep 2008, Peter Zijlstra wrote:
> 
> Its not that complex IMHO, the thing that is ugly are those struct page
> overloads, what we could do is try and sanitize the regular struct page
> and pull all these things in.

That's not the scary part. The scary part is that somebody may well want 
to access the trace buffer pages in complex ways.

If you mmap them, for example, you can use VM_PFNMAP to make sure that 
nobody should ever look at the "struct page", but if you want to do things 
like direct-to-disk IO on the trace pages (either with splice() or with 
some kind of in-kernel IO logic), then you're officially screwed.

> Because the only reason people are doing these overloads is because
> struct page in mm_types.h is becomming an unreadable mess.

The "unreadable mess" has exactly the same issues, though: people need to 
realize that when you overload fields in the page structure, you can then 
NEVER EVER use those pages for any other thing. 

For the internal VM code, that's ok. The VM knows that a page is either an 
anonymous page or a file mapping etc, and the overloading wrt mm_types.h 
is explicit. The same goes for SL*B, although it does the overloading 
differently.

Trace buffers are different, though. Do people realize that doing the 
overloading means that you never EVER can use those buffers for anything 
else? Do people realize that it means that splice() and friends are out of 
the question?

> Trouble is, looking at it I see no easy way out,

Quite frankly, we could just put it at the head of the page itself. Having 
a "whole page" for the trace data is not possible anyway, since the trace 
header itself will always eat 8 bytes.

And I do think it would potentially be a better model. Or at least safer.

			Linus

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v10 Golden] Unified trace buffer
  2008-09-30 16:38                                   ` Linus Torvalds
@ 2008-09-30 16:48                                     ` Steven Rostedt
  2008-09-30 17:00                                       ` Peter Zijlstra
  2008-09-30 17:01                                       ` Linus Torvalds
  0 siblings, 2 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-09-30 16:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Jonathan Corbet, Mathieu Desnoyers, LKML,
	Ingo Molnar, Thomas Gleixner, Andrew Morton, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo


On Tue, 30 Sep 2008, Linus Torvalds wrote:
> 
> Trace buffers are different, though. Do people realize that doing the 
> overloading means that you never EVER can use those buffers for anything 
> else? Do people realize that it means that splice() and friends are out of 
> the question?
> 
> > Trouble is, looking at it I see no easy way out,
> 
> Quite frankly, we could just put it at the head of the page itself. Having 
> a "whole page" for the trace data is not possible anyway, since the trace 
> header itself will always eat 8 bytes.
> 
> And I do think it would potentially be a better model. Or at least safer.

Actually, looking at the code, there is no reason I need to keep this in 
the frame buffer itself. I've also encapsulated the accesses to the 
incrementing of the pointers so it would be trivial to try other 
approaches.

The problem we had with the big array struct is that we can want large 
buffers and to do that with pointers means we would need to either come up 
with a large allocator or use vmap.

But I just realized that I could also just make a link list of page 
pointers and do the exact same thing without having to worry about page 
frames.  Again, the way I coded this up, it is quite trivial to replace 
the handling of the pages with other schemes.

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v10 Golden] Unified trace buffer
  2008-09-30 16:48                                     ` Steven Rostedt
@ 2008-09-30 17:00                                       ` Peter Zijlstra
  2008-09-30 17:41                                         ` Steven Rostedt
  2008-09-30 17:01                                       ` Linus Torvalds
  1 sibling, 1 reply; 102+ messages in thread
From: Peter Zijlstra @ 2008-09-30 17:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Jonathan Corbet, Mathieu Desnoyers, LKML,
	Ingo Molnar, Thomas Gleixner, Andrew Morton, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo

On Tue, 2008-09-30 at 12:48 -0400, Steven Rostedt wrote:
> On Tue, 30 Sep 2008, Linus Torvalds wrote:
> > 
> > Trace buffers are different, though. Do people realize that doing the 
> > overloading means that you never EVER can use those buffers for anything 
> > else? Do people realize that it means that splice() and friends are out of 
> > the question?
> > 
> > > Trouble is, looking at it I see no easy way out,
> > 
> > Quite frankly, we could just put it at the head of the page itself. Having 
> > a "whole page" for the trace data is not possible anyway, since the trace 
> > header itself will always eat 8 bytes.
> > 
> > And I do think it would potentially be a better model. Or at least safer.
> 
> Actually, looking at the code, there is no reason I need to keep this in 
> the frame buffer itself. I've also encapsulated the accesses to the 
> incrementing of the pointers so it would be trivial to try other 
> approaches.
> 
> The problem we had with the big array struct is that we can want large 
> buffers and to do that with pointers means we would need to either come up 
> with a large allocator or use vmap.
> 
> But I just realized that I could also just make a link list of page 
> pointers and do the exact same thing without having to worry about page 
> frames.  Again, the way I coded this up, it is quite trivial to replace 
> the handling of the pages with other schemes.

The list_head in the page frame should be available regardless of
splice() stuffs.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v10 Golden] Unified trace buffer
  2008-09-30 16:48                                     ` Steven Rostedt
  2008-09-30 17:00                                       ` Peter Zijlstra
@ 2008-09-30 17:01                                       ` Linus Torvalds
  2008-10-01 15:14                                         ` [PATCH] ring_buffer: allocate buffer page pointer Steven Rostedt
  1 sibling, 1 reply; 102+ messages in thread
From: Linus Torvalds @ 2008-09-30 17:01 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Jonathan Corbet, Mathieu Desnoyers, LKML,
	Ingo Molnar, Thomas Gleixner, Andrew Morton, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo



On Tue, 30 Sep 2008, Steven Rostedt wrote:
> 
> But I just realized that I could also just make a link list of page 
> pointers and do the exact same thing without having to worry about page 
> frames.  Again, the way I coded this up, it is quite trivial to replace 
> the handling of the pages with other schemes.

That might be the best option.

Yes, doing it in the 'struct page' itself is obviously going to save us 
some memory over having specially allocated page headers, but it's not 
like we'd expect to have _that_ many of these, and having a separate 
structure is actually good in that it also would make it simpler/clearer 
when/if you want to add larger pages (or other non-page allocations) into 
the mix.

For example, if somebody really wants bigger areas, they can allocate them 
with vmalloc and/or multi-page allocations, and then add them as easily to 
the list of pages as if it was a normal page. Doing the same with playing 
tricks on 'struct page' would be pretty damn painful.

		Linus

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v10 Golden] Unified trace buffer
  2008-09-30 17:00                                       ` Peter Zijlstra
@ 2008-09-30 17:41                                         ` Steven Rostedt
  2008-09-30 17:49                                           ` Peter Zijlstra
  0 siblings, 1 reply; 102+ messages in thread
From: Steven Rostedt @ 2008-09-30 17:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Jonathan Corbet, Mathieu Desnoyers, LKML,
	Ingo Molnar, Thomas Gleixner, Andrew Morton, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo


On Tue, 30 Sep 2008, Peter Zijlstra wrote:
> > 
> > Actually, looking at the code, there is no reason I need to keep this in 
> > the frame buffer itself. I've also encapsulated the accesses to the 
> > incrementing of the pointers so it would be trivial to try other 
> > approaches.
> > 
> > The problem we had with the big array struct is that we can want large 
> > buffers and to do that with pointers means we would need to either come up 
> > with a large allocator or use vmap.
> > 
> > But I just realized that I could also just make a link list of page 
> > pointers and do the exact same thing without having to worry about page 
> > frames.  Again, the way I coded this up, it is quite trivial to replace 
> > the handling of the pages with other schemes.
> 
> The list_head in the page frame should be available regardless of
> splice() stuffs.

Regardless, there's more info we want to store for each page than the list 
head.  Especially when we start converting this to lockless. I rather get 
out of the overlaying of the page frames, its nice to save the space, but 
really scares the hell out of me. I can just imagine this blowing up if we 
redo the paging, and I dislike this transparent coupling between the 
tracer buffer and the pages.

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v10 Golden] Unified trace buffer
  2008-09-30 17:41                                         ` Steven Rostedt
@ 2008-09-30 17:49                                           ` Peter Zijlstra
  2008-09-30 17:56                                             ` Steven Rostedt
  0 siblings, 1 reply; 102+ messages in thread
From: Peter Zijlstra @ 2008-09-30 17:49 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Jonathan Corbet, Mathieu Desnoyers, LKML,
	Ingo Molnar, Thomas Gleixner, Andrew Morton, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo

On Tue, 2008-09-30 at 13:41 -0400, Steven Rostedt wrote:
> On Tue, 30 Sep 2008, Peter Zijlstra wrote:
> > > 
> > > Actually, looking at the code, there is no reason I need to keep this in 
> > > the frame buffer itself. I've also encapsulated the accesses to the 
> > > incrementing of the pointers so it would be trivial to try other 
> > > approaches.
> > > 
> > > The problem we had with the big array struct is that we can want large 
> > > buffers and to do that with pointers means we would need to either come up 
> > > with a large allocator or use vmap.
> > > 
> > > But I just realized that I could also just make a link list of page 
> > > pointers and do the exact same thing without having to worry about page 
> > > frames.  Again, the way I coded this up, it is quite trivial to replace 
> > > the handling of the pages with other schemes.
> > 
> > The list_head in the page frame should be available regardless of
> > splice() stuffs.
> 
> Regardless, there's more info we want to store for each page than the list 
> head.  Especially when we start converting this to lockless. I rather get 
> out of the overlaying of the page frames, its nice to save the space, but 
> really scares the hell out of me. I can just imagine this blowing up if we 
> redo the paging, and I dislike this transparent coupling between the 
> tracer buffer and the pages.

The problem with storing the page link information inside the page is
that it doesnt transfer to another address space, so if you do indeed
mmap these pages, then the link information is bogus.

Of course, in such a situation you could ignore these headers, but
somehow that doesn't sound too apealing.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v10 Golden] Unified trace buffer
  2008-09-30 17:49                                           ` Peter Zijlstra
@ 2008-09-30 17:56                                             ` Steven Rostedt
  2008-09-30 18:02                                               ` Steven Rostedt
  0 siblings, 1 reply; 102+ messages in thread
From: Steven Rostedt @ 2008-09-30 17:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Jonathan Corbet, Mathieu Desnoyers, LKML,
	Ingo Molnar, Thomas Gleixner, Andrew Morton, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo


On Tue, 30 Sep 2008, Peter Zijlstra wrote:
> 
> The problem with storing the page link information inside the page is
> that it doesnt transfer to another address space, so if you do indeed
> mmap these pages, then the link information is bogus.
> 
> Of course, in such a situation you could ignore these headers, but
> somehow that doesn't sound too apealing.


No that's not what I'm proposing. I'm proposing to allocate a page_header 
structure for every page we alloc, and make a link list of them.
In other words:


struct ring_buffer_per_cpu {
	[...]
	struct list_head pages;
	[...]
};

struct buffer_page {
	[...];
	void *page;
	struct list_head list;
	[...];
};

In ring_buffer_allocate_cpu:

	struct buffer_page *bpage;
	struct unsigned long addr;

	[...]

	for every page() {
		bpage = kzalloc(sizeof(*bpage), GFP_KERNEL);
		addr = get_free_page();
		bpage->page = (void *)addr;
		list_add(&bpage->list, &cpu_buffer->pages);
	}


Obviously need to add the error checking, but you get the idea. Here I do 
not need to change any of the later logic, because we are still dealing 
with the buffer_page. I only need to update way to index the page which is 
already encapsulated in its own function.

-- Steve



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH v10 Golden] Unified trace buffer
  2008-09-30 17:56                                             ` Steven Rostedt
@ 2008-09-30 18:02                                               ` Steven Rostedt
  0 siblings, 0 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-09-30 18:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Jonathan Corbet, Mathieu Desnoyers, LKML,
	Ingo Molnar, Thomas Gleixner, Andrew Morton, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo


On Tue, 30 Sep 2008, Steven Rostedt wrote:
> 
> In ring_buffer_allocate_cpu:
> 
> 	struct buffer_page *bpage;
> 	struct unsigned long addr;

Of course we would not be declaring a "struct unsigned long" ;-)

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH] ring_buffer: allocate buffer page pointer
  2008-09-30 17:01                                       ` Linus Torvalds
@ 2008-10-01 15:14                                         ` Steven Rostedt
  2008-10-01 17:36                                           ` Mathieu Desnoyers
                                                             ` (2 more replies)
  0 siblings, 3 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-10-01 15:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Jonathan Corbet, Mathieu Desnoyers, LKML,
	Ingo Molnar, Thomas Gleixner, Andrew Morton, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo


The current method of overlaying the page frame as the buffer page pointer
can be very dangerous and limits our ability to do other things with
a page from the buffer, like send it off to disk.

This patch allocates the buffer_page instead of overlaying the page's
page frame. The use of the buffer_page has hardly changed due to this.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 kernel/trace/ring_buffer.c |   54 ++++++++++++++++++++++++++-------------------
 1 file changed, 32 insertions(+), 22 deletions(-)

Index: linux-tip.git/kernel/trace/ring_buffer.c
===================================================================
--- linux-tip.git.orig/kernel/trace/ring_buffer.c	2008-10-01 09:37:23.000000000 -0400
+++ linux-tip.git/kernel/trace/ring_buffer.c	2008-10-01 11:03:16.000000000 -0400
@@ -115,16 +115,10 @@ void *ring_buffer_event_data(struct ring
  * Thanks to Peter Zijlstra for suggesting this idea.
  */
 struct buffer_page {
-	union {
-		struct {
-			unsigned long	 flags;		/* mandatory */
-			atomic_t	 _count;	/* mandatory */
-			u64		 time_stamp;	/* page time stamp */
-			unsigned	 size;		/* size of page data */
-			struct list_head list;		/* list of free pages */
-		};
-		struct page page;
-	};
+	u64		 time_stamp;	/* page time stamp */
+	unsigned	 size;		/* size of page data */
+	struct list_head list;		/* list of free pages */
+	void *page;			/* Actual data page */
 };
 
 /*
@@ -133,9 +127,9 @@ struct buffer_page {
  */
 static inline void free_buffer_page(struct buffer_page *bpage)
 {
-	reset_page_mapcount(&bpage->page);
-	bpage->page.mapping = NULL;
-	__free_page(&bpage->page);
+	if (bpage->page)
+		__free_page(bpage->page);
+	kfree(bpage);
 }
 
 /*
@@ -237,11 +231,16 @@ static int rb_allocate_pages(struct ring
 	unsigned i;
 
 	for (i = 0; i < nr_pages; i++) {
+		page = kzalloc_node(ALIGN(sizeof(*page), cache_line_size()),
+				    GFP_KERNEL, cpu_to_node(cpu));
+		if (!page)
+			goto free_pages;
+		list_add(&page->list, &pages);
+
 		addr = __get_free_page(GFP_KERNEL);
 		if (!addr)
 			goto free_pages;
-		page = (struct buffer_page *)virt_to_page(addr);
-		list_add(&page->list, &pages);
+		page->page = (void *)addr;
 	}
 
 	list_splice(&pages, head);
@@ -262,6 +261,7 @@ static struct ring_buffer_per_cpu *
 rb_allocate_cpu_buffer(struct ring_buffer *buffer, int cpu)
 {
 	struct ring_buffer_per_cpu *cpu_buffer;
+	struct buffer_page *page;
 	unsigned long addr;
 	int ret;
 
@@ -275,10 +275,17 @@ rb_allocate_cpu_buffer(struct ring_buffe
 	spin_lock_init(&cpu_buffer->lock);
 	INIT_LIST_HEAD(&cpu_buffer->pages);
 
+	page = kzalloc_node(ALIGN(sizeof(*page), cache_line_size()),
+			    GFP_KERNEL, cpu_to_node(cpu));
+	if (!page)
+		goto fail_free_buffer;
+
+	cpu_buffer->reader_page = page;
 	addr = __get_free_page(GFP_KERNEL);
 	if (!addr)
-		goto fail_free_buffer;
-	cpu_buffer->reader_page = (struct buffer_page *)virt_to_page(addr);
+		goto fail_free_reader;
+	page->page = (void *)addr;
+
 	INIT_LIST_HEAD(&cpu_buffer->reader_page->list);
 	cpu_buffer->reader_page->size = 0;
 
@@ -523,11 +530,16 @@ int ring_buffer_resize(struct ring_buffe
 
 	for_each_buffer_cpu(buffer, cpu) {
 		for (i = 0; i < new_pages; i++) {
+			page = kzalloc_node(ALIGN(sizeof(*page),
+						  cache_line_size()),
+					    GFP_KERNEL, cpu_to_node(cpu));
+			if (!page)
+				goto free_pages;
+			list_add(&page->list, &pages);
 			addr = __get_free_page(GFP_KERNEL);
 			if (!addr)
 				goto free_pages;
-			page = (struct buffer_page *)virt_to_page(addr);
-			list_add(&page->list, &pages);
+			page->page = (void *)addr;
 		}
 	}
 
@@ -567,9 +579,7 @@ static inline int rb_null_event(struct r
 
 static inline void *rb_page_index(struct buffer_page *page, unsigned index)
 {
-	void *addr = page_address(&page->page);
-
-	return addr + index;
+	return page->page + index;
 }
 
 static inline struct ring_buffer_event *



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH] ring_buffer: allocate buffer page pointer
  2008-10-01 15:14                                         ` [PATCH] ring_buffer: allocate buffer page pointer Steven Rostedt
@ 2008-10-01 17:36                                           ` Mathieu Desnoyers
  2008-10-01 17:49                                             ` Steven Rostedt
  2008-10-01 18:21                                           ` Mathieu Desnoyers
  2008-10-02  8:50                                           ` Ingo Molnar
  2 siblings, 1 reply; 102+ messages in thread
From: Mathieu Desnoyers @ 2008-10-01 17:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Peter Zijlstra, Jonathan Corbet, LKML,
	Ingo Molnar, Thomas Gleixner, Andrew Morton, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> The current method of overlaying the page frame as the buffer page pointer
> can be very dangerous and limits our ability to do other things with
> a page from the buffer, like send it off to disk.
> 
> This patch allocates the buffer_page instead of overlaying the page's
> page frame. The use of the buffer_page has hardly changed due to this.
> 
> Signed-off-by: Steven Rostedt <srostedt@redhat.com>
> ---
>  kernel/trace/ring_buffer.c |   54 ++++++++++++++++++++++++++-------------------
>  1 file changed, 32 insertions(+), 22 deletions(-)
> 
> Index: linux-tip.git/kernel/trace/ring_buffer.c
> ===================================================================
> --- linux-tip.git.orig/kernel/trace/ring_buffer.c	2008-10-01 09:37:23.000000000 -0400
> +++ linux-tip.git/kernel/trace/ring_buffer.c	2008-10-01 11:03:16.000000000 -0400
> @@ -115,16 +115,10 @@ void *ring_buffer_event_data(struct ring
>   * Thanks to Peter Zijlstra for suggesting this idea.
>   */
>  struct buffer_page {
> -	union {
> -		struct {
> -			unsigned long	 flags;		/* mandatory */
> -			atomic_t	 _count;	/* mandatory */
> -			u64		 time_stamp;	/* page time stamp */
> -			unsigned	 size;		/* size of page data */
> -			struct list_head list;		/* list of free pages */
> -		};
> -		struct page page;
> -	};
> +	u64		 time_stamp;	/* page time stamp */
> +	unsigned	 size;		/* size of page data */
> +	struct list_head list;		/* list of free pages */
> +	void *page;			/* Actual data page */
>  };
>  
>  /*
> @@ -133,9 +127,9 @@ struct buffer_page {
>   */
>  static inline void free_buffer_page(struct buffer_page *bpage)
>  {
> -	reset_page_mapcount(&bpage->page);
> -	bpage->page.mapping = NULL;
> -	__free_page(&bpage->page);
> +	if (bpage->page)
> +		__free_page(bpage->page);
> +	kfree(bpage);
>  }
>  
>  /*
> @@ -237,11 +231,16 @@ static int rb_allocate_pages(struct ring
>  	unsigned i;
>  
>  	for (i = 0; i < nr_pages; i++) {
> +		page = kzalloc_node(ALIGN(sizeof(*page), cache_line_size()),
> +				    GFP_KERNEL, cpu_to_node(cpu));
> +		if (!page)
> +			goto free_pages;
> +		list_add(&page->list, &pages);
> +
>  		addr = __get_free_page(GFP_KERNEL);
>  		if (!addr)
>  			goto free_pages;
> -		page = (struct buffer_page *)virt_to_page(addr);
> -		list_add(&page->list, &pages);
> +		page->page = (void *)addr;
>  	}
>  
>  	list_splice(&pages, head);
> @@ -262,6 +261,7 @@ static struct ring_buffer_per_cpu *
>  rb_allocate_cpu_buffer(struct ring_buffer *buffer, int cpu)
>  {
>  	struct ring_buffer_per_cpu *cpu_buffer;
> +	struct buffer_page *page;
>  	unsigned long addr;
>  	int ret;
>  
> @@ -275,10 +275,17 @@ rb_allocate_cpu_buffer(struct ring_buffe
>  	spin_lock_init(&cpu_buffer->lock);
>  	INIT_LIST_HEAD(&cpu_buffer->pages);
>  
> +	page = kzalloc_node(ALIGN(sizeof(*page), cache_line_size()),
> +			    GFP_KERNEL, cpu_to_node(cpu));

Hi Steven,

I understand that you want to allocate these struct buffer_page in
memory local to a given cpu node, which is great, but why do you feel
you need to align them on cache_line_size() ?

Hrm.. you put the timestamp in there, so I guess you're concerned about
having a writer on one CPU, a reader on another, and the fact that you
will have cache line bouncing because of that.

Note that if you put the timestamp and the unused bytes in a tiny header
at the beginning of the page, you

1 - make this information directly accessible for disk, network I/O
without any other abstraction layer.
2 - won't have to do such alignment on the struct buffer_page, because
it will only be read once it's been allocated.

My 2 cents ;)

Mathieu

> +	if (!page)
> +		goto fail_free_buffer;
> +
> +	cpu_buffer->reader_page = page;
>  	addr = __get_free_page(GFP_KERNEL);
>  	if (!addr)
> -		goto fail_free_buffer;
> -	cpu_buffer->reader_page = (struct buffer_page *)virt_to_page(addr);
> +		goto fail_free_reader;
> +	page->page = (void *)addr;
> +
>  	INIT_LIST_HEAD(&cpu_buffer->reader_page->list);
>  	cpu_buffer->reader_page->size = 0;
>  
> @@ -523,11 +530,16 @@ int ring_buffer_resize(struct ring_buffe
>  
>  	for_each_buffer_cpu(buffer, cpu) {
>  		for (i = 0; i < new_pages; i++) {
> +			page = kzalloc_node(ALIGN(sizeof(*page),
> +						  cache_line_size()),
> +					    GFP_KERNEL, cpu_to_node(cpu));
> +			if (!page)
> +				goto free_pages;
> +			list_add(&page->list, &pages);
>  			addr = __get_free_page(GFP_KERNEL);
>  			if (!addr)
>  				goto free_pages;
> -			page = (struct buffer_page *)virt_to_page(addr);
> -			list_add(&page->list, &pages);
> +			page->page = (void *)addr;
>  		}
>  	}
>  
> @@ -567,9 +579,7 @@ static inline int rb_null_event(struct r
>  
>  static inline void *rb_page_index(struct buffer_page *page, unsigned index)
>  {
> -	void *addr = page_address(&page->page);
> -
> -	return addr + index;
> +	return page->page + index;
>  }
>  
>  static inline struct ring_buffer_event *
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH] ring_buffer: allocate buffer page pointer
  2008-10-01 17:36                                           ` Mathieu Desnoyers
@ 2008-10-01 17:49                                             ` Steven Rostedt
  0 siblings, 0 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-10-01 17:49 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Peter Zijlstra, Jonathan Corbet, LKML,
	Ingo Molnar, Thomas Gleixner, Andrew Morton, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo


On Wed, 1 Oct 2008, Mathieu Desnoyers wrote:
> 
> I understand that you want to allocate these struct buffer_page in
> memory local to a given cpu node, which is great, but why do you feel
> you need to align them on cache_line_size() ?
> 
> Hrm.. you put the timestamp in there, so I guess you're concerned about
> having a writer on one CPU, a reader on another, and the fact that you
> will have cache line bouncing because of that.
> 
> Note that if you put the timestamp and the unused bytes in a tiny header
> at the beginning of the page, you
> 
> 1 - make this information directly accessible for disk, network I/O
> without any other abstraction layer.
> 2 - won't have to do such alignment on the struct buffer_page, because
> it will only be read once it's been allocated.
> 

That was the approach I actually started with. But someone (I think
Peter) asked me to remove it.

Who knows, perhaps I can put it back. It's not that hard to do. This is
why I used BUF_PAGE_SIZE to determine the size of the buffer page.
Right now it BUF_PAGE_SIZE == PAGE_SIZE, but if we do add a header than
it will be BUF_PAGE_SIZE == PAGE_SIZE - sizeof(header)

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH] ring_buffer: allocate buffer page pointer
  2008-10-01 15:14                                         ` [PATCH] ring_buffer: allocate buffer page pointer Steven Rostedt
  2008-10-01 17:36                                           ` Mathieu Desnoyers
@ 2008-10-01 18:21                                           ` Mathieu Desnoyers
  2008-10-02  8:50                                           ` Ingo Molnar
  2 siblings, 0 replies; 102+ messages in thread
From: Mathieu Desnoyers @ 2008-10-01 18:21 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Peter Zijlstra, Jonathan Corbet, LKML,
	Ingo Molnar, Thomas Gleixner, Andrew Morton, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> The current method of overlaying the page frame as the buffer page pointer
> can be very dangerous and limits our ability to do other things with
> a page from the buffer, like send it off to disk.
> 
> This patch allocates the buffer_page instead of overlaying the page's
> page frame. The use of the buffer_page has hardly changed due to this.
> 
> Signed-off-by: Steven Rostedt <srostedt@redhat.com>
> ---
>  kernel/trace/ring_buffer.c |   54 ++++++++++++++++++++++++++-------------------
>  1 file changed, 32 insertions(+), 22 deletions(-)
> 
> Index: linux-tip.git/kernel/trace/ring_buffer.c
> ===================================================================
> --- linux-tip.git.orig/kernel/trace/ring_buffer.c	2008-10-01 09:37:23.000000000 -0400
> +++ linux-tip.git/kernel/trace/ring_buffer.c	2008-10-01 11:03:16.000000000 -0400
> @@ -115,16 +115,10 @@ void *ring_buffer_event_data(struct ring
>   * Thanks to Peter Zijlstra for suggesting this idea.
>   */
>  struct buffer_page {
> -	union {
> -		struct {
> -			unsigned long	 flags;		/* mandatory */
> -			atomic_t	 _count;	/* mandatory */
> -			u64		 time_stamp;	/* page time stamp */
> -			unsigned	 size;		/* size of page data */
> -			struct list_head list;		/* list of free pages */
> -		};
> -		struct page page;
> -	};
> +	u64		 time_stamp;	/* page time stamp */
> +	unsigned	 size;		/* size of page data */
> +	struct list_head list;		/* list of free pages */
> +	void *page;			/* Actual data page */
>  };
>  
>  /*
> @@ -133,9 +127,9 @@ struct buffer_page {
>   */
>  static inline void free_buffer_page(struct buffer_page *bpage)
>  {
> -	reset_page_mapcount(&bpage->page);
> -	bpage->page.mapping = NULL;
> -	__free_page(&bpage->page);
> +	if (bpage->page)
> +		__free_page(bpage->page);
> +	kfree(bpage);
>  }
>  
>  /*
> @@ -237,11 +231,16 @@ static int rb_allocate_pages(struct ring
>  	unsigned i;
>  
>  	for (i = 0; i < nr_pages; i++) {
> +		page = kzalloc_node(ALIGN(sizeof(*page), cache_line_size()),
> +				    GFP_KERNEL, cpu_to_node(cpu));
> +		if (!page)
> +			goto free_pages;
> +		list_add(&page->list, &pages);
> +
>  		addr = __get_free_page(GFP_KERNEL);

You could probably use alloc_pages_node instead here...

Mathieu

>  		if (!addr)
>  			goto free_pages;
> -		page = (struct buffer_page *)virt_to_page(addr);
> -		list_add(&page->list, &pages);
> +		page->page = (void *)addr;
>  	}
>  
>  	list_splice(&pages, head);
> @@ -262,6 +261,7 @@ static struct ring_buffer_per_cpu *
>  rb_allocate_cpu_buffer(struct ring_buffer *buffer, int cpu)
>  {
>  	struct ring_buffer_per_cpu *cpu_buffer;
> +	struct buffer_page *page;
>  	unsigned long addr;
>  	int ret;
>  
> @@ -275,10 +275,17 @@ rb_allocate_cpu_buffer(struct ring_buffe
>  	spin_lock_init(&cpu_buffer->lock);
>  	INIT_LIST_HEAD(&cpu_buffer->pages);
>  
> +	page = kzalloc_node(ALIGN(sizeof(*page), cache_line_size()),
> +			    GFP_KERNEL, cpu_to_node(cpu));
> +	if (!page)
> +		goto fail_free_buffer;
> +
> +	cpu_buffer->reader_page = page;
>  	addr = __get_free_page(GFP_KERNEL);
>  	if (!addr)
> -		goto fail_free_buffer;
> -	cpu_buffer->reader_page = (struct buffer_page *)virt_to_page(addr);
> +		goto fail_free_reader;
> +	page->page = (void *)addr;
> +
>  	INIT_LIST_HEAD(&cpu_buffer->reader_page->list);
>  	cpu_buffer->reader_page->size = 0;
>  
> @@ -523,11 +530,16 @@ int ring_buffer_resize(struct ring_buffe
>  
>  	for_each_buffer_cpu(buffer, cpu) {
>  		for (i = 0; i < new_pages; i++) {
> +			page = kzalloc_node(ALIGN(sizeof(*page),
> +						  cache_line_size()),
> +					    GFP_KERNEL, cpu_to_node(cpu));
> +			if (!page)
> +				goto free_pages;
> +			list_add(&page->list, &pages);
>  			addr = __get_free_page(GFP_KERNEL);
>  			if (!addr)
>  				goto free_pages;
> -			page = (struct buffer_page *)virt_to_page(addr);
> -			list_add(&page->list, &pages);
> +			page->page = (void *)addr;
>  		}
>  	}
>  
> @@ -567,9 +579,7 @@ static inline int rb_null_event(struct r
>  
>  static inline void *rb_page_index(struct buffer_page *page, unsigned index)
>  {
> -	void *addr = page_address(&page->page);
> -
> -	return addr + index;
> +	return page->page + index;
>  }
>  
>  static inline struct ring_buffer_event *
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH] ring_buffer: allocate buffer page pointer
  2008-10-01 15:14                                         ` [PATCH] ring_buffer: allocate buffer page pointer Steven Rostedt
  2008-10-01 17:36                                           ` Mathieu Desnoyers
  2008-10-01 18:21                                           ` Mathieu Desnoyers
@ 2008-10-02  8:50                                           ` Ingo Molnar
  2008-10-02  8:51                                             ` Ingo Molnar
  2008-10-02  9:06                                             ` [PATCH] ring_buffer: allocate buffer page pointer Andrew Morton
  2 siblings, 2 replies; 102+ messages in thread
From: Ingo Molnar @ 2008-10-02  8:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Peter Zijlstra, Jonathan Corbet,
	Mathieu Desnoyers, LKML, Thomas Gleixner, Andrew Morton, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo


* Steven Rostedt <rostedt@goodmis.org> wrote:

> 
> The current method of overlaying the page frame as the buffer page pointer
> can be very dangerous and limits our ability to do other things with
> a page from the buffer, like send it off to disk.
> 
> This patch allocates the buffer_page instead of overlaying the page's
> page frame. The use of the buffer_page has hardly changed due to this.
> 
> Signed-off-by: Steven Rostedt <srostedt@redhat.com>
> ---
>  kernel/trace/ring_buffer.c |   54 ++++++++++++++++++++++++++-------------------
>  1 file changed, 32 insertions(+), 22 deletions(-)

applied to tip/tracing/ftrace, with the extended changlog below - i 
think this commit warrants that extra mention.

	Ingo

--------------->
>From da78331b4ced2763322d732ac5ba275965853bde Mon Sep 17 00:00:00 2001
From: Steven Rostedt <rostedt@goodmis.org>
Date: Wed, 1 Oct 2008 10:52:51 -0400
Subject: [PATCH] ftrace: type cast filter+verifier

The mmiotrace map had a bug that would typecast the entry from
the trace to the wrong type. That is a known danger of C typecasts,
there's absolutely zero checking done on them.

Help that problem a bit by using a GCC extension to implement a
type filter that restricts the types that a trace record can be
cast into, and by adding a dynamic check (in debug mode) to verify
the type of the entry.

This patch adds a macro to assign all entries of ftrace using the type
of the variable and checking the entry id. The typecasts are now done
in the macro for only those types that it knows about, which should
be all the types that are allowed to be read from the tracer.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/trace/trace.c           |   85 ++++++++++++++++++++++++++++------------
 kernel/trace/trace.h           |   42 ++++++++++++++++++++
 kernel/trace/trace_mmiotrace.c |   14 +++++--
 3 files changed, 112 insertions(+), 29 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index c163406..948f7d8 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1350,7 +1350,9 @@ print_lat_fmt(struct trace_iterator *iter, unsigned int trace_idx, int cpu)
 	}
 	switch (entry->type) {
 	case TRACE_FN: {
-		struct ftrace_entry *field = (struct ftrace_entry *)entry;
+		struct ftrace_entry *field;
+
+		trace_assign_type(field, entry);
 
 		seq_print_ip_sym(s, field->ip, sym_flags);
 		trace_seq_puts(s, " (");
@@ -1363,8 +1365,9 @@ print_lat_fmt(struct trace_iterator *iter, unsigned int trace_idx, int cpu)
 	}
 	case TRACE_CTX:
 	case TRACE_WAKE: {
-		struct ctx_switch_entry *field =
-			(struct ctx_switch_entry *)entry;
+		struct ctx_switch_entry *field;
+
+		trace_assign_type(field, entry);
 
 		T = field->next_state < sizeof(state_to_char) ?
 			state_to_char[field->next_state] : 'X';
@@ -1384,7 +1387,9 @@ print_lat_fmt(struct trace_iterator *iter, unsigned int trace_idx, int cpu)
 		break;
 	}
 	case TRACE_SPECIAL: {
-		struct special_entry *field = (struct special_entry *)entry;
+		struct special_entry *field;
+
+		trace_assign_type(field, entry);
 
 		trace_seq_printf(s, "# %ld %ld %ld\n",
 				 field->arg1,
@@ -1393,7 +1398,9 @@ print_lat_fmt(struct trace_iterator *iter, unsigned int trace_idx, int cpu)
 		break;
 	}
 	case TRACE_STACK: {
-		struct stack_entry *field = (struct stack_entry *)entry;
+		struct stack_entry *field;
+
+		trace_assign_type(field, entry);
 
 		for (i = 0; i < FTRACE_STACK_ENTRIES; i++) {
 			if (i)
@@ -1404,7 +1411,9 @@ print_lat_fmt(struct trace_iterator *iter, unsigned int trace_idx, int cpu)
 		break;
 	}
 	case TRACE_PRINT: {
-		struct print_entry *field = (struct print_entry *)entry;
+		struct print_entry *field;
+
+		trace_assign_type(field, entry);
 
 		seq_print_ip_sym(s, field->ip, sym_flags);
 		trace_seq_printf(s, ": %s", field->buf);
@@ -1454,7 +1463,9 @@ static enum print_line_t print_trace_fmt(struct trace_iterator *iter)
 
 	switch (entry->type) {
 	case TRACE_FN: {
-		struct ftrace_entry *field = (struct ftrace_entry *)entry;
+		struct ftrace_entry *field;
+
+		trace_assign_type(field, entry);
 
 		ret = seq_print_ip_sym(s, field->ip, sym_flags);
 		if (!ret)
@@ -1480,8 +1491,9 @@ static enum print_line_t print_trace_fmt(struct trace_iterator *iter)
 	}
 	case TRACE_CTX:
 	case TRACE_WAKE: {
-		struct ctx_switch_entry *field =
-			(struct ctx_switch_entry *)entry;
+		struct ctx_switch_entry *field;
+
+		trace_assign_type(field, entry);
 
 		S = field->prev_state < sizeof(state_to_char) ?
 			state_to_char[field->prev_state] : 'X';
@@ -1501,7 +1513,9 @@ static enum print_line_t print_trace_fmt(struct trace_iterator *iter)
 		break;
 	}
 	case TRACE_SPECIAL: {
-		struct special_entry *field = (struct special_entry *)entry;
+		struct special_entry *field;
+
+		trace_assign_type(field, entry);
 
 		ret = trace_seq_printf(s, "# %ld %ld %ld\n",
 				 field->arg1,
@@ -1512,7 +1526,9 @@ static enum print_line_t print_trace_fmt(struct trace_iterator *iter)
 		break;
 	}
 	case TRACE_STACK: {
-		struct stack_entry *field = (struct stack_entry *)entry;
+		struct stack_entry *field;
+
+		trace_assign_type(field, entry);
 
 		for (i = 0; i < FTRACE_STACK_ENTRIES; i++) {
 			if (i) {
@@ -1531,7 +1547,9 @@ static enum print_line_t print_trace_fmt(struct trace_iterator *iter)
 		break;
 	}
 	case TRACE_PRINT: {
-		struct print_entry *field = (struct print_entry *)entry;
+		struct print_entry *field;
+
+		trace_assign_type(field, entry);
 
 		seq_print_ip_sym(s, field->ip, sym_flags);
 		trace_seq_printf(s, ": %s", field->buf);
@@ -1562,7 +1580,9 @@ static enum print_line_t print_raw_fmt(struct trace_iterator *iter)
 
 	switch (entry->type) {
 	case TRACE_FN: {
-		struct ftrace_entry *field = (struct ftrace_entry *)entry;
+		struct ftrace_entry *field;
+
+		trace_assign_type(field, entry);
 
 		ret = trace_seq_printf(s, "%x %x\n",
 					field->ip,
@@ -1573,8 +1593,9 @@ static enum print_line_t print_raw_fmt(struct trace_iterator *iter)
 	}
 	case TRACE_CTX:
 	case TRACE_WAKE: {
-		struct ctx_switch_entry *field =
-			(struct ctx_switch_entry *)entry;
+		struct ctx_switch_entry *field;
+
+		trace_assign_type(field, entry);
 
 		S = field->prev_state < sizeof(state_to_char) ?
 			state_to_char[field->prev_state] : 'X';
@@ -1596,7 +1617,9 @@ static enum print_line_t print_raw_fmt(struct trace_iterator *iter)
 	}
 	case TRACE_SPECIAL:
 	case TRACE_STACK: {
-		struct special_entry *field = (struct special_entry *)entry;
+		struct special_entry *field;
+
+		trace_assign_type(field, entry);
 
 		ret = trace_seq_printf(s, "# %ld %ld %ld\n",
 				 field->arg1,
@@ -1607,7 +1630,9 @@ static enum print_line_t print_raw_fmt(struct trace_iterator *iter)
 		break;
 	}
 	case TRACE_PRINT: {
-		struct print_entry *field = (struct print_entry *)entry;
+		struct print_entry *field;
+
+		trace_assign_type(field, entry);
 
 		trace_seq_printf(s, "# %lx %s", field->ip, field->buf);
 		if (entry->flags & TRACE_FLAG_CONT)
@@ -1648,7 +1673,9 @@ static enum print_line_t print_hex_fmt(struct trace_iterator *iter)
 
 	switch (entry->type) {
 	case TRACE_FN: {
-		struct ftrace_entry *field = (struct ftrace_entry *)entry;
+		struct ftrace_entry *field;
+
+		trace_assign_type(field, entry);
 
 		SEQ_PUT_HEX_FIELD_RET(s, field->ip);
 		SEQ_PUT_HEX_FIELD_RET(s, field->parent_ip);
@@ -1656,8 +1683,9 @@ static enum print_line_t print_hex_fmt(struct trace_iterator *iter)
 	}
 	case TRACE_CTX:
 	case TRACE_WAKE: {
-		struct ctx_switch_entry *field =
-			(struct ctx_switch_entry *)entry;
+		struct ctx_switch_entry *field;
+
+		trace_assign_type(field, entry);
 
 		S = field->prev_state < sizeof(state_to_char) ?
 			state_to_char[field->prev_state] : 'X';
@@ -1676,7 +1704,9 @@ static enum print_line_t print_hex_fmt(struct trace_iterator *iter)
 	}
 	case TRACE_SPECIAL:
 	case TRACE_STACK: {
-		struct special_entry *field = (struct special_entry *)entry;
+		struct special_entry *field;
+
+		trace_assign_type(field, entry);
 
 		SEQ_PUT_HEX_FIELD_RET(s, field->arg1);
 		SEQ_PUT_HEX_FIELD_RET(s, field->arg2);
@@ -1705,15 +1735,18 @@ static enum print_line_t print_bin_fmt(struct trace_iterator *iter)
 
 	switch (entry->type) {
 	case TRACE_FN: {
-		struct ftrace_entry *field = (struct ftrace_entry *)entry;
+		struct ftrace_entry *field;
+
+		trace_assign_type(field, entry);
 
 		SEQ_PUT_FIELD_RET(s, field->ip);
 		SEQ_PUT_FIELD_RET(s, field->parent_ip);
 		break;
 	}
 	case TRACE_CTX: {
-		struct ctx_switch_entry *field =
-			(struct ctx_switch_entry *)entry;
+		struct ctx_switch_entry *field;
+
+		trace_assign_type(field, entry);
 
 		SEQ_PUT_FIELD_RET(s, field->prev_pid);
 		SEQ_PUT_FIELD_RET(s, field->prev_prio);
@@ -1725,7 +1758,9 @@ static enum print_line_t print_bin_fmt(struct trace_iterator *iter)
 	}
 	case TRACE_SPECIAL:
 	case TRACE_STACK: {
-		struct special_entry *field = (struct special_entry *)entry;
+		struct special_entry *field;
+
+		trace_assign_type(field, entry);
 
 		SEQ_PUT_FIELD_RET(s, field->arg1);
 		SEQ_PUT_FIELD_RET(s, field->arg2);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index a921ba5..f02042d 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -177,6 +177,48 @@ struct trace_array {
 	struct trace_array_cpu	*data[NR_CPUS];
 };
 
+#define FTRACE_CMP_TYPE(var, type) \
+	__builtin_types_compatible_p(typeof(var), type *)
+
+#undef IF_ASSIGN
+#define IF_ASSIGN(var, entry, etype, id)		\
+	if (FTRACE_CMP_TYPE(var, etype)) {		\
+		var = (typeof(var))(entry);		\
+		WARN_ON(id && (entry)->type != id);	\
+		break;					\
+	}
+
+/* Will cause compile errors if type is not found. */
+extern void __ftrace_bad_type(void);
+
+/*
+ * The trace_assign_type is a verifier that the entry type is
+ * the same as the type being assigned. To add new types simply
+ * add a line with the following format:
+ *
+ * IF_ASSIGN(var, ent, type, id);
+ *
+ *  Where "type" is the trace type that includes the trace_entry
+ *  as the "ent" item. And "id" is the trace identifier that is
+ *  used in the trace_type enum.
+ *
+ *  If the type can have more than one id, then use zero.
+ */
+#define trace_assign_type(var, ent)					\
+	do {								\
+		IF_ASSIGN(var, ent, struct ftrace_entry, TRACE_FN);	\
+		IF_ASSIGN(var, ent, struct ctx_switch_entry, 0);	\
+		IF_ASSIGN(var, ent, struct trace_field_cont, TRACE_CONT); \
+		IF_ASSIGN(var, ent, struct stack_entry, TRACE_STACK);	\
+		IF_ASSIGN(var, ent, struct print_entry, TRACE_PRINT);	\
+		IF_ASSIGN(var, ent, struct special_entry, 0);		\
+		IF_ASSIGN(var, ent, struct trace_mmiotrace_rw,		\
+			  TRACE_MMIO_RW);				\
+		IF_ASSIGN(var, ent, struct trace_mmiotrace_map,		\
+			  TRACE_MMIO_MAP);				\
+		IF_ASSIGN(var, ent, struct trace_boot, TRACE_BOOT);	\
+		__ftrace_bad_type();					\
+	} while (0)
 
 /* Return values for print_line callback */
 enum print_line_t {
diff --git a/kernel/trace/trace_mmiotrace.c b/kernel/trace/trace_mmiotrace.c
index 1a266aa..0e819f4 100644
--- a/kernel/trace/trace_mmiotrace.c
+++ b/kernel/trace/trace_mmiotrace.c
@@ -178,15 +178,17 @@ print_out:
 static enum print_line_t mmio_print_rw(struct trace_iterator *iter)
 {
 	struct trace_entry *entry = iter->ent;
-	struct trace_mmiotrace_rw *field =
-		(struct trace_mmiotrace_rw *)entry;
-	struct mmiotrace_rw *rw	= &field->rw;
+	struct trace_mmiotrace_rw *field;
+	struct mmiotrace_rw *rw;
 	struct trace_seq *s	= &iter->seq;
 	unsigned long long t	= ns2usecs(iter->ts);
 	unsigned long usec_rem	= do_div(t, 1000000ULL);
 	unsigned secs		= (unsigned long)t;
 	int ret = 1;
 
+	trace_assign_type(field, entry);
+	rw = &field->rw;
+
 	switch (rw->opcode) {
 	case MMIO_READ:
 		ret = trace_seq_printf(s,
@@ -222,13 +224,17 @@ static enum print_line_t mmio_print_rw(struct trace_iterator *iter)
 static enum print_line_t mmio_print_map(struct trace_iterator *iter)
 {
 	struct trace_entry *entry = iter->ent;
-	struct mmiotrace_map *m	= (struct mmiotrace_map *)entry;
+	struct trace_mmiotrace_map *field;
+	struct mmiotrace_map *m;
 	struct trace_seq *s	= &iter->seq;
 	unsigned long long t	= ns2usecs(iter->ts);
 	unsigned long usec_rem	= do_div(t, 1000000ULL);
 	unsigned secs		= (unsigned long)t;
 	int ret;
 
+	trace_assign_type(field, entry);
+	m = &field->map;
+
 	switch (m->opcode) {
 	case MMIO_PROBE:
 		ret = trace_seq_printf(s,

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [PATCH] ring_buffer: allocate buffer page pointer
  2008-10-02  8:50                                           ` Ingo Molnar
@ 2008-10-02  8:51                                             ` Ingo Molnar
  2008-10-02  9:05                                               ` [PATCH] ring-buffer: fix build error Ingo Molnar
  2008-10-02  9:06                                             ` [PATCH] ring_buffer: allocate buffer page pointer Andrew Morton
  1 sibling, 1 reply; 102+ messages in thread
From: Ingo Molnar @ 2008-10-02  8:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Peter Zijlstra, Jonathan Corbet,
	Mathieu Desnoyers, LKML, Thomas Gleixner, Andrew Morton, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo


* Ingo Molnar <mingo@elte.hu> wrote:

> applied to tip/tracing/ftrace, with the extended changlog below - i 
> think this commit warrants that extra mention.

that was for the type filter commit. The 3 patches i've picked up into 
tip/tracing/ring-buffer are:

 b6eeea4: ftrace: preempt disable over interrupt disable
 52abc82: ring_buffer: allocate buffer page pointer
 da78331: ftrace: type cast filter+verifier

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH] ring-buffer: fix build error
  2008-10-02  8:51                                             ` Ingo Molnar
@ 2008-10-02  9:05                                               ` Ingo Molnar
  2008-10-02  9:38                                                 ` [boot crash] " Ingo Molnar
  0 siblings, 1 reply; 102+ messages in thread
From: Ingo Molnar @ 2008-10-02  9:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Peter Zijlstra, Jonathan Corbet,
	Mathieu Desnoyers, LKML, Thomas Gleixner, Andrew Morton, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo


* Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > applied to tip/tracing/ftrace, with the extended changlog below - i 
> > think this commit warrants that extra mention.
> 
> that was for the type filter commit. The 3 patches i've picked up into 
> tip/tracing/ring-buffer are:
> 
>  b6eeea4: ftrace: preempt disable over interrupt disable
>  52abc82: ring_buffer: allocate buffer page pointer
>  da78331: ftrace: type cast filter+verifier

trivial build fix below.

	Ingo

>From 339ce9af3e6cbc02442b0b356c1ecb80a8ae92fb Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@elte.hu>
Date: Thu, 2 Oct 2008 11:04:14 +0200
Subject: [PATCH] ring-buffer: fix build error
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

fix:

 kernel/trace/ring_buffer.c: In function ‘rb_allocate_pages’:
 kernel/trace/ring_buffer.c:235: error: ‘cpu’ undeclared (first use in this function)
 kernel/trace/ring_buffer.c:235: error: (Each undeclared identifier is reported only once
 kernel/trace/ring_buffer.c:235: error: for each function it appears in.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/trace/ring_buffer.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 9814571..54a3098 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -232,7 +232,7 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
 
 	for (i = 0; i < nr_pages; i++) {
 		page = kzalloc_node(ALIGN(sizeof(*page), cache_line_size()),
-				    GFP_KERNEL, cpu_to_node(cpu));
+				    GFP_KERNEL, cpu_to_node(i));
 		if (!page)
 			goto free_pages;
 		list_add(&page->list, &pages);

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [PATCH] ring_buffer: allocate buffer page pointer
  2008-10-02  8:50                                           ` Ingo Molnar
  2008-10-02  8:51                                             ` Ingo Molnar
@ 2008-10-02  9:06                                             ` Andrew Morton
  2008-10-02  9:41                                               ` Ingo Molnar
  2008-10-02 13:06                                               ` Steven Rostedt
  1 sibling, 2 replies; 102+ messages in thread
From: Andrew Morton @ 2008-10-02  9:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Linus Torvalds, Peter Zijlstra, Jonathan Corbet,
	Mathieu Desnoyers, LKML, Thomas Gleixner, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo

On Thu, 2 Oct 2008 10:50:30 +0200 Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > 
> > The current method of overlaying the page frame as the buffer page pointer
> > can be very dangerous and limits our ability to do other things with
> > a page from the buffer, like send it off to disk.
> > 
> > This patch allocates the buffer_page instead of overlaying the page's
> > page frame. The use of the buffer_page has hardly changed due to this.
> > 
> > Signed-off-by: Steven Rostedt <srostedt@redhat.com>
> > ---
> >  kernel/trace/ring_buffer.c |   54 ++++++++++++++++++++++++++-------------------
> >  1 file changed, 32 insertions(+), 22 deletions(-)
> 
> applied to tip/tracing/ftrace, with the extended changlog below - i 
> think this commit warrants that extra mention.
> 
> 	Ingo
> 
> --------------->
> >From da78331b4ced2763322d732ac5ba275965853bde Mon Sep 17 00:00:00 2001
> From: Steven Rostedt <rostedt@goodmis.org>
> Date: Wed, 1 Oct 2008 10:52:51 -0400
> Subject: [PATCH] ftrace: type cast filter+verifier
> 
> The mmiotrace map had a bug that would typecast the entry from
> the trace to the wrong type. That is a known danger of C typecasts,
> there's absolutely zero checking done on them.
> 
> Help that problem a bit by using a GCC extension to implement a
> type filter that restricts the types that a trace record can be
> cast into, and by adding a dynamic check (in debug mode) to verify
> the type of the entry.
> 
> This patch adds a macro to assign all entries of ftrace using the type
> of the variable and checking the entry id. The typecasts are now done
> in the macro for only those types that it knows about, which should
> be all the types that are allowed to be read from the tracer.
> 

I'm somewhat at a loss here because I'm unable to find any version of
kernel/trace/trace.c which looks anything like the one which is being
patched, but...

> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -1350,7 +1350,9 @@ print_lat_fmt(struct trace_iterator *iter, unsigned int trace_idx, int cpu)
>  	}
>  	switch (entry->type) {
>  	case TRACE_FN: {
> -		struct ftrace_entry *field = (struct ftrace_entry *)entry;

Why was this code using a cast in the first place?  It should be using
entry->some_field_i_dont_have_here?  That was the whole point in using 
the anonymous union in struct trace_entry?


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [boot crash] Re: [PATCH] ring-buffer: fix build error
  2008-10-02  9:05                                               ` [PATCH] ring-buffer: fix build error Ingo Molnar
@ 2008-10-02  9:38                                                 ` Ingo Molnar
  2008-10-02 13:16                                                   ` Steven Rostedt
                                                                     ` (2 more replies)
  0 siblings, 3 replies; 102+ messages in thread
From: Ingo Molnar @ 2008-10-02  9:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Peter Zijlstra, Jonathan Corbet,
	Mathieu Desnoyers, LKML, Thomas Gleixner, Andrew Morton, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo


* Ingo Molnar <mingo@elte.hu> wrote:

> > that was for the type filter commit. The 3 patches i've picked up into 
> > tip/tracing/ring-buffer are:
> > 
> >  b6eeea4: ftrace: preempt disable over interrupt disable
> >  52abc82: ring_buffer: allocate buffer page pointer
> >  da78331: ftrace: type cast filter+verifier
> 
> trivial build fix below.

ok, these latest ring-buffer updates cause more serious trouble, i just 
got this boot crash on a testbox:

[    0.324003] calling  tracer_alloc_buffers+0x0/0x14a @ 1
[    0.328008] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[    0.332001] IP: [<ffffffff8027d28b>] ring_buffer_alloc+0x207/0x3fc
[    0.332001] PGD 0 
[    0.332001] Oops: 0000 [1] SMP 
[    0.332001] CPU 0 
[    0.332001] Modules linked in:
[    0.332001] Pid: 1, comm: swapper Not tainted 2.6.27-rc8-tip-01064-gd163d6b-dirty #1
[    0.332001] RIP: 0010:[<ffffffff8027d28b>]  [<ffffffff8027d28b>] ring_buffer_alloc+0x207/0x3fc
[    0.332001] RSP: 0018:ffff88003f9d7de0  EFLAGS: 00010287
[    0.332001] RAX: 0000000000000000 RBX: ffffffff80b08404 RCX: 0000000000000067
[    0.332001] RDX: 0000000000000004 RSI: 00000000000080d0 RDI: ffffffffffffffc0
[    0.332001] RBP: ffff88003f9d7e80 R08: ffff88003f8010b4 R09: 000000000003db02
[    0.332001] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88003f801600
[    0.332001] R13: 0000000000000004 R14: ffff88003f801580 R15: ffff88003f801618
[    0.332001] FS:  0000000000000000(0000) GS:ffffffff80a68280(0000) knlGS:0000000000000000
[    0.332001] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[    0.332001] CR2: 0000000000000008 CR3: 0000000000201000 CR4: 00000000000006e0
[    0.332001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.332001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[    0.332001] Process swapper (pid: 1, threadinfo ffff88003f9d6000, task ffff88003f9d8000)
[    0.332001] Stack:  ffff88003f9d7df0 ffff88003f9d7e40 0000000000000283 ffffffff80b08404
[    0.332001]  ffffffff80b08404 ffff88003f801598 0000000000000000 ffff88003f801598
[    0.332001]  ffff88003f801580 0000016000000000 ffff88003f801600 ffff88003f9a2a40
[    0.332001] Call Trace:
[    0.332001]  [<ffffffff80a95f41>] ? tracer_alloc_buffers+0x0/0x14a
[    0.332001]  [<ffffffff80a95f67>] tracer_alloc_buffers+0x26/0x14a
[    0.332001]  [<ffffffff80a95f41>] ? tracer_alloc_buffers+0x0/0x14a
[    0.332001]  [<ffffffff80209056>] do_one_initcall+0x56/0x144
[    0.332001]  [<ffffffff80a87d4a>] ? native_smp_prepare_cpus+0x2aa/0x2ef
[    0.332001]  [<ffffffff80a7c8ce>] kernel_init+0x69/0x20e
[    0.332001]  [<ffffffff8020d4e9>] child_rip+0xa/0x11
[    0.332001]  [<ffffffff80257896>] ? __atomic_notifier_call_chain+0xd/0xf
[    0.332001]  [<ffffffff80a7c865>] ? kernel_init+0x0/0x20e
[    0.332001]  [<ffffffff8020d4df>] ? child_rip+0x0/0x11
[    0.332001] Code: 48 8b 05 d9 b2 7e 00 49 63 d5 48 63 0d 1b b2 7e 00 48 8b 9d 78 ff ff ff be d0 80 00 00 48 8b 04 d0 48 89 cf 48 83 c1 27 48 f7 df <48> 8b 40 08 48 21 cf 8b 14 03 e8 4e b5 02 00 48 85 c0 48 89 c3 
[    0.332001] RIP  [<ffffffff8027d28b>] ring_buffer_alloc+0x207/0x3fc
[    0.332001]  RSP <ffff88003f9d7de0>
[    0.332001] CR2: 0000000000000008
[    0.332002] Kernel panic - not syncing: Fatal exception

full serial log and config attached. I'm excluding these latest commits 
from tip/master for now:

 339ce9a: ring-buffer: fix build error
 b6eeea4: ftrace: preempt disable over interrupt disable
 52abc82: ring_buffer: allocate buffer page pointer
 da78331: ftrace: type cast filter+verifier

i'm quite sure 52abc82 causes this problem.

Another 64-bit testbox crashed too meanwhile.

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH] ring_buffer: allocate buffer page pointer
  2008-10-02  9:06                                             ` [PATCH] ring_buffer: allocate buffer page pointer Andrew Morton
@ 2008-10-02  9:41                                               ` Ingo Molnar
  2008-10-02 13:06                                               ` Steven Rostedt
  1 sibling, 0 replies; 102+ messages in thread
From: Ingo Molnar @ 2008-10-02  9:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Steven Rostedt, Linus Torvalds, Peter Zijlstra, Jonathan Corbet,
	Mathieu Desnoyers, LKML, Thomas Gleixner, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 2 Oct 2008 10:50:30 +0200 Ingo Molnar <mingo@elte.hu> wrote:
> 
> > 
> > * Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> > > 
> > > The current method of overlaying the page frame as the buffer page pointer
> > > can be very dangerous and limits our ability to do other things with
> > > a page from the buffer, like send it off to disk.
> > > 
> > > This patch allocates the buffer_page instead of overlaying the page's
> > > page frame. The use of the buffer_page has hardly changed due to this.
> > > 
> > > Signed-off-by: Steven Rostedt <srostedt@redhat.com>
> > > ---
> > >  kernel/trace/ring_buffer.c |   54 ++++++++++++++++++++++++++-------------------
> > >  1 file changed, 32 insertions(+), 22 deletions(-)
> > 
> > applied to tip/tracing/ftrace, with the extended changlog below - i 
> > think this commit warrants that extra mention.
> > 
> > 	Ingo
> > 
> > --------------->
> > >From da78331b4ced2763322d732ac5ba275965853bde Mon Sep 17 00:00:00 2001
> > From: Steven Rostedt <rostedt@goodmis.org>
> > Date: Wed, 1 Oct 2008 10:52:51 -0400
> > Subject: [PATCH] ftrace: type cast filter+verifier
> > 
> > The mmiotrace map had a bug that would typecast the entry from
> > the trace to the wrong type. That is a known danger of C typecasts,
> > there's absolutely zero checking done on them.
> > 
> > Help that problem a bit by using a GCC extension to implement a
> > type filter that restricts the types that a trace record can be
> > cast into, and by adding a dynamic check (in debug mode) to verify
> > the type of the entry.
> > 
> > This patch adds a macro to assign all entries of ftrace using the type
> > of the variable and checking the entry id. The typecasts are now done
> > in the macro for only those types that it knows about, which should
> > be all the types that are allowed to be read from the tracer.
> > 
> 
> I'm somewhat at a loss here because I'm unable to find any version of 
> kernel/trace/trace.c which looks anything like the one which is being 
> patched, but...

it's in tip/tracing/ring-buffer (also tip/master), but we are still 
working on it (i just triggered a crash with it) so i havent pushed it 
out into the auto-ftrace-next branch yet.

> > --- a/kernel/trace/trace.c
> > +++ b/kernel/trace/trace.c
> > @@ -1350,7 +1350,9 @@ print_lat_fmt(struct trace_iterator *iter, unsigned int trace_idx, int cpu)
> >  	}
> >  	switch (entry->type) {
> >  	case TRACE_FN: {
> > -		struct ftrace_entry *field = (struct ftrace_entry *)entry;
> 
> Why was this code using a cast in the first place?  It should be using 
> entry->some_field_i_dont_have_here?  That was the whole point in using 
> the anonymous union in struct trace_entry?

this whole mega-thread was about removing that union and turning the 
tracer into a type-opaque entity. I warned about the inevitable 
fragility - but with this type filter approach the risks should be 
substantially lower.

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH] ring_buffer: allocate buffer page pointer
  2008-10-02  9:06                                             ` [PATCH] ring_buffer: allocate buffer page pointer Andrew Morton
  2008-10-02  9:41                                               ` Ingo Molnar
@ 2008-10-02 13:06                                               ` Steven Rostedt
  1 sibling, 0 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-10-02 13:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra, Jonathan Corbet,
	Mathieu Desnoyers, LKML, Thomas Gleixner, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo


On Thu, 2 Oct 2008, Andrew Morton wrote:
> > 
> > This patch adds a macro to assign all entries of ftrace using the type
> > of the variable and checking the entry id. The typecasts are now done
> > in the macro for only those types that it knows about, which should
> > be all the types that are allowed to be read from the tracer.
> > 
> 
> I'm somewhat at a loss here because I'm unable to find any version of
> kernel/trace/trace.c which looks anything like the one which is being
> patched, but...

As Ingo mentioned, you don't have this yet. And be happy that you don't 
;-)

This patch is to fix the patch that did this.

> 
> > --- a/kernel/trace/trace.c
> > +++ b/kernel/trace/trace.c
> > @@ -1350,7 +1350,9 @@ print_lat_fmt(struct trace_iterator *iter, unsigned int trace_idx, int cpu)
> >  	}
> >  	switch (entry->type) {
> >  	case TRACE_FN: {
> > -		struct ftrace_entry *field = (struct ftrace_entry *)entry;
> 
> Why was this code using a cast in the first place?  It should be using
> entry->some_field_i_dont_have_here?  That was the whole point in using 
> the anonymous union in struct trace_entry?

Because the ring_buffer now allows for variable length entries, having a 
one size fits all entry is not optimal.

But because C is not the best for typecasting, we have this macro to help 
solve the issue. Instead of registering everything into a single union and 
causing small fields to be large, we have a macro you can register your
type with instead.

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [boot crash] Re: [PATCH] ring-buffer: fix build error
  2008-10-02  9:38                                                 ` [boot crash] " Ingo Molnar
@ 2008-10-02 13:16                                                   ` Steven Rostedt
  2008-10-02 13:17                                                   ` Steven Rostedt
  2008-10-02 23:18                                                   ` [PATCH] ring_buffer: map to cpu not page Steven Rostedt
  2 siblings, 0 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-10-02 13:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, Jonathan Corbet,
	Mathieu Desnoyers, LKML, Thomas Gleixner, Andrew Morton, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo


On Thu, 2 Oct 2008, Ingo Molnar wrote:
> [    0.332002] Kernel panic - not syncing: Fatal exception
> 
> full serial log and config attached. I'm excluding these latest commits 
> from tip/master for now:

Thanks I'll take a look at this.

> 
>  339ce9a: ring-buffer: fix build error
>  b6eeea4: ftrace: preempt disable over interrupt disable

The above "preempt disable" is the most likely culprit.  I'm trying to get 
towards an interrupt disabled free and lockless code path. But in doing 
so, one must be extra careful. This is why I'm taking baby steps towards 
this approach. Any little error in one of these steps, and you have race 
conditions biting you.

The above replaces interrupt disables with preempt disables, uses the
atomic data disable to protect against reentrancy. But this could also 
have opened up a bug that was not present with the interrupts disabled 
version.


>  52abc82: ring_buffer: allocate buffer page pointer
>  da78331: ftrace: type cast filter+verifier

> 
> i'm quite sure 52abc82 causes this problem.

Hmm, that was a trivial patch. Perhaps the trivial ones are the ones that 
are most likely to be error prone. A developer will take much more care in 
developing a patch that is complex than something he sees as trivial ;-)

-- Steve

> 
> Another 64-bit testbox crashed too meanwhile.
> 
> 	Ingo
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [boot crash] Re: [PATCH] ring-buffer: fix build error
  2008-10-02  9:38                                                 ` [boot crash] " Ingo Molnar
  2008-10-02 13:16                                                   ` Steven Rostedt
@ 2008-10-02 13:17                                                   ` Steven Rostedt
  2008-10-02 15:50                                                     ` Ingo Molnar
  2008-10-02 23:18                                                   ` [PATCH] ring_buffer: map to cpu not page Steven Rostedt
  2 siblings, 1 reply; 102+ messages in thread
From: Steven Rostedt @ 2008-10-02 13:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, Jonathan Corbet,
	Mathieu Desnoyers, LKML, Thomas Gleixner, Andrew Morton, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo


On Thu, 2 Oct 2008, Ingo Molnar wrote:
> * Ingo Molnar <mingo@elte.hu> wrote:

> full serial log and config attached. I'm excluding these latest commits 

 -ENOATTACHMENT

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [boot crash] Re: [PATCH] ring-buffer: fix build error
  2008-10-02 13:17                                                   ` Steven Rostedt
@ 2008-10-02 15:50                                                     ` Ingo Molnar
  2008-10-02 18:27                                                       ` Steven Rostedt
  0 siblings, 1 reply; 102+ messages in thread
From: Ingo Molnar @ 2008-10-02 15:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Peter Zijlstra, Jonathan Corbet,
	Mathieu Desnoyers, LKML, Thomas Gleixner, Andrew Morton, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo

[-- Attachment #1: Type: text/plain, Size: 347 bytes --]


* Steven Rostedt <rostedt@goodmis.org> wrote:

> 
> On Thu, 2 Oct 2008, Ingo Molnar wrote:
> > * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > full serial log and config attached. I'm excluding these latest commits 
> 
>  -ENOATTACHMENT

attached.

You can get the broken tree by doing this in tip/master:

  git-merge tip/tracing/ring-buffer

	Ingo

[-- Attachment #2: crash.log --]
[-- Type: text/plain, Size: 16994 bytes --]

[    0.000000] Linux version 2.6.27-rc8-tip-01064-gd163d6b-dirty (mingo@titan) (gcc version 4.2.3) #1 SMP Thu Oct 2 11:21:04 CEST 2008
[    0.000000] Command line: root=/dev/sda1 earlyprintk=serial,ttyS0,115200 console=ttyS0,115200 console=tty 5 profile=0 debug initcall_debug apic=debug apic=verbose ignore_loglevel sysrq_always_enabled pci=nomsi
[    0.000000] KERNEL supported cpus:
[    0.000000]   Intel GenuineIntel
[    0.000000]   AMD AuthenticAMD
[    0.000000]   Centaur CentaurHauls
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
[    0.000000]  BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
[    0.000000]  BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
[    0.000000]  BIOS-e820: 0000000000100000 - 000000003ed94000 (usable)
[    0.000000]  BIOS-e820: 000000003ed94000 - 000000003ee4e000 (ACPI NVS)
[    0.000000]  BIOS-e820: 000000003ee4e000 - 000000003fea2000 (usable)
[    0.000000]  BIOS-e820: 000000003fea2000 - 000000003fee9000 (ACPI NVS)
[    0.000000]  BIOS-e820: 000000003fee9000 - 000000003feed000 (usable)
[    0.000000]  BIOS-e820: 000000003feed000 - 000000003feff000 (ACPI data)
[    0.000000]  BIOS-e820: 000000003feff000 - 000000003ff00000 (usable)
[    0.000000] console [earlyser0] enabled
[    0.000000] debug: ignoring loglevel setting.
[    0.000000] DMI 2.3 present.
[    0.000000] last_pfn = 0x3ff00 max_arch_pfn = 0x3ffffffff
[    0.000000] x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
[    0.000000] Scanning 2 areas for low memory corruption
[    0.000000] modified physical RAM map:
[    0.000000]  modified: 0000000000000000 - 0000000000001000 (usable)
[    0.000000]  modified: 0000000000001000 - 0000000000006000 (reserved)
[    0.000000]  modified: 0000000000006000 - 0000000000008000 (usable)
[    0.000000]  modified: 0000000000008000 - 0000000000010000 (reserved)
[    0.000000]  modified: 0000000000010000 - 0000000000092c00 (usable)
[    0.000000]  modified: 000000000009fc00 - 00000000000a0000 (reserved)
[    0.000000]  modified: 00000000000e0000 - 0000000000100000 (reserved)
[    0.000000]  modified: 0000000000100000 - 000000003ed94000 (usable)
[    0.000000]  modified: 000000003ed94000 - 000000003ee4e000 (ACPI NVS)
[    0.000000]  modified: 000000003ee4e000 - 000000003fea2000 (usable)
[    0.000000]  modified: 000000003fea2000 - 000000003fee9000 (ACPI NVS)
[    0.000000]  modified: 000000003fee9000 - 000000003feed000 (usable)
[    0.000000]  modified: 000000003feed000 - 000000003feff000 (ACPI data)
[    0.000000]  modified: 000000003feff000 - 000000003ff00000 (usable)
[    0.000000] init_memory_mapping
[    0.000000]  0000000000 - 003fe00000 page 2M
[    0.000000]  003fe00000 - 003ff00000 page 4k
[    0.000000] kernel direct mapping tables up to 3ff00000 @ 10000-13000
[    0.000000] last_map_addr: 3ff00000 end: 3ff00000
[    0.000000] ACPI: RSDP 000FE020, 0014 (r0 INTEL )
[    0.000000] ACPI: RSDT 3FEFDE48, 0050 (r1 INTEL  D975XBX       4B9 MSFT  1000013)
[    0.000000] ACPI: FACP 3FEFCF10, 0074 (r1 INTEL  D975XBX       4B9 MSFT  1000013)
[    0.000000] ACPI: DSDT 3FEF8010, 3E70 (r1 INTEL  D975XBX       4B9 MSFT  1000013)
[    0.000000] ACPI: FACS 3FEDFC40, 0040
[    0.000000] ACPI: APIC 3FEFCE10, 0078 (r1 INTEL  D975XBX       4B9 MSFT  1000013)
[    0.000000] ACPI: WDDT 3FEF7F90, 0040 (r1 INTEL  D975XBX       4B9 MSFT  1000013)
[    0.000000] ACPI: MCFG 3FEF7F10, 003C (r1 INTEL  D975XBX       4B9 MSFT  1000013)
[    0.000000] ACPI: ASF! 3FEFCD10, 00A6 (r32 INTEL  D975XBX       4B9 MSFT  1000013)
[    0.000000] ACPI: HPET 3FEF7E90, 0038 (r1 INTEL  D975XBX       4B9 MSFT  1000013)
[    0.000000] ACPI: SSDT 3FEFDC10, 01BC (r1 INTEL     CpuPm      4B9 MSFT  1000013)
[    0.000000] ACPI: SSDT 3FEFDA10, 01B7 (r1 INTEL   Cpu0Ist      4B9 MSFT  1000013)
[    0.000000] ACPI: SSDT 3FEFD810, 01B7 (r1 INTEL   Cpu1Ist      4B9 MSFT  1000013)
[    0.000000] ACPI: SSDT 3FEFD610, 01B7 (r1 INTEL   Cpu2Ist      4B9 MSFT  1000013)
[    0.000000] ACPI: SSDT 3FEFD410, 01B7 (r1 INTEL   Cpu3Ist      4B9 MSFT  1000013)
[    0.000000] ACPI: Local APIC address 0xfee00000
[    0.000000] No NUMA configuration found
[    0.000000] Faking a node at 0000000000000000-000000003ff00000
[    0.000000] Bootmem setup node 0 0000000000000000-000000003ff00000
[    0.000000]   NODE_DATA [0000000000013000 - 0000000000016fff]
[    0.000000]   bootmap [0000000000017000 -  000000000001efdf] pages 8
[    0.000000] (5 early reservations) ==> bootmem [0000000000 - 003ff00000]
[    0.000000]   #0 [0000000000 - 0000001000]   BIOS data page ==> [0000000000 - 0000001000]
[    0.000000]   #1 [0000006000 - 0000008000]       TRAMPOLINE ==> [0000006000 - 0000008000]
[    0.000000]   #2 [0000200000 - 0000c68370]    TEXT DATA BSS ==> [0000200000 - 0000c68370]
[    0.000000]   #3 [000009fc00 - 0000100000]    BIOS reserved ==> [000009fc00 - 0000100000]
[    0.000000]   #4 [0000010000 - 0000013000]          PGTABLE ==> [0000010000 - 0000013000]
[    0.000000] Scan SMP from ffff880000000000 for 1024 bytes.
[    0.000000] Scan SMP from ffff88000009fc00 for 1024 bytes.
[    0.000000] Scan SMP from ffff8800000f0000 for 65536 bytes.
[    0.000000] found SMP MP-table at [ffff8800000fe680] 000fe680
[    0.000000] Zone PFN ranges:
[    0.000000]   DMA      0x00000000 -> 0x00001000
[    0.000000]   DMA32    0x00001000 -> 0x00100000
[    0.000000]   Normal   0x00100000 -> 0x00100000
[    0.000000] Movable zone start PFN for each node
[    0.000000] early_node_map[7] active PFN ranges
[    0.000000]     0: 0x00000000 -> 0x00000001
[    0.000000]     0: 0x00000006 -> 0x00000008
[    0.000000]     0: 0x00000010 -> 0x00000092
[    0.000000]     0: 0x00000100 -> 0x0003ed94
[    0.000000]     0: 0x0003ee4e -> 0x0003fea2
[    0.000000]     0: 0x0003fee9 -> 0x0003feed
[    0.000000]     0: 0x0003feff -> 0x0003ff00
[    0.000000] On node 0 totalpages: 261490
[    0.000000]   DMA zone: 1141 pages, LIFO batch:0
[    0.000000]   DMA32 zone: 253489 pages, LIFO batch:31
[    0.000000] ACPI: PM-Timer IO Port: 0x408
[    0.000000] ACPI: Local APIC address 0xfee00000
[    0.000000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x03] lapic_id[0x82] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x04] lapic_id[0x83] disabled)
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1])
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x02] dfl dfl lint[0x1])
[    0.000000] ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
[    0.000000] IOAPIC[0]: apic_id 2, version 0, address 0xfec00000, GSI 0-23
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[    0.000000] ACPI: IRQ0 used by override.
[    0.000000] ACPI: IRQ2 used by override.
[    0.000000] ACPI: IRQ9 used by override.
[    0.000000] ACPI: HPET id: 0x8086a201 base: 0xfed00000
[    0.000000] Using ACPI (MADT) for SMP configuration information
[    0.000000] SMP: Allowing 4 CPUs, 2 hotplug CPUs
[    0.000000] mapped APIC to ffffffffff5fc000 (fee00000)
[    0.000000] mapped IOAPIC to ffffffffff5fb000 (fec00000)
[    0.000000] Allocating PCI resources starting at 40000000 (gap: 3ff00000:c0100000)
[    0.000000] dyn_array irq_2_pin_head+0x0/0x4a8 size:0x10 nr:256 align:0x1000
[    0.000000] dyn_array irq_cfgx+0x0/0x8 size:0x28 nr:48 align:0x1000
[    0.000000] dyn_array irq_desc+0x0/0x8 size:0x100 nr:48 align:0x1000
[    0.000000] dyn_array irq_timer_state+0x0/0x10 size:0x8 nr:48 align:0x1000
[    0.000000] dyn_array total_size: 0x6000
[    0.000000] dyn_array irq_2_pin_head+0x0/0x4a8 ==> [0x100a000 - 0x100b000]
[    0.000000] dyn_array irq_cfgx+0x0/0x8 ==> [0x100b000 - 0x100b780]
[    0.000000] dyn_array irq_desc+0x0/0x8 ==> [0x100c000 - 0x100f000]
[    0.000000] kstat_irqs ==> [0x1010000 - 0x1010300]
[    0.000000] dyn_array irq_timer_state+0x0/0x10 ==> [0x100f000 - 0x100f180]
[    0.000000] PERCPU: Allocating 57344 bytes of per cpu data
[    0.000000] per cpu data for cpu0 on node0 at 0000000001011000
[    0.000000] per cpu data for cpu1 on node0 at 000000000101f000
[    0.000000] per cpu data for cpu2 on node0 at 000000000102d000
[    0.000000] per cpu data for cpu3 on node0 at 000000000103b000
[    0.000000] NR_CPUS: 8, nr_cpu_ids: 4, nr_node_ids 1
[    0.000000] Built 1 zonelists in Node order, mobility grouping on.  Total pages: 254630
[    0.000000] Policy zone: DMA32
[    0.000000] Kernel command line: root=/dev/sda1 earlyprintk=serial,ttyS0,115200 console=ttyS0,115200 console=tty 5 profile=0 debug initcall_debug apic=debug apic=verbose ignore_loglevel sysrq_always_enabled pci=nomsi
[    0.000000] debug: sysrq always enabled.
[    0.000000] Initializing CPU#0
[    0.000000] PID hash table entries: 4096 (order: 12, 32768 bytes)
[    0.000000] Fast TSC calibration using PIT
[    0.000000] Detected 2933.237 MHz processor.
[    0.004000] Console: colour VGA+ 80x25
[    0.004000] console handover: boot [earlyser0] -> real [tty0]
[    0.004000] console [ttyS0] enabled
[    0.004000] Scanning for low memory corruption every 60 seconds
[    0.004000] Checking aperture...
[    0.004000] No AGP bridge found
[    0.004000] Memory: 1018524k/1047552k available (5392k kernel code, 27436k reserved, 3199k data, 608k init)
[    0.004000] SLUB: Genslabs=13, HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[    0.004000] hpet clockevent registered
[    0.004000] HPET: 3 timers in total, 0 timers will be used for per-cpu timer
[    0.004011] Calibrating delay loop (skipped), value calculated using timer frequency.. 5866.47 BogoMIPS (lpj=11732948)
[    0.012021] Security Framework initialized
[    0.016091] Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes)
[    0.020676] Inode-cache hash table entries: 65536 (order: 7, 524288 bytes)
[    0.024312] Mount-cache hash table entries: 256
[    0.028160] Initializing cgroup subsys debug
[    0.032004] Initializing cgroup subsys ns
[    0.036003] Initializing cgroup subsys memory
[    0.040021] CPU: L1 I cache: 32K, L1 D cache: 32K
[    0.045682] CPU: L2 cache: 4096K
[    0.048004] CPU 0/0x0 -> Node 0
[    0.052003] CPU: Physical Processor ID: 0
[    0.056002] CPU: Processor Core ID: 0
[    0.060007] CPU0: Thermal monitoring enabled (TM2)
[    0.064003] using mwait in idle threads.
[    0.069799] ACPI: Core revision 20080609
[    0.081336] Parsing all Control Methods:
[    0.084117] Table [DSDT](id 0001) - 529 Objects with 55 Devices 147 Methods 32 Regions
[    0.088143] Parsing all Control Methods:
[    0.092098] Table [SSDT](id 0002) - 10 Objects with 0 Devices 4 Methods 0 Regions
[    0.096087] Parsing all Control Methods:
[    0.100098] Table [SSDT](id 0003) - 5 Objects with 0 Devices 3 Methods 0 Regions
[    0.104087] Parsing all Control Methods:
[    0.108099] Table [SSDT](id 0004) - 5 Objects with 0 Devices 3 Methods 0 Regions
[    0.112086] Parsing all Control Methods:
[    0.116101] Table [SSDT](id 0005) - 5 Objects with 0 Devices 3 Methods 0 Regions
[    0.120087] Parsing all Control Methods:
[    0.124101] Table [SSDT](id 0006) - 5 Objects with 0 Devices 3 Methods 0 Regions
[    0.128003]  tbxface-0607 [02] tb_load_namespace     : ACPI Tables successfully acquired
[    0.137205] evxfevnt-0091 [02] enable                : Transition to ACPI mode successful
[    0.144052] Setting APIC routing to flat
[    0.148005] enabled ExtINT on CPU#0
[    0.152139] ENABLING IO-APIC IRQs
[    0.156003] init IO_APIC IRQs
[    0.158984]  2-0 (apicid-pin) not connected
[    0.162551] IOAPIC[0]: Set routing entry (2-1 -> 0x31 -> IRQ 1 Mode:0 Active:0)
[    0.164005] IOAPIC[0]: Set routing entry (2-2 -> 0x30 -> IRQ 0 Mode:0 Active:0)
[    0.168004] IOAPIC[0]: Set routing entry (2-3 -> 0x33 -> IRQ 3 Mode:0 Active:0)
[    0.172004] IOAPIC[0]: Set routing entry (2-4 -> 0x34 -> IRQ 4 Mode:0 Active:0)
[    0.176003] IOAPIC[0]: Set routing entry (2-5 -> 0x35 -> IRQ 5 Mode:0 Active:0)
[    0.180004] IOAPIC[0]: Set routing entry (2-6 -> 0x36 -> IRQ 6 Mode:0 Active:0)
[    0.184004] IOAPIC[0]: Set routing entry (2-7 -> 0x37 -> IRQ 7 Mode:0 Active:0)
[    0.188003] IOAPIC[0]: Set routing entry (2-8 -> 0x38 -> IRQ 8 Mode:0 Active:0)
[    0.192004] IOAPIC[0]: Set routing entry (2-9 -> 0x39 -> IRQ 9 Mode:1 Active:0)
[    0.196004] IOAPIC[0]: Set routing entry (2-10 -> 0x3a -> IRQ 10 Mode:0 Active:0)
[    0.200004] IOAPIC[0]: Set routing entry (2-11 -> 0x3b -> IRQ 11 Mode:0 Active:0)
[    0.204004] IOAPIC[0]: Set routing entry (2-12 -> 0x3c -> IRQ 12 Mode:0 Active:0)
[    0.208004] IOAPIC[0]: Set routing entry (2-13 -> 0x3d -> IRQ 13 Mode:0 Active:0)
[    0.212004] IOAPIC[0]: Set routing entry (2-14 -> 0x3e -> IRQ 14 Mode:0 Active:0)
[    0.216004] IOAPIC[0]: Set routing entry (2-15 -> 0x3f -> IRQ 15 Mode:0 Active:0)
[    0.220003]  2-16 2-17 2-18 2-19 2-20 2-21 2-22 2-23 (apicid-pin) not connected
[    0.228132] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.272252] CPU0: Intel(R) Core(TM)2 CPU         E6800  @ 2.93GHz stepping 05
[    0.277251] Using local APIC timer interrupts.
[    0.277252] calibrating APIC timer ...
[    0.284001] ... lapic delta = 1666638
[    0.284001] ... PM timer delta = 357949
[    0.284001] ... PM timer result ok
[    0.284001] ..... delta 1666638
[    0.284001] ..... mult: 71577083
[    0.284001] ..... calibration result: 1066648
[    0.284001] ..... CPU clock speed is 2933.1156 MHz.
[    0.284001] ..... host bus clock speed is 266.2648 MHz.
[    0.284012] calling  migration_init+0x0/0x5b @ 1
[    0.288026] initcall migration_init+0x0/0x5b returned 1 after 0 msecs
[    0.292003] initcall migration_init+0x0/0x5b returned with error code 1 
[    0.300003] calling  spawn_ksoftirqd+0x0/0x58 @ 1
[    0.304017] initcall spawn_ksoftirqd+0x0/0x58 returned 0 after 0 msecs
[    0.308005] calling  init_call_single_data+0x0/0x52 @ 1
[    0.312003] initcall init_call_single_data+0x0/0x52 returned 0 after 0 msecs
[    0.316003] calling  relay_init+0x0/0x14 @ 1
[    0.320003] initcall relay_init+0x0/0x14 returned 0 after 0 msecs
[    0.324003] calling  tracer_alloc_buffers+0x0/0x14a @ 1
[    0.328008] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[    0.332001] IP: [<ffffffff8027d28b>] ring_buffer_alloc+0x207/0x3fc
[    0.332001] PGD 0 
[    0.332001] Oops: 0000 [1] SMP 
[    0.332001] CPU 0 
[    0.332001] Modules linked in:
[    0.332001] Pid: 1, comm: swapper Not tainted 2.6.27-rc8-tip-01064-gd163d6b-dirty #1
[    0.332001] RIP: 0010:[<ffffffff8027d28b>]  [<ffffffff8027d28b>] ring_buffer_alloc+0x207/0x3fc
[    0.332001] RSP: 0018:ffff88003f9d7de0  EFLAGS: 00010287
[    0.332001] RAX: 0000000000000000 RBX: ffffffff80b08404 RCX: 0000000000000067
[    0.332001] RDX: 0000000000000004 RSI: 00000000000080d0 RDI: ffffffffffffffc0
[    0.332001] RBP: ffff88003f9d7e80 R08: ffff88003f8010b4 R09: 000000000003db02
[    0.332001] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88003f801600
[    0.332001] R13: 0000000000000004 R14: ffff88003f801580 R15: ffff88003f801618
[    0.332001] FS:  0000000000000000(0000) GS:ffffffff80a68280(0000) knlGS:0000000000000000
[    0.332001] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[    0.332001] CR2: 0000000000000008 CR3: 0000000000201000 CR4: 00000000000006e0
[    0.332001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.332001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[    0.332001] Process swapper (pid: 1, threadinfo ffff88003f9d6000, task ffff88003f9d8000)
[    0.332001] Stack:  ffff88003f9d7df0 ffff88003f9d7e40 0000000000000283 ffffffff80b08404
[    0.332001]  ffffffff80b08404 ffff88003f801598 0000000000000000 ffff88003f801598
[    0.332001]  ffff88003f801580 0000016000000000 ffff88003f801600 ffff88003f9a2a40
[    0.332001] Call Trace:
[    0.332001]  [<ffffffff80a95f41>] ? tracer_alloc_buffers+0x0/0x14a
[    0.332001]  [<ffffffff80a95f67>] tracer_alloc_buffers+0x26/0x14a
[    0.332001]  [<ffffffff80a95f41>] ? tracer_alloc_buffers+0x0/0x14a
[    0.332001]  [<ffffffff80209056>] do_one_initcall+0x56/0x144
[    0.332001]  [<ffffffff80a87d4a>] ? native_smp_prepare_cpus+0x2aa/0x2ef
[    0.332001]  [<ffffffff80a7c8ce>] kernel_init+0x69/0x20e
[    0.332001]  [<ffffffff8020d4e9>] child_rip+0xa/0x11
[    0.332001]  [<ffffffff80257896>] ? __atomic_notifier_call_chain+0xd/0xf
[    0.332001]  [<ffffffff80a7c865>] ? kernel_init+0x0/0x20e
[    0.332001]  [<ffffffff8020d4df>] ? child_rip+0x0/0x11
[    0.332001] Code: 48 8b 05 d9 b2 7e 00 49 63 d5 48 63 0d 1b b2 7e 00 48 8b 9d 78 ff ff ff be d0 80 00 00 48 8b 04 d0 48 89 cf 48 83 c1 27 48 f7 df <48> 8b 40 08 48 21 cf 8b 14 03 e8 4e b5 02 00 48 85 c0 48 89 c3 
[    0.332001] RIP  [<ffffffff8027d28b>] ring_buffer_alloc+0x207/0x3fc
[    0.332001]  RSP <ffff88003f9d7de0>
[    0.332001] CR2: 0000000000000008
[    0.332002] Kernel panic - not syncing: Fatal exception

[-- Attachment #3: config-Thu_Oct__2_11_00_18_CEST_2008.bad --]
[-- Type: text/plain, Size: 51332 bytes --]

# d3ff3924
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.27-rc8
# Thu Oct  2 11:00:18 2008
#
CONFIG_64BIT=y
# CONFIG_X86_32 is not set
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
# CONFIG_GENERIC_LOCKBREAK is not set
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_GPIO=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_HAVE_CPUMASK_OF_CPU_MAP=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ZONE_DMA32=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_AOUT=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_X86_SMP=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_X86_TRAMPOLINE=y
# CONFIG_KTIME_SCALAR is not set
CONFIG_BOOTPARAM_SUPPORT_NOT_WANTED=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_BOOT_ALLOWED4=y
CONFIG_BROKEN_BOOT_ALLOWED3=y
# CONFIG_BROKEN_BOOT_ALLOWED2 is not set
# CONFIG_BROKEN_BOOT_EUROPE is not set
# CONFIG_BROKEN_BOOT_TITAN is not set
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
# CONFIG_SYSVIPC is not set
# CONFIG_POSIX_MQUEUE is not set
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
# CONFIG_TASK_DELAY_ACCT is not set
# CONFIG_TASK_XACCT is not set
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_TREE=y
CONFIG_IKCONFIG=y
# CONFIG_IKCONFIG_PROC is not set
CONFIG_LOG_BUF_SHIFT=21
CONFIG_CGROUPS=y
CONFIG_CGROUP_DEBUG=y
CONFIG_CGROUP_NS=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CPUSETS=y
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_RT_GROUP_SCHED=y
CONFIG_USER_SCHED=y
# CONFIG_CGROUP_SCHED is not set
CONFIG_CGROUP_CPUACCT=y
CONFIG_RESOURCE_COUNTERS=y
CONFIG_MM_OWNER=y
CONFIG_CGROUP_MEM_RES_CTLR=y
CONFIG_SYSFS_DEPRECATED=y
# CONFIG_SYSFS_DEPRECATED_V2 is not set
CONFIG_PROC_PID_CPUSET=y
CONFIG_RELAY=y
CONFIG_NAMESPACES=y
# CONFIG_UTS_NS is not set
# CONFIG_USER_NS is not set
CONFIG_PID_NS=y
# CONFIG_BLK_DEV_INITRD is not set
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_FASTBOOT=y
CONFIG_SYSCTL=y
CONFIG_EMBEDDED=y
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
# CONFIG_BUG is not set
CONFIG_ELF_CORE=y
# CONFIG_PCSPKR_PLATFORM is not set
CONFIG_COMPAT_BRK=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
# CONFIG_VM_EVENT_COUNTERS is not set
CONFIG_SLUB_DEBUG=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
# CONFIG_MARKERS is not set
CONFIG_OPROFILE=y
# CONFIG_OPROFILE_IBS is not set
CONFIG_HAVE_OPROFILE=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
# CONFIG_HAVE_DMA_ATTRS is not set
CONFIG_USE_GENERIC_SMP_HELPERS=y
# CONFIG_HAVE_CLK is not set
CONFIG_HAVE_DYN_ARRAY=y
# CONFIG_PROC_PAGE_MONITOR is not set
# CONFIG_HAVE_GENERIC_DMA_COHERENT is not set
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
# CONFIG_MODULES is not set
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_BLK_DEV_BSG=y
CONFIG_BLK_DEV_INTEGRITY=y
CONFIG_BLOCK_COMPAT=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_CLASSIC_RCU=y

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
# CONFIG_NO_HZ is not set
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_SMP_SUPPORT=y
CONFIG_HAVE_SPARSE_IRQ=y
# CONFIG_X86_MPPARSE is not set
# CONFIG_UP_WANTED_1 is not set
CONFIG_SMP=y
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_VSMP is not set
# CONFIG_PARAVIRT_GUEST is not set
# CONFIG_MEMTEST is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
# CONFIG_MPENTIUMM is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_MVIAC7 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_CPU=y
CONFIG_X86_L1_CACHE_BYTES=128
CONFIG_X86_INTERNODE_CACHE_BYTES=128
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=7
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_PROCESSOR_SELECT=y
# CONFIG_CPU_SUP_INTEL is not set
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR_64=y
CONFIG_X86_DS=y
# CONFIG_X86_PTRACE_BTS is not set
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
CONFIG_GART_IOMMU=y
CONFIG_CALGARY_IOMMU=y
CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT=y
CONFIG_AMD_IOMMU=y
CONFIG_SWIOTLB=y
CONFIG_IOMMU_HELPER=y
CONFIG_NR_CPUS=8
# CONFIG_SCHED_SMT is not set
# CONFIG_SCHED_MC is not set
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
# CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS is not set
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
# CONFIG_X86_MCE_AMD is not set
CONFIG_I8K=y
CONFIG_MICROCODE=y
CONFIG_MICROCODE_INTEL=y
CONFIG_MICROCODE_AMD=y
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_NUMA=y
CONFIG_K8_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_NODES_SPAN_OTHER_NODES=y
CONFIG_NUMA_EMU=y
CONFIG_NODES_SHIFT=6
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
CONFIG_SELECT_MEMORY_MODEL=y
# CONFIG_FLATMEM_MANUAL is not set
# CONFIG_DISCONTIGMEM_MANUAL is not set
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_NEED_MULTIPLE_NODES=y
CONFIG_HAVE_MEMORY_PRESENT=y
# CONFIG_SPARSEMEM_STATIC is not set
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y

#
# Memory hotplug is currently incompatible with Software Suspend
#
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_MIGRATION=y
CONFIG_RESOURCES_64BIT=y
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_MMU_NOTIFIER=y
CONFIG_X86_CHECK_BIOS_CORRUPTION=y
CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK=y
CONFIG_X86_RESERVE_LOW_64K=y
# CONFIG_MTRR is not set
CONFIG_EFI=y
# CONFIG_SECCOMP is not set
CONFIG_CC_STACKPROTECTOR_ALL=y
CONFIG_CC_STACKPROTECTOR=y
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
# CONFIG_CRASH_DUMP is not set
CONFIG_PHYSICAL_START=0x200000
CONFIG_RELOCATABLE=y
CONFIG_PHYSICAL_ALIGN=0x200000
CONFIG_HOTPLUG_CPU=y
CONFIG_COMPAT_VDSO=y
# CONFIG_CMDLINE_BOOL is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID=y

#
# Power management options
#
CONFIG_ARCH_HIBERNATION_HEADER=y
CONFIG_PM=y
CONFIG_PM_DEBUG=y
# CONFIG_PM_VERBOSE is not set
CONFIG_CAN_PM_TRACE=y
# CONFIG_PM_TRACE_RTC is not set
CONFIG_PM_SLEEP_SMP=y
CONFIG_PM_SLEEP=y
# CONFIG_SUSPEND is not set
CONFIG_HIBERNATION=y
CONFIG_PM_STD_PARTITION=""
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_PROCFS=y
CONFIG_ACPI_PROCFS_POWER=y
# CONFIG_ACPI_SYSFS_POWER is not set
# CONFIG_ACPI_PROC_EVENT is not set
CONFIG_ACPI_AC=y
# CONFIG_ACPI_BATTERY is not set
# CONFIG_ACPI_BUTTON is not set
CONFIG_ACPI_VIDEO=y
CONFIG_ACPI_FAN=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_BAY=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_ACPI_THERMAL=y
CONFIG_ACPI_NUMA=y
CONFIG_ACPI_WMI=y
CONFIG_ACPI_ASUS=y
CONFIG_ACPI_TOSHIBA=y
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_EC=y
CONFIG_ACPI_PCI_SLOT=y
CONFIG_ACPI_POWER=y
CONFIG_ACPI_SYSTEM=y
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y
CONFIG_ACPI_SBS=y

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
CONFIG_CPU_FREQ_DEBUG=y
# CONFIG_CPU_FREQ_STAT is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y

#
# CPUFreq processor drivers
#
CONFIG_X86_ACPI_CPUFREQ=y
CONFIG_X86_POWERNOW_K8=y
CONFIG_X86_POWERNOW_K8_ACPI=y
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
CONFIG_X86_P4_CLOCKMOD=y

#
# shared options
#
CONFIG_X86_ACPI_CPUFREQ_PROC_INTF=y
CONFIG_X86_SPEEDSTEP_LIB=y
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
# CONFIG_PCI_MMCONFIG is not set
CONFIG_PCI_DOMAINS=y
CONFIG_DMAR=y
# CONFIG_DMAR_GFX_WA is not set
CONFIG_DMAR_FLOPPY_WA=y
CONFIG_INTR_REMAP=y
# CONFIG_PCIEPORTBUS is not set
CONFIG_ARCH_SUPPORTS_MSI=y
CONFIG_PCI_MSI=y
CONFIG_PCI_LEGACY=y
CONFIG_HT_IRQ=y
CONFIG_ISA_DMA_API=y
CONFIG_K8_NB=y
# CONFIG_PCCARD is not set
CONFIG_HOTPLUG_PCI=y
CONFIG_HOTPLUG_PCI_FAKE=y
CONFIG_HOTPLUG_PCI_ACPI=y
CONFIG_HOTPLUG_PCI_ACPI_IBM=y
CONFIG_HOTPLUG_PCI_CPCI=y
CONFIG_HOTPLUG_PCI_CPCI_ZT5550=y
CONFIG_HOTPLUG_PCI_CPCI_GENERIC=y
# CONFIG_HOTPLUG_PCI_SHPC is not set

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
CONFIG_BINFMT_MISC=y
CONFIG_IA32_EMULATION=y
CONFIG_IA32_AOUT=y
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=y
CONFIG_PACKET_MMAP=y
CONFIG_UNIX=y
CONFIG_XFRM=y
# CONFIG_XFRM_USER is not set
CONFIG_XFRM_SUB_POLICY=y
CONFIG_XFRM_MIGRATE=y
CONFIG_XFRM_STATISTICS=y
CONFIG_XFRM_IPCOMP=y
CONFIG_NET_KEY=y
CONFIG_NET_KEY_MIGRATE=y
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_ASK_IP_FIB_HASH=y
# CONFIG_IP_FIB_TRIE is not set
CONFIG_IP_FIB_HASH=y
# CONFIG_IP_MULTIPLE_TABLES is not set
CONFIG_IP_ROUTE_MULTIPATH=y
# CONFIG_IP_ROUTE_VERBOSE is not set
# CONFIG_IP_PNP is not set
CONFIG_NET_IPIP=y
CONFIG_NET_IPGRE=y
# CONFIG_NET_IPGRE_BROADCAST is not set
CONFIG_IP_MROUTE=y
CONFIG_IP_PIMSM_V1=y
# CONFIG_IP_PIMSM_V2 is not set
CONFIG_ARPD=y
# CONFIG_SYN_COOKIES is not set
# CONFIG_INET_AH is not set
CONFIG_INET_ESP=y
CONFIG_INET_IPCOMP=y
CONFIG_INET_XFRM_TUNNEL=y
CONFIG_INET_TUNNEL=y
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
CONFIG_INET_XFRM_MODE_BEET=y
CONFIG_INET_LRO=y
CONFIG_INET_DIAG=y
CONFIG_INET_TCP_DIAG=y
# CONFIG_TCP_CONG_ADVANCED is not set
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
CONFIG_TCP_MD5SIG=y
# CONFIG_IPV6 is not set
CONFIG_NETLABEL=y
CONFIG_NETWORK_SECMARK=y
# CONFIG_NETFILTER is not set
# CONFIG_IP_DCCP is not set
CONFIG_IP_SCTP=y
# CONFIG_SCTP_DBG_MSG is not set
CONFIG_SCTP_DBG_OBJCNT=y
# CONFIG_SCTP_HMAC_NONE is not set
# CONFIG_SCTP_HMAC_SHA1 is not set
CONFIG_SCTP_HMAC_MD5=y
CONFIG_TIPC=y
# CONFIG_TIPC_ADVANCED is not set
# CONFIG_TIPC_DEBUG is not set
CONFIG_ATM=y
CONFIG_ATM_CLIP=y
CONFIG_ATM_CLIP_NO_ICMP=y
CONFIG_ATM_LANE=y
CONFIG_ATM_MPOA=y
CONFIG_ATM_BR2684=y
# CONFIG_ATM_BR2684_IPFILTER is not set
CONFIG_STP=y
CONFIG_BRIDGE=y
CONFIG_VLAN_8021Q=y
# CONFIG_VLAN_8021Q_GVRP is not set
CONFIG_DECNET=y
CONFIG_DECNET_ROUTER=y
CONFIG_LLC=y
CONFIG_LLC2=y
CONFIG_IPX=y
CONFIG_IPX_INTERN=y
CONFIG_ATALK=y
CONFIG_DEV_APPLETALK=y
# CONFIG_IPDDP is not set
CONFIG_X25=y
CONFIG_LAPB=y
CONFIG_ECONET=y
CONFIG_ECONET_AUNUDP=y
CONFIG_ECONET_NATIVE=y
# CONFIG_WAN_ROUTER is not set
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
# CONFIG_NET_SCH_CBQ is not set
# CONFIG_NET_SCH_HTB is not set
CONFIG_NET_SCH_HFSC=y
# CONFIG_NET_SCH_ATM is not set
CONFIG_NET_SCH_PRIO=y
# CONFIG_NET_SCH_RED is not set
CONFIG_NET_SCH_SFQ=y
CONFIG_NET_SCH_TEQL=y
CONFIG_NET_SCH_TBF=y
CONFIG_NET_SCH_GRED=y
# CONFIG_NET_SCH_DSMARK is not set
CONFIG_NET_SCH_NETEM=y
# CONFIG_NET_SCH_INGRESS is not set

#
# Classification
#
CONFIG_NET_CLS=y
# CONFIG_NET_CLS_BASIC is not set
CONFIG_NET_CLS_TCINDEX=y
CONFIG_NET_CLS_ROUTE4=y
CONFIG_NET_CLS_ROUTE=y
CONFIG_NET_CLS_FW=y
CONFIG_NET_CLS_U32=y
# CONFIG_CLS_U32_PERF is not set
CONFIG_CLS_U32_MARK=y
# CONFIG_NET_CLS_RSVP is not set
CONFIG_NET_CLS_RSVP6=y
CONFIG_NET_CLS_FLOW=y
CONFIG_NET_EMATCH=y
CONFIG_NET_EMATCH_STACK=32
# CONFIG_NET_EMATCH_CMP is not set
CONFIG_NET_EMATCH_NBYTE=y
CONFIG_NET_EMATCH_U32=y
CONFIG_NET_EMATCH_META=y
# CONFIG_NET_EMATCH_TEXT is not set
CONFIG_NET_CLS_ACT=y
# CONFIG_NET_ACT_POLICE is not set
CONFIG_NET_ACT_GACT=y
CONFIG_GACT_PROB=y
CONFIG_NET_ACT_MIRRED=y
CONFIG_NET_ACT_NAT=y
CONFIG_NET_ACT_PEDIT=y
CONFIG_NET_ACT_SIMP=y
# CONFIG_NET_CLS_IND is not set
CONFIG_NET_SCH_FIFO=y

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
CONFIG_HAMRADIO=y

#
# Packet Radio protocols
#
CONFIG_AX25=y
CONFIG_AX25_DAMA_SLAVE=y
# CONFIG_NETROM is not set
CONFIG_ROSE=y

#
# AX.25 network device drivers
#
# CONFIG_MKISS is not set
# CONFIG_6PACK is not set
CONFIG_BPQETHER=y
CONFIG_BAYCOM_SER_FDX=y
CONFIG_BAYCOM_SER_HDX=y
# CONFIG_YAM is not set
CONFIG_CAN=y
CONFIG_CAN_RAW=y
# CONFIG_CAN_BCM is not set

#
# CAN Device Drivers
#
# CONFIG_CAN_VCAN is not set
CONFIG_CAN_DEBUG_DEVICES=y
CONFIG_IRDA=y

#
# IrDA protocols
#
CONFIG_IRLAN=y
CONFIG_IRNET=y
CONFIG_IRCOMM=y
CONFIG_IRDA_ULTRA=y

#
# IrDA options
#
# CONFIG_IRDA_CACHE_LAST_LSAP is not set
CONFIG_IRDA_FAST_RR=y
CONFIG_IRDA_DEBUG=y

#
# Infrared-port device drivers
#

#
# SIR device drivers
#
CONFIG_IRTTY_SIR=y

#
# Dongle support
#
CONFIG_DONGLE=y
CONFIG_ESI_DONGLE=y
CONFIG_ACTISYS_DONGLE=y
CONFIG_TEKRAM_DONGLE=y
CONFIG_TOIM3232_DONGLE=y
CONFIG_LITELINK_DONGLE=y
CONFIG_MA600_DONGLE=y
CONFIG_GIRBIL_DONGLE=y
CONFIG_MCP2120_DONGLE=y
CONFIG_OLD_BELKIN_DONGLE=y
CONFIG_ACT200L_DONGLE=y
CONFIG_KINGSUN_DONGLE=y
CONFIG_KSDAZZLE_DONGLE=y
CONFIG_KS959_DONGLE=y

#
# FIR device drivers
#
CONFIG_USB_IRDA=y
# CONFIG_SIGMATEL_FIR is not set
CONFIG_NSC_FIR=y
CONFIG_WINBOND_FIR=y
# CONFIG_SMC_IRCC_FIR is not set
# CONFIG_ALI_FIR is not set
CONFIG_VLSI_FIR=y
# CONFIG_VIA_FIR is not set
# CONFIG_MCS_FIR is not set
CONFIG_BT=y
# CONFIG_BT_L2CAP is not set
CONFIG_BT_SCO=y

#
# Bluetooth device drivers
#
CONFIG_BT_HCIUSB=y
CONFIG_BT_HCIUSB_SCO=y
# CONFIG_BT_HCIBTUSB is not set
# CONFIG_BT_HCIUART is not set
# CONFIG_BT_HCIBCM203X is not set
CONFIG_BT_HCIBPA10X=y
# CONFIG_BT_HCIBFUSB is not set
CONFIG_BT_HCIVHCI=y
CONFIG_AF_RXRPC=y
# CONFIG_AF_RXRPC_DEBUG is not set
CONFIG_RXKAD=y
CONFIG_FIB_RULES=y

#
# Wireless
#
CONFIG_CFG80211=y
CONFIG_NL80211=y
CONFIG_WIRELESS_EXT=y
CONFIG_WIRELESS_EXT_SYSFS=y
CONFIG_MAC80211=y

#
# Rate control algorithm selection
#
CONFIG_MAC80211_RC_PID=y
CONFIG_MAC80211_RC_DEFAULT_PID=y
CONFIG_MAC80211_RC_DEFAULT="pid"
# CONFIG_MAC80211_MESH is not set
# CONFIG_MAC80211_LEDS is not set
CONFIG_MAC80211_DEBUGFS=y
CONFIG_MAC80211_DEBUG_MENU=y
# CONFIG_MAC80211_DEBUG_PACKET_ALIGNMENT is not set
# CONFIG_MAC80211_NOINLINE is not set
CONFIG_MAC80211_VERBOSE_DEBUG=y
CONFIG_MAC80211_HT_DEBUG=y
CONFIG_MAC80211_TKIP_DEBUG=y
CONFIG_MAC80211_IBSS_DEBUG=y
CONFIG_MAC80211_VERBOSE_PS_DEBUG=y
CONFIG_MAC80211_LOWTX_FRAME_DUMP=y
CONFIG_MAC80211_DEBUG_COUNTERS=y
CONFIG_MAC80211_VERBOSE_SPECT_MGMT_DEBUG=y
CONFIG_IEEE80211=y
CONFIG_IEEE80211_DEBUG=y
# CONFIG_IEEE80211_CRYPT_WEP is not set
CONFIG_IEEE80211_CRYPT_CCMP=y
CONFIG_IEEE80211_CRYPT_TKIP=y
CONFIG_RFKILL=y
CONFIG_RFKILL_INPUT=y
CONFIG_RFKILL_LEDS=y
CONFIG_NET_9P=y
CONFIG_NET_9P_VIRTIO=y
# CONFIG_NET_9P_DEBUG is not set

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
CONFIG_FIRMWARE_IN_KERNEL=y
CONFIG_EXTRA_FIRMWARE=""
# CONFIG_SYS_HYPERVISOR is not set
CONFIG_CONNECTOR=y
CONFIG_PROC_EVENTS=y
# CONFIG_PARPORT is not set
CONFIG_PNP=y
# CONFIG_PNP_DEBUG is not set

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
CONFIG_BLK_DEV_FD=y
CONFIG_BLK_CPQ_DA=y
CONFIG_BLK_CPQ_CISS_DA=y
# CONFIG_CISS_SCSI_TAPE is not set
CONFIG_BLK_DEV_DAC960=y
CONFIG_BLK_DEV_UMEM=y
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=y
CONFIG_BLK_DEV_CRYPTOLOOP=y
CONFIG_BLK_DEV_NBD=y
CONFIG_BLK_DEV_SX8=y
CONFIG_BLK_DEV_UB=y
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=4096
CONFIG_BLK_DEV_XIP=y
# CONFIG_CDROM_PKTCDVD is not set
CONFIG_ATA_OVER_ETH=y
CONFIG_VIRTIO_BLK=y
CONFIG_BLK_DEV_HD=y
CONFIG_MISC_DEVICES=y
CONFIG_IBM_ASM=y
CONFIG_PHANTOM=y
CONFIG_EEPROM_93CX6=y
CONFIG_SGI_IOC4=y
CONFIG_TIFM_CORE=y
CONFIG_TIFM_7XX1=y
# CONFIG_ACER_WMI is not set
CONFIG_FUJITSU_LAPTOP=y
CONFIG_FUJITSU_LAPTOP_DEBUG=y
# CONFIG_HP_WMI is not set
CONFIG_MSI_LAPTOP=y
# CONFIG_COMPAL_LAPTOP is not set
CONFIG_SONY_LAPTOP=y
# CONFIG_SONYPI_COMPAT is not set
# CONFIG_THINKPAD_ACPI is not set
CONFIG_INTEL_MENLOW=y
CONFIG_ENCLOSURE_SERVICES=y
# CONFIG_SGI_XP is not set
CONFIG_HP_ILO=y
# CONFIG_SGI_GRU is not set
CONFIG_HAVE_IDE=y

#
# SCSI device support
#
CONFIG_RAID_ATTRS=y
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
CONFIG_SCSI_TGT=y
CONFIG_SCSI_NETLINK=y
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
CONFIG_CHR_DEV_ST=y
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=y
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=y
# CONFIG_CHR_DEV_SCH is not set
CONFIG_SCSI_ENCLOSURE=y

#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
CONFIG_SCSI_SCAN_ASYNC=y

#
# SCSI Transports
#
CONFIG_SCSI_SPI_ATTRS=y
CONFIG_SCSI_FC_ATTRS=y
# CONFIG_SCSI_FC_TGT_ATTRS is not set
CONFIG_SCSI_ISCSI_ATTRS=y
CONFIG_SCSI_SAS_ATTRS=y
CONFIG_SCSI_SRP_ATTRS=y
CONFIG_SCSI_SRP_TGT_ATTRS=y
CONFIG_SCSI_LOWLEVEL=y
CONFIG_ISCSI_TCP=y
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
CONFIG_SCSI_3W_9XXX=y
CONFIG_SCSI_ACARD=y
CONFIG_SCSI_AACRAID=y
CONFIG_SCSI_AIC7XXX=y
CONFIG_AIC7XXX_CMDS_PER_DEVICE=32
CONFIG_AIC7XXX_RESET_DELAY_MS=5000
CONFIG_AIC7XXX_DEBUG_ENABLE=y
CONFIG_AIC7XXX_DEBUG_MASK=0
CONFIG_AIC7XXX_REG_PRETTY_PRINT=y
CONFIG_SCSI_AIC7XXX_OLD=y
# CONFIG_SCSI_AIC79XX is not set
CONFIG_SCSI_DPT_I2O=y
CONFIG_SCSI_ADVANSYS=y
# CONFIG_SCSI_ARCMSR is not set
CONFIG_MEGARAID_NEWGEN=y
# CONFIG_MEGARAID_MM is not set
CONFIG_MEGARAID_LEGACY=y
CONFIG_MEGARAID_SAS=y
# CONFIG_SCSI_HPTIOP is not set
CONFIG_SCSI_BUSLOGIC=y
CONFIG_SCSI_DMX3191D=y
CONFIG_SCSI_EATA=y
CONFIG_SCSI_EATA_TAGGED_QUEUE=y
CONFIG_SCSI_EATA_LINKED_COMMANDS=y
CONFIG_SCSI_EATA_MAX_TAGS=16
CONFIG_SCSI_FUTURE_DOMAIN=y
# CONFIG_SCSI_GDTH is not set
CONFIG_SCSI_IPS=y
CONFIG_SCSI_INITIO=y
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_STEX is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
CONFIG_SCSI_IPR=y
# CONFIG_SCSI_IPR_TRACE is not set
CONFIG_SCSI_IPR_DUMP=y
CONFIG_SCSI_QLOGIC_1280=y
# CONFIG_SCSI_QLA_FC is not set
CONFIG_SCSI_QLA_ISCSI=y
CONFIG_SCSI_LPFC=y
CONFIG_SCSI_DC395x=y
CONFIG_SCSI_DC390T=y
# CONFIG_SCSI_SRP is not set
CONFIG_SCSI_DH=y
# CONFIG_SCSI_DH_RDAC is not set
CONFIG_SCSI_DH_HP_SW=y
# CONFIG_SCSI_DH_EMC is not set
CONFIG_SCSI_DH_ALUA=y
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
# CONFIG_ATA_ACPI is not set
# CONFIG_SATA_PMP is not set
CONFIG_SATA_AHCI=y
CONFIG_SATA_SIL24=y
CONFIG_ATA_SFF=y
CONFIG_SATA_SVW=y
CONFIG_ATA_PIIX=y
CONFIG_SATA_MV=y
CONFIG_SATA_NV=y
CONFIG_PDC_ADMA=y
CONFIG_SATA_QSTOR=y
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SX4 is not set
CONFIG_SATA_SIL=y
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_ULI is not set
CONFIG_SATA_VIA=y
# CONFIG_SATA_VITESSE is not set
CONFIG_SATA_INIC162X=y
CONFIG_PATA_ALI=y
CONFIG_PATA_AMD=y
CONFIG_PATA_ARTOP=y
CONFIG_PATA_ATIIXP=y
CONFIG_PATA_CMD640_PCI=y
# CONFIG_PATA_CMD64X is not set
CONFIG_PATA_CS5520=y
# CONFIG_PATA_CS5530 is not set
CONFIG_PATA_CYPRESS=y
# CONFIG_PATA_EFAR is not set
CONFIG_ATA_GENERIC=y
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
CONFIG_PATA_HPT3X2N=y
CONFIG_PATA_HPT3X3=y
CONFIG_PATA_HPT3X3_DMA=y
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_IT8213 is not set
CONFIG_PATA_JMICRON=y
CONFIG_PATA_TRIFLEX=y
# CONFIG_PATA_MARVELL is not set
CONFIG_PATA_MPIIX=y
CONFIG_PATA_OLDPIIX=y
CONFIG_PATA_NETCELL=y
CONFIG_PATA_NINJA32=y
CONFIG_PATA_NS87410=y
CONFIG_PATA_NS87415=y
CONFIG_PATA_OPTI=y
CONFIG_PATA_OPTIDMA=y
# CONFIG_PATA_PDC_OLD is not set
CONFIG_PATA_RADISYS=y
# CONFIG_PATA_RZ1000 is not set
CONFIG_PATA_SC1200=y
CONFIG_PATA_SERVERWORKS=y
CONFIG_PATA_PDC2027X=y
CONFIG_PATA_SIL680=y
CONFIG_PATA_SIS=y
# CONFIG_PATA_VIA is not set
# CONFIG_PATA_WINBOND is not set
CONFIG_PATA_PLATFORM=y
# CONFIG_PATA_SCH is not set
CONFIG_MD=y
# CONFIG_BLK_DEV_MD is not set
CONFIG_BLK_DEV_DM=y
CONFIG_DM_DEBUG=y
# CONFIG_DM_CRYPT is not set
CONFIG_DM_SNAPSHOT=y
CONFIG_DM_MIRROR=y
CONFIG_DM_ZERO=y
CONFIG_DM_MULTIPATH=y
CONFIG_DM_DELAY=y
CONFIG_DM_UEVENT=y
# CONFIG_FUSION is not set

#
# IEEE 1394 (FireWire) support
#

#
# Enable only one of the two stacks, unless you know what you are doing
#
CONFIG_FIREWIRE=y
CONFIG_FIREWIRE_OHCI=y
CONFIG_FIREWIRE_OHCI_DEBUG=y
CONFIG_FIREWIRE_SBP2=y
CONFIG_IEEE1394=y
CONFIG_IEEE1394_OHCI1394=y
# CONFIG_IEEE1394_PCILYNX is not set
CONFIG_IEEE1394_SBP2=y
CONFIG_IEEE1394_SBP2_PHYS_DMA=y
CONFIG_IEEE1394_ETH1394_ROM_ENTRY=y
CONFIG_IEEE1394_ETH1394=y
CONFIG_IEEE1394_RAWIO=y
CONFIG_IEEE1394_VIDEO1394=y
# CONFIG_IEEE1394_DV1394 is not set
# CONFIG_IEEE1394_VERBOSEDEBUG is not set
CONFIG_I2O=y
CONFIG_I2O_LCT_NOTIFY_ON_CHANGES=y
CONFIG_I2O_EXT_ADAPTEC=y
CONFIG_I2O_EXT_ADAPTEC_DMA64=y
# CONFIG_I2O_CONFIG is not set
# CONFIG_I2O_BUS is not set
# CONFIG_I2O_BLOCK is not set
CONFIG_I2O_SCSI=y
CONFIG_I2O_PROC=y
# CONFIG_MACINTOSH_DRIVERS is not set
CONFIG_NETDEVICES=y
CONFIG_IFB=y
CONFIG_DUMMY=y
CONFIG_BONDING=y
CONFIG_MACVLAN=y
# CONFIG_EQUALIZER is not set
CONFIG_TUN=y
CONFIG_VETH=y
# CONFIG_NET_SB1000 is not set
CONFIG_ARCNET=y
CONFIG_ARCNET_1201=y
CONFIG_ARCNET_1051=y
CONFIG_ARCNET_RAW=y
CONFIG_ARCNET_CAP=y
CONFIG_ARCNET_COM90xx=y
CONFIG_ARCNET_COM90xxIO=y
# CONFIG_ARCNET_RIM_I is not set
# CONFIG_ARCNET_COM20020 is not set
CONFIG_PHYLIB=y

#
# MII PHY device drivers
#
# CONFIG_MARVELL_PHY is not set
CONFIG_DAVICOM_PHY=y
CONFIG_QSEMI_PHY=y
CONFIG_LXT_PHY=y
CONFIG_CICADA_PHY=y
CONFIG_VITESSE_PHY=y
CONFIG_SMSC_PHY=y
CONFIG_BROADCOM_PHY=y
# CONFIG_ICPLUS_PHY is not set
CONFIG_REALTEK_PHY=y
# CONFIG_FIXED_PHY is not set
CONFIG_MDIO_BITBANG=y
CONFIG_NET_ETHERNET=y
CONFIG_MII=y
CONFIG_HAPPYMEAL=y
CONFIG_SUNGEM=y
CONFIG_CASSINI=y
# CONFIG_NET_VENDOR_3COM is not set
CONFIG_VORTEX=y
# CONFIG_NET_TULIP is not set
CONFIG_HP100=y
# CONFIG_IBM_NEW_EMAC_ZMII is not set
# CONFIG_IBM_NEW_EMAC_RGMII is not set
# CONFIG_IBM_NEW_EMAC_TAH is not set
# CONFIG_IBM_NEW_EMAC_EMAC4 is not set
CONFIG_NET_PCI=y
CONFIG_PCNET32=y
# CONFIG_AMD8111_ETH is not set
CONFIG_ADAPTEC_STARFIRE=y
CONFIG_B44=y
CONFIG_B44_PCI_AUTOSELECT=y
CONFIG_B44_PCICORE_AUTOSELECT=y
CONFIG_B44_PCI=y
CONFIG_FORCEDETH=y
# CONFIG_FORCEDETH_NAPI is not set
CONFIG_EEPRO100=y
CONFIG_E100=y
CONFIG_FEALNX=y
CONFIG_NATSEMI=y
# CONFIG_NE2K_PCI is not set
CONFIG_8139CP=y
CONFIG_8139TOO=y
CONFIG_8139TOO_PIO=y
CONFIG_8139TOO_TUNE_TWISTER=y
# CONFIG_8139TOO_8129 is not set
CONFIG_8139_OLD_RX_RESET=y
# CONFIG_R6040 is not set
# CONFIG_SIS900 is not set
CONFIG_EPIC100=y
CONFIG_SUNDANCE=y
# CONFIG_SUNDANCE_MMIO is not set
CONFIG_TLAN=y
CONFIG_VIA_RHINE=y
CONFIG_VIA_RHINE_MMIO=y
CONFIG_SC92031=y
CONFIG_NETDEV_1000=y
CONFIG_ACENIC=y
CONFIG_ACENIC_OMIT_TIGON_I=y
# CONFIG_DL2K is not set
CONFIG_E1000=y
CONFIG_E1000_DISABLE_PACKET_SPLIT=y
CONFIG_E1000E=y
CONFIG_IP1000=y
CONFIG_IGB=y
# CONFIG_IGB_LRO is not set
CONFIG_NS83820=y
CONFIG_HAMACHI=y
# CONFIG_YELLOWFIN is not set
CONFIG_R8169=y
CONFIG_R8169_VLAN=y
CONFIG_SIS190=y
# CONFIG_SKGE is not set
CONFIG_SKY2=y
CONFIG_SKY2_DEBUG=y
CONFIG_VIA_VELOCITY=y
CONFIG_TIGON3=y
CONFIG_BNX2=y
# CONFIG_QLA3XXX is not set
CONFIG_ATL1=y
CONFIG_ATL1E=y
# CONFIG_NETDEV_10000 is not set
CONFIG_MLX4_CORE=y
# CONFIG_TR is not set

#
# Wireless LAN
#
CONFIG_WLAN_PRE80211=y
CONFIG_STRIP=y
# CONFIG_WLAN_80211 is not set
# CONFIG_IWLWIFI_LEDS is not set

#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
CONFIG_USB_KAWETH=y
CONFIG_USB_PEGASUS=y
# CONFIG_USB_RTL8150 is not set
CONFIG_USB_USBNET=y
CONFIG_USB_NET_AX8817X=y
CONFIG_USB_NET_CDCETHER=y
CONFIG_USB_NET_DM9601=y
CONFIG_USB_NET_GL620A=y
CONFIG_USB_NET_NET1080=y
CONFIG_USB_NET_PLUSB=y
# CONFIG_USB_NET_MCS7830 is not set
# CONFIG_USB_NET_RNDIS_HOST is not set
# CONFIG_USB_NET_CDC_SUBSET is not set
CONFIG_USB_NET_ZAURUS=y
# CONFIG_USB_HSO is not set
CONFIG_WAN=y
# CONFIG_LANMEDIA is not set
CONFIG_HDLC=y
CONFIG_HDLC_RAW=y
# CONFIG_HDLC_RAW_ETH is not set
CONFIG_HDLC_CISCO=y
CONFIG_HDLC_FR=y
CONFIG_HDLC_PPP=y
CONFIG_HDLC_X25=y
CONFIG_PCI200SYN=y
CONFIG_WANXL=y
# CONFIG_PC300 is not set
CONFIG_PC300TOO=y
# CONFIG_FARSYNC is not set
# CONFIG_DLCI is not set
CONFIG_LAPBETHER=y
CONFIG_X25_ASY=y
CONFIG_SBNI=y
CONFIG_SBNI_MULTILINE=y
CONFIG_ATM_DRIVERS=y
CONFIG_ATM_DUMMY=y
# CONFIG_ATM_TCP is not set
CONFIG_ATM_LANAI=y
CONFIG_ATM_ENI=y
# CONFIG_ATM_ENI_DEBUG is not set
CONFIG_ATM_ENI_TUNE_BURST=y
CONFIG_ATM_ENI_BURST_TX_16W=y
CONFIG_ATM_ENI_BURST_TX_8W=y
# CONFIG_ATM_ENI_BURST_TX_4W is not set
CONFIG_ATM_ENI_BURST_TX_2W=y
CONFIG_ATM_ENI_BURST_RX_16W=y
CONFIG_ATM_ENI_BURST_RX_8W=y
CONFIG_ATM_ENI_BURST_RX_4W=y
CONFIG_ATM_ENI_BURST_RX_2W=y
CONFIG_ATM_FIRESTREAM=y
CONFIG_ATM_ZATM=y
CONFIG_ATM_ZATM_DEBUG=y
CONFIG_ATM_IDT77252=y
CONFIG_ATM_IDT77252_DEBUG=y
# CONFIG_ATM_IDT77252_RCV_ALL is not set
CONFIG_ATM_IDT77252_USE_SUNI=y
# CONFIG_ATM_AMBASSADOR is not set
CONFIG_ATM_HORIZON=y
CONFIG_ATM_HORIZON_DEBUG=y
# CONFIG_ATM_IA is not set
CONFIG_ATM_FORE200E=y
CONFIG_ATM_FORE200E_USE_TASKLET=y
CONFIG_ATM_FORE200E_TX_RETRY=16
CONFIG_ATM_FORE200E_DEBUG=0
CONFIG_ATM_HE=y
CONFIG_ATM_HE_USE_SUNI=y
CONFIG_FDDI=y
CONFIG_DEFXX=y
# CONFIG_DEFXX_MMIO is not set
CONFIG_SKFP=y
CONFIG_HIPPI=y
CONFIG_ROADRUNNER=y
# CONFIG_ROADRUNNER_LARGE_RINGS is not set
CONFIG_PPP=y
CONFIG_PPP_MULTILINK=y
# CONFIG_PPP_FILTER is not set
CONFIG_PPP_ASYNC=y
CONFIG_PPP_SYNC_TTY=y
# CONFIG_PPP_DEFLATE is not set
# CONFIG_PPP_BSDCOMP is not set
# CONFIG_PPP_MPPE is not set
CONFIG_PPPOE=y
# CONFIG_PPPOATM is not set
CONFIG_PPPOL2TP=y
CONFIG_SLIP=y
CONFIG_SLIP_COMPRESSED=y
CONFIG_SLHC=y
# CONFIG_SLIP_SMART is not set
# CONFIG_SLIP_MODE_SLIP6 is not set
CONFIG_NET_FC=y
CONFIG_NETCONSOLE=y
CONFIG_NETCONSOLE_DYNAMIC=y
CONFIG_NETPOLL=y
CONFIG_NETPOLL_TRAP=y
CONFIG_NET_POLL_CONTROLLER=y
CONFIG_VIRTIO_NET=y
# CONFIG_ISDN is not set
CONFIG_PHONE=y

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=y
CONFIG_INPUT_POLLDEV=y

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
# CONFIG_INPUT_EVDEV is not set
CONFIG_INPUT_EVBUG=y

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
CONFIG_KEYBOARD_SUNKBD=y
CONFIG_KEYBOARD_LKKBD=y
CONFIG_KEYBOARD_XTKBD=y
CONFIG_KEYBOARD_NEWTON=y
# CONFIG_KEYBOARD_STOWAWAY is not set
# CONFIG_KEYBOARD_GPIO is not set
# CONFIG_INPUT_MOUSE is not set
CONFIG_INPUT_JOYSTICK=y
CONFIG_JOYSTICK_ANALOG=y
CONFIG_JOYSTICK_A3D=y
# CONFIG_JOYSTICK_ADI is not set
CONFIG_JOYSTICK_COBRA=y
CONFIG_JOYSTICK_GF2K=y
CONFIG_JOYSTICK_GRIP=y
CONFIG_JOYSTICK_GRIP_MP=y
CONFIG_JOYSTICK_GUILLEMOT=y
CONFIG_JOYSTICK_INTERACT=y
CONFIG_JOYSTICK_SIDEWINDER=y
# CONFIG_JOYSTICK_TMDC is not set
# CONFIG_JOYSTICK_IFORCE is not set
CONFIG_JOYSTICK_WARRIOR=y
# CONFIG_JOYSTICK_MAGELLAN is not set
# CONFIG_JOYSTICK_SPACEORB is not set
CONFIG_JOYSTICK_SPACEBALL=y
CONFIG_JOYSTICK_STINGER=y
CONFIG_JOYSTICK_TWIDJOY=y
CONFIG_JOYSTICK_ZHENHUA=y
CONFIG_JOYSTICK_JOYDUMP=y
CONFIG_JOYSTICK_XPAD=y
CONFIG_JOYSTICK_XPAD_FF=y
# CONFIG_JOYSTICK_XPAD_LEDS is not set
# CONFIG_INPUT_TABLET is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
CONFIG_INPUT_MISC=y
CONFIG_INPUT_APANEL=y
# CONFIG_INPUT_ATLAS_BTNS is not set
CONFIG_INPUT_ATI_REMOTE=y
CONFIG_INPUT_ATI_REMOTE2=y
# CONFIG_INPUT_KEYSPAN_REMOTE is not set
# CONFIG_INPUT_POWERMATE is not set
CONFIG_INPUT_YEALINK=y
CONFIG_INPUT_UINPUT=y

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
CONFIG_SERIO_RAW=y
CONFIG_GAMEPORT=y
# CONFIG_GAMEPORT_NS558 is not set
# CONFIG_GAMEPORT_L4 is not set
CONFIG_GAMEPORT_EMU10K1=y
CONFIG_GAMEPORT_FM801=y

#
# Character devices
#
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_DEVKMEM=y
# CONFIG_SERIAL_NONSTANDARD is not set
CONFIG_NOZOMI=y

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_PNP=y
CONFIG_SERIAL_8250_NR_UARTS=4
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
# CONFIG_SERIAL_8250_EXTENDED is not set

#
# Non-8250 serial port support
#
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_SERIAL_JSM=y
CONFIG_UNIX98_PTYS=y
# CONFIG_LEGACY_PTYS is not set
CONFIG_HVC_DRIVER=y
CONFIG_VIRTIO_CONSOLE=y
# CONFIG_IPMI_HANDLER is not set
# CONFIG_HW_RANDOM is not set
CONFIG_NVRAM=y
CONFIG_RTC=y
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
CONFIG_MWAVE=y
CONFIG_PC8736x_GPIO=y
CONFIG_NSC_GPIO=y
CONFIG_RAW_DRIVER=y
CONFIG_MAX_RAW_DEVS=256
CONFIG_HPET=y
CONFIG_HPET_RTC_IRQ=y
# CONFIG_HPET_MMAP is not set
CONFIG_HANGCHECK_TIMER=y
CONFIG_TCG_TPM=y
CONFIG_TCG_TIS=y
# CONFIG_TCG_NSC is not set
# CONFIG_TCG_ATMEL is not set
CONFIG_TCG_INFINEON=y
CONFIG_TELCLOCK=y
CONFIG_DEVPORT=y
CONFIG_I2C=y
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_CHARDEV=y
# CONFIG_I2C_HELPER_AUTO is not set

#
# I2C Algorithms
#
CONFIG_I2C_ALGOBIT=y
# CONFIG_I2C_ALGOPCF is not set
CONFIG_I2C_ALGOPCA=y

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
# CONFIG_I2C_ALI1535 is not set
CONFIG_I2C_ALI1563=y
# CONFIG_I2C_ALI15X3 is not set
CONFIG_I2C_AMD756=y
# CONFIG_I2C_AMD8111 is not set
CONFIG_I2C_I801=y
# CONFIG_I2C_ISCH is not set
CONFIG_I2C_PIIX4=y
CONFIG_I2C_NFORCE2=y
CONFIG_I2C_SIS5595=y
# CONFIG_I2C_SIS630 is not set
CONFIG_I2C_SIS96X=y
CONFIG_I2C_VIA=y
CONFIG_I2C_VIAPRO=y

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
CONFIG_I2C_GPIO=y
# CONFIG_I2C_OCORES is not set
CONFIG_I2C_SIMTEC=y

#
# External I2C/SMBus adapter drivers
#
CONFIG_I2C_PARPORT_LIGHT=y
CONFIG_I2C_TAOS_EVM=y
# CONFIG_I2C_TINY_USB is not set

#
# Graphics adapter I2C/DDC channel drivers
#
CONFIG_I2C_VOODOO3=y

#
# Other I2C/SMBus bus drivers
#
CONFIG_I2C_PCA_PLATFORM=y

#
# Miscellaneous I2C Chip support
#
# CONFIG_DS1682 is not set
CONFIG_AT24=y
CONFIG_SENSORS_EEPROM=y
CONFIG_SENSORS_PCF8574=y
CONFIG_PCF8575=y
CONFIG_SENSORS_PCF8591=y
CONFIG_TPS65010=y
CONFIG_SENSORS_MAX6875=y
CONFIG_SENSORS_TSL2550=y
CONFIG_I2C_DEBUG_CORE=y
CONFIG_I2C_DEBUG_ALGO=y
# CONFIG_I2C_DEBUG_BUS is not set
CONFIG_I2C_DEBUG_CHIP=y
# CONFIG_SPI is not set
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
CONFIG_GPIOLIB=y
# CONFIG_GPIO_SYSFS is not set

#
# I2C GPIO expanders:
#
# CONFIG_GPIO_MAX732X is not set
CONFIG_GPIO_PCA953X=y
# CONFIG_GPIO_PCF857X is not set

#
# PCI GPIO expanders:
#
CONFIG_GPIO_BT8XX=y

#
# SPI GPIO expanders:
#
CONFIG_W1=y
# CONFIG_W1_CON is not set

#
# 1-wire Bus Masters
#
# CONFIG_W1_MASTER_MATROX is not set
# CONFIG_W1_MASTER_DS2490 is not set
CONFIG_W1_MASTER_DS2482=y
CONFIG_W1_MASTER_GPIO=y

#
# 1-wire Slaves
#
CONFIG_W1_SLAVE_THERM=y
# CONFIG_W1_SLAVE_SMEM is not set
# CONFIG_W1_SLAVE_DS2433 is not set
CONFIG_W1_SLAVE_DS2760=y
CONFIG_POWER_SUPPLY=y
CONFIG_POWER_SUPPLY_DEBUG=y
# CONFIG_PDA_POWER is not set
# CONFIG_BATTERY_DS2760 is not set
# CONFIG_HWMON is not set
CONFIG_THERMAL=y
# CONFIG_WATCHDOG is not set

#
# Sonics Silicon Backplane
#
CONFIG_SSB_POSSIBLE=y
CONFIG_SSB=y
CONFIG_SSB_SPROM=y
CONFIG_SSB_PCIHOST_POSSIBLE=y
CONFIG_SSB_PCIHOST=y
# CONFIG_SSB_B43_PCI_BRIDGE is not set
CONFIG_SSB_SILENT=y
CONFIG_SSB_DRIVER_PCICORE_POSSIBLE=y
CONFIG_SSB_DRIVER_PCICORE=y

#
# Multifunction device drivers
#
# CONFIG_MFD_CORE is not set
CONFIG_MFD_SM501=y
# CONFIG_MFD_SM501_GPIO is not set
# CONFIG_HTC_PASIC3 is not set
# CONFIG_MFD_TMIO is not set

#
# Multimedia devices
#

#
# Multimedia core support
#
# CONFIG_VIDEO_DEV is not set
CONFIG_DVB_CORE=y
CONFIG_VIDEO_MEDIA=y

#
# Multimedia drivers
#
CONFIG_MEDIA_TUNER=y
CONFIG_MEDIA_TUNER_CUSTOMIZE=y
# CONFIG_MEDIA_TUNER_SIMPLE is not set
CONFIG_MEDIA_TUNER_TDA8290=y
CONFIG_MEDIA_TUNER_TDA827X=y
CONFIG_MEDIA_TUNER_TDA18271=y
CONFIG_MEDIA_TUNER_TDA9887=y
CONFIG_MEDIA_TUNER_TEA5761=y
CONFIG_MEDIA_TUNER_TEA5767=y
# CONFIG_MEDIA_TUNER_MT20XX is not set
CONFIG_MEDIA_TUNER_MT2060=y
CONFIG_MEDIA_TUNER_MT2266=y
CONFIG_MEDIA_TUNER_MT2131=y
CONFIG_MEDIA_TUNER_QT1010=y
CONFIG_MEDIA_TUNER_XC2028=y
# CONFIG_MEDIA_TUNER_XC5000 is not set
CONFIG_MEDIA_TUNER_MXL5005S=y
CONFIG_MEDIA_TUNER_MXL5007T=y
# CONFIG_DVB_CAPTURE_DRIVERS is not set
CONFIG_DAB=y
CONFIG_USB_DABUSB=y

#
# Graphics support
#
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
CONFIG_AGP_INTEL=y
CONFIG_AGP_SIS=y
# CONFIG_AGP_VIA is not set
# CONFIG_DRM is not set
CONFIG_VGASTATE=y
CONFIG_VIDEO_OUTPUT_CONTROL=y
CONFIG_FB=y
# CONFIG_FIRMWARE_EDID is not set
CONFIG_FB_DDC=y
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
CONFIG_FB_SYS_FILLRECT=y
CONFIG_FB_SYS_COPYAREA=y
CONFIG_FB_SYS_IMAGEBLIT=y
CONFIG_FB_FOREIGN_ENDIAN=y
CONFIG_FB_BOTH_ENDIAN=y
# CONFIG_FB_BIG_ENDIAN is not set
# CONFIG_FB_LITTLE_ENDIAN is not set
CONFIG_FB_SYS_FOPS=y
CONFIG_FB_DEFERRED_IO=y
CONFIG_FB_HECUBA=y
CONFIG_FB_SVGALIB=y
# CONFIG_FB_MACMODES is not set
CONFIG_FB_BACKLIGHT=y
CONFIG_FB_MODE_HELPERS=y
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
CONFIG_FB_PM2=y
CONFIG_FB_PM2_FIFO_DISCONNECT=y
CONFIG_FB_CYBER2000=y
CONFIG_FB_ARC=y
CONFIG_FB_IMSTT=y
CONFIG_FB_UVESA=y
# CONFIG_FB_EFI is not set
# CONFIG_FB_IMAC is not set
CONFIG_FB_N411=y
# CONFIG_FB_HGA is not set
CONFIG_FB_S1D13XXX=y
CONFIG_FB_NVIDIA=y
CONFIG_FB_NVIDIA_I2C=y
CONFIG_FB_NVIDIA_DEBUG=y
CONFIG_FB_NVIDIA_BACKLIGHT=y
# CONFIG_FB_RIVA is not set
# CONFIG_FB_LE80578 is not set
CONFIG_FB_INTEL=y
CONFIG_FB_INTEL_DEBUG=y
# CONFIG_FB_INTEL_I2C is not set
# CONFIG_FB_MATROX is not set
# CONFIG_FB_ATY128 is not set
CONFIG_FB_ATY=y
# CONFIG_FB_ATY_CT is not set
CONFIG_FB_ATY_GX=y
# CONFIG_FB_ATY_BACKLIGHT is not set
CONFIG_FB_S3=y
CONFIG_FB_SAVAGE=y
CONFIG_FB_SAVAGE_I2C=y
# CONFIG_FB_SAVAGE_ACCEL is not set
# CONFIG_FB_SIS is not set
CONFIG_FB_NEOMAGIC=y
# CONFIG_FB_KYRO is not set
CONFIG_FB_3DFX=y
CONFIG_FB_3DFX_ACCEL=y
CONFIG_FB_VOODOO1=y
CONFIG_FB_VT8623=y
# CONFIG_FB_TRIDENT is not set
CONFIG_FB_ARK=y
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
CONFIG_FB_GEODE=y
CONFIG_FB_GEODE_LX=y
CONFIG_FB_GEODE_GX=y
CONFIG_FB_GEODE_GX1=y
CONFIG_FB_SM501=y
CONFIG_BACKLIGHT_LCD_SUPPORT=y
# CONFIG_LCD_CLASS_DEVICE is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
CONFIG_BACKLIGHT_CORGI=y
# CONFIG_BACKLIGHT_PROGEAR is not set
CONFIG_BACKLIGHT_MBP_NVIDIA=y

#
# Display device support
#
CONFIG_DISPLAY_SUPPORT=y

#
# Display hardware drivers
#

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64
CONFIG_DUMMY_CONSOLE=y
CONFIG_LOGO=y
CONFIG_LOGO_LINUX_MONO=y
CONFIG_LOGO_LINUX_VGA16=y
CONFIG_LOGO_LINUX_CLUT224=y
# CONFIG_SOUND is not set
CONFIG_HID_SUPPORT=y
CONFIG_HID=y
CONFIG_HID_DEBUG=y
# CONFIG_HIDRAW is not set

#
# USB Input Devices
#
CONFIG_USB_HID=y
CONFIG_USB_HIDINPUT_POWERBOOK=y
# CONFIG_HID_FF is not set
# CONFIG_USB_HIDDEV is not set
CONFIG_USB_MOUSE=y
CONFIG_USB_SUPPORT=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=y
CONFIG_USB_DEBUG=y
# CONFIG_USB_ANNOUNCE_NEW_DEVICES is not set

#
# Miscellaneous USB options
#
CONFIG_USB_DEVICEFS=y
CONFIG_USB_DEVICE_CLASS=y
CONFIG_USB_DYNAMIC_MINORS=y
# CONFIG_USB_SUSPEND is not set
# CONFIG_USB_OTG is not set
CONFIG_USB_OTG_WHITELIST=y
CONFIG_USB_OTG_BLACKLIST_HUB=y
CONFIG_USB_MON=y

#
# USB Host Controller Drivers
#
CONFIG_USB_C67X00_HCD=y
CONFIG_USB_EHCI_HCD=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
# CONFIG_USB_EHCI_TT_NEWSCHED is not set
CONFIG_USB_ISP116X_HCD=y
CONFIG_USB_ISP1760_HCD=y
# CONFIG_USB_ISP1760_PCI is not set
CONFIG_USB_OHCI_HCD=y
CONFIG_USB_OHCI_HCD_SSB=y
# CONFIG_USB_OHCI_BIG_ENDIAN_DESC is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_MMIO is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_UHCI_HCD=y
CONFIG_USB_U132_HCD=y
# CONFIG_USB_SL811_HCD is not set
CONFIG_USB_R8A66597_HCD=y

#
# USB Device Class drivers
#
CONFIG_USB_ACM=y
CONFIG_USB_PRINTER=y
# CONFIG_USB_WDM is not set

#
# NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support'
#

#
# may also be needed; see USB_STORAGE Help for more information
#
# CONFIG_USB_STORAGE is not set
CONFIG_USB_LIBUSUAL=y

#
# USB Imaging devices
#
CONFIG_USB_MDC800=y
# CONFIG_USB_MICROTEK is not set

#
# USB port drivers
#
CONFIG_USB_SERIAL=y
CONFIG_USB_SERIAL_CONSOLE=y
CONFIG_USB_EZUSB=y
# CONFIG_USB_SERIAL_GENERIC is not set
CONFIG_USB_SERIAL_AIRCABLE=y
CONFIG_USB_SERIAL_ARK3116=y
CONFIG_USB_SERIAL_BELKIN=y
CONFIG_USB_SERIAL_CH341=y
# CONFIG_USB_SERIAL_WHITEHEAT is not set
CONFIG_USB_SERIAL_DIGI_ACCELEPORT=y
CONFIG_USB_SERIAL_CP2101=y
CONFIG_USB_SERIAL_CYPRESS_M8=y
CONFIG_USB_SERIAL_EMPEG=y
CONFIG_USB_SERIAL_FTDI_SIO=y
CONFIG_USB_SERIAL_FUNSOFT=y
CONFIG_USB_SERIAL_VISOR=y
# CONFIG_USB_SERIAL_IPAQ is not set
CONFIG_USB_SERIAL_IR=y
# CONFIG_USB_SERIAL_EDGEPORT is not set
CONFIG_USB_SERIAL_EDGEPORT_TI=y
CONFIG_USB_SERIAL_GARMIN=y
CONFIG_USB_SERIAL_IPW=y
# CONFIG_USB_SERIAL_IUU is not set
CONFIG_USB_SERIAL_KEYSPAN_PDA=y
CONFIG_USB_SERIAL_KEYSPAN=y
# CONFIG_USB_SERIAL_KEYSPAN_MPR is not set
CONFIG_USB_SERIAL_KEYSPAN_USA28=y
CONFIG_USB_SERIAL_KEYSPAN_USA28X=y
CONFIG_USB_SERIAL_KEYSPAN_USA28XA=y
CONFIG_USB_SERIAL_KEYSPAN_USA28XB=y
CONFIG_USB_SERIAL_KEYSPAN_USA19=y
# CONFIG_USB_SERIAL_KEYSPAN_USA18X is not set
CONFIG_USB_SERIAL_KEYSPAN_USA19W=y
CONFIG_USB_SERIAL_KEYSPAN_USA19QW=y
CONFIG_USB_SERIAL_KEYSPAN_USA19QI=y
# CONFIG_USB_SERIAL_KEYSPAN_USA49W is not set
CONFIG_USB_SERIAL_KEYSPAN_USA49WLC=y
CONFIG_USB_SERIAL_KLSI=y
CONFIG_USB_SERIAL_KOBIL_SCT=y
CONFIG_USB_SERIAL_MCT_U232=y
CONFIG_USB_SERIAL_MOS7720=y
CONFIG_USB_SERIAL_MOS7840=y
# CONFIG_USB_SERIAL_MOTOROLA is not set
CONFIG_USB_SERIAL_NAVMAN=y
CONFIG_USB_SERIAL_PL2303=y
CONFIG_USB_SERIAL_OTI6858=y
# CONFIG_USB_SERIAL_SPCP8X5 is not set
CONFIG_USB_SERIAL_HP4X=y
# CONFIG_USB_SERIAL_SAFE is not set
CONFIG_USB_SERIAL_SIERRAWIRELESS=y
# CONFIG_USB_SERIAL_TI is not set
CONFIG_USB_SERIAL_CYBERJACK=y
# CONFIG_USB_SERIAL_XIRCOM is not set
CONFIG_USB_SERIAL_OPTION=y
# CONFIG_USB_SERIAL_OMNINET is not set
# CONFIG_USB_SERIAL_DEBUG is not set

#
# USB Miscellaneous drivers
#
CONFIG_USB_EMI62=y
# CONFIG_USB_EMI26 is not set
CONFIG_USB_ADUTUX=y
# CONFIG_USB_RIO500 is not set
CONFIG_USB_LEGOTOWER=y
# CONFIG_USB_LCD is not set
CONFIG_USB_BERRY_CHARGE=y
# CONFIG_USB_LED is not set
CONFIG_USB_CYPRESS_CY7C63=y
# CONFIG_USB_CYTHERM is not set
CONFIG_USB_PHIDGET=y
CONFIG_USB_PHIDGETKIT=y
# CONFIG_USB_PHIDGETMOTORCONTROL is not set
# CONFIG_USB_PHIDGETSERVO is not set
CONFIG_USB_IDMOUSE=y
CONFIG_USB_FTDI_ELAN=y
# CONFIG_USB_APPLEDISPLAY is not set
CONFIG_USB_SISUSBVGA=y
# CONFIG_USB_SISUSBVGA_CON is not set
CONFIG_USB_LD=y
# CONFIG_USB_TRANCEVIBRATOR is not set
CONFIG_USB_IOWARRIOR=y
CONFIG_USB_TEST=y
# CONFIG_USB_ISIGHTFW is not set
CONFIG_USB_ATM=y
# CONFIG_USB_SPEEDTOUCH is not set
CONFIG_USB_CXACRU=y
CONFIG_USB_UEAGLEATM=y
# CONFIG_USB_XUSBATM is not set
# CONFIG_MMC is not set
CONFIG_MEMSTICK=y
CONFIG_MEMSTICK_DEBUG=y

#
# MemoryStick drivers
#
CONFIG_MEMSTICK_UNSAFE_RESUME=y
# CONFIG_MSPRO_BLOCK is not set

#
# MemoryStick Host Controller Drivers
#
CONFIG_MEMSTICK_TIFM_MS=y
CONFIG_MEMSTICK_JMICRON_38X=y
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=y

#
# LED drivers
#
CONFIG_LEDS_PCA9532=y
# CONFIG_LEDS_GPIO is not set
# CONFIG_LEDS_CLEVO_MAIL is not set
CONFIG_LEDS_PCA955X=y

#
# LED Triggers
#
CONFIG_LEDS_TRIGGERS=y
# CONFIG_LEDS_TRIGGER_TIMER is not set
CONFIG_LEDS_TRIGGER_HEARTBEAT=y
CONFIG_LEDS_TRIGGER_DEFAULT_ON=y
CONFIG_ACCESSIBILITY=y
CONFIG_A11Y_BRAILLE_CONSOLE=y
CONFIG_INFINIBAND=y
# CONFIG_INFINIBAND_USER_MAD is not set
# CONFIG_INFINIBAND_USER_ACCESS is not set
CONFIG_INFINIBAND_ADDR_TRANS=y
# CONFIG_INFINIBAND_MTHCA is not set
CONFIG_INFINIBAND_IPATH=y
CONFIG_INFINIBAND_AMSO1100=y
# CONFIG_INFINIBAND_AMSO1100_DEBUG is not set
CONFIG_MLX4_INFINIBAND=y
CONFIG_INFINIBAND_NES=y
CONFIG_INFINIBAND_NES_DEBUG=y
CONFIG_INFINIBAND_IPOIB=y
# CONFIG_INFINIBAND_IPOIB_CM is not set
# CONFIG_INFINIBAND_IPOIB_DEBUG is not set
# CONFIG_INFINIBAND_SRP is not set
CONFIG_INFINIBAND_ISER=y
CONFIG_EDAC=y

#
# Reporting subsystems
#
CONFIG_EDAC_DEBUG=y
CONFIG_EDAC_MM_EDAC=y
CONFIG_EDAC_E752X=y
CONFIG_EDAC_I82975X=y
CONFIG_EDAC_I3000=y
# CONFIG_EDAC_I5000 is not set
# CONFIG_EDAC_I5100 is not set
# CONFIG_RTC_CLASS is not set
CONFIG_DMADEVICES=y

#
# DMA Devices
#
CONFIG_INTEL_IOATDMA=y
CONFIG_DMA_ENGINE=y

#
# DMA Clients
#
# CONFIG_NET_DMA is not set
CONFIG_DMATEST=y
CONFIG_DCA=y
CONFIG_UIO=y
# CONFIG_UIO_CIF is not set
CONFIG_UIO_PDRV=y
CONFIG_UIO_PDRV_GENIRQ=y
CONFIG_UIO_SMX=y

#
# Firmware Drivers
#
CONFIG_EDD=y
CONFIG_EDD_OFF=y
CONFIG_FIRMWARE_MEMMAP=y
CONFIG_EFI_VARS=y
CONFIG_DELL_RBU=y
# CONFIG_DCDBAS is not set
CONFIG_DMIID=y
CONFIG_ISCSI_IBFT_FIND=y
CONFIG_ISCSI_IBFT=y

#
# File systems
#
# CONFIG_EXT2_FS is not set
CONFIG_EXT3_FS=y
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
# CONFIG_EXT4DEV_FS is not set
CONFIG_JBD=y
CONFIG_JBD_DEBUG=y
CONFIG_FS_MBCACHE=y
CONFIG_REISERFS_FS=y
CONFIG_REISERFS_CHECK=y
# CONFIG_REISERFS_PROC_INFO is not set
CONFIG_REISERFS_FS_XATTR=y
CONFIG_REISERFS_FS_POSIX_ACL=y
CONFIG_REISERFS_FS_SECURITY=y
# CONFIG_JFS_FS is not set
CONFIG_FS_POSIX_ACL=y
CONFIG_XFS_FS=y
# CONFIG_XFS_QUOTA is not set
CONFIG_XFS_POSIX_ACL=y
CONFIG_XFS_RT=y
CONFIG_XFS_DEBUG=y
CONFIG_GFS2_FS=y
CONFIG_GFS2_FS_LOCKING_DLM=y
# CONFIG_OCFS2_FS is not set
CONFIG_DNOTIFY=y
CONFIG_INOTIFY=y
CONFIG_INOTIFY_USER=y
CONFIG_QUOTA=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
CONFIG_PRINT_QUOTA_WARNING=y
# CONFIG_QFMT_V1 is not set
CONFIG_QFMT_V2=y
CONFIG_QUOTACTL=y
CONFIG_AUTOFS_FS=y
CONFIG_AUTOFS4_FS=y
# CONFIG_FUSE_FS is not set
CONFIG_GENERIC_ACL=y

#
# CD-ROM/DVD Filesystems
#
# CONFIG_ISO9660_FS is not set
CONFIG_UDF_FS=y
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=y
CONFIG_MSDOS_FS=y
CONFIG_VFAT_FS=y
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
# CONFIG_NTFS_FS is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
# CONFIG_PROC_KCORE is not set
CONFIG_PROC_SYSCTL=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_CONFIGFS_FS=y

#
# Miscellaneous filesystems
#
CONFIG_ADFS_FS=y
# CONFIG_ADFS_FS_RW is not set
CONFIG_AFFS_FS=y
CONFIG_HFS_FS=y
CONFIG_HFSPLUS_FS=y
CONFIG_BEFS_FS=y
# CONFIG_BEFS_DEBUG is not set
# CONFIG_BFS_FS is not set
CONFIG_EFS_FS=y
CONFIG_CRAMFS=y
CONFIG_VXFS_FS=y
CONFIG_MINIX_FS=y
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_ROMFS_FS is not set
CONFIG_SYSV_FS=y
CONFIG_UFS_FS=y
CONFIG_UFS_FS_WRITE=y
CONFIG_UFS_DEBUG=y
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=y
# CONFIG_NFS_V3 is not set
# CONFIG_NFS_V4 is not set
CONFIG_NFSD=y
CONFIG_NFSD_V3=y
# CONFIG_NFSD_V3_ACL is not set
CONFIG_NFSD_V4=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_EXPORTFS=y
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=y
CONFIG_SUNRPC_GSS=y
CONFIG_SUNRPC_XPRT_RDMA=y
CONFIG_RPCSEC_GSS_KRB5=y
# CONFIG_RPCSEC_GSS_SPKM3 is not set
CONFIG_SMB_FS=y
# CONFIG_SMB_NLS_DEFAULT is not set
# CONFIG_CIFS is not set
CONFIG_NCP_FS=y
CONFIG_NCPFS_PACKET_SIGNING=y
# CONFIG_NCPFS_IOCTL_LOCKING is not set
# CONFIG_NCPFS_STRONG is not set
CONFIG_NCPFS_NFS_NS=y
CONFIG_NCPFS_OS2_NS=y
# CONFIG_NCPFS_SMALLDOS is not set
CONFIG_NCPFS_NLS=y
# CONFIG_NCPFS_EXTRAS is not set
CONFIG_CODA_FS=y
CONFIG_AFS_FS=y
CONFIG_AFS_DEBUG=y
CONFIG_9P_FS=y

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
CONFIG_ACORN_PARTITION=y
# CONFIG_ACORN_PARTITION_CUMANA is not set
CONFIG_ACORN_PARTITION_EESOX=y
# CONFIG_ACORN_PARTITION_ICS is not set
CONFIG_ACORN_PARTITION_ADFS=y
CONFIG_ACORN_PARTITION_POWERTEC=y
CONFIG_ACORN_PARTITION_RISCIX=y
CONFIG_OSF_PARTITION=y
CONFIG_AMIGA_PARTITION=y
# CONFIG_ATARI_PARTITION is not set
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
CONFIG_LDM_PARTITION=y
CONFIG_LDM_DEBUG=y
# CONFIG_SGI_PARTITION is not set
# CONFIG_ULTRIX_PARTITION is not set
CONFIG_SUN_PARTITION=y
CONFIG_KARMA_PARTITION=y
CONFIG_EFI_PARTITION=y
CONFIG_SYSV68_PARTITION=y
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="iso8859-1"
# CONFIG_NLS_CODEPAGE_437 is not set
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
CONFIG_NLS_CODEPAGE_850=y
# CONFIG_NLS_CODEPAGE_852 is not set
CONFIG_NLS_CODEPAGE_855=y
CONFIG_NLS_CODEPAGE_857=y
CONFIG_NLS_CODEPAGE_860=y
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
CONFIG_NLS_CODEPAGE_863=y
CONFIG_NLS_CODEPAGE_864=y
CONFIG_NLS_CODEPAGE_865=y
CONFIG_NLS_CODEPAGE_866=y
CONFIG_NLS_CODEPAGE_869=y
CONFIG_NLS_CODEPAGE_936=y
CONFIG_NLS_CODEPAGE_950=y
CONFIG_NLS_CODEPAGE_932=y
CONFIG_NLS_CODEPAGE_949=y
CONFIG_NLS_CODEPAGE_874=y
CONFIG_NLS_ISO8859_8=y
# CONFIG_NLS_CODEPAGE_1250 is not set
CONFIG_NLS_CODEPAGE_1251=y
CONFIG_NLS_ASCII=y
# CONFIG_NLS_ISO8859_1 is not set
CONFIG_NLS_ISO8859_2=y
CONFIG_NLS_ISO8859_3=y
CONFIG_NLS_ISO8859_4=y
CONFIG_NLS_ISO8859_5=y
# CONFIG_NLS_ISO8859_6 is not set
CONFIG_NLS_ISO8859_7=y
CONFIG_NLS_ISO8859_9=y
CONFIG_NLS_ISO8859_13=y
CONFIG_NLS_ISO8859_14=y
CONFIG_NLS_ISO8859_15=y
# CONFIG_NLS_KOI8_R is not set
CONFIG_NLS_KOI8_U=y
CONFIG_NLS_UTF8=y
CONFIG_DLM=y
CONFIG_DLM_DEBUG=y

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_PRINTK_TIME=y
CONFIG_ENABLE_WARN_DEPRECATED=y
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_FRAME_WARN=2048
CONFIG_MAGIC_SYSRQ=y
CONFIG_UNUSED_SYMBOLS=y
CONFIG_DEBUG_FS=y
CONFIG_HEADERS_CHECK=y
# CONFIG_DEBUG_KERNEL is not set
CONFIG_SLUB_DEBUG_ON=y
CONFIG_SLUB_STATS=y
CONFIG_STACKTRACE=y
# CONFIG_DEBUG_MEMORY_INIT is not set
# CONFIG_RCU_CPU_STALL is not set
# CONFIG_LATENCYTOP is not set
CONFIG_SYSCTL_SYSCALL_CHECK=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_RING_BUFFER=y
CONFIG_TRACING=y
CONFIG_SYSPROF_TRACER=y
CONFIG_PROVIDE_OHCI1394_DMA_INIT=y
CONFIG_FIREWIRE_OHCI_REMOTE_DMA=y
CONFIG_BUILD_DOCSRC=y
# CONFIG_SAMPLES is not set
CONFIG_HAVE_ARCH_KGDB=y
CONFIG_STRICT_DEVMEM=y
CONFIG_X86_VERBOSE_BOOTUP=y
CONFIG_EARLY_PRINTK=y
CONFIG_EARLY_PRINTK_DBGP=y
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=0
# CONFIG_OPTIMIZE_INLINING is not set

#
# Security options
#
CONFIG_KEYS=y
CONFIG_KEYS_DEBUG_PROC_KEYS=y
CONFIG_SECURITY=y
CONFIG_SECURITY_NETWORK=y
# CONFIG_SECURITY_NETWORK_XFRM is not set
CONFIG_SECURITY_FILE_CAPABILITIES=y
CONFIG_SECURITY_DEFAULT_MMAP_MIN_ADDR=0
CONFIG_SECURITY_SELINUX=y
# CONFIG_SECURITY_SELINUX_BOOTPARAM is not set
# CONFIG_SECURITY_SELINUX_DISABLE is not set
CONFIG_SECURITY_SELINUX_DEVELOP=y
CONFIG_SECURITY_SELINUX_AVC_STATS=y
CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE=1
CONFIG_SECURITY_SELINUX_POLICYDB_VERSION_MAX=y
CONFIG_SECURITY_SELINUX_POLICYDB_VERSION_MAX_VALUE=19
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_AEAD=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_GF128MUL=y
# CONFIG_CRYPTO_NULL is not set
CONFIG_CRYPTO_CRYPTD=y
CONFIG_CRYPTO_AUTHENC=y

#
# Authenticated Encryption with Associated Data
#
# CONFIG_CRYPTO_CCM is not set
CONFIG_CRYPTO_GCM=y
CONFIG_CRYPTO_SEQIV=y

#
# Block modes
#
CONFIG_CRYPTO_CBC=y
CONFIG_CRYPTO_CTR=y
CONFIG_CRYPTO_CTS=y
CONFIG_CRYPTO_ECB=y
CONFIG_CRYPTO_LRW=y
CONFIG_CRYPTO_PCBC=y
CONFIG_CRYPTO_XTS=y

#
# Hash modes
#
CONFIG_CRYPTO_HMAC=y
CONFIG_CRYPTO_XCBC=y

#
# Digest
#
CONFIG_CRYPTO_CRC32C=y
CONFIG_CRYPTO_MD4=y
CONFIG_CRYPTO_MD5=y
CONFIG_CRYPTO_MICHAEL_MIC=y
# CONFIG_CRYPTO_RMD128 is not set
CONFIG_CRYPTO_RMD160=y
CONFIG_CRYPTO_RMD256=y
# CONFIG_CRYPTO_RMD320 is not set
CONFIG_CRYPTO_SHA1=y
# CONFIG_CRYPTO_SHA256 is not set
CONFIG_CRYPTO_SHA512=y
CONFIG_CRYPTO_TGR192=y
# CONFIG_CRYPTO_WP512 is not set

#
# Ciphers
#
CONFIG_CRYPTO_AES=y
CONFIG_CRYPTO_AES_X86_64=y
CONFIG_CRYPTO_ANUBIS=y
CONFIG_CRYPTO_ARC4=y
CONFIG_CRYPTO_BLOWFISH=y
CONFIG_CRYPTO_CAMELLIA=y
CONFIG_CRYPTO_CAST5=y
CONFIG_CRYPTO_CAST6=y
CONFIG_CRYPTO_DES=y
CONFIG_CRYPTO_FCRYPT=y
CONFIG_CRYPTO_KHAZAD=y
CONFIG_CRYPTO_SALSA20=y
CONFIG_CRYPTO_SALSA20_X86_64=y
CONFIG_CRYPTO_SEED=y
CONFIG_CRYPTO_SERPENT=y
# CONFIG_CRYPTO_TEA is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_X86_64 is not set

#
# Compression
#
CONFIG_CRYPTO_DEFLATE=y
CONFIG_CRYPTO_LZO=y
CONFIG_CRYPTO_HW=y
# CONFIG_CRYPTO_DEV_HIFN_795X is not set
CONFIG_HAVE_KVM=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=y
CONFIG_KVM_INTEL=y
CONFIG_KVM_AMD=y
CONFIG_VIRTIO=y
CONFIG_VIRTIO_RING=y
# CONFIG_VIRTIO_PCI is not set
CONFIG_VIRTIO_BALLOON=y

#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_FIND_NEXT_BIT=y
CONFIG_CRC_CCITT=y
CONFIG_CRC16=y
CONFIG_CRC_T10DIF=y
CONFIG_CRC_ITU_T=y
CONFIG_CRC32=y
# CONFIG_CRC7 is not set
CONFIG_LIBCRC32C=y
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_LZO_COMPRESS=y
CONFIG_LZO_DECOMPRESS=y
CONFIG_PLIST=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y
CONFIG_CHECK_SIGNATURE=y
CONFIG_FORCE_SUCCESSFUL_BUILD=y
CONFIG_FORCE_MINIMAL_CONFIG=y
CONFIG_FORCE_MINIMAL_CONFIG_64=y
CONFIG_FORCE_MINIMAL_CONFIG_PHYS=y

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [boot crash] Re: [PATCH] ring-buffer: fix build error
  2008-10-02 15:50                                                     ` Ingo Molnar
@ 2008-10-02 18:27                                                       ` Steven Rostedt
  2008-10-02 18:55                                                         ` Ingo Molnar
  0 siblings, 1 reply; 102+ messages in thread
From: Steven Rostedt @ 2008-10-02 18:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, Jonathan Corbet,
	Mathieu Desnoyers, LKML, Thomas Gleixner, Andrew Morton, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo


On Thu, 2 Oct 2008, Ingo Molnar wrote:

> 
> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > 
> > On Thu, 2 Oct 2008, Ingo Molnar wrote:
> > > * Ingo Molnar <mingo@elte.hu> wrote:
> > 
> > > full serial log and config attached. I'm excluding these latest commits 
> > 
> >  -ENOATTACHMENT
> 
> attached.
> 
> You can get the broken tree by doing this in tip/master:
> 
>   git-merge tip/tracing/ring-buffer

I've just checked-out tip/tracing/ring-buffer. That tree is still broken 
too, right? Or do I need to merge it to get the broken version?

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [boot crash] Re: [PATCH] ring-buffer: fix build error
  2008-10-02 18:27                                                       ` Steven Rostedt
@ 2008-10-02 18:55                                                         ` Ingo Molnar
  0 siblings, 0 replies; 102+ messages in thread
From: Ingo Molnar @ 2008-10-02 18:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Peter Zijlstra, Jonathan Corbet,
	Mathieu Desnoyers, LKML, Thomas Gleixner, Andrew Morton, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo


* Steven Rostedt <rostedt@goodmis.org> wrote:

> 
> On Thu, 2 Oct 2008, Ingo Molnar wrote:
> 
> > 
> > * Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> > > 
> > > On Thu, 2 Oct 2008, Ingo Molnar wrote:
> > > > * Ingo Molnar <mingo@elte.hu> wrote:
> > > 
> > > > full serial log and config attached. I'm excluding these latest commits 
> > > 
> > >  -ENOATTACHMENT
> > 
> > attached.
> > 
> > You can get the broken tree by doing this in tip/master:
> > 
> >   git-merge tip/tracing/ring-buffer
> 
> I've just checked-out tip/tracing/ring-buffer. That tree is still broken 
> too, right? Or do I need to merge it to get the broken version?

yes, that's very likely broken too in a standalone way as well - but to 
get the exact tree i tested i'd suggest:

  git checkout tip/master
  git merge tip/tracing/ring-buffer

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH] ring_buffer: map to cpu not page
  2008-10-02  9:38                                                 ` [boot crash] " Ingo Molnar
  2008-10-02 13:16                                                   ` Steven Rostedt
  2008-10-02 13:17                                                   ` Steven Rostedt
@ 2008-10-02 23:18                                                   ` Steven Rostedt
  2008-10-02 23:36                                                     ` Steven Rostedt
                                                                       ` (2 more replies)
  2 siblings, 3 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-10-02 23:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, Jonathan Corbet,
	Mathieu Desnoyers, LKML, Thomas Gleixner, Andrew Morton, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo


My original patch had a compile bug when NUMA was configured. I
referenced cpu when it should have been cpu_buffer->cpu.

Ingo quickly fixed this bug by replacing cpu with 'i' because that
was the loop counter. Unfortunately, the 'i' was the counter of
pages, not CPUs. This caused a crash when the number of pages allocated
for the buffers exceeded the number of pages, which would usually
be the case.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 kernel/trace/ring_buffer.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-tip.git/kernel/trace/ring_buffer.c
===================================================================
--- linux-tip.git.orig/kernel/trace/ring_buffer.c	2008-10-02 09:09:01.000000000 -0400
+++ linux-tip.git/kernel/trace/ring_buffer.c	2008-10-02 18:58:44.000000000 -0400
@@ -232,7 +232,7 @@ static int rb_allocate_pages(struct ring
 
 	for (i = 0; i < nr_pages; i++) {
 		page = kzalloc_node(ALIGN(sizeof(*page), cache_line_size()),
-				    GFP_KERNEL, cpu_to_node(i));
+				    GFP_KERNEL, cpu_to_node(cpu_buffer->cpu));
 		if (!page)
 			goto free_pages;
 		list_add(&page->list, &pages);



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH] ring_buffer: map to cpu not page
  2008-10-02 23:18                                                   ` [PATCH] ring_buffer: map to cpu not page Steven Rostedt
@ 2008-10-02 23:36                                                     ` Steven Rostedt
  2008-10-03  4:56                                                     ` [PATCH] x86 Topology cpu_to_node parameter check Mathieu Desnoyers
  2008-10-03  7:27                                                     ` [PATCH] ring_buffer: map to cpu not page Ingo Molnar
  2 siblings, 0 replies; 102+ messages in thread
From: Steven Rostedt @ 2008-10-02 23:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, Jonathan Corbet,
	Mathieu Desnoyers, LKML, Thomas Gleixner, Andrew Morton, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo


On Thu, 2 Oct 2008, Steven Rostedt wrote:

> 
> My original patch had a compile bug when NUMA was configured. I
> referenced cpu when it should have been cpu_buffer->cpu.
> 
> Ingo quickly fixed this bug by replacing cpu with 'i' because that
> was the loop counter. Unfortunately, the 'i' was the counter of
> pages, not CPUs. This caused a crash when the number of pages allocated
> for the buffers exceeded the number of pages, which would usually

That should have been:

 "when the number of pages allocated for the buffers exceeded the number
  of cpus".

> be the case.
> 

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH] x86 Topology cpu_to_node parameter check
  2008-10-02 23:18                                                   ` [PATCH] ring_buffer: map to cpu not page Steven Rostedt
  2008-10-02 23:36                                                     ` Steven Rostedt
@ 2008-10-03  4:56                                                     ` Mathieu Desnoyers
  2008-10-03  5:20                                                       ` Steven Rostedt
  2008-10-03  7:27                                                     ` [PATCH] ring_buffer: map to cpu not page Ingo Molnar
  2 siblings, 1 reply; 102+ messages in thread
From: Mathieu Desnoyers @ 2008-10-03  4:56 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, Jonathan Corbet, LKML,
	Thomas Gleixner, Andrew Morton, prasad, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo

Declare NUMA-less cpu_to_node with a check that the cpu parameter exists so
people without NUMA test configs (namely Steven Rostedt and myself who ran into
this error both in the same day with different implementations) stop doing this
trivial mistake.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@elte.hu>
---
 include/asm-x86/topology.h |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6-lttng/include/asm-x86/topology.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-x86/topology.h	2008-10-03 00:37:05.000000000 -0400
+++ linux-2.6-lttng/include/asm-x86/topology.h	2008-10-03 00:45:52.000000000 -0400
@@ -182,9 +182,9 @@ extern int __node_distance(int, int);
 
 #else /* !CONFIG_NUMA */
 
-#define numa_node_id()		0
-#define	cpu_to_node(cpu)	0
-#define	early_cpu_to_node(cpu)	0
+#define	numa_node_id()		0
+#define	cpu_to_node(cpu)	((void)(cpu),0)
+#define	early_cpu_to_node(cpu)	cpu_to_node(cpu)
 
 static inline const cpumask_t *_node_to_cpumask_ptr(int node)
 {
-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH] x86 Topology cpu_to_node parameter check
  2008-10-03  4:56                                                     ` [PATCH] x86 Topology cpu_to_node parameter check Mathieu Desnoyers
@ 2008-10-03  5:20                                                       ` Steven Rostedt
  2008-10-03 15:56                                                         ` Mathieu Desnoyers
  0 siblings, 1 reply; 102+ messages in thread
From: Steven Rostedt @ 2008-10-03  5:20 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra, Jonathan Corbet,
	LKML, Thomas Gleixner, Andrew Morton, prasad, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo


On Fri, 3 Oct 2008, Mathieu Desnoyers wrote:

> ---
>  include/asm-x86/topology.h |    6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> Index: linux-2.6-lttng/include/asm-x86/topology.h
> ===================================================================
> --- linux-2.6-lttng.orig/include/asm-x86/topology.h	2008-10-03 00:37:05.000000000 -0400
> +++ linux-2.6-lttng/include/asm-x86/topology.h	2008-10-03 00:45:52.000000000 -0400
> @@ -182,9 +182,9 @@ extern int __node_distance(int, int);
>  
>  #else /* !CONFIG_NUMA */
>  
> -#define numa_node_id()		0
> -#define	cpu_to_node(cpu)	0
> -#define	early_cpu_to_node(cpu)	0
> +#define	numa_node_id()		0
> +#define	cpu_to_node(cpu)	((void)(cpu),0)
> +#define	early_cpu_to_node(cpu)	cpu_to_node(cpu)

Actually the proper way would be to have:

static inline int cpu_to_node(int cpu)
{
	return 0;
}

static inline int early_cpu_to_node(int cpu)
{
	return 0;
}

This way you also get typechecks.

-- Steve

>  
>  static inline const cpumask_t *_node_to_cpumask_ptr(int node)
>  {
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH] ring_buffer: map to cpu not page
  2008-10-02 23:18                                                   ` [PATCH] ring_buffer: map to cpu not page Steven Rostedt
  2008-10-02 23:36                                                     ` Steven Rostedt
  2008-10-03  4:56                                                     ` [PATCH] x86 Topology cpu_to_node parameter check Mathieu Desnoyers
@ 2008-10-03  7:27                                                     ` Ingo Molnar
  2 siblings, 0 replies; 102+ messages in thread
From: Ingo Molnar @ 2008-10-03  7:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Peter Zijlstra, Jonathan Corbet,
	Mathieu Desnoyers, LKML, Thomas Gleixner, Andrew Morton, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo


* Steven Rostedt <rostedt@goodmis.org> wrote:

> My original patch had a compile bug when NUMA was configured. I 
> referenced cpu when it should have been cpu_buffer->cpu.
> 
> Ingo quickly fixed this bug by replacing cpu with 'i' because that was 
> the loop counter. Unfortunately, the 'i' was the counter of pages, not 
> CPUs. This caused a crash when the number of pages allocated for the 
> buffers exceeded the number of pages, which would usually be the case.
> 
> Signed-off-by: Steven Rostedt <srostedt@redhat.com>

>  	for (i = 0; i < nr_pages; i++) {
>  		page = kzalloc_node(ALIGN(sizeof(*page), cache_line_size()),
> -				    GFP_KERNEL, cpu_to_node(i));
> +				    GFP_KERNEL, cpu_to_node(cpu_buffer->cpu));

oh, stupid typo of the year :-)

applied to tip/tracing/ring-buffer, thanks for tracking it down! I've 
reactivated the topic branch for tip/master and i'm running a few tests 
before pushing it out for wider testing.

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH] x86 Topology cpu_to_node parameter check
  2008-10-03  5:20                                                       ` Steven Rostedt
@ 2008-10-03 15:56                                                         ` Mathieu Desnoyers
  2008-10-03 16:26                                                           ` Steven Rostedt
  0 siblings, 1 reply; 102+ messages in thread
From: Mathieu Desnoyers @ 2008-10-03 15:56 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra, Jonathan Corbet,
	LKML, Thomas Gleixner, Andrew Morton, prasad, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Fri, 3 Oct 2008, Mathieu Desnoyers wrote:
> 
> > ---
> >  include/asm-x86/topology.h |    6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > Index: linux-2.6-lttng/include/asm-x86/topology.h
> > ===================================================================
> > --- linux-2.6-lttng.orig/include/asm-x86/topology.h	2008-10-03 00:37:05.000000000 -0400
> > +++ linux-2.6-lttng/include/asm-x86/topology.h	2008-10-03 00:45:52.000000000 -0400
> > @@ -182,9 +182,9 @@ extern int __node_distance(int, int);
> >  
> >  #else /* !CONFIG_NUMA */
> >  
> > -#define numa_node_id()		0
> > -#define	cpu_to_node(cpu)	0
> > -#define	early_cpu_to_node(cpu)	0
> > +#define	numa_node_id()		0
> > +#define	cpu_to_node(cpu)	((void)(cpu),0)
> > +#define	early_cpu_to_node(cpu)	cpu_to_node(cpu)
> 
> Actually the proper way would be to have:
> 
> static inline int cpu_to_node(int cpu)
> {
> 	return 0;
> }
> 
> static inline int early_cpu_to_node(int cpu)
> {
> 	return 0;
> }
> 
> This way you also get typechecks.
> 

That's how I did it first, but then I looked at asm-generic/topology.h
and have seen it uses #defines. Should we change them too ?

Mathieu

> -- Steve
> 
> >  
> >  static inline const cpumask_t *_node_to_cpumask_ptr(int node)
> >  {
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH] x86 Topology cpu_to_node parameter check
  2008-10-03 15:56                                                         ` Mathieu Desnoyers
@ 2008-10-03 16:26                                                           ` Steven Rostedt
  2008-10-03 17:21                                                             ` Mathieu Desnoyers
  0 siblings, 1 reply; 102+ messages in thread
From: Steven Rostedt @ 2008-10-03 16:26 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra, Jonathan Corbet,
	LKML, Thomas Gleixner, Andrew Morton, prasad, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo


On Fri, 3 Oct 2008, Mathieu Desnoyers wrote:
> 
> That's how I did it first, but then I looked at asm-generic/topology.h
> and have seen it uses #defines. Should we change them too ?
> 

The old way of doing this is with defines. But all new code should be 
static inline functions when feasible. This way we can get typechecking 
on the parameters even when the configuration is disabled.

Even if the rest of the file uses defines, the new code should be
static inlines. Eventually, even the old defines will be converted.

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH] x86 Topology cpu_to_node parameter check
  2008-10-03 16:26                                                           ` Steven Rostedt
@ 2008-10-03 17:21                                                             ` Mathieu Desnoyers
  2008-10-03 17:54                                                               ` Steven Rostedt
  0 siblings, 1 reply; 102+ messages in thread
From: Mathieu Desnoyers @ 2008-10-03 17:21 UTC (permalink / raw)
  To: Steven Rostedt, colpatch
  Cc: Ingo Molnar, Linus Torvalds, Peter Zijlstra, Jonathan Corbet,
	LKML, Thomas Gleixner, Andrew Morton, prasad, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Fri, 3 Oct 2008, Mathieu Desnoyers wrote:
> > 
> > That's how I did it first, but then I looked at asm-generic/topology.h
> > and have seen it uses #defines. Should we change them too ?
> > 
> 
> The old way of doing this is with defines. But all new code should be 
> static inline functions when feasible. This way we can get typechecking 
> on the parameters even when the configuration is disabled.
> 
> Even if the rest of the file uses defines, the new code should be
> static inlines. Eventually, even the old defines will be converted.
> 
> -- Steve
> 

Argh, I think topology.h is utterly broken :-(

Have you noticed the subtile interaction between the

include/asm-x86/topology.h :

#define numa_node_id()          0
#define cpu_to_node(cpu)        0
#define early_cpu_to_node(cpu)  0
...
#include <asm-generic/topology.h>


and
include/asm-generic/topology.h :
#ifndef cpu_to_node
#define cpu_to_node(cpu)        ((void)(cpu),0)
#endif

If any architecture decide for some reason to use a static inline rather
than a define, as currently done with node_to_first_cpu :

include/asm-x86/topology.h :
static inline int node_to_first_cpu(int node)
{
        return first_cpu(cpu_online_map);
}
...
#include <asm-generic/topology.h>

include/asm-generic/topology.h :
#ifndef node_to_first_cpu
#define node_to_first_cpu(node) ((void)(node),0)
#endif

(which will override the static inline !)

It results in an override of the arch-specific version. Nice eh ?

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH] x86 Topology cpu_to_node parameter check
  2008-10-03 17:21                                                             ` Mathieu Desnoyers
@ 2008-10-03 17:54                                                               ` Steven Rostedt
  2008-10-03 18:53                                                                 ` [PATCH] topology.h define mess fix Mathieu Desnoyers
  0 siblings, 1 reply; 102+ messages in thread
From: Steven Rostedt @ 2008-10-03 17:54 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: colpatch, Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	Jonathan Corbet, LKML, Thomas Gleixner, Andrew Morton, prasad,
	Frank Ch. Eigler, David Wilder, hch, Martin Bligh,
	Christoph Hellwig, Masami Hiramatsu, Steven Rostedt,
	Arnaldo Carvalho de Melo


On Fri, 3 Oct 2008, Mathieu Desnoyers wrote:

> * Steven Rostedt (rostedt@goodmis.org) wrote:
> > 
> > On Fri, 3 Oct 2008, Mathieu Desnoyers wrote:
> > > 
> > > That's how I did it first, but then I looked at asm-generic/topology.h
> > > and have seen it uses #defines. Should we change them too ?
> > > 
> > 
> > The old way of doing this is with defines. But all new code should be 
> > static inline functions when feasible. This way we can get typechecking 
> > on the parameters even when the configuration is disabled.
> > 
> > Even if the rest of the file uses defines, the new code should be
> > static inlines. Eventually, even the old defines will be converted.
> > 
> > -- Steve
> > 
> 
> Argh, I think topology.h is utterly broken :-(
> 
> Have you noticed the subtile interaction between the
> 
> include/asm-x86/topology.h :
> 
> #define numa_node_id()          0
> #define cpu_to_node(cpu)        0
> #define early_cpu_to_node(cpu)  0
> ...
> #include <asm-generic/topology.h>
> 
> 
> and
> include/asm-generic/topology.h :
> #ifndef cpu_to_node
> #define cpu_to_node(cpu)        ((void)(cpu),0)
> #endif
> 
> If any architecture decide for some reason to use a static inline rather
> than a define, as currently done with node_to_first_cpu :
> 
> include/asm-x86/topology.h :
> static inline int node_to_first_cpu(int node)
> {
>         return first_cpu(cpu_online_map);
> }
> ...
> #include <asm-generic/topology.h>
> 
> include/asm-generic/topology.h :
> #ifndef node_to_first_cpu
> #define node_to_first_cpu(node) ((void)(node),0)
> #endif
> 
> (which will override the static inline !)
> 
> It results in an override of the arch-specific version. Nice eh ?

Seems that they expect cpu_to_node to be a macro if NUMA is not 
configured.

Actually, since the asm-generic/topology.h does have the cpu shown 
(although not in inline format), the solution here is to simply remove
the

#define cpu_to_node() 0

And we can still make the early_cpu_to_node a static inline since it is 
not referenced in the generic code.

-- Steve


^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH] topology.h define mess fix
  2008-10-03 17:54                                                               ` Steven Rostedt
@ 2008-10-03 18:53                                                                 ` Mathieu Desnoyers
  2008-10-03 20:14                                                                   ` Luck, Tony
  0 siblings, 1 reply; 102+ messages in thread
From: Mathieu Desnoyers @ 2008-10-03 18:53 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, rth, tony.luck, paulus, benh, lethal
  Cc: colpatch, Linus Torvalds, Peter Zijlstra, Jonathan Corbet, LKML,
	Thomas Gleixner, Andrew Morton, prasad, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo

* Steven Rostedt (rostedt@goodmis.org) wrote:

> Seems that they expect cpu_to_node to be a macro if NUMA is not 
> configured.
> 
> Actually, since the asm-generic/topology.h does have the cpu shown 
> (although not in inline format), the solution here is to simply remove
> the
> 
> #define cpu_to_node() 0
> 
> And we can still make the early_cpu_to_node a static inline since it is 
> not referenced in the generic code.
> 
> -- Steve
> 

Or we take a deep breath and clean this up ?

Ingo, I build tested this on x86_64 (with and without NUMA), x86_32,
powerpc, arm and mips. I applies to both -tip and 2.6.27-rc8. Could it
be pulled into -tip for further testing ?

Note that checkpatch.pl spills a warning telling me to modify include/asm-*/
files (unexisting in my tree) rather than arch/*/include/asm/. Any idea
why ?

Thanks,

Mathieu


topology.h define mess fix

Original goal : Declare NUMA-less cpu_to_node with a check that the cpu
parameter exists so people without NUMA test configs (namely Steven Rostedt and
myself who ran into this error both in the same day with different
implementations) stop doing this trivial mistake.

End result :

Argh, I think topology.h is utterly broken :-(

Have you noticed the subtile interaction between the

include/asm-x86/topology.h :

#define numa_node_id()          0
#define cpu_to_node(cpu)        0
#define early_cpu_to_node(cpu)  0
...
#include <asm-generic/topology.h>


and
include/asm-generic/topology.h :
#ifndef cpu_to_node
#define cpu_to_node(cpu)        ((void)(cpu),0)
#endif

If any architecture decide for some reason to use a static inline rather
than a define, as currently done with node_to_first_cpu :

include/asm-x86/topology.h :
static inline int node_to_first_cpu(int node)
{
        return first_cpu(cpu_online_map);
}
...
#include <asm-generic/topology.h>

include/asm-generic/topology.h :
#ifndef node_to_first_cpu
#define node_to_first_cpu(node) ((void)(node),0)
#endif

(which will override the static inline !)

It results in an override of the arch-specific version. Nice eh ?

This patch fixes this issue by declaring static inlines in
asm-generic/topology.h and by requiring a _complete_ override of the
topology functions when an architecture needs to override them. An
architecture overriding the topology functions should not include
asm-generic/topology.h anymore.

- alpha needs careful checking, as it did not implement parent_node nor
  node_to_first_cpu previously.
- Major cross-architecture built test is required.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@elte.hu>
CC: rth@twiddle.net
CC: tony.luck@intel.com
CC: paulus@samba.org
CC: benh@kernel.crashing.org
CC: lethal@linux-sh.org
---
 arch/alpha/include/asm/topology.h   |   38 +++++++++++++++++++
 arch/ia64/include/asm/topology.h    |   16 ++++----
 arch/powerpc/include/asm/topology.h |   12 +++++-
 arch/sh/include/asm/topology.h      |   11 -----
 include/asm-generic/topology.h      |   70 ++++++++++++++++++++----------------
 include/asm-x86/topology.h          |   66 +++++++++++++++++++++++++--------
 6 files changed, 144 insertions(+), 69 deletions(-)

Index: linux-2.6-lttng/include/asm-x86/topology.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-x86/topology.h	2008-10-03 14:41:05.000000000 -0400
+++ linux-2.6-lttng/include/asm-x86/topology.h	2008-10-03 14:41:12.000000000 -0400
@@ -38,6 +38,8 @@
 /* Node not present */
 #define NUMA_NO_NODE	(-1)
 
+struct pci_bus;
+
 #ifdef CONFIG_NUMA
 #include <linux/cpumask.h>
 #include <asm/mpspec.h>
@@ -116,7 +118,6 @@ static inline cpumask_t node_to_cpumask(
 
 #endif /* !CONFIG_DEBUG_PER_CPU_MAPS */
 
-/* Replace default node_to_cpumask_ptr with optimized version */
 #define node_to_cpumask_ptr(v, node)		\
 		const cpumask_t *v = _node_to_cpumask_ptr(node)
 
@@ -129,8 +130,14 @@ static inline cpumask_t node_to_cpumask(
  * Returns the number of the node containing Node 'node'. This
  * architecture is flat, so it is a pretty simple function!
  */
-#define parent_node(node) (node)
+static inline int parent_node(int node)
+{
+	return node;
+}
 
+/*
+ * Leave those as defines so we don't have to include linux/pci.h.
+ */
 #define pcibus_to_node(bus) __pcibus_to_node(bus)
 #define pcibus_to_cpumask(bus) __pcibus_to_cpumask(bus)
 
@@ -180,42 +187,67 @@ extern int __node_distance(int, int);
 #define node_distance(a, b) __node_distance(a, b)
 #endif
 
+/* Returns the number of the first CPU on Node 'node'. */
+static inline int node_to_first_cpu(int node)
+{
+	node_to_cpumask_ptr(mask, node);
+	return first_cpu(*mask);
+}
+
 #else /* !CONFIG_NUMA */
 
-#define numa_node_id()		0
-#define	cpu_to_node(cpu)	0
-#define	early_cpu_to_node(cpu)	0
+static inline int numa_node_id(void)
+{
+	return 0;
+}
 
-static inline const cpumask_t *_node_to_cpumask_ptr(int node)
+/*
+ * We override asm-generic/topology.h.
+ */
+static inline int cpu_to_node(int cpu)
 {
-	return &cpu_online_map;
+	return 0;
 }
+
+static inline int parent_node(int node)
+{
+	return 0;
+}
+
 static inline cpumask_t node_to_cpumask(int node)
 {
 	return cpu_online_map;
 }
+
 static inline int node_to_first_cpu(int node)
 {
 	return first_cpu(cpu_online_map);
 }
 
+static inline int pcibus_to_node(struct pci_bus *bus)
+{
+	return -1;
+}
+
+static inline cpumask_t pcibus_to_cpumask(struct pci_bus *bus)
+{
+	return pcibus_to_node(bus) == -1 ?
+		CPU_MASK_ALL :
+		node_to_cpumask(pcibus_to_node(bus));
+}
+
+static inline const cpumask_t *_node_to_cpumask_ptr(int node)
+{
+	return &cpu_online_map;
+}
+
 /* Replace default node_to_cpumask_ptr with optimized version */
 #define node_to_cpumask_ptr(v, node)		\
 		const cpumask_t *v = _node_to_cpumask_ptr(node)
 
 #define node_to_cpumask_ptr_next(v, node)	\
 			   v = _node_to_cpumask_ptr(node)
-#endif
-
-#include <asm-generic/topology.h>
 
-#ifdef CONFIG_NUMA
-/* Returns the number of the first CPU on Node 'node'. */
-static inline int node_to_first_cpu(int node)
-{
-	node_to_cpumask_ptr(mask, node);
-	return first_cpu(*mask);
-}
 #endif
 
 extern cpumask_t cpu_coregroup_map(int cpu);
Index: linux-2.6-lttng/arch/alpha/include/asm/topology.h
===================================================================
--- linux-2.6-lttng.orig/arch/alpha/include/asm/topology.h	2008-10-03 14:41:05.000000000 -0400
+++ linux-2.6-lttng/arch/alpha/include/asm/topology.h	2008-10-03 14:41:12.000000000 -0400
@@ -41,7 +41,43 @@ static inline cpumask_t node_to_cpumask(
 
 #define pcibus_to_cpumask(bus)	(cpu_online_map)
 
+struct pci_bus;
+
+static inline int parent_node(int node)
+{
+	return node;
+}
+
+static inline int pcibus_to_node(struct pci_bus *bus)
+{
+	return -1;
+}
+
+static inline cpumask_t pcibus_to_cpumask(struct pci_bus *bus)
+{
+	return pcibus_to_node(bus) == -1 ?
+		CPU_MASK_ALL :
+		node_to_cpumask(pcibus_to_node(bus));
+}
+
+/* returns pointer to cpumask for specified node */
+#define	node_to_cpumask_ptr(v, node) 					\
+		cpumask_t _##v = node_to_cpumask(node);			\
+		const cpumask_t *v = &_##v
+
+#define node_to_cpumask_ptr_next(v, node)				\
+			  _##v = node_to_cpumask(node)
+
+static inline int node_to_first_cpu(int node)
+{
+	node_to_cpumask_ptr(mask, node);
+	return first_cpu(*mask);
+}
+
+#else
+
+#include <asm-generic/topology.h>
+
 #endif /* !CONFIG_NUMA */
-# include <asm-generic/topology.h>
 
 #endif /* _ASM_ALPHA_TOPOLOGY_H */
Index: linux-2.6-lttng/arch/ia64/include/asm/topology.h
===================================================================
--- linux-2.6-lttng.orig/arch/ia64/include/asm/topology.h	2008-10-03 14:41:05.000000000 -0400
+++ linux-2.6-lttng/arch/ia64/include/asm/topology.h	2008-10-03 14:41:12.000000000 -0400
@@ -104,6 +104,15 @@ void build_cpu_to_node_map(void);
 	.nr_balance_failed	= 0,			\
 }
 
+#define pcibus_to_cpumask(bus)	(pcibus_to_node(bus) == -1 ? \
+					CPU_MASK_ALL : \
+					node_to_cpumask(pcibus_to_node(bus)) \
+				)
+
+#else
+
+#include <asm-generic/topology.h>
+
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_SMP
@@ -116,11 +125,4 @@ void build_cpu_to_node_map(void);
 
 extern void arch_fix_phys_package_id(int num, u32 slot);
 
-#define pcibus_to_cpumask(bus)	(pcibus_to_node(bus) == -1 ? \
-					CPU_MASK_ALL : \
-					node_to_cpumask(pcibus_to_node(bus)) \
-				)
-
-#include <asm-generic/topology.h>
-
 #endif /* _ASM_IA64_TOPOLOGY_H */
Index: linux-2.6-lttng/arch/powerpc/include/asm/topology.h
===================================================================
--- linux-2.6-lttng.orig/arch/powerpc/include/asm/topology.h	2008-10-03 14:41:05.000000000 -0400
+++ linux-2.6-lttng/arch/powerpc/include/asm/topology.h	2008-10-03 14:41:12.000000000 -0400
@@ -77,6 +77,14 @@ extern void __init dump_numa_cpu_topolog
 extern int sysfs_add_device_to_node(struct sys_device *dev, int nid);
 extern void sysfs_remove_device_from_node(struct sys_device *dev, int nid);
 
+/* returns pointer to cpumask for specified node */
+#define	node_to_cpumask_ptr(v, node) 					\
+		cpumask_t _##v = node_to_cpumask(node);			\
+		const cpumask_t *v = &_##v
+
+#define node_to_cpumask_ptr_next(v, node)				\
+			  _##v = node_to_cpumask(node)
+
 #else
 
 static inline int of_node_to_nid(struct device_node *device)
@@ -96,10 +104,10 @@ static inline void sysfs_remove_device_f
 {
 }
 
-#endif /* CONFIG_NUMA */
-
 #include <asm-generic/topology.h>
 
+#endif /* CONFIG_NUMA */
+
 #ifdef CONFIG_SMP
 #include <asm/cputable.h>
 #define smt_capable()		(cpu_has_feature(CPU_FTR_SMT))
Index: linux-2.6-lttng/arch/sh/include/asm/topology.h
===================================================================
--- linux-2.6-lttng.orig/arch/sh/include/asm/topology.h	2008-10-03 14:41:05.000000000 -0400
+++ linux-2.6-lttng/arch/sh/include/asm/topology.h	2008-10-03 14:41:12.000000000 -0400
@@ -29,17 +29,6 @@
 	.nr_balance_failed	= 0,			\
 }
 
-#define cpu_to_node(cpu)	((void)(cpu),0)
-#define parent_node(node)	((void)(node),0)
-
-#define node_to_cpumask(node)	((void)node, cpu_online_map)
-#define node_to_first_cpu(node)	((void)(node),0)
-
-#define pcibus_to_node(bus)	((void)(bus), -1)
-#define pcibus_to_cpumask(bus)	(pcibus_to_node(bus) == -1 ? \
-					CPU_MASK_ALL : \
-					node_to_cpumask(pcibus_to_node(bus)) \
-				)
 #endif
 
 #include <asm-generic/topology.h>
Index: linux-2.6-lttng/include/asm-generic/topology.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-generic/topology.h	2008-10-03 14:41:13.000000000 -0400
+++ linux-2.6-lttng/include/asm-generic/topology.h	2008-10-03 14:41:16.000000000 -0400
@@ -27,44 +27,52 @@
 #ifndef _ASM_GENERIC_TOPOLOGY_H
 #define _ASM_GENERIC_TOPOLOGY_H
 
-#ifndef	CONFIG_NUMA
-
-/* Other architectures wishing to use this simple topology API should fill
-   in the below functions as appropriate in their own <asm/topology.h> file. */
-#ifndef cpu_to_node
-#define cpu_to_node(cpu)	((void)(cpu),0)
-#endif
-#ifndef parent_node
-#define parent_node(node)	((void)(node),0)
-#endif
-#ifndef node_to_cpumask
-#define node_to_cpumask(node)	((void)node, cpu_online_map)
-#endif
-#ifndef node_to_first_cpu
-#define node_to_first_cpu(node)	((void)(node),0)
-#endif
-#ifndef pcibus_to_node
-#define pcibus_to_node(bus)	((void)(bus), -1)
-#endif
-
-#ifndef pcibus_to_cpumask
-#define pcibus_to_cpumask(bus)	(pcibus_to_node(bus) == -1 ? \
-					CPU_MASK_ALL : \
-					node_to_cpumask(pcibus_to_node(bus)) \
-				)
-#endif
-
-#endif	/* CONFIG_NUMA */
+/*
+ * Other architectures wishing to use this simple topology API should fill
+ * in the below functions as appropriate in their own <asm/topology.h> file,
+ * and _don't_ include asm-generic/topology.h.
+ */
+
+struct pci_bus;
+
+static inline int cpu_to_node(int cpu)
+{
+	return 0;
+}
+
+static inline int parent_node(int node)
+{
+	return 0;
+}
+
+static inline cpumask_t node_to_cpumask(int node)
+{
+	return cpu_online_map;
+}
+
+static inline int node_to_first_cpu(int node)
+{
+	return 0;
+}
+
+static inline int pcibus_to_node(struct pci_bus *bus)
+{
+	return -1;
+}
+
+static inline cpumask_t pcibus_to_cpumask(struct pci_bus *bus)
+{
+	return pcibus_to_node(bus) == -1 ?
+		CPU_MASK_ALL :
+		node_to_cpumask(pcibus_to_node(bus));
+}
 
 /* returns pointer to cpumask for specified node */
-#ifndef node_to_cpumask_ptr
-
 #define	node_to_cpumask_ptr(v, node) 					\
 		cpumask_t _##v = node_to_cpumask(node);			\
 		const cpumask_t *v = &_##v
 
 #define node_to_cpumask_ptr_next(v, node)				\
 			  _##v = node_to_cpumask(node)
-#endif
 
 #endif /* _ASM_GENERIC_TOPOLOGY_H */

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 102+ messages in thread

* RE: [PATCH] topology.h define mess fix
  2008-10-03 18:53                                                                 ` [PATCH] topology.h define mess fix Mathieu Desnoyers
@ 2008-10-03 20:14                                                                   ` Luck, Tony
  2008-10-03 22:47                                                                     ` [PATCH] topology.h define mess fix v2 Mathieu Desnoyers
  0 siblings, 1 reply; 102+ messages in thread
From: Luck, Tony @ 2008-10-03 20:14 UTC (permalink / raw)
  To: Mathieu Desnoyers, Steven Rostedt, Ingo Molnar, rth, paulus,
	benh, lethal
  Cc: colpatch, Linus Torvalds, Peter Zijlstra, Jonathan Corbet, LKML,
	Thomas Gleixner, Andrew Morton, prasad, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo

> - Major cross-architecture built test is required.

Some problems on ia64.  With defconfig build (which has
CONFIG_NUMA=y) I see this:

kernel/sched.c: In function 'find_next_best_node':
kernel/sched.c:6920: error: implicit declaration of function 'node_to_cpumask_ptr'
kernel/sched.c:6920: error: '__tmp__' undeclared (first use in this function)
kernel/sched.c:6920: error: (Each undeclared identifier is reported only once
kernel/sched.c:6920: error: for each function it appears in.)
kernel/sched.c: In function 'sched_domain_node_span':
kernel/sched.c:6952: error: 'nodemask' undeclared (first use in this function)
kernel/sched.c:6953: warning: ISO C90 forbids mixed declarations and code
kernel/sched.c:6964: error: implicit declaration of function 'node_to_cpumask_ptr_next'
kernel/sched.c: In function '__build_sched_domains':
kernel/sched.c:7510: error: 'pnodemask' undeclared (first use in this function)

On an "allnoconfig" build (which curiously also has CONFIG_NUMA=y :-) I see

mm/page_alloc.c: In function 'find_next_best_node':
mm/page_alloc.c:2086: error: implicit declaration of function 'node_to_cpumask_ptr'
mm/page_alloc.c:2086: error: 'tmp' undeclared (first use in this function)
mm/page_alloc.c:2086: error: (Each undeclared identifier is reported only once
mm/page_alloc.c:2086: error: for each function it appears in.)
mm/page_alloc.c:2107: error: implicit declaration of function 'node_to_cpumask_ptr_next'

There are most probably more errors ... but this is where the build stopped.

-Tony

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH] topology.h define mess fix v2
  2008-10-03 20:14                                                                   ` Luck, Tony
@ 2008-10-03 22:47                                                                     ` Mathieu Desnoyers
  0 siblings, 0 replies; 102+ messages in thread
From: Mathieu Desnoyers @ 2008-10-03 22:47 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Steven Rostedt, Ingo Molnar, rth, paulus, benh, lethal, colpatch,
	Linus Torvalds, Peter Zijlstra, Jonathan Corbet, LKML,
	Thomas Gleixner, Andrew Morton, prasad, Frank Ch. Eigler,
	David Wilder, hch, Martin Bligh, Christoph Hellwig,
	Masami Hiramatsu, Steven Rostedt, Arnaldo Carvalho de Melo

* Luck, Tony (tony.luck@intel.com) wrote:
> > - Major cross-architecture built test is required.
> 
> Some problems on ia64.  With defconfig build (which has
> CONFIG_NUMA=y) I see this:
> 
[...]

Ah, I did not select config "generic" for ia64, and thus did not get
CONFIG_NUMA. Here is a v2 which fixes this.

Thanks for testing this.

Mathieu


topology.h define mess fix v2

Update : build fix for ia64 CONFIG_NUMA.

Original goal : Declare NUMA-less cpu_to_node with a check that the cpu
parameter exists so people without NUMA test configs (namely Steven Rostedt and
myself who ran into this error both in the same day with different
implementations) stop doing this trivial mistake.

End result :

Argh, I think topology.h is utterly broken :-(

Have you noticed the subtile interaction between the

include/asm-x86/topology.h :

#define numa_node_id()          0
#define cpu_to_node(cpu)        0
#define early_cpu_to_node(cpu)  0
...
#include <asm-generic/topology.h>


and
include/asm-generic/topology.h :
#ifndef cpu_to_node
#define cpu_to_node(cpu)        ((void)(cpu),0)
#endif

If any architecture decide for some reason to use a static inline rather
than a define, as currently done with node_to_first_cpu :

include/asm-x86/topology.h :
static inline int node_to_first_cpu(int node)
{
        return first_cpu(cpu_online_map);
}
...
#include <asm-generic/topology.h>

include/asm-generic/topology.h :
#ifndef node_to_first_cpu
#define node_to_first_cpu(node) ((void)(node),0)
#endif

(which will override the static inline !)

It results in an override of the arch-specific version. Nice eh ?

This patch fixes this issue by declaring static inlines in
asm-generic/topology.h and by requiring a _complete_ override of the
topology functions when an architecture needs to override them. An
architecture overriding the topology functions should not include
asm-generic/topology.h anymore.

- alpha needs careful checking, as it did not implement parent_node nor
  node_to_first_cpu previously.
- Major cross-architecture built test is required.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@elte.hu>
CC: rth@twiddle.net
CC: tony.luck@intel.com
CC: paulus@samba.org
CC: benh@kernel.crashing.org
CC: lethal@linux-sh.org
---
 arch/alpha/include/asm/topology.h   |   38 +++++++++++++++++++
 arch/ia64/include/asm/topology.h    |   24 ++++++++----
 arch/powerpc/include/asm/topology.h |   12 +++++-
 arch/sh/include/asm/topology.h      |   11 -----
 include/asm-generic/topology.h      |   70 ++++++++++++++++++++----------------
 include/asm-x86/topology.h          |   66 +++++++++++++++++++++++++--------
 6 files changed, 152 insertions(+), 69 deletions(-)

Index: linux-2.6-lttng/include/asm-x86/topology.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-x86/topology.h	2008-10-03 17:58:00.000000000 -0400
+++ linux-2.6-lttng/include/asm-x86/topology.h	2008-10-03 17:59:12.000000000 -0400
@@ -38,6 +38,8 @@
 /* Node not present */
 #define NUMA_NO_NODE	(-1)
 
+struct pci_bus;
+
 #ifdef CONFIG_NUMA
 #include <linux/cpumask.h>
 #include <asm/mpspec.h>
@@ -116,7 +118,6 @@ static inline cpumask_t node_to_cpumask(
 
 #endif /* !CONFIG_DEBUG_PER_CPU_MAPS */
 
-/* Replace default node_to_cpumask_ptr with optimized version */
 #define node_to_cpumask_ptr(v, node)		\
 		const cpumask_t *v = _node_to_cpumask_ptr(node)
 
@@ -129,8 +130,14 @@ static inline cpumask_t node_to_cpumask(
  * Returns the number of the node containing Node 'node'. This
  * architecture is flat, so it is a pretty simple function!
  */
-#define parent_node(node) (node)
+static inline int parent_node(int node)
+{
+	return node;
+}
 
+/*
+ * Leave those as defines so we don't have to include linux/pci.h.
+ */
 #define pcibus_to_node(bus) __pcibus_to_node(bus)
 #define pcibus_to_cpumask(bus) __pcibus_to_cpumask(bus)
 
@@ -180,42 +187,67 @@ extern int __node_distance(int, int);
 #define node_distance(a, b) __node_distance(a, b)
 #endif
 
+/* Returns the number of the first CPU on Node 'node'. */
+static inline int node_to_first_cpu(int node)
+{
+	node_to_cpumask_ptr(mask, node);
+	return first_cpu(*mask);
+}
+
 #else /* !CONFIG_NUMA */
 
-#define numa_node_id()		0
-#define	cpu_to_node(cpu)	0
-#define	early_cpu_to_node(cpu)	0
+static inline int numa_node_id(void)
+{
+	return 0;
+}
 
-static inline const cpumask_t *_node_to_cpumask_ptr(int node)
+/*
+ * We override asm-generic/topology.h.
+ */
+static inline int cpu_to_node(int cpu)
 {
-	return &cpu_online_map;
+	return 0;
 }
+
+static inline int parent_node(int node)
+{
+	return 0;
+}
+
 static inline cpumask_t node_to_cpumask(int node)
 {
 	return cpu_online_map;
 }
+
 static inline int node_to_first_cpu(int node)
 {
 	return first_cpu(cpu_online_map);
 }
 
+static inline int pcibus_to_node(struct pci_bus *bus)
+{
+	return -1;
+}
+
+static inline cpumask_t pcibus_to_cpumask(struct pci_bus *bus)
+{
+	return pcibus_to_node(bus) == -1 ?
+		CPU_MASK_ALL :
+		node_to_cpumask(pcibus_to_node(bus));
+}
+
+static inline const cpumask_t *_node_to_cpumask_ptr(int node)
+{
+	return &cpu_online_map;
+}
+
 /* Replace default node_to_cpumask_ptr with optimized version */
 #define node_to_cpumask_ptr(v, node)		\
 		const cpumask_t *v = _node_to_cpumask_ptr(node)
 
 #define node_to_cpumask_ptr_next(v, node)	\
 			   v = _node_to_cpumask_ptr(node)
-#endif
-
-#include <asm-generic/topology.h>
 
-#ifdef CONFIG_NUMA
-/* Returns the number of the first CPU on Node 'node'. */
-static inline int node_to_first_cpu(int node)
-{
-	node_to_cpumask_ptr(mask, node);
-	return first_cpu(*mask);
-}
 #endif
 
 extern cpumask_t cpu_coregroup_map(int cpu);
Index: linux-2.6-lttng/arch/alpha/include/asm/topology.h
===================================================================
--- linux-2.6-lttng.orig/arch/alpha/include/asm/topology.h	2008-10-03 17:58:00.000000000 -0400
+++ linux-2.6-lttng/arch/alpha/include/asm/topology.h	2008-10-03 17:59:12.000000000 -0400
@@ -41,7 +41,43 @@ static inline cpumask_t node_to_cpumask(
 
 #define pcibus_to_cpumask(bus)	(cpu_online_map)
 
+struct pci_bus;
+
+static inline int parent_node(int node)
+{
+	return node;
+}
+
+static inline int pcibus_to_node(struct pci_bus *bus)
+{
+	return -1;
+}
+
+static inline cpumask_t pcibus_to_cpumask(struct pci_bus *bus)
+{
+	return pcibus_to_node(bus) == -1 ?
+		CPU_MASK_ALL :
+		node_to_cpumask(pcibus_to_node(bus));
+}
+
+/* returns pointer to cpumask for specified node */
+#define	node_to_cpumask_ptr(v, node) 					\
+		cpumask_t _##v = node_to_cpumask(node);			\
+		const cpumask_t *v = &_##v
+
+#define node_to_cpumask_ptr_next(v, node)				\
+			  _##v = node_to_cpumask(node)
+
+static inline int node_to_first_cpu(int node)
+{
+	node_to_cpumask_ptr(mask, node);
+	return first_cpu(*mask);
+}
+
+#else
+
+#include <asm-generic/topology.h>
+
 #endif /* !CONFIG_NUMA */
-# include <asm-generic/topology.h>
 
 #endif /* _ASM_ALPHA_TOPOLOGY_H */
Index: linux-2.6-lttng/arch/ia64/include/asm/topology.h
===================================================================
--- linux-2.6-lttng.orig/arch/ia64/include/asm/topology.h	2008-10-03 17:58:00.000000000 -0400
+++ linux-2.6-lttng/arch/ia64/include/asm/topology.h	2008-10-03 18:36:47.000000000 -0400
@@ -104,6 +104,23 @@ void build_cpu_to_node_map(void);
 	.nr_balance_failed	= 0,			\
 }
 
+#define pcibus_to_cpumask(bus)	(pcibus_to_node(bus) == -1 ? \
+					CPU_MASK_ALL : \
+					node_to_cpumask(pcibus_to_node(bus)) \
+				)
+
+/* returns pointer to cpumask for specified node */
+#define	node_to_cpumask_ptr(v, node) 					\
+		cpumask_t _##v = node_to_cpumask(node);			\
+		const cpumask_t *v = &_##v
+
+#define node_to_cpumask_ptr_next(v, node)				\
+			  _##v = node_to_cpumask(node)
+
+#else
+
+#include <asm-generic/topology.h>
+
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_SMP
@@ -116,11 +133,4 @@ void build_cpu_to_node_map(void);
 
 extern void arch_fix_phys_package_id(int num, u32 slot);
 
-#define pcibus_to_cpumask(bus)	(pcibus_to_node(bus) == -1 ? \
-					CPU_MASK_ALL : \
-					node_to_cpumask(pcibus_to_node(bus)) \
-				)
-
-#include <asm-generic/topology.h>
-
 #endif /* _ASM_IA64_TOPOLOGY_H */
Index: linux-2.6-lttng/arch/powerpc/include/asm/topology.h
===================================================================
--- linux-2.6-lttng.orig/arch/powerpc/include/asm/topology.h	2008-10-03 17:58:00.000000000 -0400
+++ linux-2.6-lttng/arch/powerpc/include/asm/topology.h	2008-10-03 17:59:12.000000000 -0400
@@ -77,6 +77,14 @@ extern void __init dump_numa_cpu_topolog
 extern int sysfs_add_device_to_node(struct sys_device *dev, int nid);
 extern void sysfs_remove_device_from_node(struct sys_device *dev, int nid);
 
+/* returns pointer to cpumask for specified node */
+#define	node_to_cpumask_ptr(v, node) 					\
+		cpumask_t _##v = node_to_cpumask(node);			\
+		const cpumask_t *v = &_##v
+
+#define node_to_cpumask_ptr_next(v, node)				\
+			  _##v = node_to_cpumask(node)
+
 #else
 
 static inline int of_node_to_nid(struct device_node *device)
@@ -96,10 +104,10 @@ static inline void sysfs_remove_device_f
 {
 }
 
-#endif /* CONFIG_NUMA */
-
 #include <asm-generic/topology.h>
 
+#endif /* CONFIG_NUMA */
+
 #ifdef CONFIG_SMP
 #include <asm/cputable.h>
 #define smt_capable()		(cpu_has_feature(CPU_FTR_SMT))
Index: linux-2.6-lttng/arch/sh/include/asm/topology.h
===================================================================
--- linux-2.6-lttng.orig/arch/sh/include/asm/topology.h	2008-10-03 17:58:00.000000000 -0400
+++ linux-2.6-lttng/arch/sh/include/asm/topology.h	2008-10-03 17:59:12.000000000 -0400
@@ -29,17 +29,6 @@
 	.nr_balance_failed	= 0,			\
 }
 
-#define cpu_to_node(cpu)	((void)(cpu),0)
-#define parent_node(node)	((void)(node),0)
-
-#define node_to_cpumask(node)	((void)node, cpu_online_map)
-#define node_to_first_cpu(node)	((void)(node),0)
-
-#define pcibus_to_node(bus)	((void)(bus), -1)
-#define pcibus_to_cpumask(bus)	(pcibus_to_node(bus) == -1 ? \
-					CPU_MASK_ALL : \
-					node_to_cpumask(pcibus_to_node(bus)) \
-				)
 #endif
 
 #include <asm-generic/topology.h>
Index: linux-2.6-lttng/include/asm-generic/topology.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-generic/topology.h	2008-10-03 17:58:00.000000000 -0400
+++ linux-2.6-lttng/include/asm-generic/topology.h	2008-10-03 17:59:12.000000000 -0400
@@ -27,44 +27,52 @@
 #ifndef _ASM_GENERIC_TOPOLOGY_H
 #define _ASM_GENERIC_TOPOLOGY_H
 
-#ifndef	CONFIG_NUMA
-
-/* Other architectures wishing to use this simple topology API should fill
-   in the below functions as appropriate in their own <asm/topology.h> file. */
-#ifndef cpu_to_node
-#define cpu_to_node(cpu)	((void)(cpu),0)
-#endif
-#ifndef parent_node
-#define parent_node(node)	((void)(node),0)
-#endif
-#ifndef node_to_cpumask
-#define node_to_cpumask(node)	((void)node, cpu_online_map)
-#endif
-#ifndef node_to_first_cpu
-#define node_to_first_cpu(node)	((void)(node),0)
-#endif
-#ifndef pcibus_to_node
-#define pcibus_to_node(bus)	((void)(bus), -1)
-#endif
-
-#ifndef pcibus_to_cpumask
-#define pcibus_to_cpumask(bus)	(pcibus_to_node(bus) == -1 ? \
-					CPU_MASK_ALL : \
-					node_to_cpumask(pcibus_to_node(bus)) \
-				)
-#endif
-
-#endif	/* CONFIG_NUMA */
+/*
+ * Other architectures wishing to use this simple topology API should fill
+ * in the below functions as appropriate in their own <asm/topology.h> file,
+ * and _don't_ include asm-generic/topology.h.
+ */
+
+struct pci_bus;
+
+static inline int cpu_to_node(int cpu)
+{
+	return 0;
+}
+
+static inline int parent_node(int node)
+{
+	return 0;
+}
+
+static inline cpumask_t node_to_cpumask(int node)
+{
+	return cpu_online_map;
+}
+
+static inline int node_to_first_cpu(int node)
+{
+	return 0;
+}
+
+static inline int pcibus_to_node(struct pci_bus *bus)
+{
+	return -1;
+}
+
+static inline cpumask_t pcibus_to_cpumask(struct pci_bus *bus)
+{
+	return pcibus_to_node(bus) == -1 ?
+		CPU_MASK_ALL :
+		node_to_cpumask(pcibus_to_node(bus));
+}
 
 /* returns pointer to cpumask for specified node */
-#ifndef node_to_cpumask_ptr
-
 #define	node_to_cpumask_ptr(v, node) 					\
 		cpumask_t _##v = node_to_cpumask(node);			\
 		const cpumask_t *v = &_##v
 
 #define node_to_cpumask_ptr_next(v, node)				\
 			  _##v = node_to_cpumask(node)
-#endif
 
 #endif /* _ASM_GENERIC_TOPOLOGY_H */

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 102+ messages in thread

end of thread, other threads:[~2008-10-03 22:52 UTC | newest]

Thread overview: 102+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-09-25 18:51 [RFC PATCH 0/2 v3] Unified trace buffer Steven Rostedt
2008-09-25 18:51 ` [RFC PATCH 1/2 " Steven Rostedt
2008-09-26  1:02   ` [RFC PATCH v4] " Steven Rostedt
2008-09-26  1:52     ` Masami Hiramatsu
2008-09-26  2:11       ` Steven Rostedt
2008-09-26  2:47         ` Masami Hiramatsu
2008-09-26  3:20         ` Mathieu Desnoyers
2008-09-26  7:18           ` Peter Zijlstra
2008-09-26 10:45             ` Steven Rostedt
2008-09-26 11:00               ` Peter Zijlstra
2008-09-26 16:57                 ` Masami Hiramatsu
2008-09-26 17:14                   ` Steven Rostedt
2008-09-26 10:47             ` Steven Rostedt
2008-09-26 16:04             ` Mathieu Desnoyers
2008-09-26 17:11       ` [PATCH v5] " Steven Rostedt
2008-09-26 17:31         ` Arnaldo Carvalho de Melo
2008-09-26 17:37           ` Linus Torvalds
2008-09-26 17:46             ` Steven Rostedt
2008-09-27 17:02               ` Ingo Molnar
2008-09-27 17:18                 ` Steven Rostedt
2008-09-26 18:05         ` [PATCH v6] " Steven Rostedt
2008-09-26 18:30           ` Richard Holden
2008-09-26 18:39             ` Steven Rostedt
2008-09-26 18:59           ` Peter Zijlstra
2008-09-26 19:46             ` Martin Bligh
2008-09-26 19:52               ` Steven Rostedt
2008-09-26 21:37               ` Steven Rostedt
2008-09-26 19:14           ` Peter Zijlstra
2008-09-26 22:28             ` Mike Travis
2008-09-26 23:56               ` Steven Rostedt
2008-09-27  0:05                 ` Mike Travis
2008-09-27  0:18                   ` Steven Rostedt
2008-09-27  0:46                     ` Mike Travis
2008-09-27  0:52                       ` Steven Rostedt
2008-09-26 19:17           ` Peter Zijlstra
2008-09-26 23:16             ` Arjan van de Ven
2008-09-26 20:08           ` Peter Zijlstra
2008-09-26 21:14             ` Masami Hiramatsu
2008-09-26 21:26               ` Steven Rostedt
2008-09-26 21:13           ` [PATCH v7] " Steven Rostedt
2008-09-27  2:02             ` [PATCH v8] " Steven Rostedt
2008-09-27  6:06               ` [PATCH v9] " Steven Rostedt
2008-09-27 18:39                 ` Ingo Molnar
2008-09-27 19:24                   ` Steven Rostedt
2008-09-27 19:41                     ` Ingo Molnar
2008-09-27 19:54                       ` Steven Rostedt
2008-09-27 20:00                         ` Ingo Molnar
2008-09-29 15:05                           ` Steven Rostedt
2008-09-27 20:07                         ` Martin Bligh
2008-09-27 20:34                           ` Ingo Molnar
2008-09-29 16:10                 ` [PATCH v10 Golden] " Steven Rostedt
2008-09-29 16:11                   ` Steven Rostedt
2008-09-29 23:35                   ` Mathieu Desnoyers
2008-09-30  0:01                     ` Steven Rostedt
2008-09-30  0:03                       ` Mathieu Desnoyers
2008-09-30  0:12                         ` Steven Rostedt
2008-09-30  3:46                           ` Mathieu Desnoyers
2008-09-30  4:00                             ` Steven Rostedt
2008-09-30 15:20                               ` Jonathan Corbet
2008-09-30 15:54                                 ` Peter Zijlstra
2008-09-30 16:38                                   ` Linus Torvalds
2008-09-30 16:48                                     ` Steven Rostedt
2008-09-30 17:00                                       ` Peter Zijlstra
2008-09-30 17:41                                         ` Steven Rostedt
2008-09-30 17:49                                           ` Peter Zijlstra
2008-09-30 17:56                                             ` Steven Rostedt
2008-09-30 18:02                                               ` Steven Rostedt
2008-09-30 17:01                                       ` Linus Torvalds
2008-10-01 15:14                                         ` [PATCH] ring_buffer: allocate buffer page pointer Steven Rostedt
2008-10-01 17:36                                           ` Mathieu Desnoyers
2008-10-01 17:49                                             ` Steven Rostedt
2008-10-01 18:21                                           ` Mathieu Desnoyers
2008-10-02  8:50                                           ` Ingo Molnar
2008-10-02  8:51                                             ` Ingo Molnar
2008-10-02  9:05                                               ` [PATCH] ring-buffer: fix build error Ingo Molnar
2008-10-02  9:38                                                 ` [boot crash] " Ingo Molnar
2008-10-02 13:16                                                   ` Steven Rostedt
2008-10-02 13:17                                                   ` Steven Rostedt
2008-10-02 15:50                                                     ` Ingo Molnar
2008-10-02 18:27                                                       ` Steven Rostedt
2008-10-02 18:55                                                         ` Ingo Molnar
2008-10-02 23:18                                                   ` [PATCH] ring_buffer: map to cpu not page Steven Rostedt
2008-10-02 23:36                                                     ` Steven Rostedt
2008-10-03  4:56                                                     ` [PATCH] x86 Topology cpu_to_node parameter check Mathieu Desnoyers
2008-10-03  5:20                                                       ` Steven Rostedt
2008-10-03 15:56                                                         ` Mathieu Desnoyers
2008-10-03 16:26                                                           ` Steven Rostedt
2008-10-03 17:21                                                             ` Mathieu Desnoyers
2008-10-03 17:54                                                               ` Steven Rostedt
2008-10-03 18:53                                                                 ` [PATCH] topology.h define mess fix Mathieu Desnoyers
2008-10-03 20:14                                                                   ` Luck, Tony
2008-10-03 22:47                                                                     ` [PATCH] topology.h define mess fix v2 Mathieu Desnoyers
2008-10-03  7:27                                                     ` [PATCH] ring_buffer: map to cpu not page Ingo Molnar
2008-10-02  9:06                                             ` [PATCH] ring_buffer: allocate buffer page pointer Andrew Morton
2008-10-02  9:41                                               ` Ingo Molnar
2008-10-02 13:06                                               ` Steven Rostedt
2008-09-26 22:31           ` [PATCH v6] Unified trace buffer Arnaldo Carvalho de Melo
2008-09-26 23:58             ` Steven Rostedt
2008-09-27  0:13               ` Linus Torvalds
2008-09-27  0:23                 ` Steven Rostedt
2008-09-27  0:28                   ` Steven Rostedt
2008-09-25 18:51 ` [RFC PATCH 2/2 v3] ftrace: make work with new ring buffer Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).