linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] [RFC] binary reading of ftrace ring buffers
@ 2009-03-04  2:49 Steven Rostedt
  2009-03-04  2:49 ` [PATCH 1/5] ring-buffer: reset write field for ring_buffer_read_page Steven Rostedt
                   ` (7 more replies)
  0 siblings, 8 replies; 26+ messages in thread
From: Steven Rostedt @ 2009-03-04  2:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Theodore Tso, Arjan van de Ven, Pekka Paalanen,
	Arnaldo Carvalho de Melo, H. Peter Anvin, Mathieu Desnoyers,
	Martin Bligh, Frank Ch. Eigler, Tom Zanussi, Masami Hiramatsu,
	KOSAKI Motohiro, Jason Baron, Christoph Hellwig, Jiaying Zhang,
	Eduard - Gabriel Munteanu, mrubin, md


RFC only, not for pulling, unless everyone is fine with these :-)

After telling the folks from Google that I had splice code that I needed
to get working, I finally got around to doing it.

Not only did I get the splice code working, but I also made a perl
script (might want to cover you eyes if you look at that code) that
can automagically create the output from the binary files reading
the format arguments in the /debugfs/tracing/events/<subsys>/<event>/format
file.

>From previous patches, we have in include/trace/sched_event_types.h:

#undef TRACE_SYSTEM
#define TRACE_SYSTEM sched

TRACE_EVENT_FORMAT(sched_switch,
	TPPROTO(struct rq *rq, struct task_struct *prev,
		struct task_struct *next),
	TPARGS(rq, prev, next),
	TPFMT("task %s:%d ==> %s:%d",
	      prev->comm, prev->pid, next->comm, next->pid),
	TRACE_STRUCT(
		TRACE_FIELD(pid_t, prev_pid, prev->pid)
		TRACE_FIELD(int, prev_prio, prev->prio)
		TRACE_FIELD_SPECIAL(char next_comm[TASK_COMM_LEN],
				    next_comm,
				    TPCMD(memcpy(TRACE_ENTRY->next_comm,
						 next->comm,
						 TASK_COMM_LEN)))
		TRACE_FIELD(pid_t, next_pid, next->pid)
		TRACE_FIELD(int, next_prio, next->prio)
	),
	TPRAWFMT("prev %d:%d ==> next %s:%d:%d")
	);

By making that, we automagically get this file:

 # cat /debug/tracing/events/sched/sched_switch/format 
name: sched_switch
ID: 29
format:
	field:unsigned char type;	offset:0;	size:1;
	field:unsigned char flags;	offset:1;	size:1;
	field:unsigned char preempt_count;	offset:2;	size:1;
	field:int pid;	offset:4;	size:4;
	field:int tgid;	offset:8;	size:4;

	field:pid_t prev_pid;	offset:12;	size:4;
	field:int prev_prio;	offset:16;	size:4;
	field special:char next_comm[TASK_COMM_LEN];	offset:20;	size:16;
	field:pid_t next_pid;	offset:36;	size:4;
	field:int next_prio;	offset:40;	size:4;

print fmt: "prev %d:%d ==> next %s:%d:%d"


Now with this patch set, we create a way to read the ftrace ring buffers
directly, as a binary page. Splice has been used such that the user
could splice the ring buffers without need to copy the pages.
The pages are taken from the ring buffers and can be placed directly
into files, without extra copies.

 # ls /debug/tracing/binary_buffers/
0  1  2  3  4  5  6  7

One can either just use the read/write to grab live data from these
buffers, or they could use splice. I have a simple file that reads
this buffers using splice. Note, it only runs on one file, you
can make multiple copies to run more. The ring buffers in ftrace
are per cpu and they are not dependent on each other.

Also, if there is no data in the buffer, it returns -EAGAIN.

 # find /debug/tracing/events -name 'type' | while read f; do
>    echo raw > $f; done

 # find /debug/tracing/events -name 'enable' | while read f; do
>   echo 1 > $f; done

 # ./splice /debug/tracing/binary_buffers/0 /tmp/buf-0

Yes you can run multiple instances of this on different buffers.

 # ./rb-read.pl /tmp/buf-0

produces:

        0  [000]  7071.936459: (irq_handler_entry) irq 48
        0  [000]  7071.936462: (irq_handler_exit) irq 48 ret 1
        0  [000]  7071.988801: (sched_signal_send) sig: 14  task 0
        0  [000]  7071.988813: (sched_wakeup) task 0 success=1
        0  [000]  7071.988823: (sched_switch) prev 0:140 ==> next ntpd:6582272:0
     3303  [000]  7071.988916: (sched_switch) prev 0:120 ==> next swap:7497072:0
        0  [000]  7072.020370: (sched_wakeup) task 0 success=1
        0  [000]  7072.020407: (sched_switch) prev 0:140 ==> next even:808416116:0
       16  [000]  7071.936157: (sched_switch) prev 0:115 ==> next swap:7497072:0

Notice the "(sched_switch)" lines.

 The splice C code:
   http://www.kernel.org/pub/linux/kernel/people/rostedt/splice.c

 The perl script to parse:
   http://www.kernel.org/pub/linux/kernel/people/rostedt/rb-read.pl

 And yes I know, they are both ugly :-p

The following patches are in:

  git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git

    branch: rfc/splice/tip/tracing/ftrace


Steven Rostedt (5):
      ring-buffer: reset write field for ring_buffer_read_page
      ring-buffer: fix ring_buffer_read_page
      ring-buffer: replace sizeof of event header with offsetof
      ring-buffer: make ring_buffer_read_page read from start on partial page
      tracing: add binary buffer files for use with splice

----
 include/linux/ring_buffer.h |    7 +-
 kernel/trace/ring_buffer.c  |  118 +++++++++++++------
 kernel/trace/trace.c        |  274 +++++++++++++++++++++++++++++++++++++++++-
 kernel/trace/trace.h        |    1 +
 4 files changed, 357 insertions(+), 43 deletions(-)
-- 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/5] ring-buffer: reset write field for ring_buffer_read_page
  2009-03-04  2:49 [PATCH 0/5] [RFC] binary reading of ftrace ring buffers Steven Rostedt
@ 2009-03-04  2:49 ` Steven Rostedt
  2009-03-04  2:49 ` [PATCH 2/5] ring-buffer: fix ring_buffer_read_page Steven Rostedt
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2009-03-04  2:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Theodore Tso, Arjan van de Ven, Pekka Paalanen,
	Arnaldo Carvalho de Melo, H. Peter Anvin, Mathieu Desnoyers,
	Martin Bligh, Frank Ch. Eigler, Tom Zanussi, Masami Hiramatsu,
	KOSAKI Motohiro, Jason Baron, Christoph Hellwig, Jiaying Zhang,
	Eduard - Gabriel Munteanu, mrubin, md, Steven Rostedt

[-- Attachment #1: 0001-ring-buffer-reset-write-field-for-ring_buffer_read_.patch --]
[-- Type: text/plain, Size: 792 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

Impact: fix ring_buffer_read_page

After a page is swapped into the ring buffer, the write field must
also be reset.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 kernel/trace/ring_buffer.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index a8c275c..9baad7e 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2492,6 +2492,7 @@ int ring_buffer_read_page(struct ring_buffer *buffer,
 		rb_init_page(bpage);
 		bpage = cpu_buffer->reader_page->page;
 		cpu_buffer->reader_page->page = *data_page;
+		local_set(&cpu_buffer->reader_page->write, 0);
 		cpu_buffer->reader_page->read = 0;
 		*data_page = bpage;
 	}
-- 
1.5.6.5

-- 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/5] ring-buffer: fix ring_buffer_read_page
  2009-03-04  2:49 [PATCH 0/5] [RFC] binary reading of ftrace ring buffers Steven Rostedt
  2009-03-04  2:49 ` [PATCH 1/5] ring-buffer: reset write field for ring_buffer_read_page Steven Rostedt
@ 2009-03-04  2:49 ` Steven Rostedt
  2009-03-04  2:49 ` [PATCH 3/5] ring-buffer: replace sizeof of event header with offsetof Steven Rostedt
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2009-03-04  2:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Theodore Tso, Arjan van de Ven, Pekka Paalanen,
	Arnaldo Carvalho de Melo, H. Peter Anvin, Mathieu Desnoyers,
	Martin Bligh, Frank Ch. Eigler, Tom Zanussi, Masami Hiramatsu,
	KOSAKI Motohiro, Jason Baron, Christoph Hellwig, Jiaying Zhang,
	Eduard - Gabriel Munteanu, mrubin, md, Steven Rostedt

[-- Attachment #1: 0002-ring-buffer-fix-ring_buffer_read_page.patch --]
[-- Type: text/plain, Size: 6570 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

The ring_buffer_read_page was broken if it were to only copy part
of the page. This patch fixes that up as well as adds a parameter
to allow a length field, in order to only copy part of the buffer page.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 include/linux/ring_buffer.h |    7 ++-
 kernel/trace/ring_buffer.c  |   92 +++++++++++++++++++++++++++---------------
 2 files changed, 64 insertions(+), 35 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index f5e793d..79fcbc4 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -121,6 +121,9 @@ unsigned long ring_buffer_overrun_cpu(struct ring_buffer *buffer, int cpu);
 u64 ring_buffer_time_stamp(int cpu);
 void ring_buffer_normalize_time_stamp(int cpu, u64 *ts);
 
+size_t ring_buffer_page_len(void *page);
+
+
 /*
  * The below functions are fine to use outside the tracing facility.
  */
@@ -138,8 +141,8 @@ static inline int tracing_is_on(void) { return 0; }
 
 void *ring_buffer_alloc_read_page(struct ring_buffer *buffer);
 void ring_buffer_free_read_page(struct ring_buffer *buffer, void *data);
-int ring_buffer_read_page(struct ring_buffer *buffer,
-			  void **data_page, int cpu, int full);
+int ring_buffer_read_page(struct ring_buffer *buffer, void **data_page,
+			  size_t len, int cpu, int full);
 
 enum ring_buffer_flags {
 	RB_FL_OVERWRITE		= 1 << 0,
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 9baad7e..2ad6bae 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -234,6 +234,11 @@ static void rb_init_page(struct buffer_data_page *bpage)
 	local_set(&bpage->commit, 0);
 }
 
+size_t ring_buffer_page_len(void *page)
+{
+	return local_read(&((struct buffer_data_page *)page)->commit);
+}
+
 /*
  * Also stolen from mm/slob.c. Thanks to Mathieu Desnoyers for pointing
  * this issue out.
@@ -2378,8 +2383,8 @@ static void rb_remove_entries(struct ring_buffer_per_cpu *cpu_buffer,
  */
 void *ring_buffer_alloc_read_page(struct ring_buffer *buffer)
 {
-	unsigned long addr;
 	struct buffer_data_page *bpage;
+	unsigned long addr;
 
 	addr = __get_free_page(GFP_KERNEL);
 	if (!addr)
@@ -2387,6 +2392,8 @@ void *ring_buffer_alloc_read_page(struct ring_buffer *buffer)
 
 	bpage = (void *)addr;
 
+	rb_init_page(bpage);
+
 	return bpage;
 }
 
@@ -2406,6 +2413,7 @@ void ring_buffer_free_read_page(struct ring_buffer *buffer, void *data)
  * ring_buffer_read_page - extract a page from the ring buffer
  * @buffer: buffer to extract from
  * @data_page: the page to use allocated from ring_buffer_alloc_read_page
+ * @len: amount to extract
  * @cpu: the cpu of the buffer to extract
  * @full: should the extraction only happen when the page is full.
  *
@@ -2418,7 +2426,7 @@ void ring_buffer_free_read_page(struct ring_buffer *buffer, void *data)
  *	rpage = ring_buffer_alloc_read_page(buffer);
  *	if (!rpage)
  *		return error;
- *	ret = ring_buffer_read_page(buffer, &rpage, cpu, 0);
+ *	ret = ring_buffer_read_page(buffer, &rpage, len, cpu, 0);
  *	if (ret >= 0)
  *		process_page(rpage, ret);
  *
@@ -2435,71 +2443,89 @@ void ring_buffer_free_read_page(struct ring_buffer *buffer, void *data)
  *  <0 if no data has been transferred.
  */
 int ring_buffer_read_page(struct ring_buffer *buffer,
-			    void **data_page, int cpu, int full)
+			  void **data_page, size_t len, int cpu, int full)
 {
 	struct ring_buffer_per_cpu *cpu_buffer = buffer->buffers[cpu];
 	struct ring_buffer_event *event;
 	struct buffer_data_page *bpage;
+	struct buffer_page *reader;
 	unsigned long flags;
+	unsigned int commit;
 	unsigned int read;
 	int ret = -1;
 
 	if (!data_page)
-		return 0;
+		return -1;
 
 	bpage = *data_page;
 	if (!bpage)
-		return 0;
+		return -1;
 
 	spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
 
-	/*
-	 * rb_buffer_peek will get the next ring buffer if
-	 * the current reader page is empty.
-	 */
-	event = rb_buffer_peek(buffer, cpu, NULL);
-	if (!event)
+	reader = rb_get_reader_page(cpu_buffer);
+	if (!reader)
 		goto out;
 
-	/* check for data */
-	if (!local_read(&cpu_buffer->reader_page->page->commit))
-		goto out;
+	event = rb_reader_event(cpu_buffer);
+
+	read = reader->read;
+	commit = rb_page_commit(reader);
 
-	read = cpu_buffer->reader_page->read;
 	/*
-	 * If the writer is already off of the read page, then simply
-	 * switch the read page with the given page. Otherwise
-	 * we need to copy the data from the reader to the writer.
+	 * If len > what's left on the page, and the writer is also off of
+	 * the read page, then simply switch the read page with the given
+	 * page. Otherwise we need to copy the data from the reader to the
+	 * writer.
 	 */
-	if (cpu_buffer->reader_page == cpu_buffer->commit_page) {
-		unsigned int commit = rb_page_commit(cpu_buffer->reader_page);
+	if ((len < (commit - read)) ||
+	    cpu_buffer->reader_page == cpu_buffer->commit_page) {
 		struct buffer_data_page *rpage = cpu_buffer->reader_page->page;
+		unsigned int pos = read;
+		unsigned int size;
 
 		if (full)
 			goto out;
-		/* The writer is still on the reader page, we must copy */
-		memcpy(bpage->data + read, rpage->data + read, commit - read);
 
-		/* consume what was read */
-		cpu_buffer->reader_page->read = commit;
+		if (len > (commit - read))
+			len = (commit - read);
+
+		size = rb_event_length(event);
+
+		if (len < size)
+			goto out;
+
+		/* Need to copy one event at a time */
+		do {
+			memcpy(bpage->data + pos, rpage->data + pos, size);
+
+			len -= size;
+
+			rb_advance_reader(cpu_buffer);
+			pos = reader->read;
+
+			event = rb_reader_event(cpu_buffer);
+			size = rb_event_length(event);
+		} while (len > size);
 
 		/* update bpage */
-		local_set(&bpage->commit, commit);
-		if (!read)
-			bpage->time_stamp = rpage->time_stamp;
+		local_set(&bpage->commit, pos);
+		bpage->time_stamp = rpage->time_stamp;
+
 	} else {
 		/* swap the pages */
 		rb_init_page(bpage);
-		bpage = cpu_buffer->reader_page->page;
-		cpu_buffer->reader_page->page = *data_page;
-		local_set(&cpu_buffer->reader_page->write, 0);
-		cpu_buffer->reader_page->read = 0;
+		bpage = reader->page;
+		reader->page = *data_page;
+		local_set(&reader->write, 0);
+		reader->read = 0;
 		*data_page = bpage;
+
+		/* update the entry counter */
+		rb_remove_entries(cpu_buffer, bpage, read);
 	}
 	ret = read;
 
-	/* update the entry counter */
-	rb_remove_entries(cpu_buffer, bpage, read);
  out:
 	spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
 
-- 
1.5.6.5

-- 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/5] ring-buffer: replace sizeof of event header with offsetof
  2009-03-04  2:49 [PATCH 0/5] [RFC] binary reading of ftrace ring buffers Steven Rostedt
  2009-03-04  2:49 ` [PATCH 1/5] ring-buffer: reset write field for ring_buffer_read_page Steven Rostedt
  2009-03-04  2:49 ` [PATCH 2/5] ring-buffer: fix ring_buffer_read_page Steven Rostedt
@ 2009-03-04  2:49 ` Steven Rostedt
  2009-03-04  2:49 ` [PATCH 4/5] ring-buffer: make ring_buffer_read_page read from start on partial page Steven Rostedt
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2009-03-04  2:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Theodore Tso, Arjan van de Ven, Pekka Paalanen,
	Arnaldo Carvalho de Melo, H. Peter Anvin, Mathieu Desnoyers,
	Martin Bligh, Frank Ch. Eigler, Tom Zanussi, Masami Hiramatsu,
	KOSAKI Motohiro, Jason Baron, Christoph Hellwig, Jiaying Zhang,
	Eduard - Gabriel Munteanu, mrubin, md, Steven Rostedt

[-- Attachment #1: 0003-ring-buffer-replace-sizeof-of-event-header-with-off.patch --]
[-- Type: text/plain, Size: 961 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

Impact: fix to possible alignment problems on some archs.

Some arch compilers include an NULL char array in the sizeof field.
Since the ring_buffer_event type includes one of these, it is better
to use the "offsetof" instead, to avoid strange bugs on these archs.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 kernel/trace/ring_buffer.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 2ad6bae..27cf834 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -132,7 +132,7 @@ void ring_buffer_normalize_time_stamp(int cpu, u64 *ts)
 }
 EXPORT_SYMBOL_GPL(ring_buffer_normalize_time_stamp);
 
-#define RB_EVNT_HDR_SIZE (sizeof(struct ring_buffer_event))
+#define RB_EVNT_HDR_SIZE (offsetof(struct ring_buffer_event, array))
 #define RB_ALIGNMENT		4U
 #define RB_MAX_SMALL_DATA	28
 
-- 
1.5.6.5

-- 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/5] ring-buffer: make ring_buffer_read_page read from start on partial page
  2009-03-04  2:49 [PATCH 0/5] [RFC] binary reading of ftrace ring buffers Steven Rostedt
                   ` (2 preceding siblings ...)
  2009-03-04  2:49 ` [PATCH 3/5] ring-buffer: replace sizeof of event header with offsetof Steven Rostedt
@ 2009-03-04  2:49 ` Steven Rostedt
  2009-03-04  2:49 ` [PATCH 5/5] tracing: add binary buffer files for use with splice Steven Rostedt
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2009-03-04  2:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Theodore Tso, Arjan van de Ven, Pekka Paalanen,
	Arnaldo Carvalho de Melo, H. Peter Anvin, Mathieu Desnoyers,
	Martin Bligh, Frank Ch. Eigler, Tom Zanussi, Masami Hiramatsu,
	KOSAKI Motohiro, Jason Baron, Christoph Hellwig, Jiaying Zhang,
	Eduard - Gabriel Munteanu, mrubin, md, Steven Rostedt

[-- Attachment #1: 0004-ring-buffer-make-ring_buffer_read_page-read-from-st.patch --]
[-- Type: text/plain, Size: 4128 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

Impact: dont leave holes in read buffer page

The ring_buffer_read_page swaps a given page with the reader page
of the ring buffer, if certain conditions are set:

 1) requested length is big enough to hold entire page data

 2) a writer is not currently on the page

 3) the page is not partially consumed.

Instead of swapping with the supplied page. It copies the data to
the supplied page instead. But currently the data is copied in the
same offset as the source page. This causes a hole at the start
of the reader page. This complicates the use of this function.
Instead, it should copy the data at the beginning of the function
and update the index fields accordingly.

Other small clean ups are also done in this patch.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 kernel/trace/ring_buffer.c |   43 +++++++++++++++++++++++++++++++++----------
 1 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 27cf834..f2a163d 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -61,6 +61,8 @@ enum {
 
 static unsigned long ring_buffer_flags __read_mostly = RB_BUFFERS_ON;
 
+#define BUF_PAGE_HDR_SIZE offsetof(struct buffer_data_page, data)
+
 /**
  * tracing_on - enable all tracing buffers
  *
@@ -234,9 +236,16 @@ static void rb_init_page(struct buffer_data_page *bpage)
 	local_set(&bpage->commit, 0);
 }
 
+/**
+ * ring_buffer_page_len - the size of data on the page.
+ * @page: The page to read
+ *
+ * Returns the amount of data on the page, including buffer page header.
+ */
 size_t ring_buffer_page_len(void *page)
 {
-	return local_read(&((struct buffer_data_page *)page)->commit);
+	return local_read(&((struct buffer_data_page *)page)->commit)
+		+ BUF_PAGE_HDR_SIZE;
 }
 
 /*
@@ -259,7 +268,7 @@ static inline int test_time_stamp(u64 delta)
 	return 0;
 }
 
-#define BUF_PAGE_SIZE (PAGE_SIZE - offsetof(struct buffer_data_page, data))
+#define BUF_PAGE_SIZE (PAGE_SIZE - BUF_PAGE_HDR_SIZE)
 
 /*
  * head_page == tail_page && head == tail then buffer is empty.
@@ -2454,6 +2463,15 @@ int ring_buffer_read_page(struct ring_buffer *buffer,
 	unsigned int read;
 	int ret = -1;
 
+	/*
+	 * If len is not big enough to hold the page header, then
+	 * we can not copy anything.
+	 */
+	if (len <= BUF_PAGE_HDR_SIZE)
+		return -1;
+
+	len -= BUF_PAGE_HDR_SIZE;
+
 	if (!data_page)
 		return -1;
 
@@ -2473,15 +2491,17 @@ int ring_buffer_read_page(struct ring_buffer *buffer,
 	commit = rb_page_commit(reader);
 
 	/*
-	 * If len > what's left on the page, and the writer is also off of
-	 * the read page, then simply switch the read page with the given
-	 * page. Otherwise we need to copy the data from the reader to the
-	 * writer.
+	 * If this page has been partially read or
+	 * if len is not big enough to read the rest of the page or
+	 * a writer is still on the page, then
+	 * we must copy the data from the page to the buffer.
+	 * Otherwise, we can simply swap the page with the one passed in.
 	 */
-	if ((len < (commit - read)) ||
+	if (read || (len < (commit - read)) ||
 	    cpu_buffer->reader_page == cpu_buffer->commit_page) {
 		struct buffer_data_page *rpage = cpu_buffer->reader_page->page;
-		unsigned int pos = read;
+		unsigned int rpos = read;
+		unsigned int pos = 0;
 		unsigned int size;
 
 		if (full)
@@ -2497,12 +2517,13 @@ int ring_buffer_read_page(struct ring_buffer *buffer,
 
 		/* Need to copy one event at a time */
 		do {
-			memcpy(bpage->data + pos, rpage->data + pos, size);
+			memcpy(bpage->data + pos, rpage->data + rpos, size);
 
 			len -= size;
 
 			rb_advance_reader(cpu_buffer);
-			pos = reader->read;
+			rpos = reader->read;
+			pos += size;
 
 			event = rb_reader_event(cpu_buffer);
 			size = rb_event_length(event);
@@ -2512,6 +2533,8 @@ int ring_buffer_read_page(struct ring_buffer *buffer,
 		local_set(&bpage->commit, pos);
 		bpage->time_stamp = rpage->time_stamp;
 
+		/* we copied everything to the beginning */
+		read = 0;
 	} else {
 		/* swap the pages */
 		rb_init_page(bpage);
-- 
1.5.6.5

-- 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/5] tracing: add binary buffer files for use with splice
  2009-03-04  2:49 [PATCH 0/5] [RFC] binary reading of ftrace ring buffers Steven Rostedt
                   ` (3 preceding siblings ...)
  2009-03-04  2:49 ` [PATCH 4/5] ring-buffer: make ring_buffer_read_page read from start on partial page Steven Rostedt
@ 2009-03-04  2:49 ` Steven Rostedt
  2009-03-04  3:35   ` Andrew Morton
  2009-03-04  3:01 ` [PATCH 0/5] [RFC] binary reading of ftrace ring buffers Steven Rostedt
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Steven Rostedt @ 2009-03-04  2:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Theodore Tso, Arjan van de Ven, Pekka Paalanen,
	Arnaldo Carvalho de Melo, H. Peter Anvin, Mathieu Desnoyers,
	Martin Bligh, Frank Ch. Eigler, Tom Zanussi, Masami Hiramatsu,
	KOSAKI Motohiro, Jason Baron, Christoph Hellwig, Jiaying Zhang,
	Eduard - Gabriel Munteanu, mrubin, md, Steven Rostedt

[-- Attachment #1: 0005-tracing-add-binary-buffer-files-for-use-with-splice.patch --]
[-- Type: text/plain, Size: 8579 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

Impact: new feature

This patch creates a directory of files that correspond to the
per CPU ring buffers. These are binary files and are made to
be used with splice. This is the fastest way to extract data from
the ftrace ring buffers.

Thanks to Jiaying Zhang for pushing me to get this code fixed,
 and to Eduard - Gabriel Munteanu for his splice code that helped
 me debug my code.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 kernel/trace/trace.c |  274 ++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/trace/trace.h |    1 +
 2 files changed, 268 insertions(+), 7 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ea055aa..12539f7 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -11,31 +11,30 @@
  *  Copyright (C) 2004-2006 Ingo Molnar
  *  Copyright (C) 2004 William Lee Irwin III
  */
+#include <linux/ring_buffer.h>
 #include <linux/utsrelease.h>
+#include <linux/stacktrace.h>
+#include <linux/writeback.h>
 #include <linux/kallsyms.h>
 #include <linux/seq_file.h>
 #include <linux/notifier.h>
+#include <linux/irqflags.h>
 #include <linux/debugfs.h>
 #include <linux/pagemap.h>
 #include <linux/hardirq.h>
 #include <linux/linkage.h>
 #include <linux/uaccess.h>
+#include <linux/kprobes.h>
 #include <linux/ftrace.h>
 #include <linux/module.h>
 #include <linux/percpu.h>
+#include <linux/splice.h>
 #include <linux/kdebug.h>
 #include <linux/ctype.h>
 #include <linux/init.h>
 #include <linux/poll.h>
 #include <linux/gfp.h>
 #include <linux/fs.h>
-#include <linux/kprobes.h>
-#include <linux/writeback.h>
-#include <linux/splice.h>
-
-#include <linux/stacktrace.h>
-#include <linux/ring_buffer.h>
-#include <linux/irqflags.h>
 
 #include "trace.h"
 #include "trace_output.h"
@@ -3005,6 +3004,246 @@ static struct file_operations tracing_mark_fops = {
 	.write		= tracing_mark_write,
 };
 
+struct ftrace_buffer_info {
+	struct trace_array	*tr;
+	void			*spare;
+	int			cpu;
+	unsigned int		read;
+};
+
+static int tracing_buffers_open(struct inode *inode, struct file *filp)
+{
+	int cpu = (int)(long)inode->i_private;
+	struct ftrace_buffer_info *info;
+
+	if (tracing_disabled)
+		return -ENODEV;
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info)
+		return -ENOMEM;
+
+	info->tr	= &global_trace;
+	info->cpu	= cpu;
+	info->spare	= ring_buffer_alloc_read_page(info->tr->buffer);
+	/* Force reading ring buffer for first read */
+	info->read	= (unsigned int)-1;
+	if (!info->spare)
+		goto out;
+
+	filp->private_data = info;
+
+	return 0;
+
+ out:
+	kfree(info);
+	return -ENOMEM;
+}
+
+static ssize_t
+tracing_buffers_read(struct file *filp, char __user *ubuf,
+		     size_t count, loff_t *ppos)
+{
+	struct ftrace_buffer_info *info = filp->private_data;
+	unsigned int pos;
+	ssize_t ret;
+	size_t size;
+
+	/* Do we have previous read data to read? */
+	if (info->read < PAGE_SIZE)
+		goto read;
+
+	info->read = 0;
+
+	ret = ring_buffer_read_page(info->tr->buffer,
+				    &info->spare,
+				    count,
+				    info->cpu, 0);
+	if (ret < 0)
+		return 0;
+
+	pos = ring_buffer_page_len(info->spare);
+
+	if (pos < PAGE_SIZE)
+		memset(info->spare + pos, 0, PAGE_SIZE - pos);
+
+read:
+	size = PAGE_SIZE - info->read;
+	if (size > count)
+		size = count;
+
+	ret = copy_to_user(ubuf, info->spare + info->read, size);
+	if (ret)
+		return -EFAULT;
+	*ppos += size;
+	info->read += size;
+
+	return size;
+}
+
+static int tracing_buffers_release(struct inode *inode, struct file *file)
+{
+	struct ftrace_buffer_info *info = file->private_data;
+
+	ring_buffer_free_read_page(info->tr->buffer, info->spare);
+	kfree(info);
+
+	return 0;
+}
+
+struct buffer_ref {
+	struct ring_buffer	*buffer;
+	void			*page;
+	int			ref;
+};
+
+static void buffer_pipe_buf_release(struct pipe_inode_info *pipe,
+				    struct pipe_buffer *buf)
+{
+	struct buffer_ref *ref = (struct buffer_ref *)buf->private;
+
+	if (--ref->ref)
+		return;
+
+	ring_buffer_free_read_page(ref->buffer, ref->page);
+	kfree(ref);
+	buf->private = 0;
+}
+
+static int buffer_pipe_buf_steal(struct pipe_inode_info *pipe,
+				 struct pipe_buffer *buf)
+{
+	return 1;
+}
+
+static void buffer_pipe_buf_get(struct pipe_inode_info *pipe,
+				struct pipe_buffer *buf)
+{
+	struct buffer_ref *ref = (struct buffer_ref *)buf->private;
+
+	ref->ref++;
+}
+
+/* Pipe buffer operations for a buffer. */
+static struct pipe_buf_operations buffer_pipe_buf_ops = {
+	.can_merge		= 0,
+	.map			= generic_pipe_buf_map,
+	.unmap			= generic_pipe_buf_unmap,
+	.confirm		= generic_pipe_buf_confirm,
+	.release		= buffer_pipe_buf_release,
+	.steal			= buffer_pipe_buf_steal,
+	.get			= buffer_pipe_buf_get,
+};
+
+/*
+ * Callback from splice_to_pipe(), if we need to release some pages
+ * at the end of the spd in case we error'ed out in filling the pipe.
+ */
+static void buffer_spd_release(struct splice_pipe_desc *spd, unsigned int i)
+{
+	struct buffer_ref *ref =
+		(struct buffer_ref *)spd->partial[i].private;
+
+	if (--ref->ref)
+		return;
+
+	ring_buffer_free_read_page(ref->buffer, ref->page);
+	kfree(ref);
+	spd->partial[i].private = 0;
+}
+
+static ssize_t
+tracing_buffers_splice_read(struct file *file, loff_t *ppos,
+			    struct pipe_inode_info *pipe, size_t len,
+			    unsigned int flags)
+{
+	struct ftrace_buffer_info *info = file->private_data;
+	struct partial_page partial[PIPE_BUFFERS];
+	struct page *pages[PIPE_BUFFERS];
+	struct splice_pipe_desc spd = {
+		.pages		= pages,
+		.partial	= partial,
+		.flags		= flags,
+		.ops		= &buffer_pipe_buf_ops,
+		.spd_release	= buffer_spd_release,
+	};
+	struct buffer_ref *ref;
+	int size, i;
+	size_t ret;
+
+	/*
+	 * We can't seek on a buffer input
+	 */
+	if (unlikely(*ppos))
+		return -ESPIPE;
+
+
+	for (i = 0; i < PIPE_BUFFERS && len; i++, len -= size) {
+		struct page *page;
+		int r;
+
+		ref = kzalloc(sizeof(*ref), GFP_KERNEL);
+		if (!ref)
+			break;
+
+		ref->buffer = info->tr->buffer;
+		ref->page = ring_buffer_alloc_read_page(ref->buffer);
+		if (!ref->page) {
+			kfree(ref);
+			break;
+		}
+
+		r = ring_buffer_read_page(ref->buffer, &ref->page,
+					  len, info->cpu, 0);
+		if (r < 0) {
+			ring_buffer_free_read_page(ref->buffer,
+						   ref->page);
+			kfree(ref);
+			break;
+		}
+
+		/*
+		 * zero out any left over data, this is going to
+		 * user land.
+		 */
+		size = ring_buffer_page_len(ref->page);
+		if (size < PAGE_SIZE)
+			memset(ref->page + size, 0, PAGE_SIZE - size);
+
+		page = virt_to_page(ref->page);
+
+		spd.pages[i] = page;
+		spd.partial[i].len = PAGE_SIZE;
+		spd.partial[i].offset = 0;
+		spd.partial[i].private = (unsigned long)ref;
+		spd.nr_pages++;
+	}
+
+	spd.nr_pages = i;
+
+	/* did we read anything? */
+	if (!spd.nr_pages) {
+		if (flags & SPLICE_F_NONBLOCK)
+			ret = -EAGAIN;
+		else
+			ret = 0;
+		/* TODO: block */
+		return ret;
+	}
+
+	ret = splice_to_pipe(pipe, &spd);
+
+	return ret;
+}
+
+static const struct file_operations tracing_buffers_fops = {
+	.open		= tracing_buffers_open,
+	.read		= tracing_buffers_read,
+	.release	= tracing_buffers_release,
+	.splice_read	= tracing_buffers_splice_read,
+	.llseek		= no_llseek,
+};
+
 #ifdef CONFIG_DYNAMIC_FTRACE
 
 int __weak ftrace_arch_read_dyn_info(char *buf, int size)
@@ -3399,6 +3638,7 @@ static __init void create_trace_options_dir(void)
 static __init int tracer_init_debugfs(void)
 {
 	struct dentry *d_tracer;
+	struct dentry *buffers;
 	struct dentry *entry;
 	int cpu;
 
@@ -3471,6 +3711,26 @@ static __init int tracer_init_debugfs(void)
 		pr_warning("Could not create debugfs "
 			   "'trace_marker' entry\n");
 
+	buffers = debugfs_create_dir("binary_buffers", d_tracer);
+
+	if (!buffers)
+		pr_warning("Could not create buffers directory\n");
+	else {
+		int cpu;
+		char buf[64];
+
+		for_each_tracing_cpu(cpu) {
+			sprintf(buf, "%d", cpu);
+
+			entry = debugfs_create_file(buf, 0444, buffers,
+						    (void *)(long)cpu,
+						    &tracing_buffers_fops);
+			if (!entry)
+				pr_warning("Could not create debugfs buffers "
+					   "'%s' entry\n", buf);
+		}
+	}
+
 #ifdef CONFIG_DYNAMIC_FTRACE
 	entry = debugfs_create_file("dyn_ftrace_total_info", 0444, d_tracer,
 				    &ftrace_update_tot_cnt,
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index e606633..561bb5c 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -217,6 +217,7 @@ enum trace_flag_type {
  */
 struct trace_array_cpu {
 	atomic_t		disabled;
+	void			*buffer_page;	/* ring buffer spare */
 
 	/* these fields get copied into max-trace: */
 	unsigned long		trace_idx;
-- 
1.5.6.5

-- 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers
  2009-03-04  2:49 [PATCH 0/5] [RFC] binary reading of ftrace ring buffers Steven Rostedt
                   ` (4 preceding siblings ...)
  2009-03-04  2:49 ` [PATCH 5/5] tracing: add binary buffer files for use with splice Steven Rostedt
@ 2009-03-04  3:01 ` Steven Rostedt
  2009-03-04  3:23   ` Steven Rostedt
  2009-03-04 10:26 ` Ingo Molnar
  2009-03-04 17:00 ` Mathieu Desnoyers
  7 siblings, 1 reply; 26+ messages in thread
From: Steven Rostedt @ 2009-03-04  3:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Theodore Tso, Arjan van de Ven, Pekka Paalanen,
	Arnaldo Carvalho de Melo, H. Peter Anvin, Mathieu Desnoyers,
	Martin Bligh, Frank Ch. Eigler, Tom Zanussi, Masami Hiramatsu,
	KOSAKI Motohiro, Jason Baron, Christoph Hellwig, Jiaying Zhang,
	Eduard - Gabriel Munteanu, mrubin, md


On Tue, 3 Mar 2009, Steven Rostedt wrote:

> Yes you can run multiple instances of this on different buffers.
> 
>  # ./rb-read.pl /tmp/buf-0
> 
> produces:
> 
>         0  [000]  7071.936459: (irq_handler_entry) irq 48
>         0  [000]  7071.936462: (irq_handler_exit) irq 48 ret 1
>         0  [000]  7071.988801: (sched_signal_send) sig: 14  task 0
>         0  [000]  7071.988813: (sched_wakeup) task 0 success=1
>         0  [000]  7071.988823: (sched_switch) prev 0:140 ==> next ntpd:6582272:0
>      3303  [000]  7071.988916: (sched_switch) prev 0:120 ==> next swap:7497072:0
>         0  [000]  7072.020370: (sched_wakeup) task 0 success=1
>         0  [000]  7072.020407: (sched_switch) prev 0:140 ==> next even:808416116:0

My perl script is still a bit buggy, but you get the idea ;-)

-- Steve

>        16  [000]  7071.936157: (sched_switch) prev 0:115 ==> next swap:7497072:0
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers
  2009-03-04  3:01 ` [PATCH 0/5] [RFC] binary reading of ftrace ring buffers Steven Rostedt
@ 2009-03-04  3:23   ` Steven Rostedt
  0 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2009-03-04  3:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Theodore Tso, Arjan van de Ven, Pekka Paalanen,
	Arnaldo Carvalho de Melo, H. Peter Anvin, Mathieu Desnoyers,
	Martin Bligh, Frank Ch. Eigler, Tom Zanussi, Masami Hiramatsu,
	KOSAKI Motohiro, Jason Baron, Christoph Hellwig, Jiaying Zhang,
	Eduard - Gabriel Munteanu, mrubin, md


On Tue, 3 Mar 2009, Steven Rostedt wrote:

> 
> On Tue, 3 Mar 2009, Steven Rostedt wrote:
> 
> > Yes you can run multiple instances of this on different buffers.
> > 
> >  # ./rb-read.pl /tmp/buf-0
> > 
> > produces:
> > 
> >         0  [000]  7071.936459: (irq_handler_entry) irq 48
> >         0  [000]  7071.936462: (irq_handler_exit) irq 48 ret 1
> >         0  [000]  7071.988801: (sched_signal_send) sig: 14  task 0
> >         0  [000]  7071.988813: (sched_wakeup) task 0 success=1
> >         0  [000]  7071.988823: (sched_switch) prev 0:140 ==> next ntpd:6582272:0
> >      3303  [000]  7071.988916: (sched_switch) prev 0:120 ==> next swap:7497072:0
> >         0  [000]  7072.020370: (sched_wakeup) task 0 success=1
> >         0  [000]  7072.020407: (sched_switch) prev 0:140 ==> next even:808416116:0
> 
> My perl script is still a bit buggy, but you get the idea ;-)

Uploaded a fix:

         0  [000]  7071.936459: (irq_handler_entry) irq 48
         0  [000]  7071.936462: (irq_handler_exit) irq 48 ret 1
         0  [000]  7071.988801: (sched_signal_send) sig: 14  task 3303
         0  [000]  7071.988813: (sched_wakeup) task 3303 success=1
         0  [000]  7071.988823: (sched_switch) prev 0:140 ==> next ntpdpd:3303:120
      3303  [000]  7071.988916: (sched_switch) prev 3303:120 ==> next swapper:0:140
         0  [000]  7072.020370: (sched_wakeup) task 16 success=1
         0  [000]  7072.020407: (sched_switch) prev 0:140 ==> next events/0:16:115
        16  [000]  7071.936157: (sched_switch) prev 16:115 ==> next swapper:0:140

Much better.

I just used the same binary image. The bug was in the perl script.

-- Steve


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] tracing: add binary buffer files for use with splice
  2009-03-04  2:49 ` [PATCH 5/5] tracing: add binary buffer files for use with splice Steven Rostedt
@ 2009-03-04  3:35   ` Andrew Morton
  2009-03-04  3:43     ` Steven Rostedt
  0 siblings, 1 reply; 26+ messages in thread
From: Andrew Morton @ 2009-03-04  3:35 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Frederic Weisbecker,
	Theodore Tso, Arjan van de Ven, Pekka Paalanen,
	Arnaldo Carvalho de Melo, H. Peter Anvin, Mathieu Desnoyers,
	Martin Bligh, Frank Ch. Eigler, Tom Zanussi, Masami Hiramatsu,
	KOSAKI Motohiro, Jason Baron, Christoph Hellwig, Jiaying Zhang,
	Eduard - Gabriel Munteanu, mrubin, md, Steven Rostedt

On Tue, 03 Mar 2009 21:49:26 -0500 Steven Rostedt <rostedt@goodmis.org> wrote:

> +static ssize_t
> +tracing_buffers_read(struct file *filp, char __user *ubuf,
> +		     size_t count, loff_t *ppos)
> +{
> +	struct ftrace_buffer_info *info = filp->private_data;
> +	unsigned int pos;
> +	ssize_t ret;
> +	size_t size;
> +
> +	/* Do we have previous read data to read? */
> +	if (info->read < PAGE_SIZE)
> +		goto read;
> +
> +	info->read = 0;
> +
> +	ret = ring_buffer_read_page(info->tr->buffer,
> +				    &info->spare,
> +				    count,
> +				    info->cpu, 0);
> +	if (ret < 0)
> +		return 0;
> +
> +	pos = ring_buffer_page_len(info->spare);
> +
> +	if (pos < PAGE_SIZE)
> +		memset(info->spare + pos, 0, PAGE_SIZE - pos);
> +
> +read:
> +	size = PAGE_SIZE - info->read;
> +	if (size > count)
> +		size = count;
> +
> +	ret = copy_to_user(ubuf, info->spare + info->read, size);
> +	if (ret)
> +		return -EFAULT;

Conventionally a read() system call will return the number of bytes
copied, and will only return -EFOO if the number of bytes copied was
zero.

Lots of parts of the kernel break this, but it's usually device drivers
and scruffy pseudo files, in which case a partial file read doesn't
make much sense.  This doesn't make the broken behaviour right, but at
least we have a bit of a weaselly excuse in that case.

> +	*ppos += size;
> +	info->read += size;
> +
> +	return size;
> +}

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] tracing: add binary buffer files for use with splice
  2009-03-04  3:35   ` Andrew Morton
@ 2009-03-04  3:43     ` Steven Rostedt
  2009-03-04  4:38       ` H. Peter Anvin
  2009-03-04  4:46       ` Theodore Tso
  0 siblings, 2 replies; 26+ messages in thread
From: Steven Rostedt @ 2009-03-04  3:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Frederic Weisbecker,
	Theodore Tso, Arjan van de Ven, Pekka Paalanen,
	Arnaldo Carvalho de Melo, H. Peter Anvin, Mathieu Desnoyers,
	Martin Bligh, Frank Ch. Eigler, Tom Zanussi, Masami Hiramatsu,
	KOSAKI Motohiro, Jason Baron, Christoph Hellwig, Jiaying Zhang,
	Eduard - Gabriel Munteanu, mrubin, md, Steven Rostedt


On Tue, 3 Mar 2009, Andrew Morton wrote:

> On Tue, 03 Mar 2009 21:49:26 -0500 Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > +static ssize_t
> > +tracing_buffers_read(struct file *filp, char __user *ubuf,
> > +		     size_t count, loff_t *ppos)
> > +{
> > +	struct ftrace_buffer_info *info = filp->private_data;
> > +	unsigned int pos;
> > +	ssize_t ret;
> > +	size_t size;
> > +
> > +	/* Do we have previous read data to read? */
> > +	if (info->read < PAGE_SIZE)
> > +		goto read;
> > +
> > +	info->read = 0;
> > +
> > +	ret = ring_buffer_read_page(info->tr->buffer,
> > +				    &info->spare,
> > +				    count,
> > +				    info->cpu, 0);
> > +	if (ret < 0)
> > +		return 0;
> > +
> > +	pos = ring_buffer_page_len(info->spare);
> > +
> > +	if (pos < PAGE_SIZE)
> > +		memset(info->spare + pos, 0, PAGE_SIZE - pos);
> > +
> > +read:
> > +	size = PAGE_SIZE - info->read;
> > +	if (size > count)
> > +		size = count;
> > +
> > +	ret = copy_to_user(ubuf, info->spare + info->read, size);
> > +	if (ret)
> > +		return -EFAULT;
> 
> Conventionally a read() system call will return the number of bytes
> copied, and will only return -EFOO if the number of bytes copied was
> zero.
> 
> Lots of parts of the kernel break this, but it's usually device drivers
> and scruffy pseudo files, in which case a partial file read doesn't
> make much sense.  This doesn't make the broken behaviour right, but at
> least we have a bit of a weaselly excuse in that case.
> 

I just went by the read man page:

       EFAULT buf is outside your accessible address space.

-- Steve


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] tracing: add binary buffer files for use with splice
  2009-03-04  3:43     ` Steven Rostedt
@ 2009-03-04  4:38       ` H. Peter Anvin
  2009-03-04  4:45         ` Steven Rostedt
  2009-03-04  4:46       ` Theodore Tso
  1 sibling, 1 reply; 26+ messages in thread
From: H. Peter Anvin @ 2009-03-04  4:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andrew Morton, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Frederic Weisbecker, Theodore Tso, Arjan van de Ven,
	Pekka Paalanen, Arnaldo Carvalho de Melo, Mathieu Desnoyers,
	Martin Bligh, Frank Ch. Eigler, Tom Zanussi, Masami Hiramatsu,
	KOSAKI Motohiro, Jason Baron, Christoph Hellwig, Jiaying Zhang,
	Eduard - Gabriel Munteanu, mrubin, md, Steven Rostedt

Steven Rostedt wrote:
>> Conventionally a read() system call will return the number of bytes
>> copied, and will only return -EFOO if the number of bytes copied was
>> zero.
>>
>> Lots of parts of the kernel break this, but it's usually device drivers
>> and scruffy pseudo files, in which case a partial file read doesn't
>> make much sense.  This doesn't make the broken behaviour right, but at
>> least we have a bit of a weaselly excuse in that case.
> 
> I just went by the read man page:
> 
>        EFAULT buf is outside your accessible address space.

Yes, and that is the correct value if no bytes were copied.

	-hpa


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] tracing: add binary buffer files for use with splice
  2009-03-04  4:38       ` H. Peter Anvin
@ 2009-03-04  4:45         ` Steven Rostedt
  0 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2009-03-04  4:45 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andrew Morton, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Frederic Weisbecker, Theodore Tso, Arjan van de Ven,
	Pekka Paalanen, Arnaldo Carvalho de Melo, Mathieu Desnoyers,
	Martin Bligh, Frank Ch. Eigler, Tom Zanussi, Masami Hiramatsu,
	KOSAKI Motohiro, Jason Baron, Christoph Hellwig, Jiaying Zhang,
	Eduard - Gabriel Munteanu, mrubin, md, Steven Rostedt



On Tue, 3 Mar 2009, H. Peter Anvin wrote:

> Steven Rostedt wrote:
> > > Conventionally a read() system call will return the number of bytes
> > > copied, and will only return -EFOO if the number of bytes copied was
> > > zero.
> > > 
> > > Lots of parts of the kernel break this, but it's usually device drivers
> > > and scruffy pseudo files, in which case a partial file read doesn't
> > > make much sense.  This doesn't make the broken behaviour right, but at
> > > least we have a bit of a weaselly excuse in that case.
> > 
> > I just went by the read man page:
> > 
> >        EFAULT buf is outside your accessible address space.
> 
> Yes, and that is the correct value if no bytes were copied.

Yuck. IMHO I rather have a read return a failure to my user app if I'm 
trying to write to more than I allocated. I guess I can change it, but it 
just seems to be hiding a bug in a user app.

-- Steve


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] tracing: add binary buffer files for use with splice
  2009-03-04  3:43     ` Steven Rostedt
  2009-03-04  4:38       ` H. Peter Anvin
@ 2009-03-04  4:46       ` Theodore Tso
  2009-03-04  4:49         ` Steven Rostedt
  2009-03-04  5:07         ` [PATCH] fs: make simple_read_from_buffer conventional Steven Rostedt
  1 sibling, 2 replies; 26+ messages in thread
From: Theodore Tso @ 2009-03-04  4:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andrew Morton, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Frederic Weisbecker, Arjan van de Ven, Pekka Paalanen,
	Arnaldo Carvalho de Melo, H. Peter Anvin, Mathieu Desnoyers,
	Martin Bligh, Frank Ch. Eigler, Tom Zanussi, Masami Hiramatsu,
	KOSAKI Motohiro, Jason Baron, Christoph Hellwig, Jiaying Zhang,
	Eduard - Gabriel Munteanu, mrubin, md, Steven Rostedt

On Tue, Mar 03, 2009 at 10:43:16PM -0500, Steven Rostedt wrote:
> > Conventionally a read() system call will return the number of bytes
> > copied, and will only return -EFOO if the number of bytes copied was
> > zero.
> > 
> > Lots of parts of the kernel break this, but it's usually device drivers
> > and scruffy pseudo files, in which case a partial file read doesn't
> > make much sense.  This doesn't make the broken behaviour right, but at
> > least we have a bit of a weaselly excuse in that case.
> > 
> 
> I just went by the read man page:
> 
>        EFAULT buf is outside your accessible address space.
> 

What is normally done in this case if buf is within the addressable
space, but buf+x is not, where x < len, then read(fd, buf, len) will
fill copy as many characters it can into buf until it runs out of
valid addres space, and then return that count as a partial read.

      	     	    	     	    	 -  Ted

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/5] tracing: add binary buffer files for use with splice
  2009-03-04  4:46       ` Theodore Tso
@ 2009-03-04  4:49         ` Steven Rostedt
  2009-03-04  5:07         ` [PATCH] fs: make simple_read_from_buffer conventional Steven Rostedt
  1 sibling, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2009-03-04  4:49 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Andrew Morton, linux-kernel, Ingo Molnar, Peter Zijlstra,
	Frederic Weisbecker, Arjan van de Ven, Pekka Paalanen,
	Arnaldo Carvalho de Melo, H. Peter Anvin, Mathieu Desnoyers,
	Martin Bligh, Frank Ch. Eigler, Tom Zanussi, Masami Hiramatsu,
	KOSAKI Motohiro, Jason Baron, Christoph Hellwig, Jiaying Zhang,
	Eduard - Gabriel Munteanu, mrubin, md, Steven Rostedt


On Tue, 3 Mar 2009, Theodore Tso wrote:

> On Tue, Mar 03, 2009 at 10:43:16PM -0500, Steven Rostedt wrote:
> > > Conventionally a read() system call will return the number of bytes
> > > copied, and will only return -EFOO if the number of bytes copied was
> > > zero.
> > > 
> > > Lots of parts of the kernel break this, but it's usually device drivers
> > > and scruffy pseudo files, in which case a partial file read doesn't
> > > make much sense.  This doesn't make the broken behaviour right, but at
> > > least we have a bit of a weaselly excuse in that case.
> > > 
> > 
> > I just went by the read man page:
> > 
> >        EFAULT buf is outside your accessible address space.
> > 
> 
> What is normally done in this case if buf is within the addressable
> space, but buf+x is not, where x < len, then read(fd, buf, len) will
> fill copy as many characters it can into buf until it runs out of
> valid addres space, and then return that count as a partial read.

I guess I need to change that code to be:

	if (ret) {
		if (ret == size)
			return -EFAULT;
		size -= ret;
	}

??

-- Steve


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH] fs: make simple_read_from_buffer conventional
  2009-03-04  4:46       ` Theodore Tso
  2009-03-04  4:49         ` Steven Rostedt
@ 2009-03-04  5:07         ` Steven Rostedt
  2009-03-04 10:12           ` Ingo Molnar
  1 sibling, 1 reply; 26+ messages in thread
From: Steven Rostedt @ 2009-03-04  5:07 UTC (permalink / raw)
  To: LKML
  Cc: Theodore Tso, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Frederic Weisbecker, Arjan van de Ven, Pekka Paalanen,
	Arnaldo Carvalho de Melo, H. Peter Anvin, Mathieu Desnoyers,
	Martin Bligh, Frank Ch. Eigler, Tom Zanussi, Masami Hiramatsu,
	KOSAKI Motohiro, Jason Baron, Christoph Hellwig, Jiaying Zhang,
	Eduard - Gabriel Munteanu, mrubin, md, Steven Rostedt


Impact: have simple_read_from_buffer conform to standards

It was brought to my attention by Andrew Morton, Theodore Tso,
and H. Peter Anvin that a read from userspace should only return
-EFAULT if nothing was actually read.

Looking at the simple_read_from_buffer I noticed that this function
does not conform to that rule. This patch fixes that function.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>

diff --git a/fs/libfs.c b/fs/libfs.c
index 49b4409..6a72298 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -525,14 +525,20 @@ ssize_t simple_read_from_buffer(void __user *to, size_t count, loff_t *ppos,
 				const void *from, size_t available)
 {
 	loff_t pos = *ppos;
+	size_t ret;
+
 	if (pos < 0)
 		return -EINVAL;
 	if (pos >= available)
 		return 0;
 	if (count > available - pos)
 		count = available - pos;
-	if (copy_to_user(to, from + pos, count))
-		return -EFAULT;
+	ret = copy_to_user(to, from + pos, count);
+	if (ret) {
+		if (ret == count)
+			return -EFAULT;
+		count -= ret;
+	}
 	*ppos = pos + count;
 	return count;
 }


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH] fs: make simple_read_from_buffer conventional
  2009-03-04  5:07         ` [PATCH] fs: make simple_read_from_buffer conventional Steven Rostedt
@ 2009-03-04 10:12           ` Ingo Molnar
  0 siblings, 0 replies; 26+ messages in thread
From: Ingo Molnar @ 2009-03-04 10:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Theodore Tso, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Arjan van de Ven, Pekka Paalanen,
	Arnaldo Carvalho de Melo, H. Peter Anvin, Mathieu Desnoyers,
	Martin Bligh, Frank Ch. Eigler, Tom Zanussi, Masami Hiramatsu,
	KOSAKI Motohiro, Jason Baron, Christoph Hellwig, Jiaying Zhang,
	Eduard - Gabriel Munteanu, mrubin, md, Steven Rostedt


* Steven Rostedt <rostedt@goodmis.org> wrote:

> Impact: have simple_read_from_buffer conform to standards
> 
> It was brought to my attention by Andrew Morton, Theodore Tso,
> and H. Peter Anvin that a read from userspace should only return
> -EFAULT if nothing was actually read.
> 
> Looking at the simple_read_from_buffer I noticed that this function
> does not conform to that rule. This patch fixes that function.
> 
> Signed-off-by: Steven Rostedt <srostedt@redhat.com>
> 
> diff --git a/fs/libfs.c b/fs/libfs.c
> index 49b4409..6a72298 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -525,14 +525,20 @@ ssize_t simple_read_from_buffer(void __user *to, size_t count, loff_t *ppos,
>  				const void *from, size_t available)
>  {
>  	loff_t pos = *ppos;
> +	size_t ret;
> +
>  	if (pos < 0)
>  		return -EINVAL;
>  	if (pos >= available)
>  		return 0;
>  	if (count > available - pos)
>  		count = available - pos;
> -	if (copy_to_user(to, from + pos, count))
> -		return -EFAULT;
> +	ret = copy_to_user(to, from + pos, count);
> +	if (ret) {
> +		if (ret == count)
> +			return -EFAULT;
> +		count -= ret;
> +	}

Btw., the git grep result below shows 160 usage sites all across 
the kernel, so this bug affects a lot of existing debugfs users.

	Ingo

arch/cris/kernel/profile.c:	ret = simple_read_from_buffer(buf, count, ppos, sample_buffer,
arch/ia64/kernel/salinfo.c:	return simple_read_from_buffer(buffer, count, ppos, buf, bufsize);
arch/powerpc/kernel/proc_ppc64.c:	return simple_read_from_buffer(buf, nbytes, ppos, dp->data, dp->size);
arch/powerpc/platforms/cell/spufs/file.c:	ret = simple_read_from_buffer(buf, len, ppos, attr->get_buf, size);
arch/powerpc/platforms/cell/spufs/file.c:	return simple_read_from_buffer(buffer, size, pos, local_store,
arch/powerpc/platforms/cell/spufs/file.c:	return simple_read_from_buffer(buffer, size, pos,
arch/powerpc/platforms/cell/spufs/file.c:	return simple_read_from_buffer(buffer, size, pos,
arch/powerpc/platforms/cell/spufs/file.c:	return simple_read_from_buffer(buf, len, pos, &data, sizeof data);
arch/powerpc/platforms/cell/spufs/file.c:	return simple_read_from_buffer(buf, len, pos, &data, sizeof data);
arch/powerpc/platforms/cell/spufs/file.c:	return simple_read_from_buffer(buf, len, pos, &data,
arch/powerpc/platforms/cell/spufs/file.c:	return simple_read_from_buffer(buf, len, pos, &info,
arch/powerpc/platforms/cell/spufs/file.c:	return simple_read_from_buffer(buf, len, pos, &info,
arch/s390/hypfs/inode.c:	ret = simple_read_from_buffer(buf, count, &offset, data, strlen(data));
arch/um/drivers/mmapper_kern.c:	return simple_read_from_buffer(buf, count, ppos, v_buf, mmapper_size);
arch/x86/xen/debugfs.c:	return simple_read_from_buffer(buf, len, ppos, file->private_data, size);
drivers/acpi/system.c:	res = simple_read_from_buffer(buffer, count, ppos, dsdt, dsdt->length);
drivers/acpi/system.c:	res = simple_read_from_buffer(buffer, count, ppos, fadt, fadt->length);
drivers/char/nwflash.c:	ret = simple_read_from_buffer(buf, size, ppos, (void *)FLASH_BASE, gbFlashSize);
drivers/idle/i7300_idle.c:	return simple_read_from_buffer(ubuf, count, off, buf, len);
drivers/infiniband/hw/ipath/ipath_fs.c:	return simple_read_from_buffer(buf, count, ppos, &ipath_stats,
drivers/infiniband/hw/ipath/ipath_fs.c:	return simple_read_from_buffer(buf, count, ppos, &counters,
drivers/isdn/hysdn/hysdn_procconf.c:	return simple_read_from_buffer(buf, count, off, cp, strlen(cp));
drivers/media/dvb/ttusb-budget/dvb-ttusb-budget.c:	return simple_read_from_buffer(buf, count, offset, stc_firmware, 8192);
drivers/media/video/cafe_ccic.c:	return simple_read_from_buffer(buf, count, ppos, cafe_debug_buf,
drivers/media/video/cafe_ccic.c:	return simple_read_from_buffer(buf, count, ppos, cafe_debug_buf,
drivers/net/wimax/i2400m/debugfs.c:	return simple_read_from_buffer(buffer, count, ppos, buf, strlen(buf));
drivers/net/wimax/i2400m/debugfs.c:	return simple_read_from_buffer(buffer, count, ppos, buf, strlen(buf));
drivers/net/wireless/airo.c:	return simple_read_from_buffer(buffer, len, offset, priv->rbuffer,
drivers/net/wireless/ath5k/debug.c:	return simple_read_from_buffer(user_buf, count, ppos, buf, 19);
drivers/net/wireless/ath5k/debug.c:	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
drivers/net/wireless/ath5k/debug.c:	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
drivers/net/wireless/ath9k/debug.c:	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
drivers/net/wireless/ath9k/debug.c:	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
drivers/net/wireless/b43/debugfs.c:	ret = simple_read_from_buffer(userbuf, count, ppos,
drivers/net/wireless/b43legacy/debugfs.c:	ret = simple_read_from_buffer(userbuf, count, ppos,
drivers/net/wireless/iwlwifi/iwl-3945-rs.c:	return simple_read_from_buffer(user_buf, count, ppos, buff, desc);
drivers/net/wireless/iwlwifi/iwl-agn-rs.c:	return simple_read_from_buffer(user_buf, count, ppos, buff, desc);
drivers/net/wireless/iwlwifi/iwl-agn-rs.c:	return simple_read_from_buffer(user_buf, count, ppos, buff, desc);
drivers/net/wireless/iwlwifi/iwl-debugfs.c:	return simple_read_from_buffer(user_buf, count, ppos, buf, pos);
drivers/net/wireless/iwlwifi/iwl-debugfs.c:	return simple_read_from_buffer(user_buf, count, ppos, buf, pos);
drivers/net/wireless/iwlwifi/iwl-debugfs.c:	ret = simple_read_from_buffer(user_buf, count, ppos, buf, pos);
drivers/net/wireless/iwlwifi/iwl-debugfs.c:	ret = simple_read_from_buffer(user_buf, count, ppos, buf, pos);
drivers/net/wireless/iwlwifi/iwl-debugfs.c:	ret = simple_read_from_buffer(user_buf, count, ppos, buf, pos);
drivers/net/wireless/iwlwifi/iwl-debugfs.c:	ret = simple_read_from_buffer(user_buf, count, ppos, buf, pos);
drivers/net/wireless/libertas/debugfs.c:	res = simple_read_from_buffer(userbuf, count, ppos, buf, pos);
drivers/net/wireless/libertas/debugfs.c:	res = simple_read_from_buffer(userbuf, count, ppos, buf, pos);
drivers/net/wireless/libertas/debugfs.c:	ret = simple_read_from_buffer(userbuf, count, ppos, buf, pos);
drivers/net/wireless/libertas/debugfs.c:	ret = simple_read_from_buffer(userbuf, count, ppos, buf, pos);
drivers/net/wireless/libertas/debugfs.c:	ret = simple_read_from_buffer(userbuf, count, ppos, buf, pos);
drivers/net/wireless/libertas/debugfs.c:	ret = simple_read_from_buffer(userbuf, count, ppos, buf, pos);
drivers/net/wireless/libertas/debugfs.c:	ret = simple_read_from_buffer(userbuf, count, ppos, buf, pos);
drivers/net/wireless/libertas/debugfs.c:	res = simple_read_from_buffer(userbuf, count, ppos, p, pos);
drivers/oprofile/oprofilefs.c:	return simple_read_from_buffer(buf, count, offset, str, strlen(str));
drivers/oprofile/oprofilefs.c:	return simple_read_from_buffer(buf, count, offset, tmpbuf, maxlen);
drivers/pci/hotplug/cpqphp_sysfs.c:	return simple_read_from_buffer(buf, nbytes, ppos, dbg->data, dbg->size);
drivers/s390/char/vmcp.c:	ret = simple_read_from_buffer(buff, count, ppos,
drivers/s390/char/zcore.c:	return simple_read_from_buffer(buf, count, ppos, filp->private_data,
drivers/scsi/lpfc/lpfc_debugfs.c:	return simple_read_from_buffer(buf, nbytes, ppos, debug->buffer,
drivers/usb/gadget/atmel_usba_udc.c:	ret = simple_read_from_buffer(buf, nbytes, ppos,
drivers/usb/host/ehci-dbg.c:	ret = simple_read_from_buffer(user_buf, len, offset,
drivers/usb/host/ohci-dbg.c:	ret = simple_read_from_buffer(user_buf, len, offset,
drivers/usb/host/uhci-debug.c:	return simple_read_from_buffer(buf, nbytes, ppos, up->data, up->size);
drivers/usb/misc/idmouse.c:	result = simple_read_from_buffer(buffer, count, ppos,
drivers/usb/mon/mon_stat.c:	return simple_read_from_buffer(buf, nbytes, ppos, sp->str, sp->slen);
drivers/video/mbx/mbxdebugfs.c:	return  simple_read_from_buffer(userbuf, count, ppos,
drivers/video/mbx/mbxdebugfs.c:	return  simple_read_from_buffer(userbuf, count, ppos,
drivers/video/mbx/mbxdebugfs.c:	return  simple_read_from_buffer(userbuf, count, ppos,
drivers/video/mbx/mbxdebugfs.c:	return  simple_read_from_buffer(userbuf, count, ppos,
drivers/video/mbx/mbxdebugfs.c:	return  simple_read_from_buffer(userbuf, count, ppos,
drivers/video/mbx/mbxdebugfs.c:	return  simple_read_from_buffer(userbuf, count, ppos,
fs/binfmt_misc.c:	res = simple_read_from_buffer(buf, nbytes, ppos, page, strlen(page));
fs/binfmt_misc.c:	return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s));
fs/configfs/file.c:	retval = simple_read_from_buffer(buf, count, ppos, buffer->page,
fs/debugfs/file.c:	return simple_read_from_buffer(user_buf, count, ppos, buf, 2);
fs/debugfs/file.c:	return simple_read_from_buffer(user_buf, count, ppos, blob->data,
fs/dlm/debug_fs.c:	rv = simple_read_from_buffer(userbuf, count, ppos, debug_buf, pos);
fs/fuse/control.c:	return simple_read_from_buffer(buf, len, ppos, tmp, size);
fs/libfs.c: * simple_read_from_buffer - copy data from the buffer to user space
fs/libfs.c: * The simple_read_from_buffer() function reads up to @count bytes from the
fs/libfs.c:ssize_t simple_read_from_buffer(void __user *to, size_t count, loff_t *ppos,
fs/libfs.c:	return simple_read_from_buffer(buf, size, pos, ar->data, ar->size);
fs/libfs.c:	ret = simple_read_from_buffer(buf, len, ppos, attr->get_buf, size);
fs/libfs.c:EXPORT_SYMBOL(simple_read_from_buffer);
fs/ocfs2/dlm/dlmdebug.c:	return simple_read_from_buffer(buf, nbytes, ppos, db->buf, db->len);
fs/ocfs2/localalloc.c:	ret = simple_read_from_buffer(userbuf, count, ppos, buf, written);
fs/ocfs2/stack_user.c:	ret = simple_read_from_buffer(buf, count, ppos,
fs/proc/base.c:		length = simple_read_from_buffer(buf, count, ppos, (char *)page, length);
fs/proc/base.c:	return simple_read_from_buffer(buf, count, ppos, buffer, len);
fs/proc/base.c:	return simple_read_from_buffer(buf, count, ppos, tmpbuf, length);
fs/proc/base.c:	return simple_read_from_buffer(buf, count, ppos, tmpbuf, length);
fs/proc/base.c:	return simple_read_from_buffer(buf, count, ppos, buffer, len);
fs/proc/base.c:		err = simple_read_from_buffer(buf, len, ppos, tmp, strlen(tmp));
fs/proc/base.c:		length = simple_read_from_buffer(buf, count, ppos, p, length);
fs/proc/base.c:		ret = simple_read_from_buffer(buf, count, ppos, buffer, len);
fs/sysfs/file.c:	retval = simple_read_from_buffer(buf, count, ppos, buffer->page,
include/linux/fs.h:extern ssize_t simple_read_from_buffer(void __user *to, size_t count,
ipc/mqueue.c:	ret = simple_read_from_buffer(u_data, count, off, buffer,
kernel/cgroup.c:	return simple_read_from_buffer(buf, nbytes, ppos, tmp, len);
kernel/cgroup.c:	return simple_read_from_buffer(buf, nbytes, ppos, tmp, len);
kernel/configs.c:	return simple_read_from_buffer(buf, len, offset,
kernel/cpuset.c:	retval = simple_read_from_buffer(buf, nbytes, ppos, page, s - page);
kernel/kprobes.c:	return simple_read_from_buffer(user_buf, count, ppos, buf, 2);
kernel/rcupreempt_trace.c:	bcount = simple_read_from_buffer(buffer, count, ppos,
kernel/rcupreempt_trace.c:	bcount = simple_read_from_buffer(buffer, count, ppos,
kernel/rcupreempt_trace.c:	bcount = simple_read_from_buffer(buffer, count, ppos,
kernel/res_counter.c:	return simple_read_from_buffer((void __user *)userbuf, nbytes,
kernel/trace/blktrace.c:	return simple_read_from_buffer(buffer, count, ppos, buf, strlen(buf));
kernel/trace/ftrace.c:	return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
kernel/trace/ring_buffer.c:	return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
kernel/trace/trace.c:	count = simple_read_from_buffer(ubuf, count, ppos, mask_str, NR_CPUS+1);
kernel/trace/trace.c:	r = simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
kernel/trace/trace.c:	return simple_read_from_buffer(ubuf, cnt, ppos,
kernel/trace/trace.c:	return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
kernel/trace/trace.c:	return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
kernel/trace/trace.c:	return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
kernel/trace/trace.c:	return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
kernel/trace/trace.c:	r = simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
kernel/trace/trace.c:	return simple_read_from_buffer(ubuf, cnt, ppos, buf, 2);
kernel/trace/trace.c:	return simple_read_from_buffer(ubuf, cnt, ppos, buf, 2);
kernel/trace/trace_events.c:	return simple_read_from_buffer(ubuf, cnt, ppos, buf, 2);
kernel/trace/trace_events.c:	return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
kernel/trace/trace_events.c:	return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
kernel/trace/trace_events.c:		r = simple_read_from_buffer(ubuf, cnt, ppos,
kernel/trace/trace_events.c:	r = simple_read_from_buffer(ubuf, cnt, ppos,
kernel/trace/trace_stack.c:	return simple_read_from_buffer(ubuf, count, ppos, buf, r);
kernel/trace/trace_sysprof.c:	return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
net/mac80211/debugfs.c:	return simple_read_from_buffer(userbuf, count, ppos, buf, res);	\
net/mac80211/debugfs.c:	return simple_read_from_buffer(userbuf, count, ppos, buf, res);
net/mac80211/debugfs_key.c:	return simple_read_from_buffer(userbuf, count, ppos, buf, res);	\
net/mac80211/debugfs_key.c:	return simple_read_from_buffer(userbuf, count, ppos, alg, strlen(alg));
net/mac80211/debugfs_key.c:	return simple_read_from_buffer(userbuf, count, ppos, buf, len);
net/mac80211/debugfs_key.c:	return simple_read_from_buffer(userbuf, count, ppos, buf, len);
net/mac80211/debugfs_key.c:	return simple_read_from_buffer(userbuf, count, ppos, buf, len);
net/mac80211/debugfs_key.c:	res = simple_read_from_buffer(userbuf, count, ppos, buf, p - buf);
net/mac80211/debugfs_netdev.c:		ret = simple_read_from_buffer(userbuf, count, ppos, buf, ret);
net/mac80211/debugfs_sta.c:	return simple_read_from_buffer(userbuf, count, ppos, buf, res);	\
net/mac80211/debugfs_sta.c:	return simple_read_from_buffer(userbuf, count, ppos, buf, res);
net/mac80211/debugfs_sta.c:	return simple_read_from_buffer(userbuf, count, ppos, buf, res);
net/mac80211/debugfs_sta.c:	return simple_read_from_buffer(userbuf, count, ppos, buf, res);
net/mac80211/debugfs_sta.c:	return simple_read_from_buffer(userbuf, count, ppos, buf, p - buf);
net/mac80211/debugfs_sta.c:	return simple_read_from_buffer(userbuf, count, ppos, buf, p - buf);
net/mac80211/rate.c:	return simple_read_from_buffer(userbuf, count, ppos,
net/sunrpc/sysctl.c:	return simple_read_from_buffer(buffer, *lenp, ppos, tmpbuf, len);
security/selinux/selinuxfs.c:	return simple_read_from_buffer(buf, count, ppos, tmpbuf, length);
security/selinux/selinuxfs.c:	return simple_read_from_buffer(buf, count, ppos, tmpbuf, length);
security/selinux/selinuxfs.c:	return simple_read_from_buffer(buf, count, ppos, tmpbuf, length);
security/selinux/selinuxfs.c:	return simple_read_from_buffer(buf, count, ppos, tmpbuf, length);
security/selinux/selinuxfs.c:	return simple_read_from_buffer(buf, count, ppos, tmpbuf, length);
security/selinux/selinuxfs.c:	return simple_read_from_buffer(buf, count, ppos, tmpbuf, length);
security/selinux/selinuxfs.c:	ret = simple_read_from_buffer(buf, count, ppos, page, length);
security/selinux/selinuxfs.c:	return simple_read_from_buffer(buf, count, ppos, tmpbuf, length);
security/selinux/selinuxfs.c:		ret = simple_read_from_buffer(buf, count, ppos, page, ret);
security/selinux/selinuxfs.c:	ret = simple_read_from_buffer(buf, count, ppos, con, len);
security/selinux/selinuxfs.c:	rc = simple_read_from_buffer(buf, count, ppos, page, len);
security/selinux/selinuxfs.c:	rc = simple_read_from_buffer(buf, count, ppos, page, len);
security/selinux/selinuxfs.c:	return simple_read_from_buffer(buf, count, ppos, tmpbuf, length);
security/smack/smackfs.c:	rc = simple_read_from_buffer(buf, count, ppos, temp, strlen(temp));
security/smack/smackfs.c:	rc = simple_read_from_buffer(buf, count, ppos, temp, strlen(temp));
security/smack/smackfs.c:		rc = simple_read_from_buffer(buf, cn, ppos,
security/smack/smackfs.c:		rc = simple_read_from_buffer(buf, cn, ppos, smack, asize);
sound/soc/soc-core.c:		ret = simple_read_from_buffer(user_buf, count, ppos, buf, ret);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers
  2009-03-04  2:49 [PATCH 0/5] [RFC] binary reading of ftrace ring buffers Steven Rostedt
                   ` (5 preceding siblings ...)
  2009-03-04  3:01 ` [PATCH 0/5] [RFC] binary reading of ftrace ring buffers Steven Rostedt
@ 2009-03-04 10:26 ` Ingo Molnar
  2009-03-04 14:51   ` Steven Rostedt
  2009-03-04 15:39   ` Mathieu Desnoyers
  2009-03-04 17:00 ` Mathieu Desnoyers
  7 siblings, 2 replies; 26+ messages in thread
From: Ingo Molnar @ 2009-03-04 10:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Theodore Tso, Arjan van de Ven, Pekka Paalanen,
	Arnaldo Carvalho de Melo, H. Peter Anvin, Mathieu Desnoyers,
	Martin Bligh, Frank Ch. Eigler, Tom Zanussi, Masami Hiramatsu,
	KOSAKI Motohiro, Jason Baron, Christoph Hellwig, Jiaying Zhang,
	Eduard - Gabriel Munteanu, mrubin, md


* Steven Rostedt <rostedt@goodmis.org> wrote:

> RFC only, not for pulling, unless everyone is fine with these 
> :-)
> 
> After telling the folks from Google that I had splice code 
> that I needed to get working, I finally got around to doing 
> it.
> 
> Not only did I get the splice code working, but I also made a 
> perl script (might want to cover you eyes if you look at that 
> code) that can automagically create the output from the binary 
> files reading the format arguments in the 
> /debugfs/tracing/events/<subsys>/<event>/format file.

Very cool stuff!

The ftrace splice path allows the following tracing path: after 
we create a trace ringbuffer page in the kernel, we dont touch 
it _ever again_. We can shuffle it to disk via DMA or over the 
network via DMA without bringing it again into the CPU's caches.

This feature allows low-overhead high-throughput tracing on 
unprecedented levels. There's no format string overhead, no 
instruction patching/trapping overhead - straight raw binary 
tracing with C syntax tracepoints and a zero-copy path to 
storage.

Have you had any chance to measure tracing overhead and the 
maximum throghput we can reach with the ftrace splice code? I'd 
expect to see some really nice numbers.

> >From previous patches, we have in include/trace/sched_event_types.h:
> 
> #undef TRACE_SYSTEM
> #define TRACE_SYSTEM sched
> 
> TRACE_EVENT_FORMAT(sched_switch,
> 	TPPROTO(struct rq *rq, struct task_struct *prev,
> 		struct task_struct *next),
> 	TPARGS(rq, prev, next),
> 	TPFMT("task %s:%d ==> %s:%d",
> 	      prev->comm, prev->pid, next->comm, next->pid),
> 	TRACE_STRUCT(
> 		TRACE_FIELD(pid_t, prev_pid, prev->pid)
> 		TRACE_FIELD(int, prev_prio, prev->prio)
> 		TRACE_FIELD_SPECIAL(char next_comm[TASK_COMM_LEN],
> 				    next_comm,
> 				    TPCMD(memcpy(TRACE_ENTRY->next_comm,
> 						 next->comm,
> 						 TASK_COMM_LEN)))
> 		TRACE_FIELD(pid_t, next_pid, next->pid)
> 		TRACE_FIELD(int, next_prio, next->prio)
> 	),
> 	TPRAWFMT("prev %d:%d ==> next %s:%d:%d")
> 	);
> 
> By making that, we automagically get this file:
> 
>  # cat /debug/tracing/events/sched/sched_switch/format 
> name: sched_switch
> ID: 29
> format:
> 	field:unsigned char type;	offset:0;	size:1;
> 	field:unsigned char flags;	offset:1;	size:1;
> 	field:unsigned char preempt_count;	offset:2;	size:1;
> 	field:int pid;	offset:4;	size:4;
> 	field:int tgid;	offset:8;	size:4;
> 
> 	field:pid_t prev_pid;	offset:12;	size:4;
> 	field:int prev_prio;	offset:16;	size:4;
> 	field special:char next_comm[TASK_COMM_LEN];	offset:20;	size:16;
> 	field:pid_t next_pid;	offset:36;	size:4;
> 	field:int next_prio;	offset:40;	size:4;
> 
> print fmt: "prev %d:%d ==> next %s:%d:%d"
> 
> 
> Now with this patch set, we create a way to read the ftrace 
> ring buffers directly, as a binary page. Splice has been used 
> such that the user could splice the ring buffers without need 
> to copy the pages. The pages are taken from the ring buffers 
> and can be placed directly into files, without extra copies.
> 
>  # ls /debug/tracing/binary_buffers/
> 0  1  2  3  4  5  6  7
> 
> One can either just use the read/write to grab live data from 
> these buffers, or they could use splice. I have a simple file 
> that reads this buffers using splice. Note, it only runs on 
> one file, you can make multiple copies to run more. The ring 
> buffers in ftrace are per cpu and they are not dependent on 
> each other.
> 
> Also, if there is no data in the buffer, it returns -EAGAIN.
> 
>  # find /debug/tracing/events -name 'type' | while read f; do
> >    echo raw > $f; done
> 
>  # find /debug/tracing/events -name 'enable' | while read f; do
> >   echo 1 > $f; done
> 
>  # ./splice /debug/tracing/binary_buffers/0 /tmp/buf-0
> 
> Yes you can run multiple instances of this on different buffers.
> 
>  # ./rb-read.pl /tmp/buf-0
> 
> produces:
> 
>         0  [000]  7071.936459: (irq_handler_entry) irq 48
>         0  [000]  7071.936462: (irq_handler_exit) irq 48 ret 1
>         0  [000]  7071.988801: (sched_signal_send) sig: 14  task 0
>         0  [000]  7071.988813: (sched_wakeup) task 0 success=1
>         0  [000]  7071.988823: (sched_switch) prev 0:140 ==> next ntpd:6582272:0
>      3303  [000]  7071.988916: (sched_switch) prev 0:120 ==> next swap:7497072:0
>         0  [000]  7072.020370: (sched_wakeup) task 0 success=1
>         0  [000]  7072.020407: (sched_switch) prev 0:140 ==> next even:808416116:0
>        16  [000]  7071.936157: (sched_switch) prev 0:115 ==> next swap:7497072:0
> 
> Notice the "(sched_switch)" lines.
> 
>  The splice C code:
>    http://www.kernel.org/pub/linux/kernel/people/rostedt/splice.c
> 
>  The perl script to parse:
>    http://www.kernel.org/pub/linux/kernel/people/rostedt/rb-read.pl
> 
>  And yes I know, they are both ugly :-p

Any chance of merging them into a single perl script? I.e. the 
perl script would embedd the .c code and build it on every 
invocation (into /tmp), so that one can do single-tool tracing 
with no immediate binary components. Such self-sufficiency is 
very handy when probing systems in a minimally invasive way.

And we could also have twice the ugliness for the same price.

> The following patches are in:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
> 
>     branch: rfc/splice/tip/tracing/ftrace

Looks like the only problem with the code is the -EFAULT 
handling bug it exposed in simple_read_from_buffer() - but that 
fix should go on a separate track IMO.

So i've pulled it into tip:tracing - thanks Steve! We definitely 
want this in v2.6.30.

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers
  2009-03-04 10:26 ` Ingo Molnar
@ 2009-03-04 14:51   ` Steven Rostedt
  2009-03-04 22:47     ` Ingo Oeser
  2009-03-04 15:39   ` Mathieu Desnoyers
  1 sibling, 1 reply; 26+ messages in thread
From: Steven Rostedt @ 2009-03-04 14:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Theodore Tso, Arjan van de Ven, Pekka Paalanen,
	Arnaldo Carvalho de Melo, H. Peter Anvin, Mathieu Desnoyers,
	Martin Bligh, Frank Ch. Eigler, Tom Zanussi, Masami Hiramatsu,
	KOSAKI Motohiro, Jason Baron, Christoph Hellwig, Jiaying Zhang,
	Eduard - Gabriel Munteanu, mrubin, md

On Wed, 4 Mar 2009, Ingo Molnar wrote:
> > 
> >  The splice C code:
> >    http://www.kernel.org/pub/linux/kernel/people/rostedt/splice.c
> > 
> >  The perl script to parse:
> >    http://www.kernel.org/pub/linux/kernel/people/rostedt/rb-read.pl
> > 
> >  And yes I know, they are both ugly :-p
> 
> Any chance of merging them into a single perl script? I.e. the 
> perl script would embedd the .c code and build it on every 
> invocation (into /tmp), so that one can do single-tool tracing 
> with no immediate binary components. Such self-sufficiency is 
> very handy when probing systems in a minimally invasive way.
> 
> And we could also have twice the ugliness for the same price.
> 

Gag, that would really be ugly code ;-)

If anything I might actually want to break up the perl script. This was 
just made to be an example. Any serious user would most likely write their 
own user apps. This was more of a proof of concept.

I could clean up that code, and then have one big file that does all the 
ugliness for those that just want to tinker with it. But that is low on 
the totem pole ;-)

-- Steve


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers
  2009-03-04 10:26 ` Ingo Molnar
  2009-03-04 14:51   ` Steven Rostedt
@ 2009-03-04 15:39   ` Mathieu Desnoyers
  1 sibling, 0 replies; 26+ messages in thread
From: Mathieu Desnoyers @ 2009-03-04 15:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, linux-kernel, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Theodore Tso, Arjan van de Ven,
	Pekka Paalanen, Arnaldo Carvalho de Melo, H. Peter Anvin,
	Martin Bligh, Frank Ch. Eigler, Tom Zanussi, Masami Hiramatsu,
	KOSAKI Motohiro, Jason Baron, Christoph Hellwig, Jiaying Zhang,
	Eduard - Gabriel Munteanu, mrubin, md

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > RFC only, not for pulling, unless everyone is fine with these 
> > :-)
> > 
> > After telling the folks from Google that I had splice code 
> > that I needed to get working, I finally got around to doing 
> > it.
> > 
> > Not only did I get the splice code working, but I also made a 
> > perl script (might want to cover you eyes if you look at that 
> > code) that can automagically create the output from the binary 
> > files reading the format arguments in the 
> > /debugfs/tracing/events/<subsys>/<event>/format file.
> 
> Very cool stuff!
> 
> The ftrace splice path allows the following tracing path: after 
> we create a trace ringbuffer page in the kernel, we dont touch 
> it _ever again_. We can shuffle it to disk via DMA or over the 
> network via DMA without bringing it again into the CPU's caches.
> 
> This feature allows low-overhead high-throughput tracing on 
> unprecedented levels. There's no format string overhead, no 
> instruction patching/trapping overhead - straight raw binary 
> tracing with C syntax tracepoints and a zero-copy path to 
> storage.
> 
> Have you had any chance to measure tracing overhead and the 
> maximum throghput we can reach with the ftrace splice code? I'd 
> expect to see some really nice numbers.
> 

"unprecented levels" -> LTTng has been using splice for about 5 months.
The only reason why I did not post the patchset which performs this is
because I am currently adapting a LTTng module to optionally format the
binary buffers to a text output. And yes, it provides very, very good
performance results.

Mathieu


> > >From previous patches, we have in include/trace/sched_event_types.h:
> > 
> > #undef TRACE_SYSTEM
> > #define TRACE_SYSTEM sched
> > 
> > TRACE_EVENT_FORMAT(sched_switch,
> > 	TPPROTO(struct rq *rq, struct task_struct *prev,
> > 		struct task_struct *next),
> > 	TPARGS(rq, prev, next),
> > 	TPFMT("task %s:%d ==> %s:%d",
> > 	      prev->comm, prev->pid, next->comm, next->pid),
> > 	TRACE_STRUCT(
> > 		TRACE_FIELD(pid_t, prev_pid, prev->pid)
> > 		TRACE_FIELD(int, prev_prio, prev->prio)
> > 		TRACE_FIELD_SPECIAL(char next_comm[TASK_COMM_LEN],
> > 				    next_comm,
> > 				    TPCMD(memcpy(TRACE_ENTRY->next_comm,
> > 						 next->comm,
> > 						 TASK_COMM_LEN)))
> > 		TRACE_FIELD(pid_t, next_pid, next->pid)
> > 		TRACE_FIELD(int, next_prio, next->prio)
> > 	),
> > 	TPRAWFMT("prev %d:%d ==> next %s:%d:%d")
> > 	);
> > 
> > By making that, we automagically get this file:
> > 
> >  # cat /debug/tracing/events/sched/sched_switch/format 
> > name: sched_switch
> > ID: 29
> > format:
> > 	field:unsigned char type;	offset:0;	size:1;
> > 	field:unsigned char flags;	offset:1;	size:1;
> > 	field:unsigned char preempt_count;	offset:2;	size:1;
> > 	field:int pid;	offset:4;	size:4;
> > 	field:int tgid;	offset:8;	size:4;
> > 
> > 	field:pid_t prev_pid;	offset:12;	size:4;
> > 	field:int prev_prio;	offset:16;	size:4;
> > 	field special:char next_comm[TASK_COMM_LEN];	offset:20;	size:16;
> > 	field:pid_t next_pid;	offset:36;	size:4;
> > 	field:int next_prio;	offset:40;	size:4;
> > 
> > print fmt: "prev %d:%d ==> next %s:%d:%d"
> > 
> > 
> > Now with this patch set, we create a way to read the ftrace 
> > ring buffers directly, as a binary page. Splice has been used 
> > such that the user could splice the ring buffers without need 
> > to copy the pages. The pages are taken from the ring buffers 
> > and can be placed directly into files, without extra copies.
> > 
> >  # ls /debug/tracing/binary_buffers/
> > 0  1  2  3  4  5  6  7
> > 
> > One can either just use the read/write to grab live data from 
> > these buffers, or they could use splice. I have a simple file 
> > that reads this buffers using splice. Note, it only runs on 
> > one file, you can make multiple copies to run more. The ring 
> > buffers in ftrace are per cpu and they are not dependent on 
> > each other.
> > 
> > Also, if there is no data in the buffer, it returns -EAGAIN.
> > 
> >  # find /debug/tracing/events -name 'type' | while read f; do
> > >    echo raw > $f; done
> > 
> >  # find /debug/tracing/events -name 'enable' | while read f; do
> > >   echo 1 > $f; done
> > 
> >  # ./splice /debug/tracing/binary_buffers/0 /tmp/buf-0
> > 
> > Yes you can run multiple instances of this on different buffers.
> > 
> >  # ./rb-read.pl /tmp/buf-0
> > 
> > produces:
> > 
> >         0  [000]  7071.936459: (irq_handler_entry) irq 48
> >         0  [000]  7071.936462: (irq_handler_exit) irq 48 ret 1
> >         0  [000]  7071.988801: (sched_signal_send) sig: 14  task 0
> >         0  [000]  7071.988813: (sched_wakeup) task 0 success=1
> >         0  [000]  7071.988823: (sched_switch) prev 0:140 ==> next ntpd:6582272:0
> >      3303  [000]  7071.988916: (sched_switch) prev 0:120 ==> next swap:7497072:0
> >         0  [000]  7072.020370: (sched_wakeup) task 0 success=1
> >         0  [000]  7072.020407: (sched_switch) prev 0:140 ==> next even:808416116:0
> >        16  [000]  7071.936157: (sched_switch) prev 0:115 ==> next swap:7497072:0
> > 
> > Notice the "(sched_switch)" lines.
> > 
> >  The splice C code:
> >    http://www.kernel.org/pub/linux/kernel/people/rostedt/splice.c
> > 
> >  The perl script to parse:
> >    http://www.kernel.org/pub/linux/kernel/people/rostedt/rb-read.pl
> > 
> >  And yes I know, they are both ugly :-p
> 
> Any chance of merging them into a single perl script? I.e. the 
> perl script would embedd the .c code and build it on every 
> invocation (into /tmp), so that one can do single-tool tracing 
> with no immediate binary components. Such self-sufficiency is 
> very handy when probing systems in a minimally invasive way.
> 
> And we could also have twice the ugliness for the same price.
> 
> > The following patches are in:
> > 
> >   git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
> > 
> >     branch: rfc/splice/tip/tracing/ftrace
> 
> Looks like the only problem with the code is the -EFAULT 
> handling bug it exposed in simple_read_from_buffer() - but that 
> fix should go on a separate track IMO.
> 
> So i've pulled it into tip:tracing - thanks Steve! We definitely 
> want this in v2.6.30.
> 
> 	Ingo
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers
  2009-03-04  2:49 [PATCH 0/5] [RFC] binary reading of ftrace ring buffers Steven Rostedt
                   ` (6 preceding siblings ...)
  2009-03-04 10:26 ` Ingo Molnar
@ 2009-03-04 17:00 ` Mathieu Desnoyers
  2009-03-04 17:19   ` Peter Zijlstra
  2009-03-06 16:59   ` Steven Rostedt
  7 siblings, 2 replies; 26+ messages in thread
From: Mathieu Desnoyers @ 2009-03-04 17:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Theodore Tso, Arjan van de Ven,
	Pekka Paalanen, Arnaldo Carvalho de Melo, H. Peter Anvin,
	Martin Bligh, Frank Ch. Eigler, Tom Zanussi, Masami Hiramatsu,
	KOSAKI Motohiro, Jason Baron, Christoph Hellwig, Jiaying Zhang,
	Eduard - Gabriel Munteanu, mrubin, md

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> RFC only, not for pulling, unless everyone is fine with these :-)
> 
> After telling the folks from Google that I had splice code that I needed
> to get working, I finally got around to doing it.
> 
> Not only did I get the splice code working, but I also made a perl
> script (might want to cover you eyes if you look at that code) that
> can automagically create the output from the binary files reading
> the format arguments in the /debugfs/tracing/events/<subsys>/<event>/format
> file.
> 
> From previous patches, we have in include/trace/sched_event_types.h:
> 
> #undef TRACE_SYSTEM
> #define TRACE_SYSTEM sched
> 
> TRACE_EVENT_FORMAT(sched_switch,
> 	TPPROTO(struct rq *rq, struct task_struct *prev,
> 		struct task_struct *next),
> 	TPARGS(rq, prev, next),
> 	TPFMT("task %s:%d ==> %s:%d",
> 	      prev->comm, prev->pid, next->comm, next->pid),
> 	TRACE_STRUCT(
> 		TRACE_FIELD(pid_t, prev_pid, prev->pid)
> 		TRACE_FIELD(int, prev_prio, prev->prio)
> 		TRACE_FIELD_SPECIAL(char next_comm[TASK_COMM_LEN],
> 				    next_comm,
> 				    TPCMD(memcpy(TRACE_ENTRY->next_comm,
> 						 next->comm,
> 						 TASK_COMM_LEN)))
> 		TRACE_FIELD(pid_t, next_pid, next->pid)
> 		TRACE_FIELD(int, next_prio, next->prio)
> 	),
> 	TPRAWFMT("prev %d:%d ==> next %s:%d:%d")
> 	);
> 

I fear that putting these user-visible format strings in tracepoint
header files will create a big maintainability issue.

I'll post the LTTng patchset in a jiffy, where the format string
awareness is done in a tracer-specific module. I don't understand why
Peter Z. is not yelling against your tracepoints modifications : they
are actually presenting to userspace an interface that is meant to
eventually change.

I used a separate layer for format string presentation for this very
purpose : I don't want to tie the kernel code instrumentation
(tracepoints) to any kind of user-visible API.

Mathieu

> By making that, we automagically get this file:
> 
>  # cat /debug/tracing/events/sched/sched_switch/format 
> name: sched_switch
> ID: 29
> format:
> 	field:unsigned char type;	offset:0;	size:1;
> 	field:unsigned char flags;	offset:1;	size:1;
> 	field:unsigned char preempt_count;	offset:2;	size:1;
> 	field:int pid;	offset:4;	size:4;
> 	field:int tgid;	offset:8;	size:4;
> 
> 	field:pid_t prev_pid;	offset:12;	size:4;
> 	field:int prev_prio;	offset:16;	size:4;
> 	field special:char next_comm[TASK_COMM_LEN];	offset:20;	size:16;
> 	field:pid_t next_pid;	offset:36;	size:4;
> 	field:int next_prio;	offset:40;	size:4;
> 
> print fmt: "prev %d:%d ==> next %s:%d:%d"
> 
> 
> Now with this patch set, we create a way to read the ftrace ring buffers
> directly, as a binary page. Splice has been used such that the user
> could splice the ring buffers without need to copy the pages.
> The pages are taken from the ring buffers and can be placed directly
> into files, without extra copies.
> 
>  # ls /debug/tracing/binary_buffers/
> 0  1  2  3  4  5  6  7
> 
> One can either just use the read/write to grab live data from these
> buffers, or they could use splice. I have a simple file that reads
> this buffers using splice. Note, it only runs on one file, you
> can make multiple copies to run more. The ring buffers in ftrace
> are per cpu and they are not dependent on each other.
> 
> Also, if there is no data in the buffer, it returns -EAGAIN.
> 
>  # find /debug/tracing/events -name 'type' | while read f; do
> >    echo raw > $f; done
> 
>  # find /debug/tracing/events -name 'enable' | while read f; do
> >   echo 1 > $f; done
> 
>  # ./splice /debug/tracing/binary_buffers/0 /tmp/buf-0
> 
> Yes you can run multiple instances of this on different buffers.
> 
>  # ./rb-read.pl /tmp/buf-0
> 
> produces:
> 
>         0  [000]  7071.936459: (irq_handler_entry) irq 48
>         0  [000]  7071.936462: (irq_handler_exit) irq 48 ret 1
>         0  [000]  7071.988801: (sched_signal_send) sig: 14  task 0
>         0  [000]  7071.988813: (sched_wakeup) task 0 success=1
>         0  [000]  7071.988823: (sched_switch) prev 0:140 ==> next ntpd:6582272:0
>      3303  [000]  7071.988916: (sched_switch) prev 0:120 ==> next swap:7497072:0
>         0  [000]  7072.020370: (sched_wakeup) task 0 success=1
>         0  [000]  7072.020407: (sched_switch) prev 0:140 ==> next even:808416116:0
>        16  [000]  7071.936157: (sched_switch) prev 0:115 ==> next swap:7497072:0
> 
> Notice the "(sched_switch)" lines.
> 
>  The splice C code:
>    http://www.kernel.org/pub/linux/kernel/people/rostedt/splice.c
> 
>  The perl script to parse:
>    http://www.kernel.org/pub/linux/kernel/people/rostedt/rb-read.pl
> 
>  And yes I know, they are both ugly :-p
> 
> The following patches are in:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
> 
>     branch: rfc/splice/tip/tracing/ftrace
> 
> 
> Steven Rostedt (5):
>       ring-buffer: reset write field for ring_buffer_read_page
>       ring-buffer: fix ring_buffer_read_page
>       ring-buffer: replace sizeof of event header with offsetof
>       ring-buffer: make ring_buffer_read_page read from start on partial page
>       tracing: add binary buffer files for use with splice
> 
> ----
>  include/linux/ring_buffer.h |    7 +-
>  kernel/trace/ring_buffer.c  |  118 +++++++++++++------
>  kernel/trace/trace.c        |  274 +++++++++++++++++++++++++++++++++++++++++-
>  kernel/trace/trace.h        |    1 +
>  4 files changed, 357 insertions(+), 43 deletions(-)
> -- 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers
  2009-03-04 17:00 ` Mathieu Desnoyers
@ 2009-03-04 17:19   ` Peter Zijlstra
  2009-03-06 16:59   ` Steven Rostedt
  1 sibling, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2009-03-04 17:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, linux-kernel, Ingo Molnar, Andrew Morton,
	Frederic Weisbecker, Theodore Tso, Arjan van de Ven,
	Pekka Paalanen, Arnaldo Carvalho de Melo, H. Peter Anvin,
	Martin Bligh, Frank Ch. Eigler, Tom Zanussi, Masami Hiramatsu,
	KOSAKI Motohiro, Jason Baron, Christoph Hellwig, Jiaying Zhang,
	Eduard - Gabriel Munteanu, mrubin, md

On Wed, 2009-03-04 at 12:00 -0500, Mathieu Desnoyers wrote:

> > TRACE_EVENT_FORMAT(sched_switch,
> > 	TPPROTO(struct rq *rq, struct task_struct *prev,
> > 		struct task_struct *next),
> > 	TPARGS(rq, prev, next),
> > 	TPFMT("task %s:%d ==> %s:%d",
> > 	      prev->comm, prev->pid, next->comm, next->pid),
> > 	TRACE_STRUCT(
> > 		TRACE_FIELD(pid_t, prev_pid, prev->pid)
> > 		TRACE_FIELD(int, prev_prio, prev->prio)
> > 		TRACE_FIELD_SPECIAL(char next_comm[TASK_COMM_LEN],
> > 				    next_comm,
> > 				    TPCMD(memcpy(TRACE_ENTRY->next_comm,
> > 						 next->comm,
> > 						 TASK_COMM_LEN)))
> > 		TRACE_FIELD(pid_t, next_pid, next->pid)
> > 		TRACE_FIELD(int, next_prio, next->prio)
> > 	),
> > 	TPRAWFMT("prev %d:%d ==> next %s:%d:%d")
> > 	);
> > 
> 
> I fear that putting these user-visible format strings in tracepoint
> header files will create a big maintainability issue.
> 
> I'll post the LTTng patchset in a jiffy, where the format string
> awareness is done in a tracer-specific module. I don't understand why
> Peter Z. is not yelling against your tracepoints modifications : they
> are actually presenting to userspace an interface that is meant to
> eventually change.
> 
> I used a separate layer for format string presentation for this very
> purpose : I don't want to tie the kernel code instrumentation
> (tracepoints) to any kind of user-visible API.

Not only that, its butt-ugly to boot.

I'm just not sure what to say anymore.. people seem insistent on doing
this, look at the recent google trace and this zedtrace thing.

What I can live with is if we tie it 1:1 to our code structure by means
of code generation, in that case we can simply argue its a direct
representation of our code, and we don't have any API rules on that.

Arnaldo proposed using sparse and tree rewriting to accomplish this.

The above however creates these artificial things that can arguably be
kept static, therefore one could defend that expectation.

What's worse, it doesn't even come with version information :-)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers
  2009-03-04 14:51   ` Steven Rostedt
@ 2009-03-04 22:47     ` Ingo Oeser
  0 siblings, 0 replies; 26+ messages in thread
From: Ingo Oeser @ 2009-03-04 22:47 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, linux-kernel, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Theodore Tso, Arjan van de Ven,
	Pekka Paalanen, Arnaldo Carvalho de Melo, H. Peter Anvin,
	Mathieu Desnoyers, Martin Bligh, Frank Ch. Eigler, Tom Zanussi,
	Masami Hiramatsu, KOSAKI Motohiro, Jason Baron,
	Christoph Hellwig, Jiaying Zhang, Eduard - Gabriel Munteanu,
	mrubin, md

Hi Steven,

On Wednesday 04 March 2009, Steven Rostedt wrote:
> On Wed, 4 Mar 2009, Ingo Molnar wrote:
> > Any chance of merging them into a single perl script? I.e. the 
> > perl script would embedd the .c code and build it on every 
> > invocation (into /tmp), so that one can do single-tool tracing 
> > with no immediate binary components. Such self-sufficiency is 
> > very handy when probing systems in a minimally invasive way.
> > 
> > And we could also have twice the ugliness for the same price.
> > 
> 
> Gag, that would really be ugly code ;-)

If you want really ugly code, you could do the SYS_splice in perl
via syscall. Just look it up via "perldoc -f syscall" and play around :-)

Best Regards

Ingo Oeser

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers
  2009-03-04 17:00 ` Mathieu Desnoyers
  2009-03-04 17:19   ` Peter Zijlstra
@ 2009-03-06 16:59   ` Steven Rostedt
  2009-03-06 19:10     ` Mathieu Desnoyers
  1 sibling, 1 reply; 26+ messages in thread
From: Steven Rostedt @ 2009-03-06 16:59 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Theodore Tso, Arjan van de Ven,
	Pekka Paalanen, Arnaldo Carvalho de Melo, H. Peter Anvin,
	Martin Bligh, Frank Ch. Eigler, Tom Zanussi, Masami Hiramatsu,
	KOSAKI Motohiro, Jason Baron, Christoph Hellwig, Jiaying Zhang,
	Eduard - Gabriel Munteanu, mrubin, md


On Wed, 4 Mar 2009, Mathieu Desnoyers wrote:
> > 
> > From previous patches, we have in include/trace/sched_event_types.h:
> > 
> > #undef TRACE_SYSTEM
> > #define TRACE_SYSTEM sched
> > 
> > TRACE_EVENT_FORMAT(sched_switch,
> > 	TPPROTO(struct rq *rq, struct task_struct *prev,
> > 		struct task_struct *next),
> > 	TPARGS(rq, prev, next),
> > 	TPFMT("task %s:%d ==> %s:%d",
> > 	      prev->comm, prev->pid, next->comm, next->pid),
> > 	TRACE_STRUCT(
> > 		TRACE_FIELD(pid_t, prev_pid, prev->pid)
> > 		TRACE_FIELD(int, prev_prio, prev->prio)
> > 		TRACE_FIELD_SPECIAL(char next_comm[TASK_COMM_LEN],
> > 				    next_comm,
> > 				    TPCMD(memcpy(TRACE_ENTRY->next_comm,
> > 						 next->comm,
> > 						 TASK_COMM_LEN)))
> > 		TRACE_FIELD(pid_t, next_pid, next->pid)
> > 		TRACE_FIELD(int, next_prio, next->prio)
> > 	),
> > 	TPRAWFMT("prev %d:%d ==> next %s:%d:%d")
> > 	);
> > 
> 
> I fear that putting these user-visible format strings in tracepoint
> header files will create a big maintainability issue.
> 
> I'll post the LTTng patchset in a jiffy, where the format string
> awareness is done in a tracer-specific module. I don't understand why
> Peter Z. is not yelling against your tracepoints modifications : they
> are actually presenting to userspace an interface that is meant to
> eventually change.
> 
> I used a separate layer for format string presentation for this very
> purpose : I don't want to tie the kernel code instrumentation
> (tracepoints) to any kind of user-visible API.
> 

Hi Mathieu,

Sorry for the late reply, I started to reply but was distracted, and sadly 
forgot about it :-(  (and I rebooted the box that had the reply window 
open)


Just a couple of points to make.

1)  The raw format string is a "hint" for userspace to read the buffers.
The offset and sizeof is the real data that userspace should use. The
print format can be just a way to help them out. But it is by no means 
something that is expected to be constant.

It is also used internal by ftrace because it needs a way to read the 
"fast C style recording" records. It is better to print some text than to 
have a bunch of hex print out in the screen. Say you have the C style 
recording enabled and you take an oops. With ftrace_dumps_on_oops set, the 
buffers will be printed to the console.  This lets ftrace have a means to 
print out the events.

2) The difference between this and markers is that the format string is 
bound to the declaration in the trace header. Not in the code itself. The 
maintainer should only see the trace point function call. It should only 
look like a function call and not some funky printf string that the 
maintainer might not care about.

If the code changes and it breaks the format string, it is up to the 
tracing maintainers to fix it. If the format string was embedded within 
the code, then the maintainer would be burdoned with that duty. But here 
the format string is in the include/trace directory and any breakage there 
should definitely be fixed by a some tracing maintainer. Not the 
maintainer of the subsystem. Unless of course that maintainer wants to fix 
it ;-)


3) TRACE_EVENT_FORMAT is an extension to TRACE_FORMAT. Developers may 
choose to use the more simple, but a bit more expensive TRACE_FORMAT if 
they choose to. Perhaps it may be more prudent to anyway. Recently
Peter Zijlstra was tracing locks and just wanted to trace the name. 
There's really no difference in copying the name via a C style (strcpy) or 
having a printf formatting do the work. Here is a case where just using 
TRACE_FORMAT is a reasonable solution.

4) I actually avoided moving the format string into the tracing utility. 
The tracing utility can ignore it if it wants. I orginially had the 
headers included in the kernel/trace/events.c file and found that was too 
cumbersome. I did not like the fact that I needed to go into some tracing 
infrastructure to add an event.  It was a burdon to me to add a new event 
and I wrote the code!  That's why I moved it all into the include/trace 
directory. One stop shopping.

A developer to add new events only needs to add the trace points in their 
code and then update the files (perhaps add new ones) in include/trace. No 
need to dig through the kernel source to find out where else they need to 
go.


I see that you published your patch queue. I've looked at most of the 
patches already. I'll make my comments shortly.

Thanks,

-- Steve


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers
  2009-03-06 16:59   ` Steven Rostedt
@ 2009-03-06 19:10     ` Mathieu Desnoyers
  2009-03-06 23:28       ` Jiaying Zhang
  0 siblings, 1 reply; 26+ messages in thread
From: Mathieu Desnoyers @ 2009-03-06 19:10 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra,
	Frederic Weisbecker, Theodore Tso, Arjan van de Ven,
	Pekka Paalanen, Arnaldo Carvalho de Melo, H. Peter Anvin,
	Martin Bligh, Frank Ch. Eigler, Tom Zanussi, Masami Hiramatsu,
	KOSAKI Motohiro, Jason Baron, Christoph Hellwig, Jiaying Zhang,
	Eduard - Gabriel Munteanu, mrubin, md

* Steven Rostedt (rostedt@goodmis.org) wrote:
> 
> On Wed, 4 Mar 2009, Mathieu Desnoyers wrote:
> > > 
> > > From previous patches, we have in include/trace/sched_event_types.h:
> > > 
> > > #undef TRACE_SYSTEM
> > > #define TRACE_SYSTEM sched
> > > 
> > > TRACE_EVENT_FORMAT(sched_switch,
> > > 	TPPROTO(struct rq *rq, struct task_struct *prev,
> > > 		struct task_struct *next),
> > > 	TPARGS(rq, prev, next),
> > > 	TPFMT("task %s:%d ==> %s:%d",
> > > 	      prev->comm, prev->pid, next->comm, next->pid),
> > > 	TRACE_STRUCT(
> > > 		TRACE_FIELD(pid_t, prev_pid, prev->pid)
> > > 		TRACE_FIELD(int, prev_prio, prev->prio)
> > > 		TRACE_FIELD_SPECIAL(char next_comm[TASK_COMM_LEN],
> > > 				    next_comm,
> > > 				    TPCMD(memcpy(TRACE_ENTRY->next_comm,
> > > 						 next->comm,
> > > 						 TASK_COMM_LEN)))
> > > 		TRACE_FIELD(pid_t, next_pid, next->pid)
> > > 		TRACE_FIELD(int, next_prio, next->prio)
> > > 	),
> > > 	TPRAWFMT("prev %d:%d ==> next %s:%d:%d")
> > > 	);
> > > 
> > 
> > I fear that putting these user-visible format strings in tracepoint
> > header files will create a big maintainability issue.
> > 
> > I'll post the LTTng patchset in a jiffy, where the format string
> > awareness is done in a tracer-specific module. I don't understand why
> > Peter Z. is not yelling against your tracepoints modifications : they
> > are actually presenting to userspace an interface that is meant to
> > eventually change.
> > 
> > I used a separate layer for format string presentation for this very
> > purpose : I don't want to tie the kernel code instrumentation
> > (tracepoints) to any kind of user-visible API.
> > 
> 
> Hi Mathieu,
> 
> Sorry for the late reply, I started to reply but was distracted, and sadly 
> forgot about it :-(  (and I rebooted the box that had the reply window 
> open)
> 
> 
> Just a couple of points to make.
> 
> 1)  The raw format string is a "hint" for userspace to read the buffers.
> The offset and sizeof is the real data that userspace should use. The
> print format can be just a way to help them out. But it is by no means 
> something that is expected to be constant.
> 
> It is also used internal by ftrace because it needs a way to read the 
> "fast C style recording" records. It is better to print some text than to 
> have a bunch of hex print out in the screen. Say you have the C style 
> recording enabled and you take an oops. With ftrace_dumps_on_oops set, the 
> buffers will be printed to the console.  This lets ftrace have a means to 
> print out the events.
> 
> 2) The difference between this and markers is that the format string is 
> bound to the declaration in the trace header. Not in the code itself. The 
> maintainer should only see the trace point function call. It should only 
> look like a function call and not some funky printf string that the 
> maintainer might not care about.
> 
> If the code changes and it breaks the format string, it is up to the 
> tracing maintainers to fix it. If the format string was embedded within 
> the code, then the maintainer would be burdoned with that duty. But here 
> the format string is in the include/trace directory and any breakage there 
> should definitely be fixed by a some tracing maintainer. Not the 
> maintainer of the subsystem. Unless of course that maintainer wants to fix 
> it ;-)
> 
> 
> 3) TRACE_EVENT_FORMAT is an extension to TRACE_FORMAT. Developers may 
> choose to use the more simple, but a bit more expensive TRACE_FORMAT if 
> they choose to. Perhaps it may be more prudent to anyway. Recently
> Peter Zijlstra was tracing locks and just wanted to trace the name. 
> There's really no difference in copying the name via a C style (strcpy) or 
> having a printf formatting do the work. Here is a case where just using 
> TRACE_FORMAT is a reasonable solution.
> 
> 4) I actually avoided moving the format string into the tracing utility. 
> The tracing utility can ignore it if it wants. I orginially had the 
> headers included in the kernel/trace/events.c file and found that was too 
> cumbersome. I did not like the fact that I needed to go into some tracing 
> infrastructure to add an event.  It was a burdon to me to add a new event 
> and I wrote the code!  That's why I moved it all into the include/trace 
> directory. One stop shopping.
> 
> A developer to add new events only needs to add the trace points in their 
> code and then update the files (perhaps add new ones) in include/trace. No 
> need to dig through the kernel source to find out where else they need to 
> go.
> 
> 
> I see that you published your patch queue. I've looked at most of the 
> patches already. I'll make my comments shortly.
> 

Hi Steve,

I totally agree that we want something like a format string to identify
the data we export. My disagreement is on the location where such
identification belongs. I have the funny feeling that you are currently
creating no less than full-blown callbacks expressed as a weird, custom,
macros-based, header definition, and I think it's seriously wrong.

To paraphrase Andrew Morton : we are writing in C. Let's keep it that
way.

This is why I strongly prefer to use the tracepoint callback -> marker
format string approach rather that putting any sort of format string in
the tracepoint header.

If what you really worry about is to have the ability to "quickly" add
debug-style instrumentation when you write code, then you do not want,
nor should ever, use a tracepoint, because tracepoint location is not
meant to be decided without careful consideration. Markers are there
fore this : you can add a quick-and-dirty one-liner :

  trace_mark(group, event, "field_identifier1 %u  f_ident2 %u", ...);

In your source code and you are all set. I don't see the problem with
such approach ?

Mathieu

> Thanks,
> 
> -- Steve
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers
  2009-03-06 19:10     ` Mathieu Desnoyers
@ 2009-03-06 23:28       ` Jiaying Zhang
  2009-03-08 19:21         ` Ingo Molnar
  0 siblings, 1 reply; 26+ messages in thread
From: Jiaying Zhang @ 2009-03-06 23:28 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, linux-kernel, Ingo Molnar, Andrew Morton,
	Peter Zijlstra, Frederic Weisbecker, Theodore Tso,
	Arjan van de Ven, Pekka Paalanen, Arnaldo Carvalho de Melo,
	H. Peter Anvin, Martin Bligh, Frank Ch. Eigler, Tom Zanussi,
	Masami Hiramatsu, KOSAKI Motohiro, Jason Baron,
	Christoph Hellwig, Eduard - Gabriel Munteanu, mrubin, md

Sorry for getting into this discussion late.

I would like to point out that we think it is really important to have
some very efficient probing mechanism in the kernel for tracing in
production environments. The printf and va_arg based probes are
flexible but less efficient when we want to trace high-throughput events.
Even function calls can add noticeable overhead according to our
measurements. So I think we need to provide a way (mostly via
macro definitions) with which a subsystem can enter an event into
a trace buffer through a short code path. I.e., we should limit the
number of callbacks and avoid format string parsing.

As I understand, Steven's latest TRACE_FIELD patch avoids such
overhead, although it does seem to add complexity for adding new
trace points. It would be nice if we can replace the above sched_switch
declaration with just a couple of macros.

Jiaying


On Fri, Mar 6, 2009 at 11:10 AM, Mathieu Desnoyers
<compudj@krystal.dyndns.org> wrote:
> * Steven Rostedt (rostedt@goodmis.org) wrote:
>>
>> On Wed, 4 Mar 2009, Mathieu Desnoyers wrote:
>> > >
>> > > From previous patches, we have in include/trace/sched_event_types.h:
>> > >
>> > > #undef TRACE_SYSTEM
>> > > #define TRACE_SYSTEM sched
>> > >
>> > > TRACE_EVENT_FORMAT(sched_switch,
>> > >   TPPROTO(struct rq *rq, struct task_struct *prev,
>> > >           struct task_struct *next),
>> > >   TPARGS(rq, prev, next),
>> > >   TPFMT("task %s:%d ==> %s:%d",
>> > >         prev->comm, prev->pid, next->comm, next->pid),
>> > >   TRACE_STRUCT(
>> > >           TRACE_FIELD(pid_t, prev_pid, prev->pid)
>> > >           TRACE_FIELD(int, prev_prio, prev->prio)
>> > >           TRACE_FIELD_SPECIAL(char next_comm[TASK_COMM_LEN],
>> > >                               next_comm,
>> > >                               TPCMD(memcpy(TRACE_ENTRY->next_comm,
>> > >                                            next->comm,
>> > >                                            TASK_COMM_LEN)))
>> > >           TRACE_FIELD(pid_t, next_pid, next->pid)
>> > >           TRACE_FIELD(int, next_prio, next->prio)
>> > >   ),
>> > >   TPRAWFMT("prev %d:%d ==> next %s:%d:%d")
>> > >   );
>> > >
>> >
>> > I fear that putting these user-visible format strings in tracepoint
>> > header files will create a big maintainability issue.
>> >
>> > I'll post the LTTng patchset in a jiffy, where the format string
>> > awareness is done in a tracer-specific module. I don't understand why
>> > Peter Z. is not yelling against your tracepoints modifications : they
>> > are actually presenting to userspace an interface that is meant to
>> > eventually change.
>> >
>> > I used a separate layer for format string presentation for this very
>> > purpose : I don't want to tie the kernel code instrumentation
>> > (tracepoints) to any kind of user-visible API.
>> >
>>
>> Hi Mathieu,
>>
>> Sorry for the late reply, I started to reply but was distracted, and sadly
>> forgot about it :-(  (and I rebooted the box that had the reply window
>> open)
>>
>>
>> Just a couple of points to make.
>>
>> 1)  The raw format string is a "hint" for userspace to read the buffers.
>> The offset and sizeof is the real data that userspace should use. The
>> print format can be just a way to help them out. But it is by no means
>> something that is expected to be constant.
>>
>> It is also used internal by ftrace because it needs a way to read the
>> "fast C style recording" records. It is better to print some text than to
>> have a bunch of hex print out in the screen. Say you have the C style
>> recording enabled and you take an oops. With ftrace_dumps_on_oops set, the
>> buffers will be printed to the console.  This lets ftrace have a means to
>> print out the events.
>>
>> 2) The difference between this and markers is that the format string is
>> bound to the declaration in the trace header. Not in the code itself. The
>> maintainer should only see the trace point function call. It should only
>> look like a function call and not some funky printf string that the
>> maintainer might not care about.
>>
>> If the code changes and it breaks the format string, it is up to the
>> tracing maintainers to fix it. If the format string was embedded within
>> the code, then the maintainer would be burdoned with that duty. But here
>> the format string is in the include/trace directory and any breakage there
>> should definitely be fixed by a some tracing maintainer. Not the
>> maintainer of the subsystem. Unless of course that maintainer wants to fix
>> it ;-)
>>
>>
>> 3) TRACE_EVENT_FORMAT is an extension to TRACE_FORMAT. Developers may
>> choose to use the more simple, but a bit more expensive TRACE_FORMAT if
>> they choose to. Perhaps it may be more prudent to anyway. Recently
>> Peter Zijlstra was tracing locks and just wanted to trace the name.
>> There's really no difference in copying the name via a C style (strcpy) or
>> having a printf formatting do the work. Here is a case where just using
>> TRACE_FORMAT is a reasonable solution.
>>
>> 4) I actually avoided moving the format string into the tracing utility.
>> The tracing utility can ignore it if it wants. I orginially had the
>> headers included in the kernel/trace/events.c file and found that was too
>> cumbersome. I did not like the fact that I needed to go into some tracing
>> infrastructure to add an event.  It was a burdon to me to add a new event
>> and I wrote the code!  That's why I moved it all into the include/trace
>> directory. One stop shopping.
>>
>> A developer to add new events only needs to add the trace points in their
>> code and then update the files (perhaps add new ones) in include/trace. No
>> need to dig through the kernel source to find out where else they need to
>> go.
>>
>>
>> I see that you published your patch queue. I've looked at most of the
>> patches already. I'll make my comments shortly.
>>
>
> Hi Steve,
>
> I totally agree that we want something like a format string to identify
> the data we export. My disagreement is on the location where such
> identification belongs. I have the funny feeling that you are currently
> creating no less than full-blown callbacks expressed as a weird, custom,
> macros-based, header definition, and I think it's seriously wrong.
>
> To paraphrase Andrew Morton : we are writing in C. Let's keep it that
> way.
>
> This is why I strongly prefer to use the tracepoint callback -> marker
> format string approach rather that putting any sort of format string in
> the tracepoint header.
>
> If what you really worry about is to have the ability to "quickly" add
> debug-style instrumentation when you write code, then you do not want,
> nor should ever, use a tracepoint, because tracepoint location is not
> meant to be decided without careful consideration. Markers are there
> fore this : you can add a quick-and-dirty one-liner :
>
>  trace_mark(group, event, "field_identifier1 %u  f_ident2 %u", ...);
>
> In your source code and you are all set. I don't see the problem with
> such approach ?
>
> Mathieu
>
>> Thanks,
>>
>> -- Steve
>>
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers
  2009-03-06 23:28       ` Jiaying Zhang
@ 2009-03-08 19:21         ` Ingo Molnar
  0 siblings, 0 replies; 26+ messages in thread
From: Ingo Molnar @ 2009-03-08 19:21 UTC (permalink / raw)
  To: Jiaying Zhang
  Cc: Mathieu Desnoyers, Steven Rostedt, linux-kernel, Andrew Morton,
	Peter Zijlstra, Frederic Weisbecker, Theodore Tso,
	Arjan van de Ven, Pekka Paalanen, Arnaldo Carvalho de Melo,
	H. Peter Anvin, Martin Bligh, Frank Ch. Eigler, Tom Zanussi,
	Masami Hiramatsu, KOSAKI Motohiro, Jason Baron,
	Christoph Hellwig, Eduard - Gabriel Munteanu, mrubin, md


* Jiaying Zhang <jiayingz@google.com> wrote:

> I would like to point out that we think it is really important 
> to have some very efficient probing mechanism in the kernel 
> for tracing in production environments. The printf and va_arg 
> based probes are flexible but less efficient when we want to 
> trace high-throughput events. Even function calls can add 
> noticeable overhead according to our measurements. So I think 
> we need to provide a way (mostly via macro definitions) with 
> which a subsystem can enter an event into a trace buffer 
> through a short code path. I.e., we should limit the number of 
> callbacks and avoid format string parsing.
> 
> As I understand, Steven's latest TRACE_FIELD patch avoids such 
> overhead, although it does seem to add complexity for adding 
> new trace points. [...]

Yeah - it was motivated by the patches you sent to lkml which 
showed that it's possible to do it quite sanely and that it can 
be done faster.

> [...] It would be nice if we can replace the above 
> sched_switch declaration with just a couple of macros.

Good point - there's ongoing work to simplify the TRACE_FIELD 
approach. The current (not yet pushed out) optimized tracepoint 
format Steve is working on is:

/*
 * Tracepoint for task switches, performed by the scheduler:
 *
 * (NOTE: the 'rq' argument is not used by generic trace events,
 *        but used by the latency tracer plugin. )
 */
TRACE_EVENT(sched_switch,

	TP_PROTO(struct rq *rq, struct task_struct *prev,
		 struct task_struct *next),

	TP_ARGS(rq, prev, next),

	TP_STRUCT__entry(
		__array(	char,	prev_comm,	TASK_COMM_LEN	)
		__field(	pid_t,	prev_pid			)
		__field(	int,	prev_prio			)
		__array(	char,	next_comm,	TASK_COMM_LEN	)
		__field(	pid_t,	next_pid			)
		__field(	int,	next_prio			)
	),

	TP_printk("task %s:%d [%d] ==> %s:%d [%d]",
		__entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
		__entry->next_comm, __entry->next_pid, __entry->next_prio),

	TP_fast_assign(
		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
		__entry->prev_pid	= prev->pid;
		__entry->prev_prio	= prev->prio;
		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
		__entry->next_pid	= next->pid;
		__entry->next_prio	= next->prio;
	)
);

As you can see it enumerates fields, provides format-based 
tracing and a tracepoint as well. It also looks quite similar to 
C syntax while still being an information-dense macro.

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2009-03-08 19:22 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-04  2:49 [PATCH 0/5] [RFC] binary reading of ftrace ring buffers Steven Rostedt
2009-03-04  2:49 ` [PATCH 1/5] ring-buffer: reset write field for ring_buffer_read_page Steven Rostedt
2009-03-04  2:49 ` [PATCH 2/5] ring-buffer: fix ring_buffer_read_page Steven Rostedt
2009-03-04  2:49 ` [PATCH 3/5] ring-buffer: replace sizeof of event header with offsetof Steven Rostedt
2009-03-04  2:49 ` [PATCH 4/5] ring-buffer: make ring_buffer_read_page read from start on partial page Steven Rostedt
2009-03-04  2:49 ` [PATCH 5/5] tracing: add binary buffer files for use with splice Steven Rostedt
2009-03-04  3:35   ` Andrew Morton
2009-03-04  3:43     ` Steven Rostedt
2009-03-04  4:38       ` H. Peter Anvin
2009-03-04  4:45         ` Steven Rostedt
2009-03-04  4:46       ` Theodore Tso
2009-03-04  4:49         ` Steven Rostedt
2009-03-04  5:07         ` [PATCH] fs: make simple_read_from_buffer conventional Steven Rostedt
2009-03-04 10:12           ` Ingo Molnar
2009-03-04  3:01 ` [PATCH 0/5] [RFC] binary reading of ftrace ring buffers Steven Rostedt
2009-03-04  3:23   ` Steven Rostedt
2009-03-04 10:26 ` Ingo Molnar
2009-03-04 14:51   ` Steven Rostedt
2009-03-04 22:47     ` Ingo Oeser
2009-03-04 15:39   ` Mathieu Desnoyers
2009-03-04 17:00 ` Mathieu Desnoyers
2009-03-04 17:19   ` Peter Zijlstra
2009-03-06 16:59   ` Steven Rostedt
2009-03-06 19:10     ` Mathieu Desnoyers
2009-03-06 23:28       ` Jiaying Zhang
2009-03-08 19:21         ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).