bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file
@ 2020-03-19 23:22 Steven Rostedt
  2020-03-19 23:22 ` [PATCH 01/12 v2] selftest/ftrace: Fix function trigger test to handle trace not disabling the tracer Steven Rostedt
                   ` (12 more replies)
  0 siblings, 13 replies; 27+ messages in thread
From: Steven Rostedt @ 2020-03-19 23:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Masami Hiramatsu,
	Alexei Starovoitov, Peter Wu, Jonathan Corbet, Tom Zanussi,
	Shuah Khan, bpf


When the ring buffer was first written for ftrace, there was two
human readable files to read it. One was a standard "producer/consumer"
file (trace_pipe), which would consume data from the ring buffer as
it read it, and the other was a "static iterator" that would not
consume the events, such that the file could be read multiple times
and return the same output each time.

The "static iterator" was never meant to be read while there was an
active writer to the ring buffer. If writing was enabled, then it
would disable the writer when the trace file was opened.

There has been some complaints about this by the BPF folks, that did
not realize this little bit of information and it was requested that
the "trace" file does not stop the writing to the ring buffer.

This patch series attempts to satisfy that request, by creating a
temporary buffer in each of the per cpu iterators to place the
read event into, such that it can be passed to users without worrying
about a writer to corrupt the event while it was being written out.
It also uses the fact that the ring buffer is broken up into pages,
where each page has its own timestamp that gets updated when a
writer crosses over to it. By copying it to the temp buffer, and
doing a "before and after" test of the time stamp with memory barriers,
can allow the events to be saved.

Changes since v1:

 - Added fix to selftest first, where these changes wont break it

 - Changed comment in trace_find_next_entry() to better explain what
   it was doing, as pointed out by Masami Hiramatsu.

 - Allocated the iterator temp buffer when the iterator is created,
   as Masami pointed out, it would be better than allocating it each
   time it was used. It is initiated as 128 bytes as most trace events
   are less than that, but will be expanded if needed. Note that
   function is only used when latency measurements are needed (seeing
   two events at once).

Steven Rostedt (VMware) (12):
      selftest/ftrace: Fix function trigger test to handle trace not disabling the tracer
      tracing: Save off entry when peeking at next entry
      ring-buffer: Have ring_buffer_empty() not depend on tracing stopped
      ring-buffer: Rename ring_buffer_read() to read_buffer_iter_advance()
      ring-buffer: Add page_stamp to iterator for synchronization
      ring-buffer: Have rb_iter_head_event() handle concurrent writer
      ring-buffer: Do not die if rb_iter_peek() fails more than thrice
      ring-buffer: Optimize rb_iter_head_event()
      ring-buffer: Do not disable recording when there is an iterator
      tracing: Do not disable tracing when reading the trace file
      ring-buffer/tracing: Have iterator acknowledge dropped events
      tracing: Have the document reflect that the trace file keeps tracing enabled

----
 Documentation/trace/ftrace.rst                     |  13 +-
 include/linux/ring_buffer.h                        |   4 +-
 include/linux/trace_events.h                       |   2 +
 kernel/trace/ring_buffer.c                         | 196 +++++++++++++++------
 kernel/trace/trace.c                               |  68 +++++--
 kernel/trace/trace_functions_graph.c               |   2 +-
 kernel/trace/trace_output.c                        |  15 +-
 .../test.d/ftrace/func_traceonoff_triggers.tc      |   2 +-
 8 files changed, 211 insertions(+), 91 deletions(-)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 01/12 v2] selftest/ftrace: Fix function trigger test to handle trace not disabling the tracer
  2020-03-19 23:22 [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file Steven Rostedt
@ 2020-03-19 23:22 ` Steven Rostedt
  2020-03-19 23:22 ` [PATCH 02/12 v2] tracing: Save off entry when peeking at next entry Steven Rostedt
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2020-03-19 23:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Masami Hiramatsu,
	Alexei Starovoitov, Peter Wu, Jonathan Corbet, Tom Zanussi,
	Shuah Khan, bpf, Shuah Khan, linux-kselftest

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

The ftrace selftest "ftrace - test for function traceon/off triggers"
enables all events and reads the trace file. Now that the trace file does
not disable tracing, and will attempt to continually read new data that is
added, the selftest gets stuck reading the trace file. This is because the
data added to the trace file will fill up quicker than the reading of it.

By only enabling scheduling events, the read can keep up with the writes.
Instead of enabling all events, only enable the scheduler events.

Link: http://lkml.kernel.org/r/20200318111345.0516642e@gandalf.local.home

Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: linux-kselftest@vger.kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 .../selftests/ftrace/test.d/ftrace/func_traceonoff_triggers.tc  | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/func_traceonoff_triggers.tc b/tools/testing/selftests/ftrace/test.d/ftrace/func_traceonoff_triggers.tc
index 0c04282d33dd..1947387fe976 100644
--- a/tools/testing/selftests/ftrace/test.d/ftrace/func_traceonoff_triggers.tc
+++ b/tools/testing/selftests/ftrace/test.d/ftrace/func_traceonoff_triggers.tc
@@ -41,7 +41,7 @@ fi
 
 echo '** ENABLE EVENTS'
 
-echo 1 > events/enable
+echo 1 > events/sched/enable
 
 echo '** ENABLE TRACING'
 enable_tracing
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 02/12 v2] tracing: Save off entry when peeking at next entry
  2020-03-19 23:22 [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file Steven Rostedt
  2020-03-19 23:22 ` [PATCH 01/12 v2] selftest/ftrace: Fix function trigger test to handle trace not disabling the tracer Steven Rostedt
@ 2020-03-19 23:22 ` Steven Rostedt
  2020-03-20  2:57   ` Masami Hiramatsu
       [not found]   ` <20200326091256.GR11705@shao2-debian>
  2020-03-19 23:22 ` [PATCH 03/12 v2] ring-buffer: Have ring_buffer_empty() not depend on tracing stopped Steven Rostedt
                   ` (10 subsequent siblings)
  12 siblings, 2 replies; 27+ messages in thread
From: Steven Rostedt @ 2020-03-19 23:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Masami Hiramatsu,
	Alexei Starovoitov, Peter Wu, Jonathan Corbet, Tom Zanussi,
	Shuah Khan, bpf

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

In order to have the iterator read the buffer even when it's still updating,
it requires that the ring buffer iterator saves each event in a separate
location outside the ring buffer such that its use is immutable.

There's one use case that saves off the event returned from the ring buffer
interator and calls it again to look at the next event, before going back to
use the first event. As the ring buffer iterator will only have a single
copy, this use case will no longer be supported.

Instead, have the one use case create its own buffer to store the first
event when looking at the next event. This way, when looking at the first
event again, it wont be corrupted by the second read.

Link: http://lkml.kernel.org/r/20200317213415.722539921@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 include/linux/trace_events.h |  2 ++
 kernel/trace/trace.c         | 40 +++++++++++++++++++++++++++++++++++-
 kernel/trace/trace_output.c  | 15 ++++++--------
 3 files changed, 47 insertions(+), 10 deletions(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 6c7a10a6d71e..5c6943354049 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -85,6 +85,8 @@ struct trace_iterator {
 	struct mutex		mutex;
 	struct ring_buffer_iter	**buffer_iter;
 	unsigned long		iter_flags;
+	void			*temp;	/* temp holder */
+	unsigned int		temp_size;
 
 	/* trace_seq for __print_flags() and __print_symbolic() etc. */
 	struct trace_seq	tmp_seq;
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 02be4ddd4ad5..819e31d0d66c 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3466,7 +3466,31 @@ __find_next_entry(struct trace_iterator *iter, int *ent_cpu,
 struct trace_entry *trace_find_next_entry(struct trace_iterator *iter,
 					  int *ent_cpu, u64 *ent_ts)
 {
-	return __find_next_entry(iter, ent_cpu, NULL, ent_ts);
+	/* __find_next_entry will reset ent_size */
+	int ent_size = iter->ent_size;
+	struct trace_entry *entry;
+
+	/*
+	 * The __find_next_entry() may call peek_next_entry(), which may
+	 * call ring_buffer_peek() that may make the contents of iter->ent
+	 * undefined. Need to copy iter->ent now.
+	 */
+	if (iter->ent && iter->ent != iter->temp) {
+		if (!iter->temp || iter->temp_size < iter->ent_size) {
+			kfree(iter->temp);
+			iter->temp = kmalloc(iter->ent_size, GFP_KERNEL);
+			if (!iter->temp)
+				return NULL;
+		}
+		memcpy(iter->temp, iter->ent, iter->ent_size);
+		iter->temp_size = iter->ent_size;
+		iter->ent = iter->temp;
+	}
+	entry = __find_next_entry(iter, ent_cpu, NULL, ent_ts);
+	/* Put back the original ent_size */
+	iter->ent_size = ent_size;
+
+	return entry;
 }
 
 /* Find the next real entry, and increment the iterator to the next entry */
@@ -4197,6 +4221,18 @@ __tracing_open(struct inode *inode, struct file *file, bool snapshot)
 	if (!iter->buffer_iter)
 		goto release;
 
+	/*
+	 * trace_find_next_entry() may need to save off iter->ent.
+	 * It will place it into the iter->temp buffer. As most
+	 * events are less than 128, allocate a buffer of that size.
+	 * If one is greater, then trace_find_next_entry() will
+	 * allocate a new buffer to adjust for the bigger iter->ent.
+	 * It's not critical if it fails to get allocated here.
+	 */
+	iter->temp = kmalloc(128, GFP_KERNEL);
+	if (iter->temp)
+		iter->temp_size = 128;
+
 	/*
 	 * We make a copy of the current tracer to avoid concurrent
 	 * changes on it while we are reading.
@@ -4269,6 +4305,7 @@ __tracing_open(struct inode *inode, struct file *file, bool snapshot)
  fail:
 	mutex_unlock(&trace_types_lock);
 	kfree(iter->trace);
+	kfree(iter->temp);
 	kfree(iter->buffer_iter);
 release:
 	seq_release_private(inode, file);
@@ -4344,6 +4381,7 @@ static int tracing_release(struct inode *inode, struct file *file)
 
 	mutex_destroy(&iter->mutex);
 	free_cpumask_var(iter->started);
+	kfree(iter->temp);
 	kfree(iter->trace);
 	kfree(iter->buffer_iter);
 	seq_release_private(inode, file);
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index e25a7da79c6b..9a121e147102 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -617,22 +617,19 @@ int trace_print_context(struct trace_iterator *iter)
 
 int trace_print_lat_context(struct trace_iterator *iter)
 {
+	struct trace_entry *entry, *next_entry;
 	struct trace_array *tr = iter->tr;
-	/* trace_find_next_entry will reset ent_size */
-	int ent_size = iter->ent_size;
 	struct trace_seq *s = &iter->seq;
-	u64 next_ts;
-	struct trace_entry *entry = iter->ent,
-			   *next_entry = trace_find_next_entry(iter, NULL,
-							       &next_ts);
 	unsigned long verbose = (tr->trace_flags & TRACE_ITER_VERBOSE);
+	u64 next_ts;
 
-	/* Restore the original ent_size */
-	iter->ent_size = ent_size;
-
+	next_entry = trace_find_next_entry(iter, NULL, &next_ts);
 	if (!next_entry)
 		next_ts = iter->ts;
 
+	/* trace_find_next_entry() may change iter->ent */
+	entry = iter->ent;
+
 	if (verbose) {
 		char comm[TASK_COMM_LEN];
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 03/12 v2] ring-buffer: Have ring_buffer_empty() not depend on tracing stopped
  2020-03-19 23:22 [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file Steven Rostedt
  2020-03-19 23:22 ` [PATCH 01/12 v2] selftest/ftrace: Fix function trigger test to handle trace not disabling the tracer Steven Rostedt
  2020-03-19 23:22 ` [PATCH 02/12 v2] tracing: Save off entry when peeking at next entry Steven Rostedt
@ 2020-03-19 23:22 ` Steven Rostedt
  2020-03-19 23:22 ` [PATCH 04/12 v2] ring-buffer: Rename ring_buffer_read() to read_buffer_iter_advance() Steven Rostedt
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2020-03-19 23:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Masami Hiramatsu,
	Alexei Starovoitov, Peter Wu, Jonathan Corbet, Tom Zanussi,
	Shuah Khan, bpf

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

It was complained about that when the trace file is read, that the tracing
is disabled, as the iterator expects writing to the buffer it reads is not
updated. Several steps are needed to make the iterator handle a writer,
by testing if things have changed as it reads.

This step is to make ring_buffer_empty() expect the buffer to be changing.
Note if the current location of the iterator is overwritten, then it will
return false as new data is being added. Note, that this means that data
will be skipped.

Link: http://lkml.kernel.org/r/20200317213415.870741809@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 61f0e92ace99..1718520a2809 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3590,16 +3590,37 @@ int ring_buffer_iter_empty(struct ring_buffer_iter *iter)
 	struct buffer_page *reader;
 	struct buffer_page *head_page;
 	struct buffer_page *commit_page;
+	struct buffer_page *curr_commit_page;
 	unsigned commit;
+	u64 curr_commit_ts;
+	u64 commit_ts;
 
 	cpu_buffer = iter->cpu_buffer;
-
-	/* Remember, trace recording is off when iterator is in use */
 	reader = cpu_buffer->reader_page;
 	head_page = cpu_buffer->head_page;
 	commit_page = cpu_buffer->commit_page;
+	commit_ts = commit_page->page->time_stamp;
+
+	/*
+	 * When the writer goes across pages, it issues a cmpxchg which
+	 * is a mb(), which will synchronize with the rmb here.
+	 * (see rb_tail_page_update())
+	 */
+	smp_rmb();
 	commit = rb_page_commit(commit_page);
+	/* We want to make sure that the commit page doesn't change */
+	smp_rmb();
+
+	/* Make sure commit page didn't change */
+	curr_commit_page = READ_ONCE(cpu_buffer->commit_page);
+	curr_commit_ts = READ_ONCE(curr_commit_page->page->time_stamp);
+
+	/* If the commit page changed, then there's more data */
+	if (curr_commit_page != commit_page ||
+	    curr_commit_ts != commit_ts)
+		return 0;
 
+	/* Still racy, as it may return a false positive, but that's OK */
 	return ((iter->head_page == commit_page && iter->head == commit) ||
 		(iter->head_page == reader && commit_page == head_page &&
 		 head_page->read == commit &&
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 04/12 v2] ring-buffer: Rename ring_buffer_read() to read_buffer_iter_advance()
  2020-03-19 23:22 [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file Steven Rostedt
                   ` (2 preceding siblings ...)
  2020-03-19 23:22 ` [PATCH 03/12 v2] ring-buffer: Have ring_buffer_empty() not depend on tracing stopped Steven Rostedt
@ 2020-03-19 23:22 ` Steven Rostedt
  2020-03-19 23:22 ` [PATCH 05/12 v2] ring-buffer: Add page_stamp to iterator for synchronization Steven Rostedt
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2020-03-19 23:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Masami Hiramatsu,
	Alexei Starovoitov, Peter Wu, Jonathan Corbet, Tom Zanussi,
	Shuah Khan, bpf

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

When the ring buffer was first created, the iterator followed the normal
producer/consumer operations where it had both a peek() operation, that just
returned the event at the current location, and a read(), that would return
the event at the current location and also increment the iterator such that
the next peek() or read() will return the next event.

The only use of the ring_buffer_read() is currently to move the iterator to
the next location and nothing now actually reads the event it returns.
Rename this function to its actual use case to ring_buffer_iter_advance(),
which also adds the "iter" part to the name, which is more meaningful. As
the timestamp returned by ring_buffer_read() was never used, there's no
reason that this new version should bother having returning it. It will also
become a void function.

Link: http://lkml.kernel.org/r/20200317213416.018928618@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 include/linux/ring_buffer.h          |  3 +--
 kernel/trace/ring_buffer.c           | 23 ++++++-----------------
 kernel/trace/trace.c                 |  4 ++--
 kernel/trace/trace_functions_graph.c |  2 +-
 4 files changed, 10 insertions(+), 22 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index df0124eabece..0ae603b79b0e 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -135,8 +135,7 @@ void ring_buffer_read_finish(struct ring_buffer_iter *iter);
 
 struct ring_buffer_event *
 ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts);
-struct ring_buffer_event *
-ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts);
+void ring_buffer_iter_advance(struct ring_buffer_iter *iter);
 void ring_buffer_iter_reset(struct ring_buffer_iter *iter);
 int ring_buffer_iter_empty(struct ring_buffer_iter *iter);
 
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 1718520a2809..f57eeaa80e3e 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -4318,35 +4318,24 @@ ring_buffer_read_finish(struct ring_buffer_iter *iter)
 EXPORT_SYMBOL_GPL(ring_buffer_read_finish);
 
 /**
- * ring_buffer_read - read the next item in the ring buffer by the iterator
+ * ring_buffer_iter_advance - advance the iterator to the next location
  * @iter: The ring buffer iterator
- * @ts: The time stamp of the event read.
  *
- * This reads the next event in the ring buffer and increments the iterator.
+ * Move the location of the iterator such that the next read will
+ * be the next location of the iterator.
  */
-struct ring_buffer_event *
-ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts)
+void ring_buffer_iter_advance(struct ring_buffer_iter *iter)
 {
-	struct ring_buffer_event *event;
 	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
- again:
-	event = rb_iter_peek(iter, ts);
-	if (!event)
-		goto out;
-
-	if (event->type_len == RINGBUF_TYPE_PADDING)
-		goto again;
 
 	rb_advance_iter(iter);
- out:
-	raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
 
-	return event;
+	raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
 }
-EXPORT_SYMBOL_GPL(ring_buffer_read);
+EXPORT_SYMBOL_GPL(ring_buffer_iter_advance);
 
 /**
  * ring_buffer_size - return the size of the ring buffer (in bytes)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 819e31d0d66c..47889123be7f 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3378,7 +3378,7 @@ static void trace_iterator_increment(struct trace_iterator *iter)
 
 	iter->idx++;
 	if (buf_iter)
-		ring_buffer_read(buf_iter, NULL);
+		ring_buffer_iter_advance(buf_iter);
 }
 
 static struct trace_entry *
@@ -3562,7 +3562,7 @@ void tracing_iter_reset(struct trace_iterator *iter, int cpu)
 		if (ts >= iter->array_buffer->time_start)
 			break;
 		entries++;
-		ring_buffer_read(buf_iter, NULL);
+		ring_buffer_iter_advance(buf_iter);
 	}
 
 	per_cpu_ptr(iter->array_buffer->data, cpu)->skipped_entries = entries;
diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
index 7d71546ba00a..4a9c49c08ec9 100644
--- a/kernel/trace/trace_functions_graph.c
+++ b/kernel/trace/trace_functions_graph.c
@@ -482,7 +482,7 @@ get_return_for_leaf(struct trace_iterator *iter,
 
 	/* this is a leaf, now advance the iterator */
 	if (ring_iter)
-		ring_buffer_read(ring_iter, NULL);
+		ring_buffer_iter_advance(ring_iter);
 
 	return next;
 }
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 05/12 v2] ring-buffer: Add page_stamp to iterator for synchronization
  2020-03-19 23:22 [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file Steven Rostedt
                   ` (3 preceding siblings ...)
  2020-03-19 23:22 ` [PATCH 04/12 v2] ring-buffer: Rename ring_buffer_read() to read_buffer_iter_advance() Steven Rostedt
@ 2020-03-19 23:22 ` Steven Rostedt
  2020-03-19 23:22 ` [PATCH 06/12 v2] ring-buffer: Have rb_iter_head_event() handle concurrent writer Steven Rostedt
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2020-03-19 23:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Masami Hiramatsu,
	Alexei Starovoitov, Peter Wu, Jonathan Corbet, Tom Zanussi,
	Shuah Khan, bpf

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

Have the ring_buffer_iter structure contain a page_stamp, such that it can
be used to see if the writer entered the page the iterator is on. When going
to a new page, the iterator will record the time stamp of that page. When
reading events, it can copy the event to an internal buffer on the iterator
(to be implemented later), then check the page's time stamp with its own to
see if the writer entered the page. If so, it will need to try to read the
event again.

Link: http://lkml.kernel.org/r/20200317213416.163549674@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index f57eeaa80e3e..e689bdcb53e8 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -507,6 +507,7 @@ struct ring_buffer_iter {
 	struct buffer_page		*cache_reader_page;
 	unsigned long			cache_read;
 	u64				read_stamp;
+	u64				page_stamp;
 };
 
 /**
@@ -1959,7 +1960,7 @@ static void rb_inc_iter(struct ring_buffer_iter *iter)
 	else
 		rb_inc_page(cpu_buffer, &iter->head_page);
 
-	iter->read_stamp = iter->head_page->page->time_stamp;
+	iter->page_stamp = iter->read_stamp = iter->head_page->page->time_stamp;
 	iter->head = 0;
 }
 
@@ -3551,10 +3552,13 @@ static void rb_iter_reset(struct ring_buffer_iter *iter)
 	iter->cache_reader_page = iter->head_page;
 	iter->cache_read = cpu_buffer->read;
 
-	if (iter->head)
+	if (iter->head) {
 		iter->read_stamp = cpu_buffer->read_stamp;
-	else
+		iter->page_stamp = cpu_buffer->reader_page->page->time_stamp;
+	} else {
 		iter->read_stamp = iter->head_page->page->time_stamp;
+		iter->page_stamp = iter->read_stamp;
+	}
 }
 
 /**
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 06/12 v2] ring-buffer: Have rb_iter_head_event() handle concurrent writer
  2020-03-19 23:22 [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file Steven Rostedt
                   ` (4 preceding siblings ...)
  2020-03-19 23:22 ` [PATCH 05/12 v2] ring-buffer: Add page_stamp to iterator for synchronization Steven Rostedt
@ 2020-03-19 23:22 ` Steven Rostedt
  2020-03-19 23:22 ` [PATCH 07/12 v2] ring-buffer: Do not die if rb_iter_peek() fails more than thrice Steven Rostedt
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2020-03-19 23:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Masami Hiramatsu,
	Alexei Starovoitov, Peter Wu, Jonathan Corbet, Tom Zanussi,
	Shuah Khan, bpf

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

Have the ring_buffer_iter structure have a place to store an event, such
that it can not be overwritten by a writer, and load it in such a way via
rb_iter_head_event() that it will return NULL and reset the iter to the
start of the current page if a writer updated the page.

Link: http://lkml.kernel.org/r/20200317213416.306959216@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 106 ++++++++++++++++++++++++++-----------
 1 file changed, 75 insertions(+), 31 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index e689bdcb53e8..3d718add73c1 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -503,11 +503,13 @@ struct trace_buffer {
 struct ring_buffer_iter {
 	struct ring_buffer_per_cpu	*cpu_buffer;
 	unsigned long			head;
+	unsigned long			next_event;
 	struct buffer_page		*head_page;
 	struct buffer_page		*cache_reader_page;
 	unsigned long			cache_read;
 	u64				read_stamp;
 	u64				page_stamp;
+	struct ring_buffer_event	*event;
 };
 
 /**
@@ -1914,15 +1916,59 @@ rb_reader_event(struct ring_buffer_per_cpu *cpu_buffer)
 			       cpu_buffer->reader_page->read);
 }
 
-static __always_inline struct ring_buffer_event *
-rb_iter_head_event(struct ring_buffer_iter *iter)
+static __always_inline unsigned rb_page_commit(struct buffer_page *bpage)
 {
-	return __rb_page_index(iter->head_page, iter->head);
+	return local_read(&bpage->page->commit);
 }
 
-static __always_inline unsigned rb_page_commit(struct buffer_page *bpage)
+static struct ring_buffer_event *
+rb_iter_head_event(struct ring_buffer_iter *iter)
 {
-	return local_read(&bpage->page->commit);
+	struct ring_buffer_event *event;
+	struct buffer_page *iter_head_page = iter->head_page;
+	unsigned long commit;
+	unsigned length;
+
+	/*
+	 * When the writer goes across pages, it issues a cmpxchg which
+	 * is a mb(), which will synchronize with the rmb here.
+	 * (see rb_tail_page_update() and __rb_reserve_next())
+	 */
+	commit = rb_page_commit(iter_head_page);
+	smp_rmb();
+	event = __rb_page_index(iter_head_page, iter->head);
+	length = rb_event_length(event);
+
+	/*
+	 * READ_ONCE() doesn't work on functions and we don't want the
+	 * compiler doing any crazy optimizations with length.
+	 */
+	barrier();
+
+	if ((iter->head + length) > commit || length > BUF_MAX_DATA_SIZE)
+		/* Writer corrupted the read? */
+		goto reset;
+
+	memcpy(iter->event, event, length);
+	/*
+	 * If the page stamp is still the same after this rmb() then the
+	 * event was safely copied without the writer entering the page.
+	 */
+	smp_rmb();
+
+	/* Make sure the page didn't change since we read this */
+	if (iter->page_stamp != iter_head_page->page->time_stamp ||
+	    commit > rb_page_commit(iter_head_page))
+		goto reset;
+
+	iter->next_event = iter->head + length;
+	return iter->event;
+ reset:
+	/* Reset to the beginning */
+	iter->page_stamp = iter->read_stamp = iter->head_page->page->time_stamp;
+	iter->head = 0;
+	iter->next_event = 0;
+	return NULL;
 }
 
 /* Size is determined by what has been committed */
@@ -1962,6 +2008,7 @@ static void rb_inc_iter(struct ring_buffer_iter *iter)
 
 	iter->page_stamp = iter->read_stamp = iter->head_page->page->time_stamp;
 	iter->head = 0;
+	iter->next_event = 0;
 }
 
 /*
@@ -3548,6 +3595,7 @@ static void rb_iter_reset(struct ring_buffer_iter *iter)
 	/* Iterator usage is expected to have record disabled */
 	iter->head_page = cpu_buffer->reader_page;
 	iter->head = cpu_buffer->reader_page->read;
+	iter->next_event = iter->head;
 
 	iter->cache_reader_page = iter->head_page;
 	iter->cache_read = cpu_buffer->read;
@@ -3625,7 +3673,7 @@ int ring_buffer_iter_empty(struct ring_buffer_iter *iter)
 		return 0;
 
 	/* Still racy, as it may return a false positive, but that's OK */
-	return ((iter->head_page == commit_page && iter->head == commit) ||
+	return ((iter->head_page == commit_page && iter->head >= commit) ||
 		(iter->head_page == reader && commit_page == head_page &&
 		 head_page->read == commit &&
 		 iter->head == rb_page_commit(cpu_buffer->reader_page)));
@@ -3853,15 +3901,22 @@ static void rb_advance_reader(struct ring_buffer_per_cpu *cpu_buffer)
 static void rb_advance_iter(struct ring_buffer_iter *iter)
 {
 	struct ring_buffer_per_cpu *cpu_buffer;
-	struct ring_buffer_event *event;
-	unsigned length;
 
 	cpu_buffer = iter->cpu_buffer;
 
+	/* If head == next_event then we need to jump to the next event */
+	if (iter->head == iter->next_event) {
+		/* If the event gets overwritten again, there's nothing to do */
+		if (rb_iter_head_event(iter) == NULL)
+			return;
+	}
+
+	iter->head = iter->next_event;
+
 	/*
 	 * Check if we are at the end of the buffer.
 	 */
-	if (iter->head >= rb_page_size(iter->head_page)) {
+	if (iter->next_event >= rb_page_size(iter->head_page)) {
 		/* discarded commits can make the page empty */
 		if (iter->head_page == cpu_buffer->commit_page)
 			return;
@@ -3869,27 +3924,7 @@ static void rb_advance_iter(struct ring_buffer_iter *iter)
 		return;
 	}
 
-	event = rb_iter_head_event(iter);
-
-	length = rb_event_length(event);
-
-	/*
-	 * This should not be called to advance the header if we are
-	 * at the tail of the buffer.
-	 */
-	if (RB_WARN_ON(cpu_buffer,
-		       (iter->head_page == cpu_buffer->commit_page) &&
-		       (iter->head + length > rb_commit_index(cpu_buffer))))
-		return;
-
-	rb_update_iter_read_stamp(iter, event);
-
-	iter->head += length;
-
-	/* check for end of page padding */
-	if ((iter->head >= rb_page_size(iter->head_page)) &&
-	    (iter->head_page != cpu_buffer->commit_page))
-		rb_inc_iter(iter);
+	rb_update_iter_read_stamp(iter, iter->event);
 }
 
 static int rb_lost_events(struct ring_buffer_per_cpu *cpu_buffer)
@@ -4017,6 +4052,8 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
 	}
 
 	event = rb_iter_head_event(iter);
+	if (!event)
+		goto again;
 
 	switch (event->type_len) {
 	case RINGBUF_TYPE_PADDING:
@@ -4233,10 +4270,16 @@ ring_buffer_read_prepare(struct trace_buffer *buffer, int cpu, gfp_t flags)
 	if (!cpumask_test_cpu(cpu, buffer->cpumask))
 		return NULL;
 
-	iter = kmalloc(sizeof(*iter), flags);
+	iter = kzalloc(sizeof(*iter), flags);
 	if (!iter)
 		return NULL;
 
+	iter->event = kmalloc(BUF_MAX_DATA_SIZE, flags);
+	if (!iter->event) {
+		kfree(iter);
+		return NULL;
+	}
+
 	cpu_buffer = buffer->buffers[cpu];
 
 	iter->cpu_buffer = cpu_buffer;
@@ -4317,6 +4360,7 @@ ring_buffer_read_finish(struct ring_buffer_iter *iter)
 
 	atomic_dec(&cpu_buffer->record_disabled);
 	atomic_dec(&cpu_buffer->buffer->resize_disabled);
+	kfree(iter->event);
 	kfree(iter);
 }
 EXPORT_SYMBOL_GPL(ring_buffer_read_finish);
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 07/12 v2] ring-buffer: Do not die if rb_iter_peek() fails more than thrice
  2020-03-19 23:22 [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file Steven Rostedt
                   ` (5 preceding siblings ...)
  2020-03-19 23:22 ` [PATCH 06/12 v2] ring-buffer: Have rb_iter_head_event() handle concurrent writer Steven Rostedt
@ 2020-03-19 23:22 ` Steven Rostedt
  2020-03-19 23:22 ` [PATCH 08/12 v2] ring-buffer: Optimize rb_iter_head_event() Steven Rostedt
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2020-03-19 23:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Masami Hiramatsu,
	Alexei Starovoitov, Peter Wu, Jonathan Corbet, Tom Zanussi,
	Shuah Khan, bpf

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

As the iterator will be reading a live buffer, and if the event being read
is on a page that a writer crosses, it will fail and try again, the
condition in rb_iter_peek() that only allows a retry to happen three times
is no longer valid. Allow rb_iter_peek() to retry more than three times
without killing the ring buffer, but only if rb_iter_head_event() had failed
at least once.

Link: http://lkml.kernel.org/r/20200317213416.452888193@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 3d718add73c1..475338fda969 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -4012,6 +4012,7 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
 	struct ring_buffer_per_cpu *cpu_buffer;
 	struct ring_buffer_event *event;
 	int nr_loops = 0;
+	bool failed = false;
 
 	if (ts)
 		*ts = 0;
@@ -4038,10 +4039,14 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
 	 * to a data event, we should never loop more than three times.
 	 * Once for going to next page, once on time extend, and
 	 * finally once to get the event.
-	 * (We never hit the following condition more than thrice).
+	 * We should never hit the following condition more than thrice,
+	 * unless the buffer is very small, and there's a writer
+	 * that is causing the reader to fail getting an event.
 	 */
-	if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3))
+	if (++nr_loops > 3) {
+		RB_WARN_ON(cpu_buffer, !failed);
 		return NULL;
+	}
 
 	if (rb_per_cpu_empty(cpu_buffer))
 		return NULL;
@@ -4052,8 +4057,10 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
 	}
 
 	event = rb_iter_head_event(iter);
-	if (!event)
+	if (!event) {
+		failed = true;
 		goto again;
+	}
 
 	switch (event->type_len) {
 	case RINGBUF_TYPE_PADDING:
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 08/12 v2] ring-buffer: Optimize rb_iter_head_event()
  2020-03-19 23:22 [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file Steven Rostedt
                   ` (6 preceding siblings ...)
  2020-03-19 23:22 ` [PATCH 07/12 v2] ring-buffer: Do not die if rb_iter_peek() fails more than thrice Steven Rostedt
@ 2020-03-19 23:22 ` Steven Rostedt
  2020-03-19 23:22 ` [PATCH 09/12 v2] ring-buffer: Do not disable recording when there is an iterator Steven Rostedt
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2020-03-19 23:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Masami Hiramatsu,
	Alexei Starovoitov, Peter Wu, Jonathan Corbet, Tom Zanussi,
	Shuah Khan, bpf

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

As it is fine to perform several "peeks" of event data in the ring buffer
via the iterator before moving it forward, do not re-read the event, just
return what was read before. Otherwise, it can cause inconsistent results,
especially when testing multiple CPU buffers to interleave them.

Link: http://lkml.kernel.org/r/20200317213416.592032170@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 475338fda969..5979327254f9 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1929,6 +1929,9 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
 	unsigned long commit;
 	unsigned length;
 
+	if (iter->head != iter->next_event)
+		return iter->event;
+
 	/*
 	 * When the writer goes across pages, it issues a cmpxchg which
 	 * is a mb(), which will synchronize with the rmb here.
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 09/12 v2] ring-buffer: Do not disable recording when there is an iterator
  2020-03-19 23:22 [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file Steven Rostedt
                   ` (7 preceding siblings ...)
  2020-03-19 23:22 ` [PATCH 08/12 v2] ring-buffer: Optimize rb_iter_head_event() Steven Rostedt
@ 2020-03-19 23:22 ` Steven Rostedt
  2020-03-19 23:22 ` [PATCH 10/12 v2] tracing: Do not disable tracing when reading the trace file Steven Rostedt
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2020-03-19 23:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Masami Hiramatsu,
	Alexei Starovoitov, Peter Wu, Jonathan Corbet, Tom Zanussi,
	Shuah Khan, bpf

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

Now that the iterator can handle a concurrent writer, do not disable writing
to the ring buffer when there is an iterator present.

Link: http://lkml.kernel.org/r/20200317213416.759770696@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 5979327254f9..8bafba674ec0 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -4295,7 +4295,6 @@ ring_buffer_read_prepare(struct trace_buffer *buffer, int cpu, gfp_t flags)
 	iter->cpu_buffer = cpu_buffer;
 
 	atomic_inc(&buffer->resize_disabled);
-	atomic_inc(&cpu_buffer->record_disabled);
 
 	return iter;
 }
@@ -4368,7 +4367,6 @@ ring_buffer_read_finish(struct ring_buffer_iter *iter)
 	rb_check_pages(cpu_buffer);
 	raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
 
-	atomic_dec(&cpu_buffer->record_disabled);
 	atomic_dec(&cpu_buffer->buffer->resize_disabled);
 	kfree(iter->event);
 	kfree(iter);
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 10/12 v2] tracing: Do not disable tracing when reading the trace file
  2020-03-19 23:22 [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file Steven Rostedt
                   ` (8 preceding siblings ...)
  2020-03-19 23:22 ` [PATCH 09/12 v2] ring-buffer: Do not disable recording when there is an iterator Steven Rostedt
@ 2020-03-19 23:22 ` Steven Rostedt
  2020-03-19 23:22 ` [PATCH 11/12 v2] ring-buffer/tracing: Have iterator acknowledge dropped events Steven Rostedt
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2020-03-19 23:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Masami Hiramatsu,
	Alexei Starovoitov, Peter Wu, Jonathan Corbet, Tom Zanussi,
	Shuah Khan, bpf

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

When opening the "trace" file, it is no longer necessary to disable tracing.

Link: http://lkml.kernel.org/r/20200317213416.903351225@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 kernel/trace/trace.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 47889123be7f..b7052ffb0211 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4273,10 +4273,6 @@ __tracing_open(struct inode *inode, struct file *file, bool snapshot)
 	if (trace_clocks[tr->clock_id].in_ns)
 		iter->iter_flags |= TRACE_FILE_TIME_IN_NS;
 
-	/* stop the trace while dumping if we are not opening "snapshot" */
-	if (!iter->snapshot)
-		tracing_stop_tr(tr);
-
 	if (iter->cpu_file == RING_BUFFER_ALL_CPUS) {
 		for_each_tracing_cpu(cpu) {
 			iter->buffer_iter[cpu] =
@@ -4371,10 +4367,6 @@ static int tracing_release(struct inode *inode, struct file *file)
 	if (iter->trace && iter->trace->close)
 		iter->trace->close(iter);
 
-	if (!iter->snapshot)
-		/* reenable tracing if it was previously enabled */
-		tracing_start_tr(tr);
-
 	__trace_array_put(tr);
 
 	mutex_unlock(&trace_types_lock);
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 11/12 v2] ring-buffer/tracing: Have iterator acknowledge dropped events
  2020-03-19 23:22 [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file Steven Rostedt
                   ` (9 preceding siblings ...)
  2020-03-19 23:22 ` [PATCH 10/12 v2] tracing: Do not disable tracing when reading the trace file Steven Rostedt
@ 2020-03-19 23:22 ` Steven Rostedt
  2020-03-19 23:22 ` [PATCH 12/12 v2] tracing: Have the document reflect that the trace file keeps tracing enabled Steven Rostedt
  2020-03-21 19:13 ` [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file David Laight
  12 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2020-03-19 23:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Masami Hiramatsu,
	Alexei Starovoitov, Peter Wu, Jonathan Corbet, Tom Zanussi,
	Shuah Khan, bpf

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

Have the ring_buffer_iterator set a flag if events were dropped as it were
to go and peek at the next event. Have the trace file display this fact if
it happened with a "LOST EVENTS" message.

Link: http://lkml.kernel.org/r/20200317213417.045858900@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 include/linux/ring_buffer.h |  1 +
 kernel/trace/ring_buffer.c  | 16 ++++++++++++++++
 kernel/trace/trace.c        | 16 ++++++++++++----
 3 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 0ae603b79b0e..c76b2f3b3ac4 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -138,6 +138,7 @@ ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts);
 void ring_buffer_iter_advance(struct ring_buffer_iter *iter);
 void ring_buffer_iter_reset(struct ring_buffer_iter *iter);
 int ring_buffer_iter_empty(struct ring_buffer_iter *iter);
+bool ring_buffer_iter_dropped(struct ring_buffer_iter *iter);
 
 unsigned long ring_buffer_size(struct trace_buffer *buffer, int cpu);
 
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 8bafba674ec0..87bbb519505f 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -510,6 +510,7 @@ struct ring_buffer_iter {
 	u64				read_stamp;
 	u64				page_stamp;
 	struct ring_buffer_event	*event;
+	int				missed_events;
 };
 
 /**
@@ -1971,6 +1972,7 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
 	iter->page_stamp = iter->read_stamp = iter->head_page->page->time_stamp;
 	iter->head = 0;
 	iter->next_event = 0;
+	iter->missed_events = 1;
 	return NULL;
 }
 
@@ -4174,6 +4176,20 @@ ring_buffer_peek(struct trace_buffer *buffer, int cpu, u64 *ts,
 	return event;
 }
 
+/** ring_buffer_iter_dropped - report if there are dropped events
+ * @iter: The ring buffer iterator
+ *
+ * Returns true if there was dropped events since the last peek.
+ */
+bool ring_buffer_iter_dropped(struct ring_buffer_iter *iter)
+{
+	bool ret = iter->missed_events != 0;
+
+	iter->missed_events = 0;
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ring_buffer_iter_dropped);
+
 /**
  * ring_buffer_iter_peek - peek at the next event to be read
  * @iter: The ring buffer iterator
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index b7052ffb0211..ee0ff8175c03 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3388,11 +3388,15 @@ peek_next_entry(struct trace_iterator *iter, int cpu, u64 *ts,
 	struct ring_buffer_event *event;
 	struct ring_buffer_iter *buf_iter = trace_buffer_iter(iter, cpu);
 
-	if (buf_iter)
+	if (buf_iter) {
 		event = ring_buffer_iter_peek(buf_iter, ts);
-	else
+		if (lost_events)
+			*lost_events = ring_buffer_iter_dropped(buf_iter) ?
+				(unsigned long)-1 : 0;
+	} else {
 		event = ring_buffer_peek(iter->array_buffer->buffer, cpu, ts,
 					 lost_events);
+	}
 
 	if (event) {
 		iter->ent_size = ring_buffer_event_length(event);
@@ -4005,8 +4009,12 @@ enum print_line_t print_trace_line(struct trace_iterator *iter)
 	enum print_line_t ret;
 
 	if (iter->lost_events) {
-		trace_seq_printf(&iter->seq, "CPU:%d [LOST %lu EVENTS]\n",
-				 iter->cpu, iter->lost_events);
+		if (iter->lost_events == (unsigned long)-1)
+			trace_seq_printf(&iter->seq, "CPU:%d [LOST EVENTS]\n",
+					 iter->cpu);
+		else
+			trace_seq_printf(&iter->seq, "CPU:%d [LOST %lu EVENTS]\n",
+					 iter->cpu, iter->lost_events);
 		if (trace_seq_has_overflowed(&iter->seq))
 			return TRACE_TYPE_PARTIAL_LINE;
 	}
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 12/12 v2] tracing: Have the document reflect that the trace file keeps tracing enabled
  2020-03-19 23:22 [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file Steven Rostedt
                   ` (10 preceding siblings ...)
  2020-03-19 23:22 ` [PATCH 11/12 v2] ring-buffer/tracing: Have iterator acknowledge dropped events Steven Rostedt
@ 2020-03-19 23:22 ` Steven Rostedt
  2020-03-21 19:13 ` [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file David Laight
  12 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2020-03-19 23:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Masami Hiramatsu,
	Alexei Starovoitov, Peter Wu, Jonathan Corbet, Tom Zanussi,
	Shuah Khan, bpf

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>

Now that reading the trace file does not temporarly stop tracing while it is
open, update the document to reflect this fact.

Link: http://lkml.kernel.org/r/20200317213417.209675068@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 Documentation/trace/ftrace.rst | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index 99a0890e20ec..86e76f2a40dc 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -125,10 +125,13 @@ of ftrace. Here is a list of some of the key files:
   trace:
 
 	This file holds the output of the trace in a human
-	readable format (described below). Note, tracing is temporarily
-	disabled when the file is open for reading. Once all readers
-	are closed, tracing is re-enabled. Opening this file for
+	readable format (described below). Opening this file for
 	writing with the O_TRUNC flag clears the ring buffer content.
+        Note, this file is not a consumer. If tracing is off
+        (no tracer running, or tracing_on is zero), it will produce
+        the same output each time it is read. When tracing is on,
+        it may produce inconsistent results as it tries to read
+        the entire buffer without consuming it.
 
   trace_pipe:
 
@@ -142,9 +145,7 @@ of ftrace. Here is a list of some of the key files:
 	will not be read again with a sequential read. The
 	"trace" file is static, and if the tracer is not
 	adding more data, it will display the same
-	information every time it is read. Unlike the
-	"trace" file, opening this file for reading will not
-	temporarily disable tracing.
+	information every time it is read.
 
   trace_options:
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 02/12 v2] tracing: Save off entry when peeking at next entry
  2020-03-19 23:22 ` [PATCH 02/12 v2] tracing: Save off entry when peeking at next entry Steven Rostedt
@ 2020-03-20  2:57   ` Masami Hiramatsu
       [not found]   ` <20200326091256.GR11705@shao2-debian>
  1 sibling, 0 replies; 27+ messages in thread
From: Masami Hiramatsu @ 2020-03-20  2:57 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra,
	Masami Hiramatsu, Alexei Starovoitov, Peter Wu, Jonathan Corbet,
	Tom Zanussi, Shuah Khan, bpf

On Thu, 19 Mar 2020 19:22:21 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
> 
> In order to have the iterator read the buffer even when it's still updating,
> it requires that the ring buffer iterator saves each event in a separate
> location outside the ring buffer such that its use is immutable.
> 
> There's one use case that saves off the event returned from the ring buffer
> interator and calls it again to look at the next event, before going back to
> use the first event. As the ring buffer iterator will only have a single
> copy, this use case will no longer be supported.
> 
> Instead, have the one use case create its own buffer to store the first
> event when looking at the next event. This way, when looking at the first
> event again, it wont be corrupted by the second read.

OK, this looks good to me.

Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>

Thank you,

> 
> Link: http://lkml.kernel.org/r/20200317213415.722539921@goodmis.org
> 
> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
> ---
>  include/linux/trace_events.h |  2 ++
>  kernel/trace/trace.c         | 40 +++++++++++++++++++++++++++++++++++-
>  kernel/trace/trace_output.c  | 15 ++++++--------
>  3 files changed, 47 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index 6c7a10a6d71e..5c6943354049 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -85,6 +85,8 @@ struct trace_iterator {
>  	struct mutex		mutex;
>  	struct ring_buffer_iter	**buffer_iter;
>  	unsigned long		iter_flags;
> +	void			*temp;	/* temp holder */
> +	unsigned int		temp_size;
>  
>  	/* trace_seq for __print_flags() and __print_symbolic() etc. */
>  	struct trace_seq	tmp_seq;
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 02be4ddd4ad5..819e31d0d66c 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -3466,7 +3466,31 @@ __find_next_entry(struct trace_iterator *iter, int *ent_cpu,
>  struct trace_entry *trace_find_next_entry(struct trace_iterator *iter,
>  					  int *ent_cpu, u64 *ent_ts)
>  {
> -	return __find_next_entry(iter, ent_cpu, NULL, ent_ts);
> +	/* __find_next_entry will reset ent_size */
> +	int ent_size = iter->ent_size;
> +	struct trace_entry *entry;
> +
> +	/*
> +	 * The __find_next_entry() may call peek_next_entry(), which may
> +	 * call ring_buffer_peek() that may make the contents of iter->ent
> +	 * undefined. Need to copy iter->ent now.
> +	 */
> +	if (iter->ent && iter->ent != iter->temp) {
> +		if (!iter->temp || iter->temp_size < iter->ent_size) {
> +			kfree(iter->temp);
> +			iter->temp = kmalloc(iter->ent_size, GFP_KERNEL);
> +			if (!iter->temp)
> +				return NULL;
> +		}
> +		memcpy(iter->temp, iter->ent, iter->ent_size);
> +		iter->temp_size = iter->ent_size;
> +		iter->ent = iter->temp;
> +	}
> +	entry = __find_next_entry(iter, ent_cpu, NULL, ent_ts);
> +	/* Put back the original ent_size */
> +	iter->ent_size = ent_size;
> +
> +	return entry;
>  }
>  
>  /* Find the next real entry, and increment the iterator to the next entry */
> @@ -4197,6 +4221,18 @@ __tracing_open(struct inode *inode, struct file *file, bool snapshot)
>  	if (!iter->buffer_iter)
>  		goto release;
>  
> +	/*
> +	 * trace_find_next_entry() may need to save off iter->ent.
> +	 * It will place it into the iter->temp buffer. As most
> +	 * events are less than 128, allocate a buffer of that size.
> +	 * If one is greater, then trace_find_next_entry() will
> +	 * allocate a new buffer to adjust for the bigger iter->ent.
> +	 * It's not critical if it fails to get allocated here.
> +	 */
> +	iter->temp = kmalloc(128, GFP_KERNEL);
> +	if (iter->temp)
> +		iter->temp_size = 128;
> +
>  	/*
>  	 * We make a copy of the current tracer to avoid concurrent
>  	 * changes on it while we are reading.
> @@ -4269,6 +4305,7 @@ __tracing_open(struct inode *inode, struct file *file, bool snapshot)
>   fail:
>  	mutex_unlock(&trace_types_lock);
>  	kfree(iter->trace);
> +	kfree(iter->temp);
>  	kfree(iter->buffer_iter);
>  release:
>  	seq_release_private(inode, file);
> @@ -4344,6 +4381,7 @@ static int tracing_release(struct inode *inode, struct file *file)
>  
>  	mutex_destroy(&iter->mutex);
>  	free_cpumask_var(iter->started);
> +	kfree(iter->temp);
>  	kfree(iter->trace);
>  	kfree(iter->buffer_iter);
>  	seq_release_private(inode, file);
> diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
> index e25a7da79c6b..9a121e147102 100644
> --- a/kernel/trace/trace_output.c
> +++ b/kernel/trace/trace_output.c
> @@ -617,22 +617,19 @@ int trace_print_context(struct trace_iterator *iter)
>  
>  int trace_print_lat_context(struct trace_iterator *iter)
>  {
> +	struct trace_entry *entry, *next_entry;
>  	struct trace_array *tr = iter->tr;
> -	/* trace_find_next_entry will reset ent_size */
> -	int ent_size = iter->ent_size;
>  	struct trace_seq *s = &iter->seq;
> -	u64 next_ts;
> -	struct trace_entry *entry = iter->ent,
> -			   *next_entry = trace_find_next_entry(iter, NULL,
> -							       &next_ts);
>  	unsigned long verbose = (tr->trace_flags & TRACE_ITER_VERBOSE);
> +	u64 next_ts;
>  
> -	/* Restore the original ent_size */
> -	iter->ent_size = ent_size;
> -
> +	next_entry = trace_find_next_entry(iter, NULL, &next_ts);
>  	if (!next_entry)
>  		next_ts = iter->ts;
>  
> +	/* trace_find_next_entry() may change iter->ent */
> +	entry = iter->ent;
> +
>  	if (verbose) {
>  		char comm[TASK_COMM_LEN];
>  
> -- 
> 2.25.1
> 
> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file
  2020-03-19 23:22 [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file Steven Rostedt
                   ` (11 preceding siblings ...)
  2020-03-19 23:22 ` [PATCH 12/12 v2] tracing: Have the document reflect that the trace file keeps tracing enabled Steven Rostedt
@ 2020-03-21 19:13 ` David Laight
  2020-03-22 18:07   ` Steven Rostedt
  2020-03-27  1:46   ` Steven Rostedt
  12 siblings, 2 replies; 27+ messages in thread
From: David Laight @ 2020-03-21 19:13 UTC (permalink / raw)
  To: 'Steven Rostedt', linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Masami Hiramatsu,
	Alexei Starovoitov, Peter Wu, Jonathan Corbet, Tom Zanussi,
	Shuah Khan, bpf

From: Steven Rostedt
> Sent: 19 March 2020 23:22
...
> 
> This patch series attempts to satisfy that request, by creating a
> temporary buffer in each of the per cpu iterators to place the
> read event into, such that it can be passed to users without worrying
> about a writer to corrupt the event while it was being written out.
> It also uses the fact that the ring buffer is broken up into pages,
> where each page has its own timestamp that gets updated when a
> writer crosses over to it. By copying it to the temp buffer, and
> doing a "before and after" test of the time stamp with memory barriers,
> can allow the events to be saved.

Does this mean the you will no longer be able to look at a snapshot
of the trace by running 'less trace' (and typically going to the end
to get info for all cpus).

A lot of the time trace is being written far too fast for it to make
any sense to try to read it continuously.

Also, if BPF start using ftrace, no one will be able to use it for
'normal debugging' on such systems.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file
  2020-03-21 19:13 ` [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file David Laight
@ 2020-03-22 18:07   ` Steven Rostedt
  2020-03-27  1:46   ` Steven Rostedt
  1 sibling, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2020-03-22 18:07 UTC (permalink / raw)
  To: David Laight
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra,
	Masami Hiramatsu, Alexei Starovoitov, Peter Wu, Jonathan Corbet,
	Tom Zanussi, Shuah Khan, bpf

On Sat, 21 Mar 2020 19:13:51 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> From: Steven Rostedt
> > Sent: 19 March 2020 23:22  
> ...
> > 
> > This patch series attempts to satisfy that request, by creating a
> > temporary buffer in each of the per cpu iterators to place the
> > read event into, such that it can be passed to users without worrying
> > about a writer to corrupt the event while it was being written out.
> > It also uses the fact that the ring buffer is broken up into pages,
> > where each page has its own timestamp that gets updated when a
> > writer crosses over to it. By copying it to the temp buffer, and
> > doing a "before and after" test of the time stamp with memory barriers,
> > can allow the events to be saved.  
> 
> Does this mean the you will no longer be able to look at a snapshot
> of the trace by running 'less trace' (and typically going to the end
> to get info for all cpus).

If there's a use case for this, it will be trivial to add an option to
bring back the old behavior. If you want that, I can do that, and even add
a config that makes it the default.

> 
> A lot of the time trace is being written far too fast for it to make
> any sense to try to read it continuously.
> 
> Also, if BPF start using ftrace, no one will be able to use it for
> 'normal debugging' on such systems.

I believe its used for debugging bpf, not for normal tracing. BPF only
uses this when it has their trace_printk() using it. Which gives that nasty
"THIS IS A DEBUG KERNEL" message ;-)   Thus, I don't think you need to
worry about bpf having this in production.

-- Steve

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file
  2020-03-21 19:13 ` [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file David Laight
  2020-03-22 18:07   ` Steven Rostedt
@ 2020-03-27  1:46   ` Steven Rostedt
  2020-03-27 10:07     ` David Laight
  1 sibling, 1 reply; 27+ messages in thread
From: Steven Rostedt @ 2020-03-27  1:46 UTC (permalink / raw)
  To: David Laight
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra,
	Masami Hiramatsu, Alexei Starovoitov, Peter Wu, Jonathan Corbet,
	Tom Zanussi, Shuah Khan, bpf

On Sat, 21 Mar 2020 19:13:51 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> From: Steven Rostedt
> > Sent: 19 March 2020 23:22  
> ...
> > 
> > This patch series attempts to satisfy that request, by creating a
> > temporary buffer in each of the per cpu iterators to place the
> > read event into, such that it can be passed to users without worrying
> > about a writer to corrupt the event while it was being written out.
> > It also uses the fact that the ring buffer is broken up into pages,
> > where each page has its own timestamp that gets updated when a
> > writer crosses over to it. By copying it to the temp buffer, and
> > doing a "before and after" test of the time stamp with memory barriers,
> > can allow the events to be saved.  
> 
> Does this mean the you will no longer be able to look at a snapshot
> of the trace by running 'less trace' (and typically going to the end
> to get info for all cpus).

I changed patch 9 to be this:

It adds an option "pause-on-trace" that when set, will bring back the
old behavior of pausing recording to the ring buffer when the trace
file is open.

If needed, I can add a kernel command line option and a Kconfig that
makes this set to true by default.

-- Steve

From 71f44d604e5b16cc239d6276b447a515448f582f Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Date: Tue, 17 Mar 2020 17:32:31 -0400
Subject: [PATCH] tracing: Do not disable tracing when reading the trace file

When opening the "trace" file, it is no longer necessary to disable tracing.

Note, a new option is created called "pause-on-trace", when set, will cause
the trace file to emulate its original behavior.

Link: http://lkml.kernel.org/r/20200317213416.903351225@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 Documentation/trace/ftrace.rst | 6 ++++++
 kernel/trace/trace.c           | 9 ++++++---
 kernel/trace/trace.h           | 1 +
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index 99a0890e20ec..c33950a35d65 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -1125,6 +1125,12 @@ Here are the available options:
 	the trace displays additional information about the
 	latency, as described in "Latency trace format".
 
+  pause-on-trace
+	When set, opening the trace file for read, will pause
+	writing to the ring buffer (as if tracing_on was set to zero).
+	This simulates the original behavior of the trace file.
+	When the file is closed, tracing will be enabled again.
+
   record-cmd
 	When any event or tracer is enabled, a hook is enabled
 	in the sched_switch trace point to fill comm cache
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 47889123be7f..650fa81fffe8 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4273,8 +4273,11 @@ __tracing_open(struct inode *inode, struct file *file, bool snapshot)
 	if (trace_clocks[tr->clock_id].in_ns)
 		iter->iter_flags |= TRACE_FILE_TIME_IN_NS;
 
-	/* stop the trace while dumping if we are not opening "snapshot" */
-	if (!iter->snapshot)
+	/*
+	 * If pause-on-trace is enabled, then stop the trace while
+	 * dumping, unless this is the "snapshot" file
+	 */
+	if (!iter->snapshot && (tr->trace_flags & TRACE_ITER_PAUSE_ON_TRACE))
 		tracing_stop_tr(tr);
 
 	if (iter->cpu_file == RING_BUFFER_ALL_CPUS) {
@@ -4371,7 +4374,7 @@ static int tracing_release(struct inode *inode, struct file *file)
 	if (iter->trace && iter->trace->close)
 		iter->trace->close(iter);
 
-	if (!iter->snapshot)
+	if (!iter->snapshot && tr->stop_count)
 		/* reenable tracing if it was previously enabled */
 		tracing_start_tr(tr);
 
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index c61e1b1c85a6..f37e05135986 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -1302,6 +1302,7 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
 		C(IRQ_INFO,		"irq-info"),		\
 		C(MARKERS,		"markers"),		\
 		C(EVENT_FORK,		"event-fork"),		\
+		C(PAUSE_ON_TRACE,	"pause-on-trace"),	\
 		FUNCTION_FLAGS					\
 		FGRAPH_FLAGS					\
 		STACK_FLAGS					\
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* RE: [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file
  2020-03-27  1:46   ` Steven Rostedt
@ 2020-03-27 10:07     ` David Laight
  2020-03-27 14:30       ` Steven Rostedt
  0 siblings, 1 reply; 27+ messages in thread
From: David Laight @ 2020-03-27 10:07 UTC (permalink / raw)
  To: 'Steven Rostedt'
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra,
	Masami Hiramatsu, Alexei Starovoitov, Peter Wu, Jonathan Corbet,
	Tom Zanussi, Shuah Khan, bpf

From: Steven Rostedt
> Sent: 27 March 2020 01:46
> On Sat, 21 Mar 2020 19:13:51 +0000
> David Laight <David.Laight@ACULAB.COM> wrote:
> 
> > From: Steven Rostedt
> > > Sent: 19 March 2020 23:22
> > ...
> > >
> > > This patch series attempts to satisfy that request, by creating a
> > > temporary buffer in each of the per cpu iterators to place the
> > > read event into, such that it can be passed to users without worrying
> > > about a writer to corrupt the event while it was being written out.
> > > It also uses the fact that the ring buffer is broken up into pages,
> > > where each page has its own timestamp that gets updated when a
> > > writer crosses over to it. By copying it to the temp buffer, and
> > > doing a "before and after" test of the time stamp with memory barriers,
> > > can allow the events to be saved.
> >
> > Does this mean the you will no longer be able to look at a snapshot
> > of the trace by running 'less trace' (and typically going to the end
> > to get info for all cpus).
> 
> I changed patch 9 to be this:
> 
> It adds an option "pause-on-trace" that when set, will bring back the
> old behavior of pausing recording to the ring buffer when the trace
> file is open.
> 
> If needed, I can add a kernel command line option and a Kconfig that
> makes this set to true by default.

Maybe a different file 'trace_no_pause' ?
Along with the one that lets you read the raw trace and get EOF.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file
  2020-03-27 10:07     ` David Laight
@ 2020-03-27 14:30       ` Steven Rostedt
  2020-03-27 14:56         ` David Laight
  0 siblings, 1 reply; 27+ messages in thread
From: Steven Rostedt @ 2020-03-27 14:30 UTC (permalink / raw)
  To: David Laight
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra,
	Masami Hiramatsu, Alexei Starovoitov, Peter Wu, Jonathan Corbet,
	Tom Zanussi, Shuah Khan, bpf

On Fri, 27 Mar 2020 10:07:00 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> > If needed, I can add a kernel command line option and a Kconfig that
> > makes this set to true by default.  
> 
> Maybe a different file 'trace_no_pause' ?

I rather not add another file, it adds more complexity, and confuses the
interface even more. I'll leave this as is.

> Along with the one that lets you read the raw trace and get EOF.

Can you explain this more? I think we talked about this before, but I don't
remember the details.

-- Steve

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file
  2020-03-27 14:30       ` Steven Rostedt
@ 2020-03-27 14:56         ` David Laight
  0 siblings, 0 replies; 27+ messages in thread
From: David Laight @ 2020-03-27 14:56 UTC (permalink / raw)
  To: 'Steven Rostedt'
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra,
	Masami Hiramatsu, Alexei Starovoitov, Peter Wu, Jonathan Corbet,
	Tom Zanussi, Shuah Khan, bpf

From: Steven Rostedt
> Sent: 27 March 2020 14:31
..
> > Along with the one that lets you read the raw trace and get EOF.
> 
> Can you explain this more? I think we talked about this before, but I don't
> remember the details.

I was trying to use schedviz (on github from google).
It reads out the raw trace (for some scheduler events)
and then postprocesses it to generate to nice pictures.
However the shell script it uses for the captures has to run the
copies in the background, sleep a random time, and then kill all
the copies having hoped it has waited long enough - truly horrid.

There is also some confusion (IIRC between a header and your library
code) about the 'time-delta' on 'pad' entries.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [tracing] cd8f62b481: BUG:sleeping_function_called_from_invalid_context_at_mm/slab.h
       [not found]   ` <20200326091256.GR11705@shao2-debian>
@ 2020-04-01 14:07     ` Masami Hiramatsu
  2020-04-01 14:21       ` Steven Rostedt
  0 siblings, 1 reply; 27+ messages in thread
From: Masami Hiramatsu @ 2020-04-01 14:07 UTC (permalink / raw)
  To: kernel test robot
  Cc: Steven Rostedt, linux-kernel, Ingo Molnar, Andrew Morton,
	Peter Zijlstra, Masami Hiramatsu, Alexei Starovoitov, Peter Wu,
	Jonathan Corbet, Tom Zanussi, Shuah Khan, bpf, lkp

Hi Steve,

On Thu, 26 Mar 2020 17:12:56 +0800
kernel test robot <rong.a.chen@intel.com> wrote:

> FYI, we noticed the following commit (built with gcc-7):
> 
> commit: cd8f62b481530fafbeacee0d3283b60bcf42d854 ("[PATCH 02/12 v2] tracing: Save off entry when peeking at next entry")
> url: https://github.com/0day-ci/linux/commits/Steven-Rostedt/ring-buffer-tracing-Remove-disabling-of-ring-buffer-while-reading-trace-file/20200320-073240
> 

Hmm, this seems that we can not call kmalloc() in ftrace_dump().
Maybe we can fix it by
- Use GFP_ATOMIC.
 or
- Do not use iter->temp if it is NULL. (it is safe since ftrace_dump() stops tracing)

What would you think?

Thank you,

> 
> in testcase: rcuperf
> with following parameters:
> 
> 	runtime: 300s
> 	perf_type: rcu
> 
> 
> 
> on test machine: qemu-system-i386 -enable-kvm -cpu SandyBridge -smp 2 -m 8G
> 
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
> 
> 
> +----------------------------------------------------------------+------------+------------+
> |                                                                | 2e2bf17ca0 | cd8f62b481 |
> +----------------------------------------------------------------+------------+------------+
> | boot_successes                                                 | 13         | 0          |
> | boot_failures                                                  | 0          | 8          |
> | BUG:sleeping_function_called_from_invalid_context_at_mm/slab.h | 0          | 8          |
> +----------------------------------------------------------------+------------+------------+
> 
> 
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot <rong.a.chen@intel.com>
> 
> 
> [   16.821280] BUG: sleeping function called from invalid context at mm/slab.h:565
> [   16.822211] in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 158, name: rcu_perf_writer
> [   16.823164] no locks held by rcu_perf_writer/158.
> [   16.823650] CPU: 0 PID: 158 Comm: rcu_perf_writer Not tainted 5.6.0-rc6-00081-gcd8f62b481530f #1
> [   16.824856] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> [   16.826220] Call Trace:
> [   16.826667]  dump_stack+0x16/0x18
> [   16.827265]  ___might_sleep+0x104/0x118
> [   16.827947]  __might_sleep+0x69/0x70
> [   16.828600]  ? __fs_reclaim_release+0x17/0x19
> [   16.829374]  slab_pre_alloc_hook+0x34/0x6e
> [   16.830107]  __kmalloc+0x48/0xe8
> [   16.830710]  ? rb_reader_unlock+0x2b/0x3c
> [   16.831205]  ? trace_find_next_entry+0x84/0x133
> [   16.831723]  trace_find_next_entry+0x84/0x133
> [   16.832217]  trace_print_lat_context+0x4b/0x437
> [   16.832730]  ? rb_event_length+0x56/0x67
> [   16.833186]  ? ring_buffer_event_length+0x34/0x79
> [   16.833728]  ? __find_next_entry+0xd3/0x1c2
> [   16.834190]  print_trace_line+0x69d/0x767
> [   16.834666]  ftrace_dump+0x285/0x437
> [   16.835055]  rcu_perf_writer+0x25f/0x34e
> [   16.835476]  kthread+0xdf/0xe1
> [   16.835823]  ? rcu_perf_reader+0x9c/0x9c
> [   16.836264]  ? kthread_create_worker_on_cpu+0x1c/0x1c
> [   16.836834]  ret_from_fork+0x2e/0x38
> [   16.837270] rb_produ-160     0.... 6528532us : ring_buffer_producer: Starting ring buffer hammer
> [   16.838318] rb_produ-160     0.... 16529141us : ring_buffer_producer: End ring buffer hammer
> [   16.839349] rb_produ-160     0.... 16558157us : ring_buffer_producer: Running Consumer at nice: 19
> [   16.840382] rb_produ-160     0.... 16558162us : ring_buffer_producer: Running Producer at nice: 19
> [   16.841401] rb_produ-160     0.... 16558163us : ring_buffer_producer: WARNING!!! This test is running at lowest priority.
> [   16.854781] rb_produ-160     0.... 16558164us : ring_buffer_producer: Time:     10000604 (usecs)
> [   16.857354] rb_produ-160     0.... 16558165us : ring_buffer_producer: Overruns: 6117960
> [   16.858322] rb_produ-160     0.... 16558166us : ring_buffer_producer: Read:     4356190  (by events)
> [   16.859404] rb_produ-160     0.... 16558167us : ring_buffer_producer: Entries:  44650
> [   16.860329] rb_produ-160     0.... 16558167us : ring_buffer_producer: Total:    10518800
> [   16.861226] rb_produ-160     0.... 16558168us : ring_buffer_producer: Missed:   0
> [   16.862124] rb_produ-160     0.... 16558169us : ring_buffer_producer: Hit:      10518800
> [   16.863087] rb_produ-160     0.... 16558170us : ring_buffer_producer: Entries per millisec: 1051
> [   16.864123] rb_produ-160     0.... 16558171us : ring_buffer_producer: 951 ns per entry
> [   16.865074] rb_produ-160     0.... 16558172us : ring_buffer_producer_thread: Sleeping for 10 secs
> [   16.866105] ---------------------------------
> [   16.866725] rcu-perf: Test complete
> [   16.878198] random: fast init done
> 
> Elapsed time: 60
> 
> qemu-img create -f qcow2 disk-vm-snb-i386-9-0 256G
> qemu-img create -f qcow2 disk-vm-snb-i386-9-1 256G
> qemu-img create -f qcow2 disk-vm-snb-i386-9-2 256G
> qemu-img create -f qcow2 disk-vm-snb-i386-9-3 256G
> qemu-img create -f qcow2 disk-vm-snb-i386-9-4 256G
> qemu-img create -f qcow2 disk-vm-snb-i386-9-5 256G
> qemu-img create -f qcow2 disk-vm-snb-i386-9-6 256G
> 
> kvm=(
> 	qemu-system-i386
> 	-enable-kvm
> 	-cpu SandyBridge
> 	-kernel $kernel
> 	-initrd initrd-vm-snb-i386-9.cgz
> 	-m 8192
> 	-smp 2
> 	-device e1000,netdev=net0
> 	-netdev user,id=net0,hostfwd=tcp::32032-:22
> 	-boot order=nc
> 	-no-reboot
> 	-watchdog i6300esb
> 	-watchdog-action debug
> 	-rtc base=localtime
> 	-drive file=disk-vm-snb-i386-9-0,media=disk,if=virtio
> 	-drive file=disk-vm-snb-i386-9-1,media=disk,if=virtio
> 	-drive file=disk-vm-snb-i386-9-2,media=disk,if=virtio
> 	-drive file=disk-vm-snb-i386-9-3,media=disk,if=virtio
> 	-drive file=disk-vm-snb-i386-9-4,media=disk,if=virtio
> 	-drive file=disk-vm-snb-i386-9-5,media=disk,if=virtio
> 	-drive file=disk-vm-snb-i386-9-6,media=disk,if=virtio
> 	-serial stdio
> 	-display none
> 	-monitor null
> )
> 
> append=(
> 	ip=::::vm-snb-i386-9::dhcp
> 	root=/dev/ram0
> 	user=lkp
> 	job=/job-script
> 	ARCH=i386
> 	kconfig=i386-randconfig-g003-20200324
> 	branch=linux-devel/devel-hourly-2020032505
> 	commit=cd8f62b481530fafbeacee0d3283b60bcf42d854
> 	BOOT_IMAGE=/pkg/linux/i386-randconfig-g003-20200324/gcc-7/cd8f62b481530fafbeacee0d3283b60bcf42d854/vmlinuz-5.6.0-rc6-00081-gcd8f62b481530f
> 	max_uptime=1500
> 	RESULT_ROOT=/result/rcuperf/rcu-300s/vm-snb-i386/yocto-i386-minimal-20190520.cgz/i386-randconfig-g003-20200324/gcc-7/cd8f62b481530fafbeacee0d3283b60bcf42d854/3
> 	result_service=tmpfs
> 	selinux=0
> 	debug
> 	apic=debug
> 	sysrq_always_enabled
> 	rcupdate.rcu_cpu_stall_timeout=100
> 	net.ifnames=0
> 
> 
> To reproduce:
> 
>         # build kernel
> 	cd linux
> 	cp config-5.6.0-rc6-00081-gcd8f62b481530f .config
> 	make HOSTCC=gcc-7 CC=gcc-7 ARCH=i386 olddefconfig prepare modules_prepare bzImage
> 
>         git clone https://github.com/intel/lkp-tests.git
>         cd lkp-tests
>         bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email
> 
> 
> 
> Thanks,
> Rong Chen
> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [tracing] cd8f62b481: BUG:sleeping_function_called_from_invalid_context_at_mm/slab.h
  2020-04-01 14:07     ` [tracing] cd8f62b481: BUG:sleeping_function_called_from_invalid_context_at_mm/slab.h Masami Hiramatsu
@ 2020-04-01 14:21       ` Steven Rostedt
  2020-04-01 15:04         ` Steven Rostedt
  0 siblings, 1 reply; 27+ messages in thread
From: Steven Rostedt @ 2020-04-01 14:21 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: kernel test robot, linux-kernel, Ingo Molnar, Andrew Morton,
	Peter Zijlstra, Alexei Starovoitov, Peter Wu, Jonathan Corbet,
	Tom Zanussi, Shuah Khan, bpf, lkp

On Wed, 1 Apr 2020 23:07:00 +0900
Masami Hiramatsu <mhiramat@kernel.org> wrote:

> Hi Steve,
> 
> On Thu, 26 Mar 2020 17:12:56 +0800
> kernel test robot <rong.a.chen@intel.com> wrote:
> 
> > FYI, we noticed the following commit (built with gcc-7):
> > 
> > commit: cd8f62b481530fafbeacee0d3283b60bcf42d854 ("[PATCH 02/12 v2] tracing: Save off entry when peeking at next entry")
> > url: https://github.com/0day-ci/linux/commits/Steven-Rostedt/ring-buffer-tracing-Remove-disabling-of-ring-buffer-while-reading-trace-file/20200320-073240
> >   
> 
> Hmm, this seems that we can not call kmalloc() in ftrace_dump().
> Maybe we can fix it by
> - Use GFP_ATOMIC.
>  or
> - Do not use iter->temp if it is NULL. (it is safe since ftrace_dump() stops tracing)
> 
> What would you think?
> 

Thanks for the reminder, I knew there was something that I had to deal with
and forgot to mark this report!

I already looked at it, and yes, this is an issue, but for PREEMPT_RT even
GFP_ATOMIC will fail. As it's not critical to record it, we can just check
for in atomic and not bother with the allocation.

-- Steve

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6519b7afc499..7f1466253ca8 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3487,6 +3487,14 @@ struct trace_entry *trace_find_next_entry(struct trace_iterator *iter,
 	 */
 	if (iter->ent && iter->ent != iter->temp) {
 		if (!iter->temp || iter->temp_size < iter->ent_size) {
+			/*
+			 * This function is only used to add markers between
+			 * events that are far apart (see trace_print_lat_context()),
+			 * but if this is called in an atomic context (like NMIs)
+			 * we can't call kmalloc(), thus just return NULL.
+			 */
+			if (in_atomic() || irqs_disabled())
+				return NULL;
 			kfree(iter->temp);
 			iter->temp = kmalloc(iter->ent_size, GFP_KERNEL);
 			if (!iter->temp)

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [tracing] cd8f62b481: BUG:sleeping_function_called_from_invalid_context_at_mm/slab.h
  2020-04-01 14:21       ` Steven Rostedt
@ 2020-04-01 15:04         ` Steven Rostedt
  2020-04-02  7:19           ` Masami Hiramatsu
  0 siblings, 1 reply; 27+ messages in thread
From: Steven Rostedt @ 2020-04-01 15:04 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: kernel test robot, linux-kernel, Ingo Molnar, Andrew Morton,
	Peter Zijlstra, Alexei Starovoitov, Peter Wu, Jonathan Corbet,
	Tom Zanussi, Shuah Khan, bpf, lkp

On Wed, 1 Apr 2020 10:21:12 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 6519b7afc499..7f1466253ca8 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -3487,6 +3487,14 @@ struct trace_entry *trace_find_next_entry(struct trace_iterator *iter,
>  	 */
>  	if (iter->ent && iter->ent != iter->temp) {
>  		if (!iter->temp || iter->temp_size < iter->ent_size) {
> +			/*
> +			 * This function is only used to add markers between
> +			 * events that are far apart (see trace_print_lat_context()),
> +			 * but if this is called in an atomic context (like NMIs)
> +			 * we can't call kmalloc(), thus just return NULL.
> +			 */
> +			if (in_atomic() || irqs_disabled())
> +				return NULL;
>  			kfree(iter->temp);
>  			iter->temp = kmalloc(iter->ent_size, GFP_KERNEL);
>  			if (!iter->temp)

Peter informed me on IRC not to use in_atomic() as it doesn't catch
spin_locks when CONFIG_PREEMPT is not defined.

As the issue is just with ftrace_dump(), I'll have it use a static buffer
instead of 128 bytes. Which should be big enough for most events, and if
not, then it will just miss the markers.

-- Steve

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6519b7afc499..8c9d6a75abbf 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3472,6 +3472,8 @@ __find_next_entry(struct trace_iterator *iter, int *ent_cpu,
 	return next;
 }
 
+#define IGNORE_TEMP		((struct trace_iterator *)-1L)
+
 /* Find the next real entry, without updating the iterator itself */
 struct trace_entry *trace_find_next_entry(struct trace_iterator *iter,
 					  int *ent_cpu, u64 *ent_ts)
@@ -3480,6 +3482,17 @@ struct trace_entry *trace_find_next_entry(struct trace_iterator *iter,
 	int ent_size = iter->ent_size;
 	struct trace_entry *entry;
 
+	/*
+	 * This function is only used to add markers between
+	 * events that are far apart (see trace_print_lat_context()),
+	 * but if this is called in an atomic context (like NMIs)
+	 * kmalloc() can't be called.
+	 * That happens via ftrace_dump() which will initialize
+	 * iter->temp to IGNORE_TEMP. In such a case, just return NULL.
+	 */
+	if (iter->temp == IGNORE_TEMP)
+		return NULL;
+
 	/*
 	 * The __find_next_entry() may call peek_next_entry(), which may
 	 * call ring_buffer_peek() that may make the contents of iter->ent
@@ -9203,6 +9216,8 @@ void ftrace_dump(enum ftrace_dump_mode oops_dump_mode)
 
 	/* Simulate the iterator */
 	trace_init_global_iter(&iter);
+	/* Force not using the temp buffer */
+	iter.temp = IGNORE_TEMP;
 
 	for_each_tracing_cpu(cpu) {
 		atomic_inc(&per_cpu_ptr(iter.array_buffer->data, cpu)->disabled);

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [tracing] cd8f62b481: BUG:sleeping_function_called_from_invalid_context_at_mm/slab.h
  2020-04-01 15:04         ` Steven Rostedt
@ 2020-04-02  7:19           ` Masami Hiramatsu
  2020-04-02 18:14             ` Steven Rostedt
  0 siblings, 1 reply; 27+ messages in thread
From: Masami Hiramatsu @ 2020-04-02  7:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: kernel test robot, linux-kernel, Ingo Molnar, Andrew Morton,
	Peter Zijlstra, Alexei Starovoitov, Peter Wu, Jonathan Corbet,
	Tom Zanussi, Shuah Khan, bpf, lkp

On Wed, 1 Apr 2020 11:04:01 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Wed, 1 Apr 2020 10:21:12 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> > index 6519b7afc499..7f1466253ca8 100644
> > --- a/kernel/trace/trace.c
> > +++ b/kernel/trace/trace.c
> > @@ -3487,6 +3487,14 @@ struct trace_entry *trace_find_next_entry(struct trace_iterator *iter,
> >  	 */
> >  	if (iter->ent && iter->ent != iter->temp) {
> >  		if (!iter->temp || iter->temp_size < iter->ent_size) {
> > +			/*
> > +			 * This function is only used to add markers between
> > +			 * events that are far apart (see trace_print_lat_context()),
> > +			 * but if this is called in an atomic context (like NMIs)
> > +			 * we can't call kmalloc(), thus just return NULL.
> > +			 */
> > +			if (in_atomic() || irqs_disabled())
> > +				return NULL;
> >  			kfree(iter->temp);
> >  			iter->temp = kmalloc(iter->ent_size, GFP_KERNEL);
> >  			if (!iter->temp)
> 
> Peter informed me on IRC not to use in_atomic() as it doesn't catch
> spin_locks when CONFIG_PREEMPT is not defined.
> 
> As the issue is just with ftrace_dump(), I'll have it use a static buffer
> instead of 128 bytes. Which should be big enough for most events, and if
> not, then it will just miss the markers.


That sounds good, but the below patch seems to do different thing.
Does it just makes trace_find_next_entry() always fail if it is
called from ftrace_dump()?

Thank you,

> 
> -- Steve
> 
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 6519b7afc499..8c9d6a75abbf 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -3472,6 +3472,8 @@ __find_next_entry(struct trace_iterator *iter, int *ent_cpu,
>  	return next;
>  }
>  
> +#define IGNORE_TEMP		((struct trace_iterator *)-1L)
> +
>  /* Find the next real entry, without updating the iterator itself */
>  struct trace_entry *trace_find_next_entry(struct trace_iterator *iter,
>  					  int *ent_cpu, u64 *ent_ts)
> @@ -3480,6 +3482,17 @@ struct trace_entry *trace_find_next_entry(struct trace_iterator *iter,
>  	int ent_size = iter->ent_size;
>  	struct trace_entry *entry;
>  
> +	/*
> +	 * This function is only used to add markers between
> +	 * events that are far apart (see trace_print_lat_context()),
> +	 * but if this is called in an atomic context (like NMIs)
> +	 * kmalloc() can't be called.
> +	 * That happens via ftrace_dump() which will initialize
> +	 * iter->temp to IGNORE_TEMP. In such a case, just return NULL.
> +	 */
> +	if (iter->temp == IGNORE_TEMP)
> +		return NULL;
> +
>  	/*
>  	 * The __find_next_entry() may call peek_next_entry(), which may
>  	 * call ring_buffer_peek() that may make the contents of iter->ent
> @@ -9203,6 +9216,8 @@ void ftrace_dump(enum ftrace_dump_mode oops_dump_mode)
>  
>  	/* Simulate the iterator */
>  	trace_init_global_iter(&iter);
> +	/* Force not using the temp buffer */
> +	iter.temp = IGNORE_TEMP;
>  
>  	for_each_tracing_cpu(cpu) {
>  		atomic_inc(&per_cpu_ptr(iter.array_buffer->data, cpu)->disabled);


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [tracing] cd8f62b481: BUG:sleeping_function_called_from_invalid_context_at_mm/slab.h
  2020-04-02  7:19           ` Masami Hiramatsu
@ 2020-04-02 18:14             ` Steven Rostedt
  2020-04-03  6:47               ` Masami Hiramatsu
  0 siblings, 1 reply; 27+ messages in thread
From: Steven Rostedt @ 2020-04-02 18:14 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: kernel test robot, linux-kernel, Ingo Molnar, Andrew Morton,
	Peter Zijlstra, Alexei Starovoitov, Peter Wu, Jonathan Corbet,
	Tom Zanussi, Shuah Khan, bpf, lkp

On Thu, 2 Apr 2020 16:19:20 +0900
Masami Hiramatsu <mhiramat@kernel.org> wrote:

> On Wed, 1 Apr 2020 11:04:01 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Wed, 1 Apr 2020 10:21:12 -0400
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> >   
> > > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> > > index 6519b7afc499..7f1466253ca8 100644
> > > --- a/kernel/trace/trace.c
> > > +++ b/kernel/trace/trace.c
> > > @@ -3487,6 +3487,14 @@ struct trace_entry *trace_find_next_entry(struct trace_iterator *iter,
> > >  	 */
> > >  	if (iter->ent && iter->ent != iter->temp) {
> > >  		if (!iter->temp || iter->temp_size < iter->ent_size) {
> > > +			/*
> > > +			 * This function is only used to add markers between
> > > +			 * events that are far apart (see trace_print_lat_context()),
> > > +			 * but if this is called in an atomic context (like NMIs)
> > > +			 * we can't call kmalloc(), thus just return NULL.
> > > +			 */
> > > +			if (in_atomic() || irqs_disabled())
> > > +				return NULL;
> > >  			kfree(iter->temp);
> > >  			iter->temp = kmalloc(iter->ent_size, GFP_KERNEL);
> > >  			if (!iter->temp)  
> > 
> > Peter informed me on IRC not to use in_atomic() as it doesn't catch
> > spin_locks when CONFIG_PREEMPT is not defined.
> > 
> > As the issue is just with ftrace_dump(), I'll have it use a static buffer
> > instead of 128 bytes. Which should be big enough for most events, and if
> > not, then it will just miss the markers.  
> 
> 
> That sounds good, but the below patch seems to do different thing.
> Does it just makes trace_find_next_entry() always fail if it is
> called from ftrace_dump()?

Bah! I send my emails on a different machine than I create the patches on.
Below was my first iteration, then I decided to at least try to print some,
changed it, but never copied the new version over, and ended up sending the
last one.

Here's the actual patch!

-- Steve

From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Subject: [PATCH] tracing: Do not allocate buffer in trace_find_next_entry() in
 atomic

When dumping out the trace data in latency format, a check is made to peek
at the next event to compare its timestamp to the current one, and if the
delta is of a greater size, it will add a marker showing so. But to do this,
it needs to save the current event otherwise peeking at the next event will
remove the current event. To save the event, a temp buffer is used, and if
the event is bigger than the temp buffer, the temp buffer is freed and a
bigger buffer is allocated.

This allocation is a problem when called in atomic context. The only way
this gets called via atomic context is via ftrace_dump(). Thus, use a static
buffer of 128 bytes (which covers most events), and if the event is bigger
than that, simply return NULL. The callers of trace_find_next_entry() need
to handle a NULL case, as that's what would happen if the allocation failed.

Link: https://lore.kernel.org/r/20200326091256.GR11705@shao2-debian

Fixes: ff895103a84ab ("tracing: Save off entry when peeking at next entry")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 kernel/trace/trace.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 6519b7afc499..4b7bbfe7f809 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3472,6 +3472,9 @@ __find_next_entry(struct trace_iterator *iter, int *ent_cpu,
 	return next;
 }
 
+#define STATIC_TEMP_BUF_SIZE	128
+static char static_temp_buf[STATIC_TEMP_BUF_SIZE];
+
 /* Find the next real entry, without updating the iterator itself */
 struct trace_entry *trace_find_next_entry(struct trace_iterator *iter,
 					  int *ent_cpu, u64 *ent_ts)
@@ -3480,13 +3483,26 @@ struct trace_entry *trace_find_next_entry(struct trace_iterator *iter,
 	int ent_size = iter->ent_size;
 	struct trace_entry *entry;
 
+	/*
+	 * If called from ftrace_dump(), then the iter->temp buffer
+	 * will be the static_temp_buf and not created from kmalloc.
+	 * If the entry size is greater than the buffer, we can
+	 * not save it. Just return NULL in that case. This is only
+	 * used to add markers when two consecutive events' time
+	 * stamps have a large delta. See trace_print_lat_context()
+	 */
+	if (iter->temp == static_temp_buf &&
+	    STATIC_TEMP_BUF_SIZE < ent_size)
+		return NULL;
+
 	/*
 	 * The __find_next_entry() may call peek_next_entry(), which may
 	 * call ring_buffer_peek() that may make the contents of iter->ent
 	 * undefined. Need to copy iter->ent now.
 	 */
 	if (iter->ent && iter->ent != iter->temp) {
-		if (!iter->temp || iter->temp_size < iter->ent_size) {
+		if ((!iter->temp || iter->temp_size < iter->ent_size) &&
+		    !WARN_ON_ONCE(iter->temp == static_temp_buf)) {
 			kfree(iter->temp);
 			iter->temp = kmalloc(iter->ent_size, GFP_KERNEL);
 			if (!iter->temp)
@@ -9203,6 +9219,8 @@ void ftrace_dump(enum ftrace_dump_mode oops_dump_mode)
 
 	/* Simulate the iterator */
 	trace_init_global_iter(&iter);
+	/* Can not use kmalloc for iter.temp */
+	iter.temp = static_temp_buf;
 
 	for_each_tracing_cpu(cpu) {
 		atomic_inc(&per_cpu_ptr(iter.array_buffer->data, cpu)->disabled);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [tracing] cd8f62b481: BUG:sleeping_function_called_from_invalid_context_at_mm/slab.h
  2020-04-02 18:14             ` Steven Rostedt
@ 2020-04-03  6:47               ` Masami Hiramatsu
  2020-04-03 13:16                 ` Steven Rostedt
  0 siblings, 1 reply; 27+ messages in thread
From: Masami Hiramatsu @ 2020-04-03  6:47 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: kernel test robot, linux-kernel, Ingo Molnar, Andrew Morton,
	Peter Zijlstra, Alexei Starovoitov, Peter Wu, Jonathan Corbet,
	Tom Zanussi, Shuah Khan, bpf, lkp

On Thu, 2 Apr 2020 14:14:40 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Thu, 2 Apr 2020 16:19:20 +0900
> Masami Hiramatsu <mhiramat@kernel.org> wrote:
> 
> > On Wed, 1 Apr 2020 11:04:01 -0400
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> > > On Wed, 1 Apr 2020 10:21:12 -0400
> > > Steven Rostedt <rostedt@goodmis.org> wrote:
> > >   
> > > > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> > > > index 6519b7afc499..7f1466253ca8 100644
> > > > --- a/kernel/trace/trace.c
> > > > +++ b/kernel/trace/trace.c
> > > > @@ -3487,6 +3487,14 @@ struct trace_entry *trace_find_next_entry(struct trace_iterator *iter,
> > > >  	 */
> > > >  	if (iter->ent && iter->ent != iter->temp) {
> > > >  		if (!iter->temp || iter->temp_size < iter->ent_size) {
> > > > +			/*
> > > > +			 * This function is only used to add markers between
> > > > +			 * events that are far apart (see trace_print_lat_context()),
> > > > +			 * but if this is called in an atomic context (like NMIs)
> > > > +			 * we can't call kmalloc(), thus just return NULL.
> > > > +			 */
> > > > +			if (in_atomic() || irqs_disabled())
> > > > +				return NULL;
> > > >  			kfree(iter->temp);
> > > >  			iter->temp = kmalloc(iter->ent_size, GFP_KERNEL);
> > > >  			if (!iter->temp)  
> > > 
> > > Peter informed me on IRC not to use in_atomic() as it doesn't catch
> > > spin_locks when CONFIG_PREEMPT is not defined.
> > > 
> > > As the issue is just with ftrace_dump(), I'll have it use a static buffer
> > > instead of 128 bytes. Which should be big enough for most events, and if
> > > not, then it will just miss the markers.  
> > 
> > 
> > That sounds good, but the below patch seems to do different thing.
> > Does it just makes trace_find_next_entry() always fail if it is
> > called from ftrace_dump()?
> 
> Bah! I send my emails on a different machine than I create the patches on.
> Below was my first iteration, then I decided to at least try to print some,
> changed it, but never copied the new version over, and ended up sending the
> last one.
> 

Ah, got it :)

> Here's the actual patch!
> 
> -- Steve
> 
> From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
> Subject: [PATCH] tracing: Do not allocate buffer in trace_find_next_entry() in
>  atomic
> 
> When dumping out the trace data in latency format, a check is made to peek
> at the next event to compare its timestamp to the current one, and if the
> delta is of a greater size, it will add a marker showing so. But to do this,
> it needs to save the current event otherwise peeking at the next event will
> remove the current event. To save the event, a temp buffer is used, and if
> the event is bigger than the temp buffer, the temp buffer is freed and a
> bigger buffer is allocated.
> 
> This allocation is a problem when called in atomic context. The only way
> this gets called via atomic context is via ftrace_dump(). Thus, use a static
> buffer of 128 bytes (which covers most events), and if the event is bigger
> than that, simply return NULL. The callers of trace_find_next_entry() need
> to handle a NULL case, as that's what would happen if the allocation failed.
> 
> Link: https://lore.kernel.org/r/20200326091256.GR11705@shao2-debian
> 
> Fixes: ff895103a84ab ("tracing: Save off entry when peeking at next entry")
> Reported-by: kernel test robot <rong.a.chen@intel.com>
> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
> ---
>  kernel/trace/trace.c | 20 +++++++++++++++++++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 6519b7afc499..4b7bbfe7f809 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -3472,6 +3472,9 @@ __find_next_entry(struct trace_iterator *iter, int *ent_cpu,
>  	return next;
>  }
>  
> +#define STATIC_TEMP_BUF_SIZE	128
> +static char static_temp_buf[STATIC_TEMP_BUF_SIZE];
> +
>  /* Find the next real entry, without updating the iterator itself */
>  struct trace_entry *trace_find_next_entry(struct trace_iterator *iter,
>  					  int *ent_cpu, u64 *ent_ts)
> @@ -3480,13 +3483,26 @@ struct trace_entry *trace_find_next_entry(struct trace_iterator *iter,
>  	int ent_size = iter->ent_size;
>  	struct trace_entry *entry;
>  
> +	/*
> +	 * If called from ftrace_dump(), then the iter->temp buffer
> +	 * will be the static_temp_buf and not created from kmalloc.
> +	 * If the entry size is greater than the buffer, we can
> +	 * not save it. Just return NULL in that case. This is only
> +	 * used to add markers when two consecutive events' time
> +	 * stamps have a large delta. See trace_print_lat_context()
> +	 */
> +	if (iter->temp == static_temp_buf &&
> +	    STATIC_TEMP_BUF_SIZE < ent_size)
> +		return NULL;
> +
>  	/*
>  	 * The __find_next_entry() may call peek_next_entry(), which may
>  	 * call ring_buffer_peek() that may make the contents of iter->ent
>  	 * undefined. Need to copy iter->ent now.
>  	 */
>  	if (iter->ent && iter->ent != iter->temp) {
> -		if (!iter->temp || iter->temp_size < iter->ent_size) {
> +		if ((!iter->temp || iter->temp_size < iter->ent_size) &&
> +		    !WARN_ON_ONCE(iter->temp == static_temp_buf)) {

This must not happen because ent_size == iter->ent_size.
If it happens, it should return NULL without any trial of kfree() and
kmalloc(), becuase it will cause illegal freeing memory and memory leak.
(Note that the iter->temp never be freed in ftrace_dump() path)

Anyway, this condition is completery same as above return code.

>  			kfree(iter->temp);
>  			iter->temp = kmalloc(iter->ent_size, GFP_KERNEL);
>  			if (!iter->temp)
> @@ -9203,6 +9219,8 @@ void ftrace_dump(enum ftrace_dump_mode oops_dump_mode)
>  
>  	/* Simulate the iterator */
>  	trace_init_global_iter(&iter);
> +	/* Can not use kmalloc for iter.temp */
> +	iter.temp = static_temp_buf;
>  

You may miss initializing temp_size here.

	iter.temp_size = STATIC_TEMP_BUF_SIZE;

BTW, as I pointed, if the iter->temp is for avoiding the data overwritten
by ringbuffer writer, would we need to use it for ftrace_dump() too?
It seems that ftrace_dump() stops tracing.

void ftrace_dump(enum ftrace_dump_mode oops_dump_mode)
{
[...]
        /* Only allow one dump user at a time. */
        if (atomic_inc_return(&dump_running) != 1) {
                atomic_dec(&dump_running);
                return;
        }

        /*
         * Always turn off tracing when we dump.
         * We don't need to show trace output of what happens
         * between multiple crashes.
         *
         * If the user does a sysrq-z, then they can re-enable
         * tracing with echo 1 > tracing_on.
         */
        tracing_off();



Thank you,


>  	for_each_tracing_cpu(cpu) {
>  		atomic_inc(&per_cpu_ptr(iter.array_buffer->data, cpu)->disabled);
> -- 
> 2.20.1
> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [tracing] cd8f62b481: BUG:sleeping_function_called_from_invalid_context_at_mm/slab.h
  2020-04-03  6:47               ` Masami Hiramatsu
@ 2020-04-03 13:16                 ` Steven Rostedt
  0 siblings, 0 replies; 27+ messages in thread
From: Steven Rostedt @ 2020-04-03 13:16 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: kernel test robot, linux-kernel, Ingo Molnar, Andrew Morton,
	Peter Zijlstra, Alexei Starovoitov, Peter Wu, Jonathan Corbet,
	Tom Zanussi, Shuah Khan, bpf, lkp

On Fri, 3 Apr 2020 15:47:02 +0900
Masami Hiramatsu <mhiramat@kernel.org> wrote:

> > +#define STATIC_TEMP_BUF_SIZE	128
> > +static char static_temp_buf[STATIC_TEMP_BUF_SIZE];
> > +
> >  /* Find the next real entry, without updating the iterator itself */
> >  struct trace_entry *trace_find_next_entry(struct trace_iterator *iter,
> >  					  int *ent_cpu, u64 *ent_ts)
> > @@ -3480,13 +3483,26 @@ struct trace_entry *trace_find_next_entry(struct trace_iterator *iter,
> >  	int ent_size = iter->ent_size;
> >  	struct trace_entry *entry;
> >  
> > +	/*
> > +	 * If called from ftrace_dump(), then the iter->temp buffer
> > +	 * will be the static_temp_buf and not created from kmalloc.
> > +	 * If the entry size is greater than the buffer, we can
> > +	 * not save it. Just return NULL in that case. This is only
> > +	 * used to add markers when two consecutive events' time
> > +	 * stamps have a large delta. See trace_print_lat_context()
> > +	 */
> > +	if (iter->temp == static_temp_buf &&
> > +	    STATIC_TEMP_BUF_SIZE < ent_size)
> > +		return NULL;
> > +
> >  	/*
> >  	 * The __find_next_entry() may call peek_next_entry(), which may
> >  	 * call ring_buffer_peek() that may make the contents of iter->ent
> >  	 * undefined. Need to copy iter->ent now.
> >  	 */
> >  	if (iter->ent && iter->ent != iter->temp) {
> > -		if (!iter->temp || iter->temp_size < iter->ent_size) {
> > +		if ((!iter->temp || iter->temp_size < iter->ent_size) &&
> > +		    !WARN_ON_ONCE(iter->temp == static_temp_buf)) {  
> 
> This must not happen because ent_size == iter->ent_size.
> If it happens, it should return NULL without any trial of kfree() and
> kmalloc(), becuase it will cause illegal freeing memory and memory leak.
> (Note that the iter->temp never be freed in ftrace_dump() path)

Correct, which is why there's a ! in there. It's a paranoid check which
should never trigger, which is why there's a WARN_ON_ONCE() there. But as
the "!" is not easy to see, the above is the same logic as:

	if ((!iter->temp || iter->temp_size < iter->ent_size) &&
	    (iter->temp != static_temp_buf)) {

Thus, if we get to that test against static_temp_buf, and it's true, then
we will trigger the WARN_ON, but it wont call the kfree().

> 
> Anyway, this condition is completery same as above return code.
> 
> >  			kfree(iter->temp);
> >  			iter->temp = kmalloc(iter->ent_size, GFP_KERNEL);
> >  			if (!iter->temp)
> > @@ -9203,6 +9219,8 @@ void ftrace_dump(enum ftrace_dump_mode oops_dump_mode)
> >  
> >  	/* Simulate the iterator */
> >  	trace_init_global_iter(&iter);
> > +	/* Can not use kmalloc for iter.temp */
> > +	iter.temp = static_temp_buf;
> >    
> 
> You may miss initializing temp_size here.
> 
> 	iter.temp_size = STATIC_TEMP_BUF_SIZE;

Oh, damn! You're right.

> 
> BTW, as I pointed, if the iter->temp is for avoiding the data overwritten
> by ringbuffer writer, would we need to use it for ftrace_dump() too?
> It seems that ftrace_dump() stops tracing.

Yes, it is still needed. That's because the old way use to just leave the
iter->ent pointing into the ring buffer itself. The new way, the ring
buffer makes a copy of the event, and passes that back. When you do another
read, it overwrites the copy. It doesn't matter if the ring buffer is
stopped or not.

-- Steve

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2020-04-03 13:16 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-19 23:22 [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file Steven Rostedt
2020-03-19 23:22 ` [PATCH 01/12 v2] selftest/ftrace: Fix function trigger test to handle trace not disabling the tracer Steven Rostedt
2020-03-19 23:22 ` [PATCH 02/12 v2] tracing: Save off entry when peeking at next entry Steven Rostedt
2020-03-20  2:57   ` Masami Hiramatsu
     [not found]   ` <20200326091256.GR11705@shao2-debian>
2020-04-01 14:07     ` [tracing] cd8f62b481: BUG:sleeping_function_called_from_invalid_context_at_mm/slab.h Masami Hiramatsu
2020-04-01 14:21       ` Steven Rostedt
2020-04-01 15:04         ` Steven Rostedt
2020-04-02  7:19           ` Masami Hiramatsu
2020-04-02 18:14             ` Steven Rostedt
2020-04-03  6:47               ` Masami Hiramatsu
2020-04-03 13:16                 ` Steven Rostedt
2020-03-19 23:22 ` [PATCH 03/12 v2] ring-buffer: Have ring_buffer_empty() not depend on tracing stopped Steven Rostedt
2020-03-19 23:22 ` [PATCH 04/12 v2] ring-buffer: Rename ring_buffer_read() to read_buffer_iter_advance() Steven Rostedt
2020-03-19 23:22 ` [PATCH 05/12 v2] ring-buffer: Add page_stamp to iterator for synchronization Steven Rostedt
2020-03-19 23:22 ` [PATCH 06/12 v2] ring-buffer: Have rb_iter_head_event() handle concurrent writer Steven Rostedt
2020-03-19 23:22 ` [PATCH 07/12 v2] ring-buffer: Do not die if rb_iter_peek() fails more than thrice Steven Rostedt
2020-03-19 23:22 ` [PATCH 08/12 v2] ring-buffer: Optimize rb_iter_head_event() Steven Rostedt
2020-03-19 23:22 ` [PATCH 09/12 v2] ring-buffer: Do not disable recording when there is an iterator Steven Rostedt
2020-03-19 23:22 ` [PATCH 10/12 v2] tracing: Do not disable tracing when reading the trace file Steven Rostedt
2020-03-19 23:22 ` [PATCH 11/12 v2] ring-buffer/tracing: Have iterator acknowledge dropped events Steven Rostedt
2020-03-19 23:22 ` [PATCH 12/12 v2] tracing: Have the document reflect that the trace file keeps tracing enabled Steven Rostedt
2020-03-21 19:13 ` [PATCH 00/12 v2] ring-buffer/tracing: Remove disabling of ring buffer while reading trace file David Laight
2020-03-22 18:07   ` Steven Rostedt
2020-03-27  1:46   ` Steven Rostedt
2020-03-27 10:07     ` David Laight
2020-03-27 14:30       ` Steven Rostedt
2020-03-27 14:56         ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).