linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] perf record: mmap output file - v2
@ 2013-10-15  2:55 David Ahern
  2013-10-15  6:02 ` Ingo Molnar
  2013-10-15  7:31 ` Namhyung Kim
  0 siblings, 2 replies; 15+ messages in thread
From: David Ahern @ 2013-10-15  2:55 UTC (permalink / raw)
  To: acme, linux-kernel
  Cc: David Ahern, Ingo Molnar, Frederic Weisbecker, Peter Zijlstra,
	Jiri Olsa, Namhyung Kim, Mike Galbraith, Stephane Eranian

When recording raw_syscalls for the entire system, e.g.,
    perf record -e raw_syscalls:*,sched:sched_switch -a -- sleep 1

you end up with a negative feedback loop as perf itself calls
write() fairly often. This patch handles the problem by mmap'ing the
file in chunks of 64M at a time and copies events from the event buffers
to the file avoiding write system calls.

Before (with write syscall):

perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- sleep 1
[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 81.843 MB /tmp/perf.data (~3575786 samples) ]

After (using mmap):

perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- sleep 1
[ perf record: Woken up 31 times to write data ]
[ perf record: Captured and wrote 8.203 MB /tmp/perf.data (~358388 samples) ]

In addition to perf-trace benefits using mmap lowers the overhead of
perf-record. For example,

  perf stat -i -- perf record -g -o /tmp/perf.data openssl speed aes

showsi a drop in time, CPU cycles, and instructions all drop by more than a
factor of 3. Jiri also ran a test that showed a big improvement.

v2: removed msync call before munmap per Jiri's suggestion

Signed-off-by: David Ahern <dsahern@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Stephane Eranian <eranian@google.com>
---
 tools/perf/builtin-record.c |   85 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 92ca5419073b..a1cf20e65089 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -29,6 +29,9 @@
 #include <sched.h>
 #include <sys/mman.h>
 
+/* mmap file big chunks at a time */
+#define MMAP_OUTPUT_SIZE   (64*1024*1024)
+
 #ifndef HAVE_ON_EXIT_SUPPORT
 #ifndef ATEXIT_MAX
 #define ATEXIT_MAX 32
@@ -64,6 +67,14 @@ static void __handle_on_exit_funcs(void)
 struct perf_record {
 	struct perf_tool	tool;
 	struct perf_record_opts	opts;
+
+	/* for MMAP based file writes */
+	void			*mmap_addr;
+	u64			bytes_at_mmap_start; /* bytes in file when mmap use starts */
+	u64			mmap_offset;    /* current location within mmap */
+	size_t			mmap_size;      /* size of mmap segments */
+	bool			use_mmap;
+
 	u64			bytes_written;
 	const char		*output_name;
 	struct perf_evlist	*evlist;
@@ -82,8 +93,65 @@ static void advance_output(struct perf_record *rec, size_t size)
 	rec->bytes_written += size;
 }
 
+static int do_mmap_output(struct perf_record *rec, void *buf, size_t size)
+{
+	u64 remaining;
+	off_t offset;
+
+	if (rec->mmap_addr == NULL) {
+do_mmap:
+		offset = rec->bytes_at_mmap_start + rec->bytes_written;
+		if (offset < (ssize_t) rec->mmap_size) {
+			rec->mmap_offset = offset;
+			offset = 0;
+		} else
+			rec->mmap_offset = 0;
+
+		rec->mmap_addr = mmap(NULL, rec->mmap_size,
+				     PROT_WRITE | PROT_READ,
+				     MAP_SHARED,
+				     rec->output,
+				     offset);
+
+		if (rec->mmap_addr == MAP_FAILED) {
+			pr_err("mmap failed: %d: %s\n", errno, strerror(errno));
+			return -1;
+		}
+
+		/* expand file to include this mmap segment */
+		if (ftruncate(rec->output, offset + rec->mmap_size) != 0) {
+			pr_err("ftruncate failed\n");
+			return -1;
+		}
+	}
+
+	remaining = rec->mmap_size - rec->mmap_offset;
+
+	if (size > remaining) {
+		memcpy(rec->mmap_addr + rec->mmap_offset, buf, remaining);
+		rec->bytes_written += remaining;
+
+		size -= remaining;
+		buf  += remaining;
+
+		munmap(rec->mmap_addr, rec->mmap_size);
+		goto do_mmap;
+	}
+
+	if (size) {
+		memcpy(rec->mmap_addr + rec->mmap_offset, buf, size);
+		rec->bytes_written += size;
+		rec->mmap_offset += size;
+	}
+
+	return 0;
+}
+
 static int write_output(struct perf_record *rec, void *buf, size_t size)
 {
+	if (rec->use_mmap)
+		return do_mmap_output(rec, buf, size);
+
 	while (size) {
 		int ret = write(rec->output, buf, size);
 
@@ -546,6 +614,11 @@ static int __cmd_record(struct perf_record *rec, int argc, const char **argv)
 	if (forks)
 		perf_evlist__start_workload(evsel_list);
 
+	if (!rec->opts.pipe_output && stat(output_name, &st) == 0) {
+		rec->use_mmap = true;
+		rec->bytes_at_mmap_start = st.st_size - rec->bytes_written;
+	}
+
 	for (;;) {
 		int hits = rec->samples;
 
@@ -572,6 +645,17 @@ static int __cmd_record(struct perf_record *rec, int argc, const char **argv)
 		}
 	}
 
+	if (rec->use_mmap) {
+		off_t len = rec->bytes_at_mmap_start + rec->bytes_written;
+
+		rec->use_mmap = false;
+		munmap(rec->mmap_addr, rec->mmap_size);
+		rec->mmap_addr = NULL;
+
+		if (ftruncate(rec->output, len) != 0)
+			pr_err("ftruncate failed\n");
+	}
+
 	if (quiet || signr == SIGUSR1)
 		return 0;
 
@@ -807,6 +891,7 @@ static struct perf_record record = {
 			.uses_mmap   = true,
 		},
 	},
+	.mmap_size = MMAP_OUTPUT_SIZE,
 };
 
 #define CALLCHAIN_HELP "do call-graph (stack chain/backtrace) recording: "
-- 
1.7.10.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] perf record: mmap output file - v2
  2013-10-15  2:55 [PATCH] perf record: mmap output file - v2 David Ahern
@ 2013-10-15  6:02 ` Ingo Molnar
  2013-10-15  7:09   ` Namhyung Kim
  2013-10-15  7:31 ` Namhyung Kim
  1 sibling, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2013-10-15  6:02 UTC (permalink / raw)
  To: David Ahern
  Cc: acme, linux-kernel, Frederic Weisbecker, Peter Zijlstra,
	Jiri Olsa, Namhyung Kim, Mike Galbraith, Stephane Eranian


* David Ahern <dsahern@gmail.com> wrote:

> +	/* for MMAP based file writes */
> +	void			*mmap_addr;
> +	u64			bytes_at_mmap_start; /* bytes in file when mmap use starts */
> +	u64			mmap_offset;    /* current location within mmap */
> +	size_t			mmap_size;      /* size of mmap segments */
> +	bool			use_mmap;

> +	if (!rec->opts.pipe_output && stat(output_name, &st) == 0) {
> +		rec->use_mmap = true;
> +		rec->bytes_at_mmap_start = st.st_size - rec->bytes_written;
> +	}

1)

I think __cmd_record() has become way too large, nearly 300 lines of code. 
It would be nice to split it into 2-3 helpers that operate on 'struct 
perf_record' or so.

2)

The stat() seems superfluous, here in __cmd_record() we've just checked 
the output_name and made sure it exists. Can that stat() call ever fail?

3)

The rec->bytes_at_mmap_start field feels a bit weird. If I read the code 
correctly, in every 'perf record' invocation, rec->bytes_written starts at 
0 - i.e. we don't have repeat invocations of cmd_record().

That means that this:

		rec->bytes_at_mmap_start = st.st_size - rec->bytes_written;

is really:

		rec->bytes_at_mmap_start = st.st_size;

furthermore, since we don't allow appends anymore, st.st_size ought to be 
zero as well.

Which means that ->bytes_at_mmap_start is always zero - and could be 
eliminated altogether.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] perf record: mmap output file - v2
  2013-10-15  6:02 ` Ingo Molnar
@ 2013-10-15  7:09   ` Namhyung Kim
  2013-10-15  7:25     ` Ingo Molnar
  2013-10-15 13:25     ` David Ahern
  0 siblings, 2 replies; 15+ messages in thread
From: Namhyung Kim @ 2013-10-15  7:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Ahern, acme, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Jiri Olsa, Mike Galbraith, Stephane Eranian

Hi Ingo,

On Tue, 15 Oct 2013 08:02:00 +0200, Ingo Molnar wrote:
> * David Ahern <dsahern@gmail.com> wrote:
>
>> +	/* for MMAP based file writes */
>> +	void			*mmap_addr;
>> +	u64			bytes_at_mmap_start; /* bytes in file when mmap use starts */
>> +	u64			mmap_offset;    /* current location within mmap */
>> +	size_t			mmap_size;      /* size of mmap segments */
>> +	bool			use_mmap;
>
>> +	if (!rec->opts.pipe_output && stat(output_name, &st) == 0) {
>> +		rec->use_mmap = true;
>> +		rec->bytes_at_mmap_start = st.st_size - rec->bytes_written;
>> +	}
>
> 1)
>
> I think __cmd_record() has become way too large, nearly 300 lines of code. 
> It would be nice to split it into 2-3 helpers that operate on 'struct 
> perf_record' or so.

Agreed.

>
> 2)
>
> The stat() seems superfluous, here in __cmd_record() we've just checked 
> the output_name and made sure it exists. Can that stat() call ever fail?

AFAICS it's needed to check current file size.  But I think it's better
to use fstat().

>
> 3)
>
> The rec->bytes_at_mmap_start field feels a bit weird. If I read the code 
> correctly, in every 'perf record' invocation, rec->bytes_written starts at 
> 0 - i.e. we don't have repeat invocations of cmd_record().

rec->bytes_written is updated when it writes to the output file for
synthesizing COMM/MMAP events (this mmap output is not used at that time).

>
> That means that this:
>
> 		rec->bytes_at_mmap_start = st.st_size - rec->bytes_written;
>
> is really:
>
> 		rec->bytes_at_mmap_start = st.st_size;
>
> furthermore, since we don't allow appends anymore, st.st_size ought to be 
> zero as well.
>
> Which means that ->bytes_at_mmap_start is always zero - and could be 
> eliminated altogether.

No, st_size is bigger than rec->bytes_written due to the
perf_file_header which is written without updating rec->bytes_written.

Actually I worried about the mmap offset not being aligned to page
size.  But it seems that's not a problem.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] perf record: mmap output file - v2
  2013-10-15  7:09   ` Namhyung Kim
@ 2013-10-15  7:25     ` Ingo Molnar
  2013-10-15  8:17       ` Namhyung Kim
  2013-10-15 12:22       ` Jiri Olsa
  2013-10-15 13:25     ` David Ahern
  1 sibling, 2 replies; 15+ messages in thread
From: Ingo Molnar @ 2013-10-15  7:25 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: David Ahern, acme, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Jiri Olsa, Mike Galbraith, Stephane Eranian


* Namhyung Kim <namhyung@kernel.org> wrote:

> > 3)
> >
> > The rec->bytes_at_mmap_start field feels a bit weird. If I read the code 
> > correctly, in every 'perf record' invocation, rec->bytes_written starts at 
> > 0 - i.e. we don't have repeat invocations of cmd_record().
> 
> rec->bytes_written is updated when it writes to the output file for 
> synthesizing COMM/MMAP events (this mmap output is not used at that 
> time).

Btw., while looking into it, I think advance_output() needlessly 
obfuscates as well:

static void advance_output(struct perf_record *rec, size_t size)
{
        rec->bytes_written += size;
}

that code should just be written open coded.

So I think all this needs a few good rounds of cleanups, before we can 
complicate it with a new feature. (the cleanups can be on top of the 
feature, if they go in at the same time.)

> > That means that this:
> >
> > 		rec->bytes_at_mmap_start = st.st_size - rec->bytes_written;
> >
> > is really:
> >
> > 		rec->bytes_at_mmap_start = st.st_size;
> >
> > furthermore, since we don't allow appends anymore, st.st_size ought to be 
> > zero as well.
> >
> > Which means that ->bytes_at_mmap_start is always zero - and could be 
> > eliminated altogether.
> 
> No, st_size is bigger than rec->bytes_written due to the 
> perf_file_header which is written without updating rec->bytes_written.

Since all this is code that executes once during __cmd_record(), is this 
all about the header writeout?

That is what confused me about the stat() call and that's messy really: we 
are the ones who write the file header, we have a very good idea about how 
many bytes we wrote to the file! It should be entirely unnecessary to lose 
that information and then execute a system call to recover that 
information...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] perf record: mmap output file - v2
  2013-10-15  2:55 [PATCH] perf record: mmap output file - v2 David Ahern
  2013-10-15  6:02 ` Ingo Molnar
@ 2013-10-15  7:31 ` Namhyung Kim
  2013-10-15  7:44   ` Ingo Molnar
  2013-10-15 13:35   ` David Ahern
  1 sibling, 2 replies; 15+ messages in thread
From: Namhyung Kim @ 2013-10-15  7:31 UTC (permalink / raw)
  To: David Ahern
  Cc: acme, linux-kernel, Ingo Molnar, Frederic Weisbecker,
	Peter Zijlstra, Jiri Olsa, Mike Galbraith, Stephane Eranian

Hi David,

On Mon, 14 Oct 2013 20:55:31 -0600, David Ahern wrote:
> When recording raw_syscalls for the entire system, e.g.,
>     perf record -e raw_syscalls:*,sched:sched_switch -a -- sleep 1
>
> you end up with a negative feedback loop as perf itself calls
> write() fairly often. This patch handles the problem by mmap'ing the
> file in chunks of 64M at a time and copies events from the event buffers
> to the file avoiding write system calls.
>
> Before (with write syscall):
>
> perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- sleep 1
> [ perf record: Woken up 0 times to write data ]
> [ perf record: Captured and wrote 81.843 MB /tmp/perf.data (~3575786 samples) ]
>
> After (using mmap):
>
> perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- sleep 1
> [ perf record: Woken up 31 times to write data ]
> [ perf record: Captured and wrote 8.203 MB /tmp/perf.data (~358388 samples) ]

Why do they have that different size?

>
[SNIP]
> +/* mmap file big chunks at a time */
> +#define MMAP_OUTPUT_SIZE   (64*1024*1024)

Why did you choose 64MB for the size?  Did you also test other sizes?


[SNIP]
> +
> +		rec->mmap_addr = mmap(NULL, rec->mmap_size,
> +				     PROT_WRITE | PROT_READ,
> +				     MAP_SHARED,
> +				     rec->output,
> +				     offset);
> +
> +		if (rec->mmap_addr == MAP_FAILED) {
> +			pr_err("mmap failed: %d: %s\n", errno, strerror(errno));
> +			return -1;
> +		}
> +
> +		/* expand file to include this mmap segment */
> +		if (ftruncate(rec->output, offset + rec->mmap_size) != 0) {
> +			pr_err("ftruncate failed\n");
> +			return -1;
> +		}

I think this mmap + ftruncate should be reordered.  Although it looks
work without problems the mmap man pages says it's unspecified behavior.

       A file is mapped in multiples of the page size.  For a file that is not
       a multiple of the page  size,  the  remaining  memory  is  zeroed  when
       mapped, and writes to that region are not written out to the file.  The
       effect of changing the size of the underlying file of a mapping on  the
       pages  that  correspond  to  added  or  removed  regions of the file is
       unspecified.


Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] perf record: mmap output file - v2
  2013-10-15  7:31 ` Namhyung Kim
@ 2013-10-15  7:44   ` Ingo Molnar
  2013-10-15 13:45     ` David Ahern
  2013-10-15 13:35   ` David Ahern
  1 sibling, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2013-10-15  7:44 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: David Ahern, acme, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Jiri Olsa, Mike Galbraith, Stephane Eranian


* Namhyung Kim <namhyung@kernel.org> wrote:

> [SNIP]
> > +/* mmap file big chunks at a time */
> > +#define MMAP_OUTPUT_SIZE   (64*1024*1024)
> 
> Why did you choose 64MB for the size?  Did you also test other sizes?

Btw., should this value go up if the ring buffer (mmap_pages) is larger 
than 64MB?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] perf record: mmap output file - v2
  2013-10-15  7:25     ` Ingo Molnar
@ 2013-10-15  8:17       ` Namhyung Kim
  2013-10-15 12:22       ` Jiri Olsa
  1 sibling, 0 replies; 15+ messages in thread
From: Namhyung Kim @ 2013-10-15  8:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Ahern, acme, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Jiri Olsa, Mike Galbraith, Stephane Eranian

On Tue, 15 Oct 2013 09:25:40 +0200, Ingo Molnar wrote:
> * Namhyung Kim <namhyung@kernel.org> wrote:
>> No, st_size is bigger than rec->bytes_written due to the 
>> perf_file_header which is written without updating rec->bytes_written.
>
> Since all this is code that executes once during __cmd_record(), is this 
> all about the header writeout?
>
> That is what confused me about the stat() call and that's messy really: we 
> are the ones who write the file header, we have a very good idea about how 
> many bytes we wrote to the file! It should be entirely unnecessary to lose 
> that information and then execute a system call to recover that 
> information...

Right, I think session->header.data_offset has the info already.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] perf record: mmap output file - v2
  2013-10-15  7:25     ` Ingo Molnar
  2013-10-15  8:17       ` Namhyung Kim
@ 2013-10-15 12:22       ` Jiri Olsa
  2013-10-15 13:20         ` Ingo Molnar
  1 sibling, 1 reply; 15+ messages in thread
From: Jiri Olsa @ 2013-10-15 12:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Namhyung Kim, David Ahern, acme, linux-kernel,
	Frederic Weisbecker, Peter Zijlstra, Mike Galbraith,
	Stephane Eranian

On Tue, Oct 15, 2013 at 09:25:40AM +0200, Ingo Molnar wrote:
> 
> * Namhyung Kim <namhyung@kernel.org> wrote:
> 
> > > 3)
> > >
> > > The rec->bytes_at_mmap_start field feels a bit weird. If I read the code 
> > > correctly, in every 'perf record' invocation, rec->bytes_written starts at 
> > > 0 - i.e. we don't have repeat invocations of cmd_record().
> > 
> > rec->bytes_written is updated when it writes to the output file for 
> > synthesizing COMM/MMAP events (this mmap output is not used at that 
> > time).
> 
> Btw., while looking into it, I think advance_output() needlessly 
> obfuscates as well:
> 
> static void advance_output(struct perf_record *rec, size_t size)
> {
>         rec->bytes_written += size;
> }
> 
> that code should just be written open coded.
> 
> So I think all this needs a few good rounds of cleanups, before we can 
> complicate it with a new feature. (the cleanups can be on top of the 
> feature, if they go in at the same time.)

I sent some file code cleanup week ago, I'll rebase
and resend it soon:
http://marc.info/?l=linux-kernel&m=138113836428425&w=2

it's mostly about centralizing the code into file
object.. got no comments so far

jirka

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] perf record: mmap output file - v2
  2013-10-15 12:22       ` Jiri Olsa
@ 2013-10-15 13:20         ` Ingo Molnar
  0 siblings, 0 replies; 15+ messages in thread
From: Ingo Molnar @ 2013-10-15 13:20 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Namhyung Kim, David Ahern, acme, linux-kernel,
	Frederic Weisbecker, Peter Zijlstra, Mike Galbraith,
	Stephane Eranian


* Jiri Olsa <jolsa@redhat.com> wrote:

> On Tue, Oct 15, 2013 at 09:25:40AM +0200, Ingo Molnar wrote:
> > 
> > * Namhyung Kim <namhyung@kernel.org> wrote:
> > 
> > > > 3)
> > > >
> > > > The rec->bytes_at_mmap_start field feels a bit weird. If I read the code 
> > > > correctly, in every 'perf record' invocation, rec->bytes_written starts at 
> > > > 0 - i.e. we don't have repeat invocations of cmd_record().
> > > 
> > > rec->bytes_written is updated when it writes to the output file for 
> > > synthesizing COMM/MMAP events (this mmap output is not used at that 
> > > time).
> > 
> > Btw., while looking into it, I think advance_output() needlessly 
> > obfuscates as well:
> > 
> > static void advance_output(struct perf_record *rec, size_t size)
> > {
> >         rec->bytes_written += size;
> > }
> > 
> > that code should just be written open coded.
> > 
> > So I think all this needs a few good rounds of cleanups, before we can 
> > complicate it with a new feature. (the cleanups can be on top of the 
> > feature, if they go in at the same time.)
> 
> I sent some file code cleanup week ago, I'll rebase and resend it soon:
>
> http://marc.info/?l=linux-kernel&m=138113836428425&w=2

Nice!

> it's mostly about centralizing the code into file object.. got no 
> comments so far

Too much development going on I suspect :-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] perf record: mmap output file - v2
  2013-10-15  7:09   ` Namhyung Kim
  2013-10-15  7:25     ` Ingo Molnar
@ 2013-10-15 13:25     ` David Ahern
  2013-10-16  1:24       ` Namhyung Kim
  1 sibling, 1 reply; 15+ messages in thread
From: David Ahern @ 2013-10-15 13:25 UTC (permalink / raw)
  To: Namhyung Kim, Ingo Molnar
  Cc: acme, linux-kernel, Frederic Weisbecker, Peter Zijlstra,
	Jiri Olsa, Mike Galbraith, Stephane Eranian

On 10/15/13 1:09 AM, Namhyung Kim wrote:
>> The stat() seems superfluous, here in __cmd_record() we've just checked
>> the output_name and made sure it exists. Can that stat() call ever fail?
>
> AFAICS it's needed to check current file size.  But I think it's better
> to use fstat().

Sure fstat could be used over stat -- if it ends up staying.


>>
>> 3)
>>
>> The rec->bytes_at_mmap_start field feels a bit weird. If I read the code
>> correctly, in every 'perf record' invocation, rec->bytes_written starts at
>> 0 - i.e. we don't have repeat invocations of cmd_record().
>
> rec->bytes_written is updated when it writes to the output file for
> synthesizing COMM/MMAP events (this mmap output is not used at that time).

Ingo: I went through a number of itereations before using the 
bytes_at_mmap_start. One of those was to use the bytes_written counter. 
All failed. Header + synthesized events are written to the file before 
we start farming the ring buffers.

Perhaps a good code cleanup will help figure out why. I needed the 
functionality ASAP for use with perf-trace -a so I stuck with the new 
variable. Since this change is working out well, I will look at a code 
clean up on the next round.

I am traveling to LinuxCon / KVM Forum / Tracing Forum on Friday. 
Perhaps the clean up and followup patch can be done on the long plane 
ride; more likely when I return which means 3.14 material.

> Actually I worried about the mmap offset not being aligned to page
> size.  But it seems that's not a problem.

This code snippet makes sure the mmap offset is a multiple of 64M 
(rec->mmap_size). offset is the argument to mmap; mmap_offset is the 
where we are within the mmap for the next copy:

+		offset = rec->bytes_at_mmap_start + rec->bytes_written;
+		if (offset < (ssize_t) rec->mmap_size) {
+			rec->mmap_offset = offset;
+			offset = 0;
+		} else
+			rec->mmap_offset = 0;


David

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] perf record: mmap output file - v2
  2013-10-15  7:31 ` Namhyung Kim
  2013-10-15  7:44   ` Ingo Molnar
@ 2013-10-15 13:35   ` David Ahern
  2013-10-16  1:52     ` Namhyung Kim
  1 sibling, 1 reply; 15+ messages in thread
From: David Ahern @ 2013-10-15 13:35 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: acme, linux-kernel, Ingo Molnar, Frederic Weisbecker,
	Peter Zijlstra, Jiri Olsa, Mike Galbraith, Stephane Eranian

On 10/15/13 1:31 AM, Namhyung Kim wrote:
> Hi David,
>
> On Mon, 14 Oct 2013 20:55:31 -0600, David Ahern wrote:
>> When recording raw_syscalls for the entire system, e.g.,
>>      perf record -e raw_syscalls:*,sched:sched_switch -a -- sleep 1
>>
>> you end up with a negative feedback loop as perf itself calls
>> write() fairly often. This patch handles the problem by mmap'ing the
>> file in chunks of 64M at a time and copies events from the event buffers
>> to the file avoiding write system calls.
>>
>> Before (with write syscall):
>>
>> perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- sleep 1
>> [ perf record: Woken up 0 times to write data ]
>> [ perf record: Captured and wrote 81.843 MB /tmp/perf.data (~3575786 samples) ]
>>
>> After (using mmap):
>>
>> perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- sleep 1
>> [ perf record: Woken up 31 times to write data ]
>> [ perf record: Captured and wrote 8.203 MB /tmp/perf.data (~358388 samples) ]
>
> Why do they have that different size?

perf calls write() for each mmap, each time through the loop. Each write 
generates 2 events (syscall entry + exit) -- ie., generates more events. 
That's the negative feedback loop.


> [SNIP]
>> +
>> +		rec->mmap_addr = mmap(NULL, rec->mmap_size,
>> +				     PROT_WRITE | PROT_READ,
>> +				     MAP_SHARED,
>> +				     rec->output,
>> +				     offset);
>> +
>> +		if (rec->mmap_addr == MAP_FAILED) {
>> +			pr_err("mmap failed: %d: %s\n", errno, strerror(errno));
>> +			return -1;
>> +		}
>> +
>> +		/* expand file to include this mmap segment */
>> +		if (ftruncate(rec->output, offset + rec->mmap_size) != 0) {
>> +			pr_err("ftruncate failed\n");
>> +			return -1;
>> +		}
>
> I think this mmap + ftruncate should be reordered.  Although it looks
> work without problems the mmap man pages says it's unspecified behavior.
>
>         A file is mapped in multiples of the page size.  For a file that is not
>         a multiple of the page  size,  the  remaining  memory  is  zeroed  when
>         mapped, and writes to that region are not written out to the file.  The
>         effect of changing the size of the underlying file of a mapping on  the
>         pages  that  correspond  to  added  or  removed  regions of the file is
>         unspecified.

The mmap only expands the address range; the ftruncate expands the file 
behind the mmap. Both are needed and must succeed to function properly, 
and I don't see how the order matters. ie.,

This order has an extra call on the failure path:
   ftruncate
   mmap
   - on failure call ftruncate to reset file size

The order I have does not have that problem:
   mmap
   ftruncate

Here on failure just return -1 and we end the session.

David

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] perf record: mmap output file - v2
  2013-10-15  7:44   ` Ingo Molnar
@ 2013-10-15 13:45     ` David Ahern
  0 siblings, 0 replies; 15+ messages in thread
From: David Ahern @ 2013-10-15 13:45 UTC (permalink / raw)
  To: Ingo Molnar, Namhyung Kim
  Cc: acme, linux-kernel, Frederic Weisbecker, Peter Zijlstra,
	Jiri Olsa, Mike Galbraith, Stephane Eranian

On 10/15/13 1:44 AM, Ingo Molnar wrote:
>
> * Namhyung Kim <namhyung@kernel.org> wrote:
>
>> [SNIP]
>>> +/* mmap file big chunks at a time */
>>> +#define MMAP_OUTPUT_SIZE   (64*1024*1024)
>>
>> Why did you choose 64MB for the size?  Did you also test other sizes?
>
> Btw., should this value go up if the ring buffer (mmap_pages) is larger
> than 64MB?
>

I made mmap_size a variable:
+	size_t			mmap_size;      /* size of mmap segments */

with the above initial value. I was planning to make it in an option and 
just forgot to complete it.

Why 64M? mmap / munmap are also system calls and I was looking to trade 
off huge file size jumps versus the frequency of adjusting the maps. 64M 
Was just a nice round number between 1 and 100. 8, 16 are too small. 128 
seems to big for a default. That left only 32 and 64. 64M seems the 
better trade off of the two.

Making it a user knob would help with smaller deployments. Could also 
have mmap_size = 0 mean turn it off (use write over mmap).

Perhaps something that adjust automatically would be useful too. e.g., 
For the case that motivates the change I have 16 cpus each with a 4M 
buffer (1024 mmap pages). Should we generically set the size:

mmap_size = ncpus_online * mmap_pages * page_size?

Do that only for system wide profiling?

David

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] perf record: mmap output file - v2
  2013-10-15 13:25     ` David Ahern
@ 2013-10-16  1:24       ` Namhyung Kim
  0 siblings, 0 replies; 15+ messages in thread
From: Namhyung Kim @ 2013-10-16  1:24 UTC (permalink / raw)
  To: David Ahern
  Cc: Ingo Molnar, acme, linux-kernel, Frederic Weisbecker,
	Peter Zijlstra, Jiri Olsa, Mike Galbraith, Stephane Eranian

Hi David,

On Tue, 15 Oct 2013 07:25:04 -0600, David Ahern wrote:
> On 10/15/13 1:09 AM, Namhyung Kim wrote:
>>> The stat() seems superfluous, here in __cmd_record() we've just checked
>>> the output_name and made sure it exists. Can that stat() call ever fail?
>>
>> AFAICS it's needed to check current file size.  But I think it's better
>> to use fstat().
>
> Sure fstat could be used over stat -- if it ends up staying.
>
>
>>>
>>> 3)
>>>
>>> The rec->bytes_at_mmap_start field feels a bit weird. If I read the code
>>> correctly, in every 'perf record' invocation, rec->bytes_written starts at
>>> 0 - i.e. we don't have repeat invocations of cmd_record().
>>
>> rec->bytes_written is updated when it writes to the output file for
>> synthesizing COMM/MMAP events (this mmap output is not used at that time).
>
> Ingo: I went through a number of itereations before using the
> bytes_at_mmap_start. One of those was to use the bytes_written
> counter. All failed. Header + synthesized events are written to the
> file before we start farming the ring buffers.
>
> Perhaps a good code cleanup will help figure out why. I needed the
> functionality ASAP for use with perf-trace -a so I stuck with the new
> variable. Since this change is working out well, I will look at a code
> clean up on the next round.

session->header.data_offset ?

>
> I am traveling to LinuxCon / KVM Forum / Tracing Forum on
> Friday. Perhaps the clean up and followup patch can be done on the
> long plane ride; more likely when I return which means 3.14 material.

See you there :)

>
>> Actually I worried about the mmap offset not being aligned to page
>> size.  But it seems that's not a problem.
>
> This code snippet makes sure the mmap offset is a multiple of 64M
> (rec->mmap_size). offset is the argument to mmap; mmap_offset is the
> where we are within the mmap for the next copy:
>
> +		offset = rec->bytes_at_mmap_start + rec->bytes_written;
> +		if (offset < (ssize_t) rec->mmap_size) {
> +			rec->mmap_offset = offset;
> +			offset = 0;
> +		} else
> +			rec->mmap_offset = 0;
>

Oh, I overlooked this code, it actually aligned offset.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] perf record: mmap output file - v2
  2013-10-15 13:35   ` David Ahern
@ 2013-10-16  1:52     ` Namhyung Kim
  2013-10-16  1:58       ` David Ahern
  0 siblings, 1 reply; 15+ messages in thread
From: Namhyung Kim @ 2013-10-16  1:52 UTC (permalink / raw)
  To: David Ahern
  Cc: acme, linux-kernel, Ingo Molnar, Frederic Weisbecker,
	Peter Zijlstra, Jiri Olsa, Mike Galbraith, Stephane Eranian

On Tue, 15 Oct 2013 07:35:53 -0600, David Ahern wrote:
> On 10/15/13 1:31 AM, Namhyung Kim wrote:
>> Hi David,
>>
>> On Mon, 14 Oct 2013 20:55:31 -0600, David Ahern wrote:
>>> When recording raw_syscalls for the entire system, e.g.,
>>>      perf record -e raw_syscalls:*,sched:sched_switch -a -- sleep 1
>>>
>>> you end up with a negative feedback loop as perf itself calls
>>> write() fairly often. This patch handles the problem by mmap'ing the
>>> file in chunks of 64M at a time and copies events from the event buffers
>>> to the file avoiding write system calls.
>>>
>>> Before (with write syscall):
>>>
>>> perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- sleep 1
>>> [ perf record: Woken up 0 times to write data ]
>>> [ perf record: Captured and wrote 81.843 MB /tmp/perf.data (~3575786 samples) ]
>>>
>>> After (using mmap):
>>>
>>> perf record -o /tmp/perf.data -e raw_syscalls:*,sched:sched_switch -a -- sleep 1
>>> [ perf record: Woken up 31 times to write data ]
>>> [ perf record: Captured and wrote 8.203 MB /tmp/perf.data (~358388 samples) ]
>>
>> Why do they have that different size?
>
> perf calls write() for each mmap, each time through the loop. Each
> write generates 2 events (syscall entry + exit) -- ie., generates more
> events. That's the negative feedback loop.

Aha, okay.  So it mostly matters to syscall tracing, right?  For a
normal record session, it seems that the effect is not that large:

Before:

  $ perf stat -r3 --null --sync -- perf record -a -- sleep 5
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.467 MB perf.data (~20420 samples) ]
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.544 MB perf.data (~23750 samples) ]
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.482 MB perf.data (~21073 samples) ]
  
   Performance counter stats for 'perf record -a -- sleep 5' (3 runs):
  
         5.174476094 seconds time elapsed                                          ( +-  0.07% )

  $ perf record -- perf bench sched pipe
  # Running sched/pipe benchmark...
  # Executed 1000000 pipe operations between two processes
  
       Total time: 21.271 [sec]
  
        21.271357 usecs/op
            47011 ops/sec
  [ perf record: Woken up 21 times to write data ]
  [ perf record: Captured and wrote 5.643 MB perf.data (~246524 samples) ]


After:

  $ perf stat -r3 --null --sync -- perf record -a -- sleep 5
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.459 MB perf.data (~20055 samples) ]
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.463 MB perf.data (~20230 samples) ]
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.467 MB perf.data (~20401 samples) ]
  
   Performance counter stats for 'perf record -a -- sleep 5' (3 runs):
  
         5.085910919 seconds time elapsed                                          ( +-  0.06% )

  $ perf record -- perf bench sched pipe
  # Running sched/pipe benchmark...
  # Executed 1000000 pipe operations between two processes
  
       Total time: 21.175 [sec]
  
        21.175406 usecs/op
            47224 ops/sec
  [ perf record: Woken up 21 times to write data ]
  [ perf record: Captured and wrote 5.612 MB perf.data (~245206 samples) ]


Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] perf record: mmap output file - v2
  2013-10-16  1:52     ` Namhyung Kim
@ 2013-10-16  1:58       ` David Ahern
  0 siblings, 0 replies; 15+ messages in thread
From: David Ahern @ 2013-10-16  1:58 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: acme, linux-kernel, Ingo Molnar, Frederic Weisbecker,
	Peter Zijlstra, Jiri Olsa, Mike Galbraith, Stephane Eranian

On 10/15/13 7:52 PM, Namhyung Kim wrote:
> Aha, okay.  So it mostly matters to syscall tracing, right?  For a
> normal record session, it seems that the effect is not that large:

Yes, that's in the description "When recording raw_syscalls for the 
entire system"

There is a small benefit to all record sessions -- mmap+memcpy has less 
overhead than write(). I still need to look at Ingo's suggestion to use 
non-temporal stores which might reduce the overhead of the memcpy.

Try a workload that generates a HUGE data file -- say a full kernel 
build (e.g., perf record -g -- make O=/tmp/kbuild -j 16). You should see 
a much larger benefit from the mmap route. Be sure to use your callchain 
enhancements to look at that 1+G file. ;-)

David

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-10-16  1:58 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-15  2:55 [PATCH] perf record: mmap output file - v2 David Ahern
2013-10-15  6:02 ` Ingo Molnar
2013-10-15  7:09   ` Namhyung Kim
2013-10-15  7:25     ` Ingo Molnar
2013-10-15  8:17       ` Namhyung Kim
2013-10-15 12:22       ` Jiri Olsa
2013-10-15 13:20         ` Ingo Molnar
2013-10-15 13:25     ` David Ahern
2013-10-16  1:24       ` Namhyung Kim
2013-10-15  7:31 ` Namhyung Kim
2013-10-15  7:44   ` Ingo Molnar
2013-10-15 13:45     ` David Ahern
2013-10-15 13:35   ` David Ahern
2013-10-16  1:52     ` Namhyung Kim
2013-10-16  1:58       ` David Ahern

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).