linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 1/2]: perf util: map data buffer for preserving collected data
       [not found] <bdd669cd-a6dc-d592-4655-33b7bfc46cbb@linux.intel.com>
@ 2018-08-27 18:16 ` Alexey Budankov
  2018-08-27 18:16 ` [PATCH v3 2/2]: perf record: enable asynchronous trace writing Alexey Budankov
  2018-08-28  8:59 ` [PATCH v3 0/2]: perf: reduce data loss when profiling highly parallel CPU bound workloads Jiri Olsa
  2 siblings, 0 replies; 15+ messages in thread
From: Alexey Budankov @ 2018-08-27 18:16 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Arnaldo Carvalho de Melo
  Cc: Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andi Kleen, linux-kernel


The data buffer and accompanying AIO control block are allocated at 
perf_mmap object and the mapped data buffer size is equal to 
the kernel one.

The buffer is then used to preserve profiling data ready for dumping 
and queue it for asynchronous writing into perf trace thru implemented 
record__aio_write() function.

mmap_aio control structure of the size equal to the number of per-cpu 
kernel buffers is used to keep pointers to enqueued AIO control 
blocks for monitoring of completed AIO operations.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
Changes in v2:
 - converted zalloc() to calloc() for allocation of mmap_aio array,
 - cleared typo and adjusted fallback branch code;
---
 tools/perf/util/evlist.c |  7 +++++++
 tools/perf/util/evlist.h |  2 ++
 tools/perf/util/mmap.c   | 12 ++++++++++++
 tools/perf/util/mmap.h   |  3 +++
 4 files changed, 24 insertions(+)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index e7a4b31a84fb..04596f74766c 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -718,6 +718,8 @@ static void perf_evlist__munmap_nofree(struct perf_evlist *evlist)
 void perf_evlist__munmap(struct perf_evlist *evlist)
 {
 	perf_evlist__munmap_nofree(evlist);
+	if (evlist->mmap_aio)
+		zfree(&evlist->mmap_aio);
 	zfree(&evlist->mmap);
 	zfree(&evlist->overwrite_mmap);
 }
@@ -749,6 +751,11 @@ static struct perf_mmap *perf_evlist__alloc_mmap(struct perf_evlist *evlist,
 		 */
 		refcount_set(&map[i].refcnt, 0);
 	}
+
+	evlist->mmap_aio = calloc(evlist->nr_mmaps, sizeof(struct aiocb *));
+	if (!evlist->mmap_aio)
+		zfree(&map);
+
 	return map;
 }
 
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index dc66436add98..f98b949561fd 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -15,6 +15,7 @@
 #include "util.h"
 #include <signal.h>
 #include <unistd.h>
+#include <aio.h>
 
 struct pollfd;
 struct thread_map;
@@ -43,6 +44,7 @@ struct perf_evlist {
 	} workload;
 	struct fdarray	 pollfd;
 	struct perf_mmap *mmap;
+	struct aiocb	 **mmap_aio;
 	struct perf_mmap *overwrite_mmap;
 	struct thread_map *threads;
 	struct cpu_map	  *cpus;
diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index fc832676a798..e71d46cb01cc 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -155,6 +155,10 @@ void __weak auxtrace_mmap_params__set_idx(struct auxtrace_mmap_params *mp __mayb
 
 void perf_mmap__munmap(struct perf_mmap *map)
 {
+	if (map->data != NULL) {
+		munmap(map->data, perf_mmap__mmap_len(map));
+		map->data = NULL;
+	}
 	if (map->base != NULL) {
 		munmap(map->base, perf_mmap__mmap_len(map));
 		map->base = NULL;
@@ -190,6 +194,14 @@ int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd)
 		map->base = NULL;
 		return -1;
 	}
+	map->data = mmap(NULL, perf_mmap__mmap_len(map), PROT_READ | PROT_WRITE,
+			MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (map->data == MAP_FAILED) {
+		pr_debug2("failed to mmap perf event data buffer, error %d\n",
+				errno);
+		map->data = NULL;
+		return -1;
+	}
 	map->fd = fd;
 
 	if (auxtrace_mmap__mmap(&map->auxtrace_mmap,
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index d82294db1295..1974e621e36b 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -6,6 +6,7 @@
 #include <linux/types.h>
 #include <asm/barrier.h>
 #include <stdbool.h>
+#include <aio.h>
 #include "auxtrace.h"
 #include "event.h"
 
@@ -25,6 +26,8 @@ struct perf_mmap {
 	bool		 overwrite;
 	struct auxtrace_mmap auxtrace_mmap;
 	char		 event_copy[PERF_SAMPLE_MAX_SIZE] __aligned(8);
+	void 		 *data;
+	struct aiocb	 cblock;
 };
 
 /*
 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 2/2]: perf record: enable asynchronous trace writing
       [not found] <bdd669cd-a6dc-d592-4655-33b7bfc46cbb@linux.intel.com>
  2018-08-27 18:16 ` [PATCH v3 1/2]: perf util: map data buffer for preserving collected data Alexey Budankov
@ 2018-08-27 18:16 ` Alexey Budankov
  2018-08-28  8:50   ` Jiri Olsa
                     ` (3 more replies)
  2018-08-28  8:59 ` [PATCH v3 0/2]: perf: reduce data loss when profiling highly parallel CPU bound workloads Jiri Olsa
  2 siblings, 4 replies; 15+ messages in thread
From: Alexey Budankov @ 2018-08-27 18:16 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Arnaldo Carvalho de Melo
  Cc: Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andi Kleen, linux-kernel


Trace file offset are linearly calculated by perf_mmap__push() code 
for the next possible write operation, but file position is updated by 
the kernel only in the second lseek() syscall after the loop. 
The first lseek() syscall reads that file position for 
the next loop iterations.

record__mmap_read_sync implements sort of a barrier between spilling 
ready profiling data to disk.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
Changes in v3:
- written comments about nanosleep(0.5ms) call prior aio_suspend()
  to cope with intrusiveness of its implementation in glibc;
- written comments about rationale behind coping profiling data 
  into mmap->data buffer;
---
 tools/perf/builtin-record.c | 125 +++++++++++++++++++++++++++++++++++++++++---
 tools/perf/util/mmap.c      |  36 ++++++++-----
 tools/perf/util/mmap.h      |   2 +-
 3 files changed, 143 insertions(+), 20 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 22ebeb92ac51..4ac61399a09a 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -53,6 +53,7 @@
 #include <sys/mman.h>
 #include <sys/wait.h>
 #include <linux/time64.h>
+#include <aio.h>
 
 struct switch_output {
 	bool		 enabled;
@@ -121,6 +122,23 @@ static int record__write(struct record *rec, void *bf, size_t size)
 	return 0;
 }
 
+static int record__aio_write(int trace_fd, struct aiocb *cblock,
+		void *buf, size_t size, off_t off)
+{
+	cblock->aio_fildes = trace_fd;
+	cblock->aio_buf    = buf;
+	cblock->aio_nbytes = size;
+	cblock->aio_offset = off;
+	cblock->aio_sigevent.sigev_notify = SIGEV_NONE;
+
+	if (aio_write(cblock) == -1) {
+		pr_err("failed to queue perf data, error: %m\n");
+		return -1;
+	}
+
+	return 0;
+}
+
 static int process_synthesized_event(struct perf_tool *tool,
 				     union perf_event *event,
 				     struct perf_sample *sample __maybe_unused,
@@ -130,12 +148,14 @@ static int process_synthesized_event(struct perf_tool *tool,
 	return record__write(rec, event, event->header.size);
 }
 
-static int record__pushfn(void *to, void *bf, size_t size)
+static int record__pushfn(void *to, void *bf, size_t size, off_t off)
 {
 	struct record *rec = to;
+	struct perf_mmap *map = bf;
 
 	rec->samples++;
-	return record__write(rec, bf, size);
+	return record__aio_write(rec->session->data->file.fd, &map->cblock,
+			map->data, size, off);
 }
 
 static volatile int done;
@@ -510,13 +530,97 @@ static struct perf_event_header finished_round_event = {
 	.type = PERF_RECORD_FINISHED_ROUND,
 };
 
+static int record__mmap_read_sync(int trace_fd, struct aiocb **cblocks,
+		int cblocks_size, struct record *rec)
+{
+	size_t rem;
+	ssize_t size;
+	off_t rem_off;
+	int i, aio_ret, aio_errno, do_suspend;
+	struct perf_mmap *md;
+	struct timespec timeout0 = { 0, 0 };
+	struct timespec timeoutS = { 0, 1000 * 500  * 1 }; // 0.5ms
+
+	if (!cblocks_size)
+		return 0;
+
+	do {
+		do_suspend = 0;
+		/* aio_suspend() implementation inside glibc (as of v2.27) is
+		 * intrusive and not just blocks waiting io requests completion
+		 * but polls requests queue inducing context switches in perf
+		 * tool process. When profiling in system wide mode with tracing
+		 * context switches the trace may be polluted by context switches
+		 * from the perf process and the trace size becomes about 3-5
+		 * times bigger than that of when writing the trace serially.
+		 * To limit the volume of context switches from perf tool
+		 * process nanosleep() call below is added prior aio_suspend()
+		 * calling till every half of the kernel timer tick which is
+		 * usually 1ms (depends on CONFIG_HZ value).
+		 */
+		nanosleep(&timeoutS, NULL);
+		if (aio_suspend((const struct aiocb**)cblocks, cblocks_size, &timeout0)) {
+			if (errno == EAGAIN || errno == EINTR) {
+				do_suspend = 1;
+				continue;
+			} else {
+				pr_err("failed to sync perf data, error: %m\n");
+				break;
+			}
+		}
+		for (i = 0; i < cblocks_size; i++) {
+			if (cblocks[i] == NULL) {
+				continue;
+			}
+			aio_errno = aio_error(cblocks[i]);
+			if (aio_errno == EINPROGRESS) {
+				do_suspend = 1;
+				continue;
+			}
+			size = aio_ret = aio_return((struct aiocb*)cblocks[i]);
+			if (aio_ret < 0) {
+				if (aio_errno == EINTR) {
+					size = 0;
+				} else {
+					pr_err("failed to write perf data, error: %m\n");
+					cblocks[i] = NULL;
+					continue;
+				}
+			}
+			rec->bytes_written += size;
+			md = (struct perf_mmap*)((char*)cblocks[i] -
+				offsetof(struct perf_mmap, cblock));
+			rem = cblocks[i]->aio_nbytes - (size_t)size;
+			if (rem == 0) {
+				perf_mmap__put(md);
+				cblocks[i] = NULL;
+				if (switch_output_size(rec))
+					trigger_hit(&switch_output_trigger);
+			} else {
+				rem_off = cblocks[i]->aio_offset +
+					cblocks[i]->aio_nbytes - rem;
+				if (!record__aio_write(trace_fd,
+					(struct aiocb *)cblocks[i],
+					md->data + cblocks[i]->aio_nbytes - rem,
+					rem, rem_off))
+					do_suspend = 1;
+			}
+		}
+	} while (do_suspend);
+
+	return 0;
+}
+
 static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evlist,
 				    bool overwrite)
 {
 	u64 bytes_written = rec->bytes_written;
-	int i;
-	int rc = 0;
+	int i, rc = 0;
 	struct perf_mmap *maps;
+	int trace_fd = rec->session->data->file.fd;
+	struct aiocb **mmap_aio = rec->evlist->mmap_aio;
+	int mmap_aio_size = 0;
+	off_t off;
 
 	if (!evlist)
 		return 0;
@@ -528,14 +632,17 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
 	if (overwrite && evlist->bkw_mmap_state != BKW_MMAP_DATA_PENDING)
 		return 0;
 
+	off = lseek(trace_fd, 0, SEEK_CUR);
+
 	for (i = 0; i < evlist->nr_mmaps; i++) {
 		struct auxtrace_mmap *mm = &maps[i].auxtrace_mmap;
 
 		if (maps[i].base) {
-			if (perf_mmap__push(&maps[i], rec, record__pushfn) != 0) {
-				rc = -1;
+			rc = perf_mmap__push(&maps[i], rec, record__pushfn, &off);
+			if (rc < 0)
 				goto out;
-			}
+			else if (rc > 0)
+				mmap_aio[mmap_aio_size++] = &maps[i].cblock;
 		}
 
 		if (mm->base && !rec->opts.auxtrace_snapshot_mode &&
@@ -545,6 +652,10 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
 		}
 	}
 
+	record__mmap_read_sync(trace_fd, mmap_aio, mmap_aio_size, rec);
+
+	lseek(trace_fd, off, SEEK_SET);
+
 	/*
 	 * Mark the round finished in case we wrote
 	 * at least one event.
diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index e71d46cb01cc..d6eb1f17bead 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -292,11 +292,11 @@ int perf_mmap__read_init(struct perf_mmap *map)
 }
 
 int perf_mmap__push(struct perf_mmap *md, void *to,
-		    int push(void *to, void *buf, size_t size))
+		    int push(void *to, void *buf, size_t size, off_t), off_t *off)
 {
 	u64 head = perf_mmap__read_head(md);
 	unsigned char *data = md->base + page_size;
-	unsigned long size;
+	unsigned long size, size0 = 0;
 	void *buf;
 	int rc = 0;
 
@@ -306,25 +306,37 @@ int perf_mmap__push(struct perf_mmap *md, void *to,
 
 	size = md->end - md->start;
 
+	/* Data is copied into the md->data buffer to release space in
+	 * the kernel buffer as fast as possible, thru perf_mmap__consume().
+	 * That lets the kernel to proceed with storing more profiling
+	 * data into the kernel buffer earlier than other per-cpu kernel
+	 * buffers are handled.
+	 *
+	 * Coping can be done in two steps in case chunk of the profiling
+	 * data split by upper bound in the kernel buffer. In this case
+	 * we first move upper part of data from md->start till the
+	 * upper bound and then the reminder from the beginning of
+	 * the kernel buffer till the end of the data chunk.
+	 */
+
 	if ((md->start & md->mask) + size != (md->end & md->mask)) {
 		buf = &data[md->start & md->mask];
-		size = md->mask + 1 - (md->start & md->mask);
-		md->start += size;
-
-		if (push(to, buf, size) < 0) {
-			rc = -1;
-			goto out;
-		}
+		size0 = md->mask + 1 - (md->start & md->mask);
+		md->start += size0;
+		memcpy(md->data, buf, size0);
 	}
 
 	buf = &data[md->start & md->mask];
 	size = md->end - md->start;
 	md->start += size;
+	memcpy(md->data + size0, buf, size);
 
-	if (push(to, buf, size) < 0) {
-		rc = -1;
+	rc = push(to, md, size0 + size, *off) < 0 ? -1 : 1;
+	if (rc == -1)
 		goto out;
-	}
+
+	perf_mmap__get(md);
+	*off += size0 + size;
 
 	md->prev = head;
 	perf_mmap__consume(md);
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index 1974e621e36b..6211a3a0c4c3 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -95,7 +95,7 @@ union perf_event *perf_mmap__read_forward(struct perf_mmap *map);
 union perf_event *perf_mmap__read_event(struct perf_mmap *map);
 
 int perf_mmap__push(struct perf_mmap *md, void *to,
-		    int push(void *to, void *buf, size_t size));
+		    int push(void *to, void *buf, size_t size, off_t off), off_t *off);
 
 size_t perf_mmap__mmap_len(struct perf_mmap *map);
 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/2]: perf record: enable asynchronous trace writing
  2018-08-27 18:16 ` [PATCH v3 2/2]: perf record: enable asynchronous trace writing Alexey Budankov
@ 2018-08-28  8:50   ` Jiri Olsa
  2018-08-28  9:39     ` Alexey Budankov
  2018-08-28  8:53   ` Jiri Olsa
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Jiri Olsa @ 2018-08-28  8:50 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel

On Mon, Aug 27, 2018 at 09:16:55PM +0300, Alexey Budankov wrote:
> 
> Trace file offset are linearly calculated by perf_mmap__push() code 
> for the next possible write operation, but file position is updated by 
> the kernel only in the second lseek() syscall after the loop. 
> The first lseek() syscall reads that file position for 
> the next loop iterations.
> 
> record__mmap_read_sync implements sort of a barrier between spilling 
> ready profiling data to disk.
> 
> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
> ---
> Changes in v3:
> - written comments about nanosleep(0.5ms) call prior aio_suspend()
>   to cope with intrusiveness of its implementation in glibc;
> - written comments about rationale behind coping profiling data 
>   into mmap->data buffer;
> ---
>  tools/perf/builtin-record.c | 125 +++++++++++++++++++++++++++++++++++++++++---
>  tools/perf/util/mmap.c      |  36 ++++++++-----
>  tools/perf/util/mmap.h      |   2 +-
>  3 files changed, 143 insertions(+), 20 deletions(-)
> 
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 22ebeb92ac51..4ac61399a09a 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -53,6 +53,7 @@
>  #include <sys/mman.h>
>  #include <sys/wait.h>
>  #include <linux/time64.h>
> +#include <aio.h>
>  
>  struct switch_output {
>  	bool		 enabled;
> @@ -121,6 +122,23 @@ static int record__write(struct record *rec, void *bf, size_t size)
>  	return 0;
>  }
>  
> +static int record__aio_write(int trace_fd, struct aiocb *cblock,
> +		void *buf, size_t size, off_t off)
> +{
> +	cblock->aio_fildes = trace_fd;
> +	cblock->aio_buf    = buf;
> +	cblock->aio_nbytes = size;
> +	cblock->aio_offset = off;
> +	cblock->aio_sigevent.sigev_notify = SIGEV_NONE;
> +
> +	if (aio_write(cblock) == -1) {
> +		pr_err("failed to queue perf data, error: %m\n");
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
>  static int process_synthesized_event(struct perf_tool *tool,
>  				     union perf_event *event,
>  				     struct perf_sample *sample __maybe_unused,
> @@ -130,12 +148,14 @@ static int process_synthesized_event(struct perf_tool *tool,
>  	return record__write(rec, event, event->header.size);
>  }
>  
> -static int record__pushfn(void *to, void *bf, size_t size)
> +static int record__pushfn(void *to, void *bf, size_t size, off_t off)
>  {
>  	struct record *rec = to;
> +	struct perf_mmap *map = bf;

the argument needs to change for record__pushfn,
now with your changes, it's no longer 'void *bf',
but 'struct perf_mmap *map'

also I'm little confused why we have '*to' and cast
it back to 'struct record', but so be it ;-)

thanks,
jirka

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/2]: perf record: enable asynchronous trace writing
  2018-08-27 18:16 ` [PATCH v3 2/2]: perf record: enable asynchronous trace writing Alexey Budankov
  2018-08-28  8:50   ` Jiri Olsa
@ 2018-08-28  8:53   ` Jiri Olsa
  2018-08-28 10:28     ` Alexey Budankov
  2018-08-28  8:57   ` Jiri Olsa
  2018-08-28  9:01   ` Jiri Olsa
  3 siblings, 1 reply; 15+ messages in thread
From: Jiri Olsa @ 2018-08-28  8:53 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel

On Mon, Aug 27, 2018 at 09:16:55PM +0300, Alexey Budankov wrote:

SNIP

>  	if ((md->start & md->mask) + size != (md->end & md->mask)) {
>  		buf = &data[md->start & md->mask];
> -		size = md->mask + 1 - (md->start & md->mask);
> -		md->start += size;
> -
> -		if (push(to, buf, size) < 0) {
> -			rc = -1;
> -			goto out;
> -		}
> +		size0 = md->mask + 1 - (md->start & md->mask);
> +		md->start += size0;
> +		memcpy(md->data, buf, size0);
>  	}
>  
>  	buf = &data[md->start & md->mask];
>  	size = md->end - md->start;
>  	md->start += size;
> +	memcpy(md->data + size0, buf, size);
>  
> -	if (push(to, buf, size) < 0) {
> -		rc = -1;
> +	rc = push(to, md, size0 + size, *off) < 0 ? -1 : 1;
> +	if (rc == -1)
>  		goto out;
> -	}
> +
> +	perf_mmap__get(md);
> +	*off += size0 + size;

this get is for the perf_mmap pointer storage in the mmap_aio array right?

I see it's released in record__mmap_read_sync, which might also return
without releasing it.. this needs to be fixed and explained in here,
why we take the reference in the first place

thanks,
jirka

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/2]: perf record: enable asynchronous trace writing
  2018-08-27 18:16 ` [PATCH v3 2/2]: perf record: enable asynchronous trace writing Alexey Budankov
  2018-08-28  8:50   ` Jiri Olsa
  2018-08-28  8:53   ` Jiri Olsa
@ 2018-08-28  8:57   ` Jiri Olsa
  2018-08-28 11:31     ` Alexey Budankov
  2018-08-28  9:01   ` Jiri Olsa
  3 siblings, 1 reply; 15+ messages in thread
From: Jiri Olsa @ 2018-08-28  8:57 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel

On Mon, Aug 27, 2018 at 09:16:55PM +0300, Alexey Budankov wrote:

SNIP

> +	int trace_fd = rec->session->data->file.fd;
> +	struct aiocb **mmap_aio = rec->evlist->mmap_aio;
> +	int mmap_aio_size = 0;
> +	off_t off;
>  
>  	if (!evlist)
>  		return 0;
> @@ -528,14 +632,17 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
>  	if (overwrite && evlist->bkw_mmap_state != BKW_MMAP_DATA_PENDING)
>  		return 0;
>  
> +	off = lseek(trace_fd, 0, SEEK_CUR);
> +
>  	for (i = 0; i < evlist->nr_mmaps; i++) {
>  		struct auxtrace_mmap *mm = &maps[i].auxtrace_mmap;
>  
>  		if (maps[i].base) {
> -			if (perf_mmap__push(&maps[i], rec, record__pushfn) != 0) {
> -				rc = -1;
> +			rc = perf_mmap__push(&maps[i], rec, record__pushfn, &off);
> +			if (rc < 0)
>  				goto out;
> -			}
> +			else if (rc > 0)
> +				mmap_aio[mmap_aio_size++] = &maps[i].cblock;

I understand the purpose of mmap_aio array, but I don't see a reason
to fill it in every time we call record__mmap_read_evlist

the way I see it, when 'pushing the data' it's either all or nothing,

if there's an error in pushing one map, we bail out completely..
so the mmap_aio array could be preallocated (it is now) and
pre-filled with cblock pointers

that would probably ease up the reference counting I mentioned
in the previous email

thanks,
jirka

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/2]: perf: reduce data loss when profiling highly parallel CPU bound workloads
       [not found] <bdd669cd-a6dc-d592-4655-33b7bfc46cbb@linux.intel.com>
  2018-08-27 18:16 ` [PATCH v3 1/2]: perf util: map data buffer for preserving collected data Alexey Budankov
  2018-08-27 18:16 ` [PATCH v3 2/2]: perf record: enable asynchronous trace writing Alexey Budankov
@ 2018-08-28  8:59 ` Jiri Olsa
  2018-08-28 11:58   ` Alexey Budankov
  2018-08-28 14:08   ` Alexey Budankov
  2 siblings, 2 replies; 15+ messages in thread
From: Jiri Olsa @ 2018-08-28  8:59 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel

On Mon, Aug 27, 2018 at 08:03:21PM +0300, Alexey Budankov wrote:
> 
> Currently in record mode the tool implements trace writing serially. 
> The algorithm loops over mapped per-cpu data buffers and stores ready 
> data chunks into a trace file using write() system call.
> 
> At some circumstances the kernel may lack free space in a buffer 
> because the other buffer's half is not yet written to disk due to 
> some other buffer's data writing by the tool at the moment.
> 
> Thus serial trace writing implementation may cause the kernel 
> to loose profiling data and that is what observed when profiling 
> highly parallel CPU bound workloads on machines with big number 
> of cores.
> 
> Experiment with profiling matrix multiplication code executing 128 
> threads on Intel Xeon Phi (KNM) with 272 cores, like below,
> demonstrates data loss metrics value of 98%:
> 
> /usr/bin/time perf record -o /tmp/perf-ser.data -a -N -B -T -R -g \
>     --call-graph dwarf,1024 --user-regs=IP,SP,BP \
>     --switch-events -e cycles,instructions,ref-cycles,software/period=1,name=cs,config=0x3/Duk -- \
>     matrix.gcc
> 
> Data loss metrics is the ratio lost_time/elapsed_time where 
> lost_time is the sum of time intervals containing PERF_RECORD_LOST 
> records and elapsed_time is the elapsed application run time 
> under profiling.

I like the idea and I think it's good direction to go, but could
you please share some from perf stat or whatever you used to meassure
the new performance?

thanks,
jirka

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/2]: perf record: enable asynchronous trace writing
  2018-08-27 18:16 ` [PATCH v3 2/2]: perf record: enable asynchronous trace writing Alexey Budankov
                     ` (2 preceding siblings ...)
  2018-08-28  8:57   ` Jiri Olsa
@ 2018-08-28  9:01   ` Jiri Olsa
  2018-08-28 12:37     ` Alexey Budankov
  3 siblings, 1 reply; 15+ messages in thread
From: Jiri Olsa @ 2018-08-28  9:01 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel

On Mon, Aug 27, 2018 at 09:16:55PM +0300, Alexey Budankov wrote:

SNIP

> +static int record__mmap_read_sync(int trace_fd, struct aiocb **cblocks,
> +		int cblocks_size, struct record *rec)
> +{
> +	size_t rem;
> +	ssize_t size;
> +	off_t rem_off;
> +	int i, aio_ret, aio_errno, do_suspend;
> +	struct perf_mmap *md;
> +	struct timespec timeout0 = { 0, 0 };
> +	struct timespec timeoutS = { 0, 1000 * 500  * 1 }; // 0.5ms
> +
> +	if (!cblocks_size)
> +		return 0;
> +
> +	do {
> +		do_suspend = 0;
> +		/* aio_suspend() implementation inside glibc (as of v2.27) is
> +		 * intrusive and not just blocks waiting io requests completion
> +		 * but polls requests queue inducing context switches in perf
> +		 * tool process. When profiling in system wide mode with tracing
> +		 * context switches the trace may be polluted by context switches
> +		 * from the perf process and the trace size becomes about 3-5
> +		 * times bigger than that of when writing the trace serially.
> +		 * To limit the volume of context switches from perf tool
> +		 * process nanosleep() call below is added prior aio_suspend()
> +		 * calling till every half of the kernel timer tick which is
> +		 * usually 1ms (depends on CONFIG_HZ value).

nice, any idea this is intentional for some reason?

was there some impact on the overall performance or
this is purely for the sanity of the trace size?

thanks,
jirka

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/2]: perf record: enable asynchronous trace writing
  2018-08-28  8:50   ` Jiri Olsa
@ 2018-08-28  9:39     ` Alexey Budankov
  0 siblings, 0 replies; 15+ messages in thread
From: Alexey Budankov @ 2018-08-28  9:39 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel

Hi,

On 28.08.2018 11:50, Jiri Olsa wrote:
> On Mon, Aug 27, 2018 at 09:16:55PM +0300, Alexey Budankov wrote:
>>
>> Trace file offset are linearly calculated by perf_mmap__push() code 
>> for the next possible write operation, but file position is updated by 
>> the kernel only in the second lseek() syscall after the loop. 
>> The first lseek() syscall reads that file position for 
>> the next loop iterations.
>>
>> record__mmap_read_sync implements sort of a barrier between spilling 
>> ready profiling data to disk.
>>
>> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
>> ---
>> Changes in v3:
>> - written comments about nanosleep(0.5ms) call prior aio_suspend()
>>   to cope with intrusiveness of its implementation in glibc;
>> - written comments about rationale behind coping profiling data 
>>   into mmap->data buffer;
>> ---
>>  tools/perf/builtin-record.c | 125 +++++++++++++++++++++++++++++++++++++++++---
>>  tools/perf/util/mmap.c      |  36 ++++++++-----
>>  tools/perf/util/mmap.h      |   2 +-
>>  3 files changed, 143 insertions(+), 20 deletions(-)
>>
>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>> index 22ebeb92ac51..4ac61399a09a 100644
>> --- a/tools/perf/builtin-record.c
>> +++ b/tools/perf/builtin-record.c
>> @@ -53,6 +53,7 @@
>>  #include <sys/mman.h>
>>  #include <sys/wait.h>
>>  #include <linux/time64.h>
>> +#include <aio.h>
>>  
>>  struct switch_output {
>>  	bool		 enabled;
>> @@ -121,6 +122,23 @@ static int record__write(struct record *rec, void *bf, size_t size)
>>  	return 0;
>>  }
>>  
>> +static int record__aio_write(int trace_fd, struct aiocb *cblock,
>> +		void *buf, size_t size, off_t off)
>> +{
>> +	cblock->aio_fildes = trace_fd;
>> +	cblock->aio_buf    = buf;
>> +	cblock->aio_nbytes = size;
>> +	cblock->aio_offset = off;
>> +	cblock->aio_sigevent.sigev_notify = SIGEV_NONE;
>> +
>> +	if (aio_write(cblock) == -1) {
>> +		pr_err("failed to queue perf data, error: %m\n");
>> +		return -1;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>  static int process_synthesized_event(struct perf_tool *tool,
>>  				     union perf_event *event,
>>  				     struct perf_sample *sample __maybe_unused,
>> @@ -130,12 +148,14 @@ static int process_synthesized_event(struct perf_tool *tool,
>>  	return record__write(rec, event, event->header.size);
>>  }
>>  
>> -static int record__pushfn(void *to, void *bf, size_t size)
>> +static int record__pushfn(void *to, void *bf, size_t size, off_t off)
>>  {
>>  	struct record *rec = to;
>> +	struct perf_mmap *map = bf;
> 
> the argument needs to change for record__pushfn,
> now with your changes, it's no longer 'void *bf',
> but 'struct perf_mmap *map'

Ok. Included into [PATCH v4 2/2].

> 
> also I'm little confused why we have '*to' and cast
> it back to 'struct record', but so be it ;-)

Supported. :)

> 
> thanks,
> jirka
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/2]: perf record: enable asynchronous trace writing
  2018-08-28  8:53   ` Jiri Olsa
@ 2018-08-28 10:28     ` Alexey Budankov
  0 siblings, 0 replies; 15+ messages in thread
From: Alexey Budankov @ 2018-08-28 10:28 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel

Hi,

On 28.08.2018 11:53, Jiri Olsa wrote:
> On Mon, Aug 27, 2018 at 09:16:55PM +0300, Alexey Budankov wrote:
> 
> SNIP
> 
>>  	if ((md->start & md->mask) + size != (md->end & md->mask)) {
>>  		buf = &data[md->start & md->mask];
>> -		size = md->mask + 1 - (md->start & md->mask);
>> -		md->start += size;
>> -
>> -		if (push(to, buf, size) < 0) {
>> -			rc = -1;
>> -			goto out;
>> -		}
>> +		size0 = md->mask + 1 - (md->start & md->mask);
>> +		md->start += size0;
>> +		memcpy(md->data, buf, size0);
>>  	}
>>  
>>  	buf = &data[md->start & md->mask];
>>  	size = md->end - md->start;
>>  	md->start += size;
>> +	memcpy(md->data + size0, buf, size);
>>  
>> -	if (push(to, buf, size) < 0) {
>> -		rc = -1;
>> +	rc = push(to, md, size0 + size, *off) < 0 ? -1 : 1;
>> +	if (rc == -1)
>>  		goto out;
>> -	}
>> +
>> +	perf_mmap__get(md);
>> +	*off += size0 + size;
> 
> this get is for the perf_mmap pointer storage in the mmap_aio array right?

Right. perf_mmap__get() here guards whole perf_mmap object and memory 
referenced thru it from premature deallocation because mmap->base kernel 
buffer can be released earlier than aio requests started on mmap->data 
complete and the both buffers are referenced thru the same perf_mmap object.

> 
> I see it's released in record__mmap_read_sync, which might also return
> without releasing it.. this needs to be fixed and explained in here,
> why we take the reference in the first place

So we increment after successful push() from map->base to map->data 
with following aio_write() and decrement when aio_write() is fully 
complete, because it may require restart if the kernel didn't write 
whole chunk at once.

Probably we are still missing one more perf_mmap__put() after:
	pr_err("failed to write perf data, error: %m\n"); 
prior nullifying the appropriate cblock.

Updated [PATCH v4 2/2]:
- Written comment in perf_mmap__push() just before perf_mmap__get();
- Written comment in record__mmap_read_sync() on possible restarting 
  of aio_write() operation and releasing perf_mmap object after all;
- added perf_mmap__put() for the cases of failed aio_write();

Thanks!

> 
> thanks,
> jirka
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/2]: perf record: enable asynchronous trace writing
  2018-08-28  8:57   ` Jiri Olsa
@ 2018-08-28 11:31     ` Alexey Budankov
  2018-08-28 12:21       ` Jiri Olsa
  0 siblings, 1 reply; 15+ messages in thread
From: Alexey Budankov @ 2018-08-28 11:31 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel

Hi,

On 28.08.2018 11:57, Jiri Olsa wrote:
> On Mon, Aug 27, 2018 at 09:16:55PM +0300, Alexey Budankov wrote:
> 
> SNIP
> 
>> +	int trace_fd = rec->session->data->file.fd;
>> +	struct aiocb **mmap_aio = rec->evlist->mmap_aio;
>> +	int mmap_aio_size = 0;
>> +	off_t off;
>>  
>>  	if (!evlist)
>>  		return 0;
>> @@ -528,14 +632,17 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
>>  	if (overwrite && evlist->bkw_mmap_state != BKW_MMAP_DATA_PENDING)
>>  		return 0;
>>  
>> +	off = lseek(trace_fd, 0, SEEK_CUR);
>> +
>>  	for (i = 0; i < evlist->nr_mmaps; i++) {
>>  		struct auxtrace_mmap *mm = &maps[i].auxtrace_mmap;
>>  
>>  		if (maps[i].base) {
>> -			if (perf_mmap__push(&maps[i], rec, record__pushfn) != 0) {
>> -				rc = -1;
>> +			rc = perf_mmap__push(&maps[i], rec, record__pushfn, &off);
>> +			if (rc < 0)
>>  				goto out;
>> -			}
>> +			else if (rc > 0)
>> +				mmap_aio[mmap_aio_size++] = &maps[i].cblock;
> 
> I understand the purpose of mmap_aio array, but I don't see a reason
> to fill it in every time we call record__mmap_read_evlist

The cycle trips the same number of iterations over kernel buffers 
for every call of record__mmap_read_evlist(). Called perf_mmap__push() 
checks if there is data ready for spill in the corresponding buffer 
and if there is no such data returns 0. So every time we execute 
the cycle we get different set of buffers to spill and in this 
circumstances dynamic filling of mmap_aio looks preferable.
Lifetime management of perf_mmap object and referenced memory 
is not related another thing.

> 
> the way I see it, when 'pushing the data' it's either all or nothing,
> 
> if there's an error in pushing one map, we bail out completely..
> so the mmap_aio array could be preallocated (it is now) and
> pre-filled with cblock pointers
> 
> that would probably ease up the reference counting I mentioned
> in the previous email
> 
> thanks,
> jirka
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/2]: perf: reduce data loss when profiling highly parallel CPU bound workloads
  2018-08-28  8:59 ` [PATCH v3 0/2]: perf: reduce data loss when profiling highly parallel CPU bound workloads Jiri Olsa
@ 2018-08-28 11:58   ` Alexey Budankov
  2018-08-28 14:17     ` Alexey Budankov
  2018-08-28 14:08   ` Alexey Budankov
  1 sibling, 1 reply; 15+ messages in thread
From: Alexey Budankov @ 2018-08-28 11:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Arnaldo Carvalho de Melo, Alexander Shishkin, Namhyung Kim,
	Andi Kleen, linux-kernel

Hi Andi,

On 28.08.2018 11:59, Jiri Olsa wrote:
> On Mon, Aug 27, 2018 at 08:03:21PM +0300, Alexey Budankov wrote:
>>
>> Currently in record mode the tool implements trace writing serially. 
>> The algorithm loops over mapped per-cpu data buffers and stores ready 
>> data chunks into a trace file using write() system call.
>>
>> At some circumstances the kernel may lack free space in a buffer 
>> because the other buffer's half is not yet written to disk due to 
>> some other buffer's data writing by the tool at the moment.
>>
>> Thus serial trace writing implementation may cause the kernel 
>> to loose profiling data and that is what observed when profiling 
>> highly parallel CPU bound workloads on machines with big number 
>> of cores.
>>
>> Experiment with profiling matrix multiplication code executing 128 
>> threads on Intel Xeon Phi (KNM) with 272 cores, like below,
>> demonstrates data loss metrics value of 98%:
>>
>> /usr/bin/time perf record -o /tmp/perf-ser.data -a -N -B -T -R -g \
>>     --call-graph dwarf,1024 --user-regs=IP,SP,BP \
>>     --switch-events -e cycles,instructions,ref-cycles,software/period=1,name=cs,config=0x3/Duk -- \
>>     matrix.gcc
>>
>> Data loss metrics is the ratio lost_time/elapsed_time where 
>> lost_time is the sum of time intervals containing PERF_RECORD_LOST 
>> records and elapsed_time is the elapsed application run time 
>> under profiling.
> 
> I like the idea and I think it's good direction to go, but could
> you please share some from perf stat or whatever you used to meassure
> the new performance?

Is it ok to share VTune GUI screenshots I sent you the last time 
to demonstrate the advantage of AIO trace streaming?

Thanks,
Alexey


> 
> thanks,
> jirka
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/2]: perf record: enable asynchronous trace writing
  2018-08-28 11:31     ` Alexey Budankov
@ 2018-08-28 12:21       ` Jiri Olsa
  0 siblings, 0 replies; 15+ messages in thread
From: Jiri Olsa @ 2018-08-28 12:21 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel

On Tue, Aug 28, 2018 at 02:31:04PM +0300, Alexey Budankov wrote:
> Hi,
> 
> On 28.08.2018 11:57, Jiri Olsa wrote:
> > On Mon, Aug 27, 2018 at 09:16:55PM +0300, Alexey Budankov wrote:
> > 
> > SNIP
> > 
> >> +	int trace_fd = rec->session->data->file.fd;
> >> +	struct aiocb **mmap_aio = rec->evlist->mmap_aio;
> >> +	int mmap_aio_size = 0;
> >> +	off_t off;
> >>  
> >>  	if (!evlist)
> >>  		return 0;
> >> @@ -528,14 +632,17 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
> >>  	if (overwrite && evlist->bkw_mmap_state != BKW_MMAP_DATA_PENDING)
> >>  		return 0;
> >>  
> >> +	off = lseek(trace_fd, 0, SEEK_CUR);
> >> +
> >>  	for (i = 0; i < evlist->nr_mmaps; i++) {
> >>  		struct auxtrace_mmap *mm = &maps[i].auxtrace_mmap;
> >>  
> >>  		if (maps[i].base) {
> >> -			if (perf_mmap__push(&maps[i], rec, record__pushfn) != 0) {
> >> -				rc = -1;
> >> +			rc = perf_mmap__push(&maps[i], rec, record__pushfn, &off);
> >> +			if (rc < 0)
> >>  				goto out;
> >> -			}
> >> +			else if (rc > 0)
> >> +				mmap_aio[mmap_aio_size++] = &maps[i].cblock;
> > 
> > I understand the purpose of mmap_aio array, but I don't see a reason
> > to fill it in every time we call record__mmap_read_evlist
> 
> The cycle trips the same number of iterations over kernel buffers 
> for every call of record__mmap_read_evlist(). Called perf_mmap__push() 
> checks if there is data ready for spill in the corresponding buffer 
> and if there is no such data returns 0. So every time we execute 
> the cycle we get different set of buffers to spill and in this 
> circumstances dynamic filling of mmap_aio looks preferable.
> Lifetime management of perf_mmap object and referenced memory 
> is not related another thing.

argh.. the perf_mmap__push returns 0 for empty buffer, ok

jirka

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/2]: perf record: enable asynchronous trace writing
  2018-08-28  9:01   ` Jiri Olsa
@ 2018-08-28 12:37     ` Alexey Budankov
  0 siblings, 0 replies; 15+ messages in thread
From: Alexey Budankov @ 2018-08-28 12:37 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel

Hi,

On 28.08.2018 12:01, Jiri Olsa wrote:
> On Mon, Aug 27, 2018 at 09:16:55PM +0300, Alexey Budankov wrote:
> 
> SNIP
> 
>> +static int record__mmap_read_sync(int trace_fd, struct aiocb **cblocks,
>> +		int cblocks_size, struct record *rec)
>> +{
>> +	size_t rem;
>> +	ssize_t size;
>> +	off_t rem_off;
>> +	int i, aio_ret, aio_errno, do_suspend;
>> +	struct perf_mmap *md;
>> +	struct timespec timeout0 = { 0, 0 };
>> +	struct timespec timeoutS = { 0, 1000 * 500  * 1 }; // 0.5ms
>> +
>> +	if (!cblocks_size)
>> +		return 0;
>> +
>> +	do {
>> +		do_suspend = 0;
>> +		/* aio_suspend() implementation inside glibc (as of v2.27) is
>> +		 * intrusive and not just blocks waiting io requests completion
>> +		 * but polls requests queue inducing context switches in perf
>> +		 * tool process. When profiling in system wide mode with tracing
>> +		 * context switches the trace may be polluted by context switches
>> +		 * from the perf process and the trace size becomes about 3-5
>> +		 * times bigger than that of when writing the trace serially.
>> +		 * To limit the volume of context switches from perf tool
>> +		 * process nanosleep() call below is added prior aio_suspend()
>> +		 * calling till every half of the kernel timer tick which is
>> +		 * usually 1ms (depends on CONFIG_HZ value).
> 
> nice, any idea this is intentional for some reason?
> 
> was there some impact on the overall performance or
> this is purely for the sanity of the trace size?

Initially we noticed the trace size increase only. However if you look at 
the data using perf report you see that Perf tool process gets considerable 
volume of new context switches and that is what caused by switching 
to AIO streaming, specifically frequent calls of aio_suspend().

> 
> thanks,
> jirka
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/2]: perf: reduce data loss when profiling highly parallel CPU bound workloads
  2018-08-28  8:59 ` [PATCH v3 0/2]: perf: reduce data loss when profiling highly parallel CPU bound workloads Jiri Olsa
  2018-08-28 11:58   ` Alexey Budankov
@ 2018-08-28 14:08   ` Alexey Budankov
  1 sibling, 0 replies; 15+ messages in thread
From: Alexey Budankov @ 2018-08-28 14:08 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel

Hi,

On 28.08.2018 11:59, Jiri Olsa wrote:
> On Mon, Aug 27, 2018 at 08:03:21PM +0300, Alexey Budankov wrote:
>>
>> Currently in record mode the tool implements trace writing serially. 
>> The algorithm loops over mapped per-cpu data buffers and stores ready 
>> data chunks into a trace file using write() system call.
>>
>> At some circumstances the kernel may lack free space in a buffer 
>> because the other buffer's half is not yet written to disk due to 
>> some other buffer's data writing by the tool at the moment.
>>
>> Thus serial trace writing implementation may cause the kernel 
>> to loose profiling data and that is what observed when profiling 
>> highly parallel CPU bound workloads on machines with big number 
>> of cores.
>>
>> Experiment with profiling matrix multiplication code executing 128 
>> threads on Intel Xeon Phi (KNM) with 272 cores, like below,
>> demonstrates data loss metrics value of 98%:
>>
>> /usr/bin/time perf record -o /tmp/perf-ser.data -a -N -B -T -R -g \
>>     --call-graph dwarf,1024 --user-regs=IP,SP,BP \
>>     --switch-events -e cycles,instructions,ref-cycles,software/period=1,name=cs,config=0x3/Duk -- \
>>     matrix.gcc
>>
>> Data loss metrics is the ratio lost_time/elapsed_time where 
>> lost_time is the sum of time intervals containing PERF_RECORD_LOST 
>> records and elapsed_time is the elapsed application run time 
>> under profiling.
> 
> I like the idea and I think it's good direction to go, but could
> you please share some from perf stat or whatever you used to meassure
> the new performance?

I am not sure it would be representative in perf stat data however 
that could be visible there as well. I could share screenshots of 
VTune GUI that demonstrate the advantage of Perf implementing AIO 
streaming in comparison with the serial streaming. Data loss metrics 
can be easily understood from that. Is it ok to post it here?

> 
> thanks,
> jirka
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/2]: perf: reduce data loss when profiling highly parallel CPU bound workloads
  2018-08-28 11:58   ` Alexey Budankov
@ 2018-08-28 14:17     ` Alexey Budankov
  0 siblings, 0 replies; 15+ messages in thread
From: Alexey Budankov @ 2018-08-28 14:17 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Arnaldo Carvalho de Melo, Alexander Shishkin, Namhyung Kim, linux-kernel

Hi,

On 28.08.2018 14:58, Alexey Budankov wrote:
> Hi Andi,
> 
> On 28.08.2018 11:59, Jiri Olsa wrote:
>> On Mon, Aug 27, 2018 at 08:03:21PM +0300, Alexey Budankov wrote:
>>>
>>> Currently in record mode the tool implements trace writing serially. 
>>> The algorithm loops over mapped per-cpu data buffers and stores ready 
>>> data chunks into a trace file using write() system call.
>>>
>>> At some circumstances the kernel may lack free space in a buffer 
>>> because the other buffer's half is not yet written to disk due to 
>>> some other buffer's data writing by the tool at the moment.
>>>
>>> Thus serial trace writing implementation may cause the kernel 
>>> to loose profiling data and that is what observed when profiling 
>>> highly parallel CPU bound workloads on machines with big number 
>>> of cores.
>>>
>>> Experiment with profiling matrix multiplication code executing 128 
>>> threads on Intel Xeon Phi (KNM) with 272 cores, like below,
>>> demonstrates data loss metrics value of 98%:
>>>
>>> /usr/bin/time perf record -o /tmp/perf-ser.data -a -N -B -T -R -g \
>>>     --call-graph dwarf,1024 --user-regs=IP,SP,BP \
>>>     --switch-events -e cycles,instructions,ref-cycles,software/period=1,name=cs,config=0x3/Duk -- \
>>>     matrix.gcc
>>>
>>> Data loss metrics is the ratio lost_time/elapsed_time where 
>>> lost_time is the sum of time intervals containing PERF_RECORD_LOST 
>>> records and elapsed_time is the elapsed application run time 
>>> under profiling.
>>
>> I like the idea and I think it's good direction to go, but could
>> you please share some from perf stat or whatever you used to meassure
>> the new performance?
> 
> Is it ok to share VTune GUI screenshots I sent you the last time 
> to demonstrate the advantage of AIO trace streaming?

VTune release manager permitted to share it, well, sorry for bothering.

> 
> Thanks,
> Alexey
> 
> 
>>
>> thanks,
>> jirka
>>
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-08-28 14:17 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bdd669cd-a6dc-d592-4655-33b7bfc46cbb@linux.intel.com>
2018-08-27 18:16 ` [PATCH v3 1/2]: perf util: map data buffer for preserving collected data Alexey Budankov
2018-08-27 18:16 ` [PATCH v3 2/2]: perf record: enable asynchronous trace writing Alexey Budankov
2018-08-28  8:50   ` Jiri Olsa
2018-08-28  9:39     ` Alexey Budankov
2018-08-28  8:53   ` Jiri Olsa
2018-08-28 10:28     ` Alexey Budankov
2018-08-28  8:57   ` Jiri Olsa
2018-08-28 11:31     ` Alexey Budankov
2018-08-28 12:21       ` Jiri Olsa
2018-08-28  9:01   ` Jiri Olsa
2018-08-28 12:37     ` Alexey Budankov
2018-08-28  8:59 ` [PATCH v3 0/2]: perf: reduce data loss when profiling highly parallel CPU bound workloads Jiri Olsa
2018-08-28 11:58   ` Alexey Budankov
2018-08-28 14:17     ` Alexey Budankov
2018-08-28 14:08   ` Alexey Budankov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).