linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
@ 2018-09-07  7:07 Alexey Budankov
  2018-09-07  7:11 ` [PATCH v8 1/3]: perf util: map data buffer for preserving collected data Alexey Budankov
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Alexey Budankov @ 2018-09-07  7:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andi Kleen, linux-kernel


Currently in record mode the tool implements trace writing serially. 
The algorithm loops over mapped per-cpu data buffers and stores 
ready data chunks into a trace file using write() system call.

At some circumstances the kernel may lack free space in a buffer 
because the other buffer's half is not yet written to disk due to 
some other buffer's data writing by the tool at the moment.

Thus serial trace writing implementation may cause the kernel 
to loose profiling data and that is what observed when profiling 
highly parallel CPU bound workloads on machines with big number 
of cores.

Experiment with profiling matrix multiplication code executing 128 
threads on Intel Xeon Phi (KNM) with 272 cores, like below,
demonstrates data loss metrics value of 98%:

/usr/bin/time perf record -o /tmp/perf-ser.data -a -N -B -T -R -g \
    --call-graph dwarf,1024 --user-regs=IP,SP,BP \
    --switch-events -e cycles,instructions,ref-cycles,software/period=1,name=cs,config=0x3/Duk -- \
    matrix.gcc

Data loss metrics is the ratio lost_time/elapsed_time where 
lost_time is the sum of time intervals containing PERF_RECORD_LOST 
records and elapsed_time is the elapsed application run time 
under profiling.

Applying asynchronous trace streaming thru Posix AIO API
(http://man7.org/linux/man-pages/man7/aio.7.html) 
lowers data loss metrics value providing 2x improvement -
lowering 98% loss to almost 0%.

---
 Alexey Budankov (3):
        perf util: map data buffer for preserving collected data
	perf record: enable asynchronous trace writing
        perf record: extend trace writing to multi AIO
 
 tools/perf/builtin-record.c | 166 ++++++++++++++++++++++++++++++++++++++++++--
 tools/perf/perf.h           |   1 +
 tools/perf/util/evlist.c    |   7 +-
 tools/perf/util/evlist.h    |   3 +-
 tools/perf/util/mmap.c      | 114 ++++++++++++++++++++++++++----
 tools/perf/util/mmap.h      |  11 ++-
 6 files changed, 277 insertions(+), 25 deletions(-)

---
 Changes in v8:
 - run the whole thing thru checkpatch.pl and corrected found issues except
   lines longer than 80 symbols
 - corrected comments alignment and formatting
 - moved multi AIO implementation into 3rd patch in the series
 - implemented explicit cblocks array allocation
 - split AIO completion check into separate record__aio_complete()
 - set nr_cblocks default to 1 and max allowed value to 4
 Changes in v7:
 - implemented handling record.aio setting from perfconfig file
 Changes in v6:
 - adjusted setting of priorities for cblocks;
 - handled errno == EAGAIN case from aio_write() return;
 Changes in v5:
 - resolved livelock on perf record -e intel_pt// -- dd if=/dev/zero of=/dev/null count=100000
 - data loss metrics decreased from 25% to 2x in trialed configuration;
 - reshaped layout of data structures;
 - implemented --aio option;
 - avoided nanosleep() prior calling aio_suspend();
 - switched to per-cpu aio multi buffer record__aio_sync();
 - record_mmap_read_sync() now does global sync just before 
   switching trace file or collection stop;
 Changes in v4:
 - converted mmap()/munmap() to malloc()/free() for mmap->data buffer management
 - converted void *bf to struct perf_mmap *md in signatures
 - written comment in perf_mmap__push() just before perf_mmap__get();
 - written comment in record__mmap_read_sync() on possible restarting 
   of aio_write() operation and releasing perf_mmap object after all;
 - added perf_mmap__put() for the cases of failed aio_write();
 Changes in v3:
 - written comments about nanosleep(0.5ms) call prior aio_suspend()
   to cope with intrusiveness of its implementation in glibc;
 - written comments about rationale behind coping profiling data 
   into mmap->data buffer;
 Changes in v2:
 - converted zalloc() to calloc() for allocation of mmap_aio array,
 - cleared typo and adjusted fallback branch code;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v8 1/3]: perf util: map data buffer for preserving collected data
  2018-09-07  7:07 [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads Alexey Budankov
@ 2018-09-07  7:11 ` Alexey Budankov
  2018-09-07  7:19 ` [PATCH v8 2/3]: perf record: enable asynchronous trace writing Alexey Budankov
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Alexey Budankov @ 2018-09-07  7:11 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andi Kleen, linux-kernel


The map->data buffer is used to preserve map->base profiling data 
for writing to disk. AIO map->cblock is used to queue corresponding 
map->data buffer for asynchronous writing.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 Changes in v7:
  - implemented handling record.aio setting from perfconfig file
 Changes in v6:
  - adjusted setting of priorities for cblocks;
 Changes in v5:
  - reshaped layout of data structures;
  - implemented --aio option;
 Changes in v4:
  - converted mmap()/munmap() to malloc()/free() for mmap->data buffer management 
 Changes in v2:
  - converted zalloc() to calloc() for allocation of mmap_aio array,
  - cleared typo and adjusted fallback branch code;
---
 tools/perf/util/mmap.c | 25 +++++++++++++++++++++++++
 tools/perf/util/mmap.h |  3 +++
 2 files changed, 28 insertions(+)

diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index fc832676a798..e53038d76445 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -155,6 +155,8 @@ void __weak auxtrace_mmap_params__set_idx(struct auxtrace_mmap_params *mp __mayb
 
 void perf_mmap__munmap(struct perf_mmap *map)
 {
+	if (map->data)
+		zfree(&map->data);
 	if (map->base != NULL) {
 		munmap(map->base, perf_mmap__mmap_len(map));
 		map->base = NULL;
@@ -166,6 +168,7 @@ void perf_mmap__munmap(struct perf_mmap *map)
 
 int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd)
 {
+	int delta_max;
 	/*
 	 * The last one will be done at perf_mmap__consume(), so that we
 	 * make sure we don't prevent tools from consuming every last event in
@@ -190,6 +193,28 @@ int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd)
 		map->base = NULL;
 		return -1;
 	}
+	delta_max = sysconf(_SC_AIO_PRIO_DELTA_MAX);
+	map->data = malloc(perf_mmap__mmap_len(map));
+	if (!map->data) {
+		pr_debug2("failed to allocate data buffer, error %d\n",
+				errno);
+		return -1;
+	}
+	/*
+	 * Use cblock.aio_fildes value different from -1
+	 * to denote started aio write operation on the
+	 * cblock so it requires explicit record__aio_sync()
+	 * call prior the cblock may be reused again.
+	 */
+	map->cblock.aio_fildes = -1;
+	/*
+	 * Allocate cblock with max priority delta to
+	 * have faster aio_write() calls because queued
+	 * requests are kept in separate per-prio queues
+	 * and adding a new request iterates thru shorter
+	 * per-prio list.
+	 */
+	map->cblock.aio_reqprio = delta_max;
 	map->fd = fd;
 
 	if (auxtrace_mmap__mmap(&map->auxtrace_mmap,
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index d82294db1295..1974e621e36b 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -6,6 +6,7 @@
 #include <linux/types.h>
 #include <asm/barrier.h>
 #include <stdbool.h>
+#include <aio.h>
 #include "auxtrace.h"
 #include "event.h"
 
@@ -25,6 +26,8 @@ struct perf_mmap {
 	bool		 overwrite;
 	struct auxtrace_mmap auxtrace_mmap;
 	char		 event_copy[PERF_SAMPLE_MAX_SIZE] __aligned(8);
+	void 		 *data;
+	struct aiocb	 cblock;
 };
 
 /*


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v8 2/3]: perf record: enable asynchronous trace writing
  2018-09-07  7:07 [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads Alexey Budankov
  2018-09-07  7:11 ` [PATCH v8 1/3]: perf util: map data buffer for preserving collected data Alexey Budankov
@ 2018-09-07  7:19 ` Alexey Budankov
  2018-09-07  7:39 ` [PATCH v8 3/3]: perf record: extend trace writing to multi AIO Alexey Budankov
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Alexey Budankov @ 2018-09-07  7:19 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andi Kleen, linux-kernel


Trace file offset is calculated and updated linearly prior
enqueuing aio write at record__pushfn().

record__aio_sync() blocks till completion of started AIO operation 
and then proceeds.

record__mmap_read_sync() implements a barrier for all incomplete
aio write requests.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 Changes in v8:
 -  split AIO completion check into separate record__aio_complete()
 Changes in v6:
 - handled errno == EAGAIN case from aio_write();
 Changes in v5:
 - data loss metrics decreased from 25% to 2x in trialed configuration;
 - avoided nanosleep() prior calling aio_suspend();
 - switched to per cpu multi record__aio_sync() aio
 - record_mmap_read_sync() now does global barrier just before 
   switching trace file or collection stop;
 - resolved livelock on perf record -e intel_pt// -- dd if=/dev/zero of=/dev/null count=100000
 Changes in v4:
 - converted void *bf to struct perf_mmap *md in signatures
 - written comment in perf_mmap__push() just before perf_mmap__get();
 - written comment in record__mmap_read_sync() on possible restarting 
   of aio_write() operation and releasing perf_mmap object after all;
 - added perf_mmap__put() for the cases of failed aio_write();
 Changes in v3:
 - written comments about nanosleep(0.5ms) call prior aio_suspend()
   to cope with intrusiveness of its implementation in glibc;
 - written comments about rationale behind coping profiling data 
   into mmap->data buffer;
---
 tools/perf/builtin-record.c | 128 +++++++++++++++++++++++++++++++++++++++++++-
 tools/perf/util/mmap.c      |  54 ++++++++++++++-----
 tools/perf/util/mmap.h      |   2 +-
 3 files changed, 169 insertions(+), 15 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 22ebeb92ac51..d4857572cf33 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -121,6 +121,93 @@ static int record__write(struct record *rec, void *bf, size_t size)
 	return 0;
 }
 
+static int record__aio_write(struct aiocb *cblock, int trace_fd,
+		void *buf, size_t size, off_t off)
+{
+	int rc;
+
+	cblock->aio_fildes = trace_fd;
+	cblock->aio_buf    = buf;
+	cblock->aio_nbytes = size;
+	cblock->aio_offset = off;
+	cblock->aio_sigevent.sigev_notify = SIGEV_NONE;
+
+	do {
+		rc = aio_write(cblock);
+		if (rc == 0) {
+			break;
+		} else if (errno != EAGAIN) {
+			cblock->aio_fildes = -1;
+			pr_err("failed to queue perf data, error: %m\n");
+			break;
+		}
+	} while (1);
+
+	return rc;
+}
+
+static int record__aio_complete(struct perf_mmap *md, struct aiocb *cblock)
+{
+	void *rem_buf;
+	off_t rem_off;
+	size_t rem_size;
+	int rc, aio_errno;
+	ssize_t aio_ret, written;
+
+	aio_errno = aio_error(cblock);
+	if (aio_errno == EINPROGRESS)
+		return 0;
+
+	written = aio_ret = aio_return(cblock);
+	if (aio_ret < 0) {
+		if (!(aio_errno == EINTR))
+			pr_err("failed to write perf data, error: %m\n");
+		written = 0;
+	}
+
+	rem_size = cblock->aio_nbytes - written;
+
+	if (rem_size == 0) {
+		cblock->aio_fildes = -1;
+		/*
+		 * md->refcount is incremented in perf_mmap__push() for
+		 * every enqueued aio write request so decrement it because
+		 * the request is now complete.
+		 */
+		perf_mmap__put(md);
+		rc = 1;
+	} else {
+		/*
+		 * aio write request may require restart with the
+		 * reminder if the kernel didn't write whole
+		 * chunk at once.
+		 */
+		rem_off = cblock->aio_offset + written;
+		rem_buf = (void *)(cblock->aio_buf + written);
+		record__aio_write(cblock, cblock->aio_fildes,
+				rem_buf, rem_size, rem_off);
+		rc = 0;
+	}
+
+	return rc;
+}
+
+static void record__aio_sync(struct perf_mmap *md)
+{
+	struct aiocb *cblock = &md->cblock;
+	struct timespec timeout = { 0, 1000 * 1000  * 1 }; // 1ms
+
+	do {
+		if (cblock->aio_fildes == -1 || record__aio_complete(md, cblock))
+			return;
+
+		while (aio_suspend((const struct aiocb**)&cblock, 1, &timeout)) {
+			if (!(errno == EAGAIN || errno == EINTR))
+				pr_err("failed to sync perf data, error: %m\n");
+		}
+	} while (1);
+}
+
 static int process_synthesized_event(struct perf_tool *tool,
 				     union perf_event *event,
 				     struct perf_sample *sample __maybe_unused,
@@ -130,12 +217,27 @@ static int process_synthesized_event(struct perf_tool *tool,
 	return record__write(rec, event, event->header.size);
 }
 
-static int record__pushfn(void *to, void *bf, size_t size)
+static int record__pushfn(void *to, struct aiocb *cblock, void *data, size_t size)
 {
+	off_t off;
 	struct record *rec = to;
+	int ret, trace_fd = rec->session->data->file.fd;
 
 	rec->samples++;
-	return record__write(rec, bf, size);
+
+	off = lseek(trace_fd, 0, SEEK_CUR);
+	lseek(trace_fd, off + size, SEEK_SET);
+	ret = record__aio_write(cblock, trace_fd, data, size, off);
+	if (!ret) {
+		rec->bytes_written += size;
+
+		if (switch_output_size(rec))
+			trigger_hit(&switch_output_trigger);
+	} else {
+		lseek(trace_fd, off, SEEK_SET);
+	}
+
+	return ret;
 }
 
 static volatile int done;
@@ -510,6 +612,19 @@ static struct perf_event_header finished_round_event = {
 	.type = PERF_RECORD_FINISHED_ROUND,
 };
 
+static void record__mmap_read_sync(struct record *rec)
+{
+	int i;
+	struct perf_evlist *evlist = rec->evlist;
+	struct perf_mmap *maps = evlist->mmap;
+
+	for (i = 0; i < evlist->nr_mmaps; i++) {
+		struct perf_mmap *map = &maps[i];
+		if (map->base)
+			record__aio_sync(map);
+	}
+}
+
 static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evlist,
 				    bool overwrite)
 {
@@ -532,6 +647,11 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
 		struct auxtrace_mmap *mm = &maps[i].auxtrace_mmap;
 
 		if (maps[i].base) {
+			/*
+			 * Call record__aio_sync() to wait till map->data buffer
+			 * becomes available after previous aio write request.
+			 */
+			record__aio_sync(&maps[i]);
 			if (perf_mmap__push(&maps[i], rec, record__pushfn) != 0) {
 				rc = -1;
 				goto out;
@@ -1054,6 +1174,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 			perf_evlist__toggle_bkw_mmap(rec->evlist, BKW_MMAP_DATA_PENDING);
 
 		if (record__mmap_read_all(rec) < 0) {
+			record__mmap_read_sync(rec);
 			trigger_error(&auxtrace_snapshot_trigger);
 			trigger_error(&switch_output_trigger);
 			err = -1;
@@ -1065,6 +1186,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 			if (!trigger_is_error(&auxtrace_snapshot_trigger))
 				record__read_auxtrace_snapshot(rec);
 			if (trigger_is_error(&auxtrace_snapshot_trigger)) {
+				record__mmap_read_sync(rec);
 				pr_err("AUX area tracing snapshot failed\n");
 				err = -1;
 				goto out_child;
@@ -1083,6 +1205,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 			 */
 			if (rec->evlist->bkw_mmap_state == BKW_MMAP_RUNNING)
 				continue;
+			record__mmap_read_sync(rec);
 			trigger_ready(&switch_output_trigger);
 
 			/*
@@ -1136,6 +1259,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 			disabled = true;
 		}
 	}
+	record__mmap_read_sync(rec);
 	trigger_off(&auxtrace_snapshot_trigger);
 	trigger_off(&switch_output_trigger);
 
diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index e53038d76445..71c5628df3db 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -305,11 +305,11 @@ int perf_mmap__read_init(struct perf_mmap *map)
 }
 
 int perf_mmap__push(struct perf_mmap *md, void *to,
-		    int push(void *to, void *buf, size_t size))
+		    int push(void *to, struct aiocb *cblock, void *data, size_t size))
 {
 	u64 head = perf_mmap__read_head(md);
 	unsigned char *data = md->base + page_size;
-	unsigned long size;
+	unsigned long size, size0 = 0;
 	void *buf;
 	int rc = 0;
 
@@ -317,31 +317,61 @@ int perf_mmap__push(struct perf_mmap *md, void *to,
 	if (rc < 0)
 		return (rc == -EAGAIN) ? 0 : -1;
 
+	/*
+	 * md->base data is copied into md->data buffer to
+	 * release space in the kernel buffer as fast as possible,
+	 * thru perf_mmap__consume() below.
+	 *
+	 * That lets the kernel to proceed with storing more
+	 * profiling data into the kernel buffer earlier than other
+	 * per-cpu kernel buffers are handled.
+	 *
+	 * Coping can be done in two steps in case the chunk of
+	 * profiling data crosses the upper bound of the kernel buffer.
+	 * In this case we first move part of data from md->start
+	 * till the upper bound and then the reminder from the
+	 * beginning of the kernel buffer till the end of
+	 * the data chunk.
+	 */
+
 	size = md->end - md->start;
 
 	if ((md->start & md->mask) + size != (md->end & md->mask)) {
 		buf = &data[md->start & md->mask];
 		size = md->mask + 1 - (md->start & md->mask);
 		md->start += size;
-
-		if (push(to, buf, size) < 0) {
-			rc = -1;
-			goto out;
-		}
+		memcpy(md->data, buf, size);
+		size0 = size;
 	}
 
 	buf = &data[md->start & md->mask];
 	size = md->end - md->start;
 	md->start += size;
+	memcpy(md->data + size0, buf, size);
 
-	if (push(to, buf, size) < 0) {
-		rc = -1;
-		goto out;
-	}
+	/*
+	 * Increment md->refcount to guard md->data buffer
+	 * from premature deallocation because md object can be
+	 * released earlier than aio write request started
+	 * on mmap->data is complete.
+	 *
+	 * perf_mmap__put() is done at record__aio_complete()
+	 * after started request completion.
+	 */
+	perf_mmap__get(md);
 
 	md->prev = head;
 	perf_mmap__consume(md);
-out:
+
+	rc = push(to, &md->cblock, md->data, size0 + size);
+	if (rc) {
+		/*
+		 * Decrement md->refcount back if aio write
+		 * operation failed to start.
+		 */
+		perf_mmap__put(md);
+	}
+
 	return rc;
 }
 
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index 1974e621e36b..a9795a5fe200 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -95,7 +95,7 @@ union perf_event *perf_mmap__read_forward(struct perf_mmap *map);
 union perf_event *perf_mmap__read_event(struct perf_mmap *map);
 
 int perf_mmap__push(struct perf_mmap *md, void *to,
-		    int push(void *to, void *buf, size_t size));
+		    int push(void *to, struct aiocb *cblock, void *data, size_t size));
 
 size_t perf_mmap__mmap_len(struct perf_mmap *map);


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v8 3/3]: perf record: extend trace writing to multi AIO
  2018-09-07  7:07 [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads Alexey Budankov
  2018-09-07  7:11 ` [PATCH v8 1/3]: perf util: map data buffer for preserving collected data Alexey Budankov
  2018-09-07  7:19 ` [PATCH v8 2/3]: perf record: enable asynchronous trace writing Alexey Budankov
@ 2018-09-07  7:39 ` Alexey Budankov
  2018-09-07  9:34 ` [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads Alexey Budankov
  2018-09-10  9:18 ` Ingo Molnar
  4 siblings, 0 replies; 24+ messages in thread
From: Alexey Budankov @ 2018-09-07  7:39 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andi Kleen, linux-kernel


Multi AIO trace writing allows caching more kernel data into userspace 
memory postponing trace writing for the sake of overall profiling data 
thruput increase. It could be seen as kernel data buffer extension into
userspace memory.

With aio-cblocks option value different from 1, current default value, 
tool has capability to cache more and more data into user space
along with delegating spill to AIO.

That allows avoiding suspend at record__aio_sync() between calls of 
record__mmap_read_evlist() and increase profiling data thruput for 
the cost of userspace memory.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 tools/perf/builtin-record.c | 55 +++++++++++++++++++++++-------
 tools/perf/perf.h           |  1 +
 tools/perf/util/evlist.c    |  7 ++--
 tools/perf/util/evlist.h    |  3 +-
 tools/perf/util/mmap.c      | 83 +++++++++++++++++++++++++++++++--------------
 tools/perf/util/mmap.h      | 10 +++---
 6 files changed, 114 insertions(+), 45 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index d4857572cf33..6361098a5898 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -192,16 +192,35 @@ static int record__aio_complete(struct perf_mmap *md, struct aiocb *cblock)
 	return rc;
 }
 
-static void record__aio_sync(struct perf_mmap *md)
+static int record__aio_sync(struct perf_mmap *md, bool sync_all)
 {
-	struct aiocb *cblock = &md->cblock;
+	struct aiocb **aiocb = md->aiocb;
+	struct aiocb *cblocks = md->cblocks;
 	struct timespec timeout = { 0, 1000 * 1000  * 1 }; // 1ms
+	int i, do_suspend;
 
 	do {
-		if (cblock->aio_fildes == -1 || record__aio_complete(md, cblock))
-			return;
+		do_suspend = 0;
+		for (i = 0; i < md->nr_cblocks; ++i) {
+			if (cblocks[i].aio_fildes == -1 || record__aio_complete(md, &cblocks[i])) {
+				if (sync_all)
+					aiocb[i] = NULL;
+				else
+					return i;
+			} else {
+				/*
+				 * Started aio write is not complete yet
+				 * so it has to be waited before the
+				 * next allocation.
+				 */
+				aiocb[i] = &cblocks[i];
+				do_suspend = 1;
+			}
+		}
+		if (!do_suspend)
+			return -1;
 
-		while (aio_suspend((const struct aiocb**)&cblock, 1, &timeout)) {
+		while (aio_suspend((const struct aiocb **)aiocb, md->nr_cblocks, &timeout)) {
 			if (!(errno == EAGAIN || errno == EINTR))
 				pr_err("failed to sync perf data, error: %m\n");
 		}
@@ -428,7 +447,8 @@ static int record__mmap_evlist(struct record *rec,
 
 	if (perf_evlist__mmap_ex(evlist, opts->mmap_pages,
 				 opts->auxtrace_mmap_pages,
-				 opts->auxtrace_snapshot_mode) < 0) {
+				 opts->auxtrace_snapshot_mode,
+				 opts->nr_cblocks) < 0) {
 		if (errno == EPERM) {
 			pr_err("Permission error mapping pages.\n"
 			       "Consider increasing "
@@ -621,7 +641,7 @@ static void record__mmap_read_sync(struct record *rec)
 	for (i = 0; i < evlist->nr_mmaps; i++) {
 		struct perf_mmap *map = &maps[i];
 		if (map->base)
-			record__aio_sync(map);
+			record__aio_sync(map, true);
 	}
 }
 
@@ -629,7 +649,7 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
 				    bool overwrite)
 {
 	u64 bytes_written = rec->bytes_written;
-	int i;
+	int i, idx;
 	int rc = 0;
 	struct perf_mmap *maps;
 
@@ -648,11 +668,12 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
 
 		if (maps[i].base) {
 			/*
-			 * Call record__aio_sync() to wait till map->data buffer
-			 * becomes available after previous aio write request.
+			 * Call record__aio_sync() to get some free map->data
+			 * buffer or wait if all of previously started aio
+			 * writes are still incomplete.
 			 */
-			record__aio_sync(&maps[i]);
-			if (perf_mmap__push(&maps[i], rec, record__pushfn) != 0) {
+			idx = record__aio_sync(&maps[i], false);
+			if (perf_mmap__push(&maps[i], rec, idx, record__pushfn) != 0) {
 				rc = -1;
 				goto out;
 			}
@@ -1411,6 +1432,8 @@ static int perf_record_config(const char *var, const char *value, void *cb)
 		var = "call-graph.record-mode";
 		return perf_default_config(var, value, cb);
 	}
+	if (!strcmp(var, "record.aio-cblocks"))
+		rec->opts.nr_cblocks = strtol(value, NULL, 0);
 
 	return 0;
 }
@@ -1643,6 +1666,7 @@ static struct record record = {
 			.default_per_cpu = true,
 		},
 		.proc_map_timeout     = 500,
+		.nr_cblocks	      = 1
 	},
 	.tool = {
 		.sample		= process_sample_event,
@@ -1802,6 +1826,8 @@ static struct option __record_options[] = {
 			  "signal"),
 	OPT_BOOLEAN(0, "dry-run", &dry_run,
 		    "Parse options then exit"),
+	OPT_INTEGER(0, "aio-cblocks", &record.opts.nr_cblocks,
+		    "Max number of simultaneous per-mmap AIO trace writes (min: 1, max: 4, default: 1)"),
 	OPT_END()
 };
 
@@ -1994,6 +2020,11 @@ int cmd_record(int argc, const char **argv)
 		goto out;
 	}
 
+	if (!(rec->opts.nr_cblocks >= 1 && rec->opts.nr_cblocks <= 4))
+		rec->opts.nr_cblocks = 1;
+	if (verbose > 0)
+		pr_info("AIO trace writes: %d\n", rec->opts.nr_cblocks);
+
 	err = __cmd_record(&record, argc, argv);
 out:
 	perf_evlist__delete(rec->evlist);
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index 21bf7f5a3cf5..0a1ae2ae567a 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -82,6 +82,7 @@ struct record_opts {
 	bool         use_clockid;
 	clockid_t    clockid;
 	unsigned int proc_map_timeout;
+	int	     nr_cblocks;
 };
 
 struct option;
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index e7a4b31a84fb..4be49dc67c0d 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1018,7 +1018,8 @@ int perf_evlist__parse_mmap_pages(const struct option *opt, const char *str,
  */
 int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
 			 unsigned int auxtrace_pages,
-			 bool auxtrace_overwrite)
+			 bool auxtrace_overwrite,
+			 int nr_cblocks)
 {
 	struct perf_evsel *evsel;
 	const struct cpu_map *cpus = evlist->cpus;
@@ -1028,7 +1029,7 @@ int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
 	 * Its value is decided by evsel's write_backward.
 	 * So &mp should not be passed through const pointer.
 	 */
-	struct mmap_params mp;
+	struct mmap_params mp = { .nr_cblocks = nr_cblocks };
 
 	if (!evlist->mmap)
 		evlist->mmap = perf_evlist__alloc_mmap(evlist, false);
@@ -1060,7 +1061,7 @@ int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
 
 int perf_evlist__mmap(struct perf_evlist *evlist, unsigned int pages)
 {
-	return perf_evlist__mmap_ex(evlist, pages, 0, false);
+	return perf_evlist__mmap_ex(evlist, pages, 0, false, 1);
 }
 
 int perf_evlist__create_maps(struct perf_evlist *evlist, struct target *target)
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index dc66436add98..a94d3c613254 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -162,7 +162,8 @@ unsigned long perf_event_mlock_kb_in_pages(void);
 
 int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
 			 unsigned int auxtrace_pages,
-			 bool auxtrace_overwrite);
+			 bool auxtrace_overwrite,
+			 int nr_cblocks);
 int perf_evlist__mmap(struct perf_evlist *evlist, unsigned int pages);
 void perf_evlist__munmap(struct perf_evlist *evlist);
 
diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index 71c5628df3db..1da3abb63028 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -155,6 +155,16 @@ void __weak auxtrace_mmap_params__set_idx(struct auxtrace_mmap_params *mp __mayb
 
 void perf_mmap__munmap(struct perf_mmap *map)
 {
+	int i;
+
+	if (map->aiocb)
+		zfree(&map->aiocb);
+	if (map->cblocks)
+		zfree(&map->cblocks);
+	for (i = 0; i < map->nr_cblocks; ++i) {
+		if (map->data[i])
+			zfree(&map->data[i]);
+	}
 	if (map->data)
 		zfree(&map->data);
 	if (map->base != NULL) {
@@ -168,7 +178,7 @@ void perf_mmap__munmap(struct perf_mmap *map)
 
 int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd)
 {
-	int delta_max;
+	int delta_max, i, prio;
 	/*
 	 * The last one will be done at perf_mmap__consume(), so that we
 	 * make sure we don't prevent tools from consuming every last event in
@@ -193,28 +203,51 @@ int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd)
 		map->base = NULL;
 		return -1;
 	}
-	delta_max = sysconf(_SC_AIO_PRIO_DELTA_MAX);
-	map->data = malloc(perf_mmap__mmap_len(map));
+	map->nr_cblocks = mp->nr_cblocks;
+	map->aiocb = calloc(map->nr_cblocks, sizeof(struct aiocb *));
+	if (!map->aiocb) {
+		pr_debug2("failed to allocate aiocb for data buffers, error %d\n",
+				errno);
+		return -1;
+	}
+	map->cblocks = calloc(map->nr_cblocks, sizeof(struct aiocb));
+	if (!map->cblocks) {
+		pr_debug2("failed to allocate cblocks for data buffers, error %d\n",
+				errno);
+		return -1;
+	}
+	map->data = calloc(map->nr_cblocks, sizeof(void *));
 	if (!map->data) {
 		pr_debug2("failed to allocate data buffer, error %d\n",
 				errno);
 		return -1;
 	}
-	/*
-	 * Use cblock.aio_fildes value different from -1
-	 * to denote started aio write operation on the
-	 * cblock so it requires explicit record__aio_sync()
-	 * call prior the cblock may be reused again.
-	 */
-	map->cblock.aio_fildes = -1;
-	/*
-	 * Allocate cblock with max priority delta to
-	 * have faster aio_write() calls because queued
-	 * requests are kept in separate per-prio queues
-	 * and adding a new request iterates thru shorter
-	 * per-prio list.
-	 */
-	map->cblock.aio_reqprio = delta_max;
+	delta_max = sysconf(_SC_AIO_PRIO_DELTA_MAX);
+	for (i = 0; i < map->nr_cblocks; ++i) {
+		map->data[i] = malloc(perf_mmap__mmap_len(map));
+		if (!map->data[i]) {
+			pr_debug2("failed to allocate buffer area, error %d\n",
+					errno);
+			return -1;
+		}
+		/*
+		 * Use cblock.aio_fildes value different from -1
+		 * to denote started aio write operation on the
+		 * cblock so it requires explicit record__aio_sync()
+		 * call prior the cblock may be reused again.
+		 */
+		map->cblocks[i].aio_fildes = -1;
+		/*
+		 * Allocate cblocks with priority delta to
+		 * have faster aio_write() calls because queued
+		 * requests are kept in separate per-prio queues
+		 * and adding a new request iterates thru shorter
+		 * per-prio list. Blocks with numbers higher than
+		 * _SC_AIO_PRIO_DELTA_MAX go with priority 0.
+		 */
+		prio = delta_max - i;
+		map->cblocks[i].aio_reqprio = prio >= 0 ? prio : 0;
+	}
 	map->fd = fd;
 
 	if (auxtrace_mmap__mmap(&map->auxtrace_mmap,
@@ -304,7 +337,7 @@ int perf_mmap__read_init(struct perf_mmap *map)
 	return __perf_mmap__read_init(map);
 }
 
-int perf_mmap__push(struct perf_mmap *md, void *to,
+int perf_mmap__push(struct perf_mmap *md, void *to, int idx,
 		    int push(void *to, struct aiocb *cblock, void *data, size_t size))
 {
 	u64 head = perf_mmap__read_head(md);
@@ -318,7 +351,7 @@ int perf_mmap__push(struct perf_mmap *md, void *to,
 		return (rc == -EAGAIN) ? 0 : -1;
 
 	/*
-	 * md->base data is copied into md->data buffer to
+	 * md->base data is copied into md->data[idx] buffer to
 	 * release space in the kernel buffer as fast as possible,
 	 * thru perf_mmap__consume() below.
 	 *
@@ -340,20 +373,20 @@ int perf_mmap__push(struct perf_mmap *md, void *to,
 		buf = &data[md->start & md->mask];
 		size = md->mask + 1 - (md->start & md->mask);
 		md->start += size;
-		memcpy(md->data, buf, size);
+		memcpy(md->data[idx], buf, size);
 		size0 = size;
 	}
 
 	buf = &data[md->start & md->mask];
 	size = md->end - md->start;
 	md->start += size;
-	memcpy(md->data + size0, buf, size);
+	memcpy(md->data[idx] + size0, buf, size);
 
 	/*
-	 * Increment md->refcount to guard md->data buffer
+	 * Increment md->refcount to guard md->data[idx] buffer
 	 * from premature deallocation because md object can be
 	 * released earlier than aio write request started
-	 * on mmap->data is complete.
+	 * on mmap->data[idx] is complete.
 	 *
 	 * perf_mmap__put() is done at record__aio_complete()
 	 * after started request completion.
@@ -363,7 +396,7 @@ int perf_mmap__push(struct perf_mmap *md, void *to,
 	md->prev = head;
 	perf_mmap__consume(md);
 
-	rc = push(to, &md->cblock, md->data, size0 + size);
+	rc = push(to, &md->cblocks[idx], md->data[idx], size0 + size);
 	if (rc) {
 		/*
 		 * Decrement md->refcount back if aio write
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index a9795a5fe200..04b9ea59b4b6 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -26,8 +26,10 @@ struct perf_mmap {
 	bool		 overwrite;
 	struct auxtrace_mmap auxtrace_mmap;
 	char		 event_copy[PERF_SAMPLE_MAX_SIZE] __aligned(8);
-	void 		 *data;
-	struct aiocb	 cblock;
+	void		 **data;
+	struct aiocb	 *cblocks;
+	struct aiocb	 **aiocb;
+	int		 nr_cblocks;
 };
 
 /*
@@ -59,7 +61,7 @@ enum bkw_mmap_state {
 };
 
 struct mmap_params {
-	int			    prot, mask;
+	int			    prot, mask, nr_cblocks;
 	struct auxtrace_mmap_params auxtrace_mp;
 };
 
@@ -94,7 +96,7 @@ union perf_event *perf_mmap__read_forward(struct perf_mmap *map);
 
 union perf_event *perf_mmap__read_event(struct perf_mmap *map);
 
-int perf_mmap__push(struct perf_mmap *md, void *to,
+int perf_mmap__push(struct perf_mmap *md, void *to, int idx,
 		    int push(void *to, struct aiocb *cblock, void *data, size_t size));
 
 size_t perf_mmap__mmap_len(struct perf_mmap *map);

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
  2018-09-07  7:07 [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads Alexey Budankov
                   ` (2 preceding siblings ...)
  2018-09-07  7:39 ` [PATCH v8 3/3]: perf record: extend trace writing to multi AIO Alexey Budankov
@ 2018-09-07  9:34 ` Alexey Budankov
  2018-09-10  9:18 ` Ingo Molnar
  4 siblings, 0 replies; 24+ messages in thread
From: Alexey Budankov @ 2018-09-07  9:34 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andi Kleen, linux-kernel



On 07.09.2018 10:07, Alexey Budankov wrote:
> 
> Currently in record mode the tool implements trace writing serially. 
> The algorithm loops over mapped per-cpu data buffers and stores 
> ready data chunks into a trace file using write() system call.
> 
> At some circumstances the kernel may lack free space in a buffer 
> because the other buffer's half is not yet written to disk due to 
> some other buffer's data writing by the tool at the moment.
> 
> Thus serial trace writing implementation may cause the kernel 
> to loose profiling data and that is what observed when profiling 
> highly parallel CPU bound workloads on machines with big number 
> of cores.
> 
> Experiment with profiling matrix multiplication code executing 128 
> threads on Intel Xeon Phi (KNM) with 272 cores, like below,
> demonstrates data loss metrics value of 98%:
> 
> /usr/bin/time perf record -o /tmp/perf-ser.data -a -N -B -T -R -g \
>     --call-graph dwarf,1024 --user-regs=IP,SP,BP \
>     --switch-events -e cycles,instructions,ref-cycles,software/period=1,name=cs,config=0x3/Duk -- \
>     matrix.gcc
> 
> Data loss metrics is the ratio lost_time/elapsed_time where 
> lost_time is the sum of time intervals containing PERF_RECORD_LOST 
> records and elapsed_time is the elapsed application run time 
> under profiling.
> 
> Applying asynchronous trace streaming thru Posix AIO API
> (http://man7.org/linux/man-pages/man7/aio.7.html) 
> lowers data loss metrics value providing 2x improvement -
> lowering 98% loss to almost 0%.
> 
> ---
>  Alexey Budankov (3):
>         perf util: map data buffer for preserving collected data
> 	  perf record: enable asynchronous trace writing
>         perf record: extend trace writing to multi AIO
>  
>  tools/perf/builtin-record.c | 166 ++++++++++++++++++++++++++++++++++++++++++--
>  tools/perf/perf.h           |   1 +
>  tools/perf/util/evlist.c    |   7 +-
>  tools/perf/util/evlist.h    |   3 +-
>  tools/perf/util/mmap.c      | 114 ++++++++++++++++++++++++++----
>  tools/perf/util/mmap.h      |  11 ++-
>  6 files changed, 277 insertions(+), 25 deletions(-)

The whole thing for 

	git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux perf/core 

repository follows:

 tools/perf/builtin-record.c | 165 ++++++++++++++++++++++++++++++++++++++++++--
 tools/perf/perf.h           |   1 +
 tools/perf/util/evlist.c    |   7 +-
 tools/perf/util/evlist.h    |   3 +-
 tools/perf/util/mmap.c      | 114 ++++++++++++++++++++++++++----
 tools/perf/util/mmap.h      |  11 ++-
 6 files changed, 276 insertions(+), 25 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 9853552bcf16..7bb7947072e5 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -121,6 +121,112 @@ static int record__write(struct record *rec, void *bf, size_t size)
 	return 0;
 }
 
+static int record__aio_write(struct aiocb *cblock, int trace_fd,
+		void *buf, size_t size, off_t off)
+{
+	int rc;
+
+	cblock->aio_fildes = trace_fd;
+	cblock->aio_buf    = buf;
+	cblock->aio_nbytes = size;
+	cblock->aio_offset = off;
+	cblock->aio_sigevent.sigev_notify = SIGEV_NONE;
+
+	do {
+		rc = aio_write(cblock);
+		if (rc == 0) {
+			break;
+		} else if (errno != EAGAIN) {
+			cblock->aio_fildes = -1;
+			pr_err("failed to queue perf data, error: %m\n");
+			break;
+		}
+	} while (1);
+
+	return rc;
+}
+
+static int record__aio_complete(struct perf_mmap *md, struct aiocb *cblock)
+{
+	void *rem_buf;
+	off_t rem_off;
+	size_t rem_size;
+	int rc, aio_errno;
+	ssize_t aio_ret, written;
+
+	aio_errno = aio_error(cblock);
+	if (aio_errno == EINPROGRESS)
+		return 0;
+
+	written = aio_ret = aio_return(cblock);
+	if (aio_ret < 0) {
+		if (!(aio_errno == EINTR))
+			pr_err("failed to write perf data, error: %m\n");
+		written = 0;
+	}
+
+	rem_size = cblock->aio_nbytes - written;
+
+	if (rem_size == 0) {
+		cblock->aio_fildes = -1;
+		/*
+		 * md->refcount is incremented in perf_mmap__push() for
+		 * every enqueued aio write request so decrement it because
+		 * the request is now complete.
+		 */
+		perf_mmap__put(md);
+		rc = 1;
+	} else {
+		/*
+		 * aio write request may require restart with the
+		 * reminder if the kernel didn't write whole
+		 * chunk at once.
+		 */
+		rem_off = cblock->aio_offset + written;
+		rem_buf = (void *)(cblock->aio_buf + written);
+		record__aio_write(cblock, cblock->aio_fildes,
+				rem_buf, rem_size, rem_off);
+		rc = 0;
+	}
+
+	return rc;
+}
+
+static int record__aio_sync(struct perf_mmap *md, bool sync_all)
+{
+	struct aiocb **aiocb = md->aiocb;
+	struct aiocb *cblocks = md->cblocks;
+	struct timespec timeout = { 0, 1000 * 1000  * 1 }; // 1ms
+	int i, do_suspend;
+
+	do {
+		do_suspend = 0;
+		for (i = 0; i < md->nr_cblocks; ++i) {
+			if (cblocks[i].aio_fildes == -1 || record__aio_complete(md, &cblocks[i])) {
+				if (sync_all)
+					aiocb[i] = NULL;
+				else
+					return i;
+			} else {
+				/*
+				 * Started aio write is not complete yet
+				 * so it has to be waited before the
+				 * next allocation.
+				 */
+				aiocb[i] = &cblocks[i];
+				do_suspend = 1;
+			}
+		}
+		if (!do_suspend)
+			return -1;
+
+		while (aio_suspend((const struct aiocb **)aiocb, md->nr_cblocks, &timeout)) {
+			if (!(errno == EAGAIN || errno == EINTR))
+				pr_err("failed to sync perf data, error: %m\n");
+		}
+	} while (1);
+}
+
 static int process_synthesized_event(struct perf_tool *tool,
 				     union perf_event *event,
 				     struct perf_sample *sample __maybe_unused,
@@ -130,12 +236,27 @@ static int process_synthesized_event(struct perf_tool *tool,
 	return record__write(rec, event, event->header.size);
 }
 
-static int record__pushfn(void *to, void *bf, size_t size)
+static int record__pushfn(void *to, struct aiocb *cblock, void *data, size_t size)
 {
+	off_t off;
 	struct record *rec = to;
+	int ret, trace_fd = rec->session->data->file.fd;
 
 	rec->samples++;
-	return record__write(rec, bf, size);
+
+	off = lseek(trace_fd, 0, SEEK_CUR);
+	lseek(trace_fd, off + size, SEEK_SET);
+	ret = record__aio_write(cblock, trace_fd, data, size, off);
+	if (!ret) {
+		rec->bytes_written += size;
+
+		if (switch_output_size(rec))
+			trigger_hit(&switch_output_trigger);
+	} else {
+		lseek(trace_fd, off, SEEK_SET);
+	}
+
+	return ret;
 }
 
 static volatile int done;
@@ -326,7 +447,8 @@ static int record__mmap_evlist(struct record *rec,
 
 	if (perf_evlist__mmap_ex(evlist, opts->mmap_pages,
 				 opts->auxtrace_mmap_pages,
-				 opts->auxtrace_snapshot_mode) < 0) {
+				 opts->auxtrace_snapshot_mode,
+				 opts->nr_cblocks) < 0) {
 		if (errno == EPERM) {
 			pr_err("Permission error mapping pages.\n"
 			       "Consider increasing "
@@ -510,11 +632,24 @@ static struct perf_event_header finished_round_event = {
 	.type = PERF_RECORD_FINISHED_ROUND,
 };
 
+static void record__mmap_read_sync(struct record *rec)
+{
+	int i;
+	struct perf_evlist *evlist = rec->evlist;
+	struct perf_mmap *maps = evlist->mmap;
+
+	for (i = 0; i < evlist->nr_mmaps; i++) {
+		struct perf_mmap *map = &maps[i];
+		if (map->base)
+			record__aio_sync(map, true);
+	}
+}
+
 static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evlist,
 				    bool overwrite)
 {
 	u64 bytes_written = rec->bytes_written;
-	int i;
+	int i, idx;
 	int rc = 0;
 	struct perf_mmap *maps;
 
@@ -532,7 +667,13 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
 		struct auxtrace_mmap *mm = &maps[i].auxtrace_mmap;
 
 		if (maps[i].base) {
-			if (perf_mmap__push(&maps[i], rec, record__pushfn) != 0) {
+			/*
+			 * Call record__aio_sync() to get some free map->data
+			 * buffer or wait if all of previously started aio
+			 * writes are still incomplete.
+			 */
+			idx = record__aio_sync(&maps[i], false);
+			if (perf_mmap__push(&maps[i], rec, idx, record__pushfn) != 0) {
 				rc = -1;
 				goto out;
 			}
@@ -1054,6 +1195,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 			perf_evlist__toggle_bkw_mmap(rec->evlist, BKW_MMAP_DATA_PENDING);
 
 		if (record__mmap_read_all(rec) < 0) {
+			record__mmap_read_sync(rec);
 			trigger_error(&auxtrace_snapshot_trigger);
 			trigger_error(&switch_output_trigger);
 			err = -1;
@@ -1065,6 +1207,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 			if (!trigger_is_error(&auxtrace_snapshot_trigger))
 				record__read_auxtrace_snapshot(rec);
 			if (trigger_is_error(&auxtrace_snapshot_trigger)) {
+				record__mmap_read_sync(rec);
 				pr_err("AUX area tracing snapshot failed\n");
 				err = -1;
 				goto out_child;
@@ -1083,6 +1226,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 			 */
 			if (rec->evlist->bkw_mmap_state == BKW_MMAP_RUNNING)
 				continue;
+			record__mmap_read_sync(rec);
 			trigger_ready(&switch_output_trigger);
 
 			/*
@@ -1136,6 +1280,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 			disabled = true;
 		}
 	}
+	record__mmap_read_sync(rec);
 	trigger_off(&auxtrace_snapshot_trigger);
 	trigger_off(&switch_output_trigger);
 
@@ -1287,6 +1432,8 @@ static int perf_record_config(const char *var, const char *value, void *cb)
 		var = "call-graph.record-mode";
 		return perf_default_config(var, value, cb);
 	}
+	if (!strcmp(var, "record.aio-cblocks"))
+		rec->opts.nr_cblocks = strtol(value, NULL, 0);
 
 	return 0;
 }
@@ -1519,6 +1666,7 @@ static struct record record = {
 			.default_per_cpu = true,
 		},
 		.proc_map_timeout     = 500,
+		.nr_cblocks	      = 1
 	},
 	.tool = {
 		.sample		= process_sample_event,
@@ -1678,6 +1826,8 @@ static struct option __record_options[] = {
 			  "signal"),
 	OPT_BOOLEAN(0, "dry-run", &dry_run,
 		    "Parse options then exit"),
+	OPT_INTEGER(0, "aio-cblocks", &record.opts.nr_cblocks,
+		    "Max number of simultaneous per-mmap AIO trace writes (min: 1, max: 4, default: 1)"),
 	OPT_END()
 };
 
@@ -1870,6 +2020,11 @@ int cmd_record(int argc, const char **argv)
 		goto out;
 	}
 
+	if (!(rec->opts.nr_cblocks >= 1 && rec->opts.nr_cblocks <= 4))
+		rec->opts.nr_cblocks = 1;
+	if (verbose > 0)
+		pr_info("AIO trace writes: %d\n", rec->opts.nr_cblocks);
+
 	err = __cmd_record(&record, argc, argv);
 out:
 	perf_evlist__delete(rec->evlist);
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index 21bf7f5a3cf5..0a1ae2ae567a 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -82,6 +82,7 @@ struct record_opts {
 	bool         use_clockid;
 	clockid_t    clockid;
 	unsigned int proc_map_timeout;
+	int	     nr_cblocks;
 };
 
 struct option;
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index be440df29615..cc1381fb4ab3 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1018,7 +1018,8 @@ int perf_evlist__parse_mmap_pages(const struct option *opt, const char *str,
  */
 int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
 			 unsigned int auxtrace_pages,
-			 bool auxtrace_overwrite)
+			 bool auxtrace_overwrite,
+			 int nr_cblocks)
 {
 	struct perf_evsel *evsel;
 	const struct cpu_map *cpus = evlist->cpus;
@@ -1028,7 +1029,7 @@ int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
 	 * Its value is decided by evsel's write_backward.
 	 * So &mp should not be passed through const pointer.
 	 */
-	struct mmap_params mp;
+	struct mmap_params mp = { .nr_cblocks = nr_cblocks };
 
 	if (!evlist->mmap)
 		evlist->mmap = perf_evlist__alloc_mmap(evlist, false);
@@ -1060,7 +1061,7 @@ int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
 
 int perf_evlist__mmap(struct perf_evlist *evlist, unsigned int pages)
 {
-	return perf_evlist__mmap_ex(evlist, pages, 0, false);
+	return perf_evlist__mmap_ex(evlist, pages, 0, false, 1);
 }
 
 int perf_evlist__create_maps(struct perf_evlist *evlist, struct target *target)
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index dc66436add98..a94d3c613254 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -162,7 +162,8 @@ unsigned long perf_event_mlock_kb_in_pages(void);
 
 int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
 			 unsigned int auxtrace_pages,
-			 bool auxtrace_overwrite);
+			 bool auxtrace_overwrite,
+			 int nr_cblocks);
 int perf_evlist__mmap(struct perf_evlist *evlist, unsigned int pages);
 void perf_evlist__munmap(struct perf_evlist *evlist);
 
diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index 215f69f41672..46e541cfa3b3 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -155,6 +155,18 @@ void __weak auxtrace_mmap_params__set_idx(struct auxtrace_mmap_params *mp __mayb
 
 void perf_mmap__munmap(struct perf_mmap *map)
 {
+	int i;
+
+	if (map->aiocb)
+		zfree(&map->aiocb);
+	if (map->cblocks)
+		zfree(&map->cblocks);
+	for (i = 0; i < map->nr_cblocks; ++i) {
+		if (map->data[i])
+			zfree(&map->data[i]);
+	}
+	if (map->data)
+		zfree(&map->data);
 	if (map->base != NULL) {
 		munmap(map->base, perf_mmap__mmap_len(map));
 		map->base = NULL;
@@ -166,6 +178,7 @@ void perf_mmap__munmap(struct perf_mmap *map)
 
 int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd, int cpu)
 {
+	int delta_max, i, prio;
 	/*
 	 * The last one will be done at perf_mmap__consume(), so that we
 	 * make sure we don't prevent tools from consuming every last event in
@@ -190,6 +203,51 @@ int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd, int c
 		map->base = NULL;
 		return -1;
 	}
+	map->nr_cblocks = mp->nr_cblocks;
+	map->aiocb = calloc(map->nr_cblocks, sizeof(struct aiocb *));
+	if (!map->aiocb) {
+		pr_debug2("failed to allocate aiocb for data buffer, error %d\n",
+				errno);
+		return -1;
+	}
+	map->cblocks = calloc(map->nr_cblocks, sizeof(struct aiocb));
+	if (!map->cblocks) {
+		pr_debug2("failed to allocate cblocks for data buffer, error %d\n",
+				errno);
+		return -1;
+	}
+	map->data = calloc(map->nr_cblocks, sizeof(void *));
+	if (!map->data) {
+		pr_debug2("failed to allocate data buffer, error %d\n",
+				errno);
+		return -1;
+	}
+	delta_max = sysconf(_SC_AIO_PRIO_DELTA_MAX);
+	for (i = 0; i < map->nr_cblocks; ++i) {
+		map->data[i] = malloc(perf_mmap__mmap_len(map));
+		if (!map->data[i]) {
+			pr_debug2("failed to allocate buffer area, error %d\n",
+					errno);
+			return -1;
+		}
+		/*
+		 * Use cblock.aio_fildes value different from -1
+		 * to denote started aio write operation on the
+		 * cblock so it requires explicit record__aio_sync()
+		 * call prior the cblock may be reused again.
+		 */
+		map->cblocks[i].aio_fildes = -1;
+		/*
+		 * Allocate cblocks with priority delta to
+		 * have faster aio_write() calls because queued
+		 * requests are kept in separate per-prio queues
+		 * and adding a new request iterates thru shorter
+		 * per-prio list. Blocks with numbers higher than
+		 * _SC_AIO_PRIO_DELTA_MAX go with priority 0.
+		 */
+		prio = delta_max - i;
+		map->cblocks[i].aio_reqprio = prio >= 0 ? prio : 0;
+	}
 	map->fd = fd;
 	map->cpu = cpu;
 
@@ -280,12 +338,12 @@ int perf_mmap__read_init(struct perf_mmap *map)
 	return __perf_mmap__read_init(map);
 }
 
-int perf_mmap__push(struct perf_mmap *md, void *to,
-		    int push(void *to, void *buf, size_t size))
+int perf_mmap__push(struct perf_mmap *md, void *to, int idx,
+		    int push(void *to, struct aiocb *cblock, void *data, size_t size))
 {
 	u64 head = perf_mmap__read_head(md);
 	unsigned char *data = md->base + page_size;
-	unsigned long size;
+	unsigned long size, size0 = 0;
 	void *buf;
 	int rc = 0;
 
@@ -293,31 +351,61 @@ int perf_mmap__push(struct perf_mmap *md, void *to,
 	if (rc < 0)
 		return (rc == -EAGAIN) ? 0 : -1;
 
+	/*
+	 * md->base data is copied into md->data[idx] buffer to
+	 * release space in the kernel buffer as fast as possible,
+	 * thru perf_mmap__consume() below.
+	 *
+	 * That lets the kernel to proceed with storing more
+	 * profiling data into the kernel buffer earlier than other
+	 * per-cpu kernel buffers are handled.
+	 *
+	 * Coping can be done in two steps in case the chunk of
+	 * profiling data crosses the upper bound of the kernel buffer.
+	 * In this case we first move part of data from md->start
+	 * till the upper bound and then the reminder from the
+	 * beginning of the kernel buffer till the end of
+	 * the data chunk.
+	 */
+
 	size = md->end - md->start;
 
 	if ((md->start & md->mask) + size != (md->end & md->mask)) {
 		buf = &data[md->start & md->mask];
 		size = md->mask + 1 - (md->start & md->mask);
 		md->start += size;
-
-		if (push(to, buf, size) < 0) {
-			rc = -1;
-			goto out;
-		}
+		memcpy(md->data[idx], buf, size);
+		size0 = size;
 	}
 
 	buf = &data[md->start & md->mask];
 	size = md->end - md->start;
 	md->start += size;
+	memcpy(md->data[idx] + size0, buf, size);
 
-	if (push(to, buf, size) < 0) {
-		rc = -1;
-		goto out;
-	}
+	/*
+	 * Increment md->refcount to guard md->data[idx] buffer
+	 * from premature deallocation because md object can be
+	 * released earlier than aio write request started
+	 * on mmap->data[idx] is complete.
+	 *
+	 * perf_mmap__put() is done at record__aio_complete()
+	 * after started request completion.
+	 */
+	perf_mmap__get(md);
 
 	md->prev = head;
 	perf_mmap__consume(md);
-out:
+
+	rc = push(to, &md->cblocks[idx], md->data[idx], size0 + size);
+	if (rc) {
+		/*
+		 * Decrement md->refcount back if aio write
+		 * operation failed to start.
+		 */
+		perf_mmap__put(md);
+	}
+
 	return rc;
 }
 
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index 05a6d47c7956..d279c7429b3f 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -6,6 +6,7 @@
 #include <linux/types.h>
 #include <asm/barrier.h>
 #include <stdbool.h>
+#include <aio.h>
 #include "auxtrace.h"
 #include "event.h"
 
@@ -26,6 +27,10 @@ struct perf_mmap {
 	bool		 overwrite;
 	struct auxtrace_mmap auxtrace_mmap;
 	char		 event_copy[PERF_SAMPLE_MAX_SIZE] __aligned(8);
+	void		 **data;
+	struct aiocb	 *cblocks;
+	struct aiocb	 **aiocb;
+	int		 nr_cblocks;
 };
 
 /*
@@ -57,7 +62,7 @@ enum bkw_mmap_state {
 };
 
 struct mmap_params {
-	int			    prot, mask;
+	int			    prot, mask, nr_cblocks;
 	struct auxtrace_mmap_params auxtrace_mp;
 };
 
@@ -92,8 +97,8 @@ union perf_event *perf_mmap__read_forward(struct perf_mmap *map);
 
 union perf_event *perf_mmap__read_event(struct perf_mmap *map);
 
-int perf_mmap__push(struct perf_mmap *md, void *to,
-		    int push(void *to, void *buf, size_t size));
+int perf_mmap__push(struct perf_mmap *md, void *to, int idx,
+		    int push(void *to, struct aiocb *cblock, void *data, size_t size));
 
 size_t perf_mmap__mmap_len(struct perf_mmap *map);

> 
> ---
>  Changes in v8:
>  - run the whole thing thru checkpatch.pl and corrected found issues except
>    lines longer than 80 symbols
>  - corrected comments alignment and formatting
>  - moved multi AIO implementation into 3rd patch in the series
>  - implemented explicit cblocks array allocation
>  - split AIO completion check into separate record__aio_complete()
>  - set nr_cblocks default to 1 and max allowed value to 4
>  Changes in v7:
>  - implemented handling record.aio setting from perfconfig file
>  Changes in v6:
>  - adjusted setting of priorities for cblocks;
>  - handled errno == EAGAIN case from aio_write() return;
>  Changes in v5:
>  - resolved livelock on perf record -e intel_pt// -- dd if=/dev/zero of=/dev/null count=100000
>  - data loss metrics decreased from 25% to 2x in trialed configuration;
>  - reshaped layout of data structures;
>  - implemented --aio option;
>  - avoided nanosleep() prior calling aio_suspend();
>  - switched to per-cpu aio multi buffer record__aio_sync();
>  - record_mmap_read_sync() now does global sync just before 
>    switching trace file or collection stop;
>  Changes in v4:
>  - converted mmap()/munmap() to malloc()/free() for mmap->data buffer management
>  - converted void *bf to struct perf_mmap *md in signatures
>  - written comment in perf_mmap__push() just before perf_mmap__get();
>  - written comment in record__mmap_read_sync() on possible restarting 
>    of aio_write() operation and releasing perf_mmap object after all;
>  - added perf_mmap__put() for the cases of failed aio_write();
>  Changes in v3:
>  - written comments about nanosleep(0.5ms) call prior aio_suspend()
>    to cope with intrusiveness of its implementation in glibc;
>  - written comments about rationale behind coping profiling data 
>    into mmap->data buffer;
>  Changes in v2:
>  - converted zalloc() to calloc() for allocation of mmap_aio array,
>  - cleared typo and adjusted fallback branch code;
> 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
  2018-09-07  7:07 [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads Alexey Budankov
                   ` (3 preceding siblings ...)
  2018-09-07  9:34 ` [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads Alexey Budankov
@ 2018-09-10  9:18 ` Ingo Molnar
  2018-09-10  9:59   ` Jiri Olsa
  2018-09-10 10:40   ` Alexey Budankov
  4 siblings, 2 replies; 24+ messages in thread
From: Ingo Molnar @ 2018-09-10  9:18 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Peter Zijlstra, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Andi Kleen, linux-kernel


* Alexey Budankov <alexey.budankov@linux.intel.com> wrote:

> 
> Currently in record mode the tool implements trace writing serially. 
> The algorithm loops over mapped per-cpu data buffers and stores 
> ready data chunks into a trace file using write() system call.
> 
> At some circumstances the kernel may lack free space in a buffer 
> because the other buffer's half is not yet written to disk due to 
> some other buffer's data writing by the tool at the moment.
> 
> Thus serial trace writing implementation may cause the kernel 
> to loose profiling data and that is what observed when profiling 
> highly parallel CPU bound workloads on machines with big number 
> of cores.

Yay! I saw this frequently on a 120-CPU box (hw is broken now).

> Data loss metrics is the ratio lost_time/elapsed_time where 
> lost_time is the sum of time intervals containing PERF_RECORD_LOST 
> records and elapsed_time is the elapsed application run time 
> under profiling.
> 
> Applying asynchronous trace streaming thru Posix AIO API
> (http://man7.org/linux/man-pages/man7/aio.7.html) 
> lowers data loss metrics value providing 2x improvement -
> lowering 98% loss to almost 0%.

Hm, instead of AIO why don't we use explicit threads instead? I think Posix AIO will fall back 
to threads anyway when there's no kernel AIO support (which there probably isn't for perf 
events).

Per-CPU threading the record session would have so many other advantages as well (scalability, 
etc.).

Jiri did per-CPU recording patches a couple of months ago, not sure how usable they are at the 
moment?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
  2018-09-10  9:18 ` Ingo Molnar
@ 2018-09-10  9:59   ` Jiri Olsa
  2018-09-10 10:03     ` Ingo Molnar
  2018-09-10 10:40   ` Alexey Budankov
  1 sibling, 1 reply; 24+ messages in thread
From: Jiri Olsa @ 2018-09-10  9:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alexey Budankov, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel

On Mon, Sep 10, 2018 at 11:18:41AM +0200, Ingo Molnar wrote:
> 
> * Alexey Budankov <alexey.budankov@linux.intel.com> wrote:
> 
> > 
> > Currently in record mode the tool implements trace writing serially. 
> > The algorithm loops over mapped per-cpu data buffers and stores 
> > ready data chunks into a trace file using write() system call.
> > 
> > At some circumstances the kernel may lack free space in a buffer 
> > because the other buffer's half is not yet written to disk due to 
> > some other buffer's data writing by the tool at the moment.
> > 
> > Thus serial trace writing implementation may cause the kernel 
> > to loose profiling data and that is what observed when profiling 
> > highly parallel CPU bound workloads on machines with big number 
> > of cores.
> 
> Yay! I saw this frequently on a 120-CPU box (hw is broken now).
> 
> > Data loss metrics is the ratio lost_time/elapsed_time where 
> > lost_time is the sum of time intervals containing PERF_RECORD_LOST 
> > records and elapsed_time is the elapsed application run time 
> > under profiling.
> > 
> > Applying asynchronous trace streaming thru Posix AIO API
> > (http://man7.org/linux/man-pages/man7/aio.7.html) 
> > lowers data loss metrics value providing 2x improvement -
> > lowering 98% loss to almost 0%.
> 
> Hm, instead of AIO why don't we use explicit threads instead? I think Posix AIO will fall back 
> to threads anyway when there's no kernel AIO support (which there probably isn't for perf 
> events).

this patch adds the aoi for writing to the perf.data
file, reading of events is unchanged

> 
> Per-CPU threading the record session would have so many other advantages as well (scalability, 
> etc.).
> 
> Jiri did per-CPU recording patches a couple of months ago, not sure how usable they are at the 
> moment?

it's still usable, I can rebase it and post a branch pointer,
the problem is I haven't been able to find a case with a real
performance benefit yet.. ;-)

perhaps because I haven't tried on server with really big cpu
numbers

jirka

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
  2018-09-10  9:59   ` Jiri Olsa
@ 2018-09-10 10:03     ` Ingo Molnar
  2018-09-10 10:08       ` Jiri Olsa
  0 siblings, 1 reply; 24+ messages in thread
From: Ingo Molnar @ 2018-09-10 10:03 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Alexey Budankov, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel


* Jiri Olsa <jolsa@redhat.com> wrote:

> > Per-CPU threading the record session would have so many other advantages as well (scalability, 
> > etc.).
> > 
> > Jiri did per-CPU recording patches a couple of months ago, not sure how usable they are at the 
> > moment?
> 
> it's still usable, I can rebase it and post a branch pointer,
> the problem is I haven't been able to find a case with a real
> performance benefit yet.. ;-)
> 
> perhaps because I haven't tried on server with really big cpu
> numbers

Maybe Alexey could pick up from there? Your concept looked fairly mature to me
and I tried it on a big-CPU box back then and there were real improvements.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
  2018-09-10 10:03     ` Ingo Molnar
@ 2018-09-10 10:08       ` Jiri Olsa
  2018-09-10 10:13         ` Ingo Molnar
  0 siblings, 1 reply; 24+ messages in thread
From: Jiri Olsa @ 2018-09-10 10:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alexey Budankov, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel

On Mon, Sep 10, 2018 at 12:03:03PM +0200, Ingo Molnar wrote:
> 
> * Jiri Olsa <jolsa@redhat.com> wrote:
> 
> > > Per-CPU threading the record session would have so many other advantages as well (scalability, 
> > > etc.).
> > > 
> > > Jiri did per-CPU recording patches a couple of months ago, not sure how usable they are at the 
> > > moment?
> > 
> > it's still usable, I can rebase it and post a branch pointer,
> > the problem is I haven't been able to find a case with a real
> > performance benefit yet.. ;-)
> > 
> > perhaps because I haven't tried on server with really big cpu
> > numbers
> 
> Maybe Alexey could pick up from there? Your concept looked fairly mature to me
> and I tried it on a big-CPU box back then and there were real improvements.

too bad u did not share your results, it could have been already in ;-)

let me rebase/repost once more and let's see

I think we could benefit from both multiple threads event reading
and AIO writing for perf.data.. it could be merged together

jirka

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
  2018-09-10 10:08       ` Jiri Olsa
@ 2018-09-10 10:13         ` Ingo Molnar
  2018-09-10 10:23           ` Jiri Olsa
  0 siblings, 1 reply; 24+ messages in thread
From: Ingo Molnar @ 2018-09-10 10:13 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Alexey Budankov, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel


* Jiri Olsa <jolsa@redhat.com> wrote:

> On Mon, Sep 10, 2018 at 12:03:03PM +0200, Ingo Molnar wrote:
> > 
> > * Jiri Olsa <jolsa@redhat.com> wrote:
> > 
> > > > Per-CPU threading the record session would have so many other advantages as well (scalability, 
> > > > etc.).
> > > > 
> > > > Jiri did per-CPU recording patches a couple of months ago, not sure how usable they are at the 
> > > > moment?
> > > 
> > > it's still usable, I can rebase it and post a branch pointer,
> > > the problem is I haven't been able to find a case with a real
> > > performance benefit yet.. ;-)
> > > 
> > > perhaps because I haven't tried on server with really big cpu
> > > numbers
> > 
> > Maybe Alexey could pick up from there? Your concept looked fairly mature to me
> > and I tried it on a big-CPU box back then and there were real improvements.
> 
> too bad u did not share your results, it could have been already in ;-)

Yeah :-/ Had a proper round of testing on my TODO, then the big box I'd have tested it on
broke ...

> let me rebase/repost once more and let's see

Thanks!

> I think we could benefit from both multiple threads event reading
> and AIO writing for perf.data.. it could be merged together

So instead of AIO writing perf.data, why not just turn perf.data into a directory structure 
with per CPU files? That would allow all sorts of neat future performance features such as 
mmap() or splice() based zero-copy.

User-space post-processing can then read the files and put them into global order - or use the 
per CPU nature of them, which would be pretty useful too.

Also note how well this works on NUMA as well, as the backing pages would be allocated in a 
NUMA-local fashion.

I.e. the whole per-CPU threading would enable such a separation of the tracing/event streams 
and would allow true scalability.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
  2018-09-10 10:13         ` Ingo Molnar
@ 2018-09-10 10:23           ` Jiri Olsa
  2018-09-10 10:45             ` Alexey Budankov
  0 siblings, 1 reply; 24+ messages in thread
From: Jiri Olsa @ 2018-09-10 10:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alexey Budankov, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel

On Mon, Sep 10, 2018 at 12:13:25PM +0200, Ingo Molnar wrote:
> 
> * Jiri Olsa <jolsa@redhat.com> wrote:
> 
> > On Mon, Sep 10, 2018 at 12:03:03PM +0200, Ingo Molnar wrote:
> > > 
> > > * Jiri Olsa <jolsa@redhat.com> wrote:
> > > 
> > > > > Per-CPU threading the record session would have so many other advantages as well (scalability, 
> > > > > etc.).
> > > > > 
> > > > > Jiri did per-CPU recording patches a couple of months ago, not sure how usable they are at the 
> > > > > moment?
> > > > 
> > > > it's still usable, I can rebase it and post a branch pointer,
> > > > the problem is I haven't been able to find a case with a real
> > > > performance benefit yet.. ;-)
> > > > 
> > > > perhaps because I haven't tried on server with really big cpu
> > > > numbers
> > > 
> > > Maybe Alexey could pick up from there? Your concept looked fairly mature to me
> > > and I tried it on a big-CPU box back then and there were real improvements.
> > 
> > too bad u did not share your results, it could have been already in ;-)
> 
> Yeah :-/ Had a proper round of testing on my TODO, then the big box I'd have tested it on
> broke ...
> 
> > let me rebase/repost once more and let's see
> 
> Thanks!
> 
> > I think we could benefit from both multiple threads event reading
> > and AIO writing for perf.data.. it could be merged together
> 
> So instead of AIO writing perf.data, why not just turn perf.data into a directory structure 
> with per CPU files? That would allow all sorts of neat future performance features such as 

that's basically what the multiple-thread record patchset does

jirka

> mmap() or splice() based zero-copy.
> 
> User-space post-processing can then read the files and put them into global order - or use the 
> per CPU nature of them, which would be pretty useful too.
> 
> Also note how well this works on NUMA as well, as the backing pages would be allocated in a 
> NUMA-local fashion.
> 
> I.e. the whole per-CPU threading would enable such a separation of the tracing/event streams 
> and would allow true scalability.
> 
> Thanks,
> 
> 	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
  2018-09-10  9:18 ` Ingo Molnar
  2018-09-10  9:59   ` Jiri Olsa
@ 2018-09-10 10:40   ` Alexey Budankov
  2018-09-10 12:06     ` Ingo Molnar
  1 sibling, 1 reply; 24+ messages in thread
From: Alexey Budankov @ 2018-09-10 10:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Andi Kleen, linux-kernel

Hi Ingo,

On 10.09.2018 12:18, Ingo Molnar wrote:
> 
> * Alexey Budankov <alexey.budankov@linux.intel.com> wrote:
> 
>>
>> Currently in record mode the tool implements trace writing serially. 
>> The algorithm loops over mapped per-cpu data buffers and stores 
>> ready data chunks into a trace file using write() system call.
>>
>> At some circumstances the kernel may lack free space in a buffer 
>> because the other buffer's half is not yet written to disk due to 
>> some other buffer's data writing by the tool at the moment.
>>
>> Thus serial trace writing implementation may cause the kernel 
>> to loose profiling data and that is what observed when profiling 
>> highly parallel CPU bound workloads on machines with big number 
>> of cores.
> 
> Yay! I saw this frequently on a 120-CPU box (hw is broken now).
> 
>> Data loss metrics is the ratio lost_time/elapsed_time where 
>> lost_time is the sum of time intervals containing PERF_RECORD_LOST 
>> records and elapsed_time is the elapsed application run time 
>> under profiling.
>>
>> Applying asynchronous trace streaming thru Posix AIO API
>> (http://man7.org/linux/man-pages/man7/aio.7.html) 
>> lowers data loss metrics value providing 2x improvement -
>> lowering 98% loss to almost 0%.
> 
> Hm, instead of AIO why don't we use explicit threads instead? I think Posix AIO will fall back 
> to threads anyway when there's no kernel AIO support (which there probably isn't for perf 
> events).

Explicit threading is surely an option but having more threads 
in the tool that stream performance data is a considerable 
design complication.

Luckily, glibc AIO implementation is already based on pthreads, 
but having a writing thread for every distinct fd only.

> 
> Per-CPU threading the record session would have so many other advantages as well (scalability, 
> etc.).> 
> Jiri did per-CPU recording patches a couple of months ago, not sure how usable they are at the 
> moment?

Tool threads may contend, and actually do, with application 
threads, under heavy load when all CPU cores are utilized,
and this may alter performance profile.

Thanks,
Alexey

> 
> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
  2018-09-10 10:23           ` Jiri Olsa
@ 2018-09-10 10:45             ` Alexey Budankov
  0 siblings, 0 replies; 24+ messages in thread
From: Alexey Budankov @ 2018-09-10 10:45 UTC (permalink / raw)
  To: Jiri Olsa, Ingo Molnar
  Cc: Peter Zijlstra, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Namhyung Kim, Andi Kleen, linux-kernel

Hi,

On 10.09.2018 13:23, Jiri Olsa wrote:
> On Mon, Sep 10, 2018 at 12:13:25PM +0200, Ingo Molnar wrote:
>>
>> * Jiri Olsa <jolsa@redhat.com> wrote:
>>
>>> On Mon, Sep 10, 2018 at 12:03:03PM +0200, Ingo Molnar wrote:
>>>>
>>>> * Jiri Olsa <jolsa@redhat.com> wrote:
>>>>
>>>>>> Per-CPU threading the record session would have so many other advantages as well (scalability, 
>>>>>> etc.).
>>>>>>
>>>>>> Jiri did per-CPU recording patches a couple of months ago, not sure how usable they are at the 
>>>>>> moment?
>>>>>
>>>>> it's still usable, I can rebase it and post a branch pointer,
>>>>> the problem is I haven't been able to find a case with a real
>>>>> performance benefit yet.. ;-)
>>>>>
>>>>> perhaps because I haven't tried on server with really big cpu
>>>>> numbers
>>>>
>>>> Maybe Alexey could pick up from there? Your concept looked fairly mature to me
>>>> and I tried it on a big-CPU box back then and there were real improvements.
>>>
>>> too bad u did not share your results, it could have been already in ;-)
>>
>> Yeah :-/ Had a proper round of testing on my TODO, then the big box I'd have tested it on
>> broke ...
>>
>>> let me rebase/repost once more and let's see
>>
>> Thanks!
>>
>>> I think we could benefit from both multiple threads event reading
>>> and AIO writing for perf.data.. it could be merged together
>>
>> So instead of AIO writing perf.data, why not just turn perf.data into a directory structure 
>> with per CPU files? That would allow all sorts of neat future performance features such as 
> 
> that's basically what the multiple-thread record patchset does

Re-posting part of my answer here...

Please note that tool threads may contend, and actually do, with 
application threads, under heavy load when all CPU cores are utilized,
and this may alter performance profile.

So this or that tool design is also a matter of proper system balancing
when profiling so that the gathered performance data would be actual.

Thanks,
Alexey

> 
> jirka
> 
>> mmap() or splice() based zero-copy.
>>
>> User-space post-processing can then read the files and put them into global order - or use the 
>> per CPU nature of them, which would be pretty useful too.
>>
>> Also note how well this works on NUMA as well, as the backing pages would be allocated in a 
>> NUMA-local fashion.
>>
>> I.e. the whole per-CPU threading would enable such a separation of the tracing/event streams 
>> and would allow true scalability.
>>
>> Thanks,
>>
>> 	Ingo
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
  2018-09-10 10:40   ` Alexey Budankov
@ 2018-09-10 12:06     ` Ingo Molnar
  2018-09-10 13:58       ` Arnaldo Carvalho de Melo
  2018-09-10 14:48       ` Alexey Budankov
  0 siblings, 2 replies; 24+ messages in thread
From: Ingo Molnar @ 2018-09-10 12:06 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Peter Zijlstra, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Andi Kleen, linux-kernel


* Alexey Budankov <alexey.budankov@linux.intel.com> wrote:

> Hi Ingo,
> 
> On 10.09.2018 12:18, Ingo Molnar wrote:
> > 
> > * Alexey Budankov <alexey.budankov@linux.intel.com> wrote:
> > 
> >>
> >> Currently in record mode the tool implements trace writing serially. 
> >> The algorithm loops over mapped per-cpu data buffers and stores 
> >> ready data chunks into a trace file using write() system call.
> >>
> >> At some circumstances the kernel may lack free space in a buffer 
> >> because the other buffer's half is not yet written to disk due to 
> >> some other buffer's data writing by the tool at the moment.
> >>
> >> Thus serial trace writing implementation may cause the kernel 
> >> to loose profiling data and that is what observed when profiling 
> >> highly parallel CPU bound workloads on machines with big number 
> >> of cores.
> > 
> > Yay! I saw this frequently on a 120-CPU box (hw is broken now).
> > 
> >> Data loss metrics is the ratio lost_time/elapsed_time where 
> >> lost_time is the sum of time intervals containing PERF_RECORD_LOST 
> >> records and elapsed_time is the elapsed application run time 
> >> under profiling.
> >>
> >> Applying asynchronous trace streaming thru Posix AIO API
> >> (http://man7.org/linux/man-pages/man7/aio.7.html) 
> >> lowers data loss metrics value providing 2x improvement -
> >> lowering 98% loss to almost 0%.
> > 
> > Hm, instead of AIO why don't we use explicit threads instead? I think Posix AIO will fall back 
> > to threads anyway when there's no kernel AIO support (which there probably isn't for perf 
> > events).
> 
> Explicit threading is surely an option but having more threads 
> in the tool that stream performance data is a considerable 
> design complication.
> 
> Luckily, glibc AIO implementation is already based on pthreads, 
> but having a writing thread for every distinct fd only.

My argument is, we don't want to rely on glibc's choices here. They might
use a different threading design in the future, or it might differ between
libc versions.

The basic flow of tracing/profiling data is something we should control explicitly,
via explicit threading.

BTW., the usecase I was primarily concentrating on was a simpler one: 'perf record -a', not 
inherited workflow tracing. For system-wide profiling the ideal tracing setup is clean per-CPU 
separation, i.e. per CPU event fds, per CPU threads that read and then write into separate 
per-CPU files.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
  2018-09-10 12:06     ` Ingo Molnar
@ 2018-09-10 13:58       ` Arnaldo Carvalho de Melo
  2018-09-10 15:19         ` Alexey Budankov
  2018-09-10 14:48       ` Alexey Budankov
  1 sibling, 1 reply; 24+ messages in thread
From: Arnaldo Carvalho de Melo @ 2018-09-10 13:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alexey Budankov, Peter Zijlstra, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Andi Kleen, linux-kernel

Em Mon, Sep 10, 2018 at 02:06:43PM +0200, Ingo Molnar escreveu:
> * Alexey Budankov <alexey.budankov@linux.intel.com> wrote:
> > On 10.09.2018 12:18, Ingo Molnar wrote:
> > > * Alexey Budankov <alexey.budankov@linux.intel.com> wrote:
> > >> Currently in record mode the tool implements trace writing serially. 
> > >> The algorithm loops over mapped per-cpu data buffers and stores 
> > >> ready data chunks into a trace file using write() system call.
> > >>
> > >> At some circumstances the kernel may lack free space in a buffer 
> > >> because the other buffer's half is not yet written to disk due to 
> > >> some other buffer's data writing by the tool at the moment.
> > >>
> > >> Thus serial trace writing implementation may cause the kernel 
> > >> to loose profiling data and that is what observed when profiling 
> > >> highly parallel CPU bound workloads on machines with big number 
> > >> of cores.
> > > 
> > > Yay! I saw this frequently on a 120-CPU box (hw is broken now).
> > > 
> > >> Data loss metrics is the ratio lost_time/elapsed_time where 
> > >> lost_time is the sum of time intervals containing PERF_RECORD_LOST 
> > >> records and elapsed_time is the elapsed application run time 
> > >> under profiling.
> > >>
> > >> Applying asynchronous trace streaming thru Posix AIO API
> > >> (http://man7.org/linux/man-pages/man7/aio.7.html) 
> > >> lowers data loss metrics value providing 2x improvement -
> > >> lowering 98% loss to almost 0%.
> > > 
> > > Hm, instead of AIO why don't we use explicit threads instead? I think Posix AIO will fall back 
> > > to threads anyway when there's no kernel AIO support (which there probably isn't for perf 
> > > events).
> > 
> > Explicit threading is surely an option but having more threads 
> > in the tool that stream performance data is a considerable 
> > design complication.
> > 
> > Luckily, glibc AIO implementation is already based on pthreads, 
> > but having a writing thread for every distinct fd only.
> 
> My argument is, we don't want to rely on glibc's choices here. They might
> use a different threading design in the future, or it might differ between
> libc versions.
> 
> The basic flow of tracing/profiling data is something we should control explicitly,
> via explicit threading.
> 
> BTW., the usecase I was primarily concentrating on was a simpler one: 'perf record -a', not 
> inherited workflow tracing. For system-wide profiling the ideal tracing setup is clean per-CPU 
> separation, i.e. per CPU event fds, per CPU threads that read and then write into separate 
> per-CPU files.

My main request here is that we think about the 'perf top' and 'perf
trace' workflows as well when working on this, i.e. that we don't take
for granted that we'll have the perf.data files to work with.

I.e. N threads, that periodically use that FINISHED_ROUND event to order
events and go on consuming. All of the objects already have refcounts
and locking to allow for things like decaying of samples to take care of
trowing away no longer needed objects (struct map, thread, dso, symbol
tables, etc) to trim memory usage.

- Arnaldo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
  2018-09-10 12:06     ` Ingo Molnar
  2018-09-10 13:58       ` Arnaldo Carvalho de Melo
@ 2018-09-10 14:48       ` Alexey Budankov
  2018-09-11  6:35         ` Ingo Molnar
  1 sibling, 1 reply; 24+ messages in thread
From: Alexey Budankov @ 2018-09-10 14:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Andi Kleen, linux-kernel

Hi Ingo,

On 10.09.2018 15:06, Ingo Molnar wrote:
> 
> * Alexey Budankov <alexey.budankov@linux.intel.com> wrote:
> 
>> Hi Ingo,
>>
>> On 10.09.2018 12:18, Ingo Molnar wrote:
>>>
>>> * Alexey Budankov <alexey.budankov@linux.intel.com> wrote:
>>>
>>>>
>>>> Currently in record mode the tool implements trace writing serially. 
>>>> The algorithm loops over mapped per-cpu data buffers and stores 
>>>> ready data chunks into a trace file using write() system call.
>>>>
>>>> At some circumstances the kernel may lack free space in a buffer 
>>>> because the other buffer's half is not yet written to disk due to 
>>>> some other buffer's data writing by the tool at the moment.
>>>>
>>>> Thus serial trace writing implementation may cause the kernel 
>>>> to loose profiling data and that is what observed when profiling 
>>>> highly parallel CPU bound workloads on machines with big number 
>>>> of cores.
>>>
>>> Yay! I saw this frequently on a 120-CPU box (hw is broken now).
>>>
>>>> Data loss metrics is the ratio lost_time/elapsed_time where 
>>>> lost_time is the sum of time intervals containing PERF_RECORD_LOST 
>>>> records and elapsed_time is the elapsed application run time 
>>>> under profiling.
>>>>
>>>> Applying asynchronous trace streaming thru Posix AIO API
>>>> (http://man7.org/linux/man-pages/man7/aio.7.html) 
>>>> lowers data loss metrics value providing 2x improvement -
>>>> lowering 98% loss to almost 0%.
>>>
>>> Hm, instead of AIO why don't we use explicit threads instead? I think Posix AIO will fall back 
>>> to threads anyway when there's no kernel AIO support (which there probably isn't for perf 
>>> events).
>>
>> Explicit threading is surely an option but having more threads 
>> in the tool that stream performance data is a considerable 
>> design complication.
>>
>> Luckily, glibc AIO implementation is already based on pthreads, 
>> but having a writing thread for every distinct fd only.
> 
> My argument is, we don't want to rely on glibc's choices here. They might
> use a different threading design in the future, or it might differ between
> libc versions.> 
> The basic flow of tracing/profiling data is something we should control explicitly,
> via explicit threading.

It may sound too optimistic but glibc API is expected to be backward compatible 
and for POSIX AIO API part too. Internal implementation also tends to evolve to 
better option overtime, more probably basing on modern kernel capabilities 
mentioned here: http://man7.org/linux/man-pages/man2/io_submit.2.html

Well, explicit threading in the tool for AIO, in the simplest case, means 
incorporating some POSIX API implementation into the tool, avoiding 
code reuse in the first place. That tends to be error prone and costly.

Regards,
Alexey

> 
> BTW., the usecase I was primarily concentrating on was a simpler one: 'perf record -a', not 
> inherited workflow tracing. For system-wide profiling the ideal tracing setup is clean per-CPU 
> separation, i.e. per CPU event fds, per CPU threads that read and then write into separate 
> per-CPU files.
> 
> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
  2018-09-10 13:58       ` Arnaldo Carvalho de Melo
@ 2018-09-10 15:19         ` Alexey Budankov
  0 siblings, 0 replies; 24+ messages in thread
From: Alexey Budankov @ 2018-09-10 15:19 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Ingo Molnar
  Cc: Peter Zijlstra, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Andi Kleen, linux-kernel

Hi,

On 10.09.2018 16:58, Arnaldo Carvalho de Melo wrote:
> Em Mon, Sep 10, 2018 at 02:06:43PM +0200, Ingo Molnar escreveu:
>> * Alexey Budankov <alexey.budankov@linux.intel.com> wrote:
>>> On 10.09.2018 12:18, Ingo Molnar wrote:
>>>> * Alexey Budankov <alexey.budankov@linux.intel.com> wrote:
>>>>> Currently in record mode the tool implements trace writing serially. 
>>>>> The algorithm loops over mapped per-cpu data buffers and stores 
>>>>> ready data chunks into a trace file using write() system call.
>>>>>
>>>>> At some circumstances the kernel may lack free space in a buffer 
>>>>> because the other buffer's half is not yet written to disk due to 
>>>>> some other buffer's data writing by the tool at the moment.
>>>>>
>>>>> Thus serial trace writing implementation may cause the kernel 
>>>>> to loose profiling data and that is what observed when profiling 
>>>>> highly parallel CPU bound workloads on machines with big number 
>>>>> of cores.
>>>>
>>>> Yay! I saw this frequently on a 120-CPU box (hw is broken now).
>>>>
>>>>> Data loss metrics is the ratio lost_time/elapsed_time where 
>>>>> lost_time is the sum of time intervals containing PERF_RECORD_LOST 
>>>>> records and elapsed_time is the elapsed application run time 
>>>>> under profiling.
>>>>>
>>>>> Applying asynchronous trace streaming thru Posix AIO API
>>>>> (http://man7.org/linux/man-pages/man7/aio.7.html) 
>>>>> lowers data loss metrics value providing 2x improvement -
>>>>> lowering 98% loss to almost 0%.
>>>>
>>>> Hm, instead of AIO why don't we use explicit threads instead? I think Posix AIO will fall back 
>>>> to threads anyway when there's no kernel AIO support (which there probably isn't for perf 
>>>> events).
>>>
>>> Explicit threading is surely an option but having more threads 
>>> in the tool that stream performance data is a considerable 
>>> design complication.
>>>
>>> Luckily, glibc AIO implementation is already based on pthreads, 
>>> but having a writing thread for every distinct fd only.
>>
>> My argument is, we don't want to rely on glibc's choices here. They might
>> use a different threading design in the future, or it might differ between
>> libc versions.
>>
>> The basic flow of tracing/profiling data is something we should control explicitly,
>> via explicit threading.
>>
>> BTW., the usecase I was primarily concentrating on was a simpler one: 'perf record -a', not 
>> inherited workflow tracing. For system-wide profiling the ideal tracing setup is clean per-CPU 
>> separation, i.e. per CPU event fds, per CPU threads that read and then write into separate 
>> per-CPU files.
> 
> My main request here is that we think about the 'perf top' and 'perf
> trace' workflows as well when working on this, i.e. that we don't take
> for granted that we'll have the perf.data files to work with.

Made manual sanity checks of perf top and perf trace modes using the same 
matrix multiplication workload. The modes look working after applying 
the patch set.

Regards,
Alexey

> 
> I.e. N threads, that periodically use that FINISHED_ROUND event to order
> events and go on consuming. All of the objects already have refcounts
> and locking to allow for things like decaying of samples to take care of
> trowing away no longer needed objects (struct map, thread, dso, symbol
> tables, etc) to trim memory usage.
> 
> - Arnaldo
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
  2018-09-10 14:48       ` Alexey Budankov
@ 2018-09-11  6:35         ` Ingo Molnar
  2018-09-11  8:16           ` Alexey Budankov
  2018-09-11 14:19           ` Peter Zijlstra
  0 siblings, 2 replies; 24+ messages in thread
From: Ingo Molnar @ 2018-09-11  6:35 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Peter Zijlstra, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Andi Kleen, linux-kernel


* Alexey Budankov <alexey.budankov@linux.intel.com> wrote:

> It may sound too optimistic but glibc API is expected to be backward compatible 
> and for POSIX AIO API part too. Internal implementation also tends to evolve to 
> better option overtime, more probably basing on modern kernel capabilities 
> mentioned here: http://man7.org/linux/man-pages/man2/io_submit.2.html

I'm not talking about compatibility, and I'm not just talking about glibc, perf works under 
other libcs as well - and let me phrase it in another way: basic event handling, threading, 
scheduling internals should be a *core competency* of a tracing/profiling tool.

I.e. we might end up using the exact same per event fd thread pool design that glibc uses 
currently. Or not. Having that internal and open coded to perf, like Jiri has started 
implementing it, allows people to experiment with it.

This isn't some GUI toolkit, this is at the essence of perf, and we are not very good on large 
systems right now, and I think the design should be open-coded threading, not relying on an 
(perf-)external AIO library to get it right.

The glibc thread pool implementation of POSIX AIO is basically a fall-back 
implementation, for the case where there's no native KAIO interface to rely on.

> Well, explicit threading in the tool for AIO, in the simplest case, means 
> incorporating some POSIX API implementation into the tool, avoiding 
> code reuse in the first place. That tends to be error prone and costly.

It's a core competency, we better do it right and not outsource it.

Please take a look at Jiri's patches (once he re-posts them), I think it's a very good 
starting point.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
  2018-09-11  6:35         ` Ingo Molnar
@ 2018-09-11  8:16           ` Alexey Budankov
  2018-09-11  8:34             ` Jiri Olsa
  2018-09-11 14:19           ` Peter Zijlstra
  1 sibling, 1 reply; 24+ messages in thread
From: Alexey Budankov @ 2018-09-11  8:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Andi Kleen, linux-kernel


Hi Ingo,

On 11.09.2018 9:35, Ingo Molnar wrote:
> 
> * Alexey Budankov <alexey.budankov@linux.intel.com> wrote:
> 
>> It may sound too optimistic but glibc API is expected to be backward compatible 
>> and for POSIX AIO API part too. Internal implementation also tends to evolve to 
>> better option overtime, more probably basing on modern kernel capabilities 
>> mentioned here: http://man7.org/linux/man-pages/man2/io_submit.2.html
> 
> I'm not talking about compatibility, and I'm not just talking about glibc, perf works under 
> other libcs as well - and let me phrase it in another way: basic event handling, threading, 
> scheduling internals should be a *core competency* of a tracing/profiling tool.

Well, the requirement of independence from some specific libc implementation 
as well as *core competency* design approach clarify a lot. Thanks!

> 
> I.e. we might end up using the exact same per event fd thread pool design that glibc uses 
> currently. Or not. Having that internal and open coded to perf, like Jiri has started 
> implementing it, allows people to experiment with it.

My point here is that following some standardized programming models and APIs 
(like POSIX) in the tool code, even if the tool itself provides internal open 
coded implementation for the APIs, would simplify experimenting with the tool 
as well as lower barriers for new comers. Perf project could benefit from that.

> 
> This isn't some GUI toolkit, this is at the essence of perf, and we are not very good on large 
> systems right now, and I think the design should be open-coded threading, not relying on an 
> (perf-)external AIO library to get it right.
> 
> The glibc thread pool implementation of POSIX AIO is basically a fall-back 
> implementation, for the case where there's no native KAIO interface to rely on.
> 
>> Well, explicit threading in the tool for AIO, in the simplest case, means 
>> incorporating some POSIX API implementation into the tool, avoiding 
>> code reuse in the first place. That tends to be error prone and costly.
> 
> It's a core competency, we better do it right and not outsource it.

Yep, makes sense.

Thanks!
Alexey

> 
> Please take a look at Jiri's patches (once he re-posts them), I think it's a very good 
> starting point.
> 
> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
  2018-09-11  8:16           ` Alexey Budankov
@ 2018-09-11  8:34             ` Jiri Olsa
  2018-09-11 13:42               ` Alexey Budankov
  0 siblings, 1 reply; 24+ messages in thread
From: Jiri Olsa @ 2018-09-11  8:34 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel

On Tue, Sep 11, 2018 at 11:16:45AM +0300, Alexey Budankov wrote:
> 
> Hi Ingo,
> 
> On 11.09.2018 9:35, Ingo Molnar wrote:
> > 
> > * Alexey Budankov <alexey.budankov@linux.intel.com> wrote:
> > 
> >> It may sound too optimistic but glibc API is expected to be backward compatible 
> >> and for POSIX AIO API part too. Internal implementation also tends to evolve to 
> >> better option overtime, more probably basing on modern kernel capabilities 
> >> mentioned here: http://man7.org/linux/man-pages/man2/io_submit.2.html
> > 
> > I'm not talking about compatibility, and I'm not just talking about glibc, perf works under 
> > other libcs as well - and let me phrase it in another way: basic event handling, threading, 
> > scheduling internals should be a *core competency* of a tracing/profiling tool.
> 
> Well, the requirement of independence from some specific libc implementation 
> as well as *core competency* design approach clarify a lot. Thanks!
> 
> > 
> > I.e. we might end up using the exact same per event fd thread pool design that glibc uses 
> > currently. Or not. Having that internal and open coded to perf, like Jiri has started 
> > implementing it, allows people to experiment with it.
> 
> My point here is that following some standardized programming models and APIs 
> (like POSIX) in the tool code, even if the tool itself provides internal open 
> coded implementation for the APIs, would simplify experimenting with the tool 
> as well as lower barriers for new comers. Perf project could benefit from that.
> 
> > 
> > This isn't some GUI toolkit, this is at the essence of perf, and we are not very good on large 
> > systems right now, and I think the design should be open-coded threading, not relying on an 
> > (perf-)external AIO library to get it right.
> > 
> > The glibc thread pool implementation of POSIX AIO is basically a fall-back 
> > implementation, for the case where there's no native KAIO interface to rely on.
> > 
> >> Well, explicit threading in the tool for AIO, in the simplest case, means 
> >> incorporating some POSIX API implementation into the tool, avoiding 
> >> code reuse in the first place. That tends to be error prone and costly.
> > 
> > It's a core competency, we better do it right and not outsource it.
> 
> Yep, makes sense.

on the other hand, we are already trying to tie this up under perf_mmap
object, which is what the threaded patchset operates on.. so I'm quite
confident that with little effort we could make those 2 things live next
to each other and let the user to decide which one to take and compare

possibilities would be like: (not sure yet the last one makes sense, but still..)

  # perf record --threads=...  ...
  # perf record --aio ...
  # perf record --threads=... --aio ...

how about that?

I just rebased the thread patchset, will make some tests (it's been few months,
so it needs some kicking/checking) and post it out hopefuly this week

jirka

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
  2018-09-11  8:34             ` Jiri Olsa
@ 2018-09-11 13:42               ` Alexey Budankov
  2018-09-13  8:00                 ` Jiri Olsa
  0 siblings, 1 reply; 24+ messages in thread
From: Alexey Budankov @ 2018-09-11 13:42 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel

Hi,

On 11.09.2018 11:34, Jiri Olsa wrote:
> On Tue, Sep 11, 2018 at 11:16:45AM +0300, Alexey Budankov wrote:
>>
>> Hi Ingo,
>>
>> On 11.09.2018 9:35, Ingo Molnar wrote:
>>>
>>> * Alexey Budankov <alexey.budankov@linux.intel.com> wrote:
>>>
>>>> It may sound too optimistic but glibc API is expected to be backward compatible 
>>>> and for POSIX AIO API part too. Internal implementation also tends to evolve to 
>>>> better option overtime, more probably basing on modern kernel capabilities 
>>>> mentioned here: http://man7.org/linux/man-pages/man2/io_submit.2.html
>>>
>>> I'm not talking about compatibility, and I'm not just talking about glibc, perf works under 
>>> other libcs as well - and let me phrase it in another way: basic event handling, threading, 
>>> scheduling internals should be a *core competency* of a tracing/profiling tool.
>>
>> Well, the requirement of independence from some specific libc implementation 
>> as well as *core competency* design approach clarify a lot. Thanks!
>>
>>>
>>> I.e. we might end up using the exact same per event fd thread pool design that glibc uses 
>>> currently. Or not. Having that internal and open coded to perf, like Jiri has started 
>>> implementing it, allows people to experiment with it.
>>
>> My point here is that following some standardized programming models and APIs 
>> (like POSIX) in the tool code, even if the tool itself provides internal open 
>> coded implementation for the APIs, would simplify experimenting with the tool 
>> as well as lower barriers for new comers. Perf project could benefit from that.
>>
>>>
>>> This isn't some GUI toolkit, this is at the essence of perf, and we are not very good on large 
>>> systems right now, and I think the design should be open-coded threading, not relying on an 
>>> (perf-)external AIO library to get it right.
>>>
>>> The glibc thread pool implementation of POSIX AIO is basically a fall-back 
>>> implementation, for the case where there's no native KAIO interface to rely on.
>>>
>>>> Well, explicit threading in the tool for AIO, in the simplest case, means 
>>>> incorporating some POSIX API implementation into the tool, avoiding 
>>>> code reuse in the first place. That tends to be error prone and costly.
>>>
>>> It's a core competency, we better do it right and not outsource it.
>>
>> Yep, makes sense.
> 
> on the other hand, we are already trying to tie this up under perf_mmap
> object, which is what the threaded patchset operates on.. so I'm quite
> confident that with little effort we could make those 2 things live next
> to each other and let the user to decide which one to take and compare
> 
> possibilities would be like: (not sure yet the last one makes sense, but still..)
> 
>   # perf record --threads=...  ...
>   # perf record --aio ...
>   # perf record --threads=... --aio ...
> 
> how about that?

That might be an option. What is the semantics of --threads?

Be aware that when experimenting with serial trace writing on an 8-core 
client machines running an HPC benchmark heavily utilizing all 8 cores 
we noticed that single Perf tool thread contended with the benchmark 
threads.

That manifested like libiomp.so (Intel OpenMP implementation) functions 
appearing among the top hotspots functions and this was indication of 
imbalance induced by the tool during profiling.

That's why we decided to first go with AIO approach, as it is posted,
and benefit from it the most thru multi AIO, prior turning to more 
resource consuming multi-threading alternative. 

> 
> I just rebased the thread patchset, will make some tests (it's been few months,
> so it needs some kicking/checking) and post it out hopefuly this week> 
> jirka
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
  2018-09-11  6:35         ` Ingo Molnar
  2018-09-11  8:16           ` Alexey Budankov
@ 2018-09-11 14:19           ` Peter Zijlstra
  2018-09-12  8:27             ` Alexey Budankov
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2018-09-11 14:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alexey Budankov, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Andi Kleen, linux-kernel

On Tue, Sep 11, 2018 at 08:35:12AM +0200, Ingo Molnar wrote:
> > Well, explicit threading in the tool for AIO, in the simplest case, means 
> > incorporating some POSIX API implementation into the tool, avoiding 
> > code reuse in the first place. That tends to be error prone and costly.
> 
> It's a core competency, we better do it right and not outsource it.
> 
> Please take a look at Jiri's patches (once he re-posts them), I think it's a very good 
> starting point.

There's another reason for doing custom per-cpu threads; it avoids
bouncing the buffer memory around the machine. If the task doing the
buffer reads is the exact same as the one doing the writes, there's less
memory traffic on the interconnects.

Also, I think we can avoid the MFENCE in that case, but I'm not sure
that one is hot enough to bother about on the perf reading side of
things.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
  2018-09-11 14:19           ` Peter Zijlstra
@ 2018-09-12  8:27             ` Alexey Budankov
  0 siblings, 0 replies; 24+ messages in thread
From: Alexey Budankov @ 2018-09-12  8:27 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Alexander Shishkin, Namhyung Kim,
	Andi Kleen, linux-kernel


Hi,

On 11.09.2018 17:19, Peter Zijlstra wrote:
> On Tue, Sep 11, 2018 at 08:35:12AM +0200, Ingo Molnar wrote:
>>> Well, explicit threading in the tool for AIO, in the simplest case, means 
>>> incorporating some POSIX API implementation into the tool, avoiding 
>>> code reuse in the first place. That tends to be error prone and costly.
>>
>> It's a core competency, we better do it right and not outsource it.
>>
>> Please take a look at Jiri's patches (once he re-posts them), I think it's a very good 
>> starting point.
> 
> There's another reason for doing custom per-cpu threads; it avoids
> bouncing the buffer memory around the machine. If the task doing the
> buffer reads is the exact same as the one doing the writes, there's less
> memory traffic on the interconnects.

Yeah, NUMA does matter. Memory locality, i.e. cache sizes and NUMA domains
for kernel/user buffers allocation, needs to be taken into account by the
effective solution. Luckily data losses hasn't been observed when testing 
matrix multiplication on 96 core dual socket machines.

> 
> Also, I think we can avoid the MFENCE in that case, but I'm not sure
> that one is hot enough to bother about on the perf reading side of
> things.

Yep, *FENCE may be costly in HW, especially on larger scale.

> 

Thanks,
Alexey

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
  2018-09-11 13:42               ` Alexey Budankov
@ 2018-09-13  8:00                 ` Jiri Olsa
  0 siblings, 0 replies; 24+ messages in thread
From: Jiri Olsa @ 2018-09-13  8:00 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel

On Tue, Sep 11, 2018 at 04:42:09PM +0300, Alexey Budankov wrote:
> Hi,
> 
> On 11.09.2018 11:34, Jiri Olsa wrote:
> > On Tue, Sep 11, 2018 at 11:16:45AM +0300, Alexey Budankov wrote:
> >>
> >> Hi Ingo,
> >>
> >> On 11.09.2018 9:35, Ingo Molnar wrote:
> >>>
> >>> * Alexey Budankov <alexey.budankov@linux.intel.com> wrote:
> >>>
> >>>> It may sound too optimistic but glibc API is expected to be backward compatible 
> >>>> and for POSIX AIO API part too. Internal implementation also tends to evolve to 
> >>>> better option overtime, more probably basing on modern kernel capabilities 
> >>>> mentioned here: http://man7.org/linux/man-pages/man2/io_submit.2.html
> >>>
> >>> I'm not talking about compatibility, and I'm not just talking about glibc, perf works under 
> >>> other libcs as well - and let me phrase it in another way: basic event handling, threading, 
> >>> scheduling internals should be a *core competency* of a tracing/profiling tool.
> >>
> >> Well, the requirement of independence from some specific libc implementation 
> >> as well as *core competency* design approach clarify a lot. Thanks!
> >>
> >>>
> >>> I.e. we might end up using the exact same per event fd thread pool design that glibc uses 
> >>> currently. Or not. Having that internal and open coded to perf, like Jiri has started 
> >>> implementing it, allows people to experiment with it.
> >>
> >> My point here is that following some standardized programming models and APIs 
> >> (like POSIX) in the tool code, even if the tool itself provides internal open 
> >> coded implementation for the APIs, would simplify experimenting with the tool 
> >> as well as lower barriers for new comers. Perf project could benefit from that.
> >>
> >>>
> >>> This isn't some GUI toolkit, this is at the essence of perf, and we are not very good on large 
> >>> systems right now, and I think the design should be open-coded threading, not relying on an 
> >>> (perf-)external AIO library to get it right.
> >>>
> >>> The glibc thread pool implementation of POSIX AIO is basically a fall-back 
> >>> implementation, for the case where there's no native KAIO interface to rely on.
> >>>
> >>>> Well, explicit threading in the tool for AIO, in the simplest case, means 
> >>>> incorporating some POSIX API implementation into the tool, avoiding 
> >>>> code reuse in the first place. That tends to be error prone and costly.
> >>>
> >>> It's a core competency, we better do it right and not outsource it.
> >>
> >> Yep, makes sense.
> > 
> > on the other hand, we are already trying to tie this up under perf_mmap
> > object, which is what the threaded patchset operates on.. so I'm quite
> > confident that with little effort we could make those 2 things live next
> > to each other and let the user to decide which one to take and compare
> > 
> > possibilities would be like: (not sure yet the last one makes sense, but still..)
> > 
> >   # perf record --threads=...  ...
> >   # perf record --aio ...
> >   # perf record --threads=... --aio ...
> > 
> > how about that?
> 
> That might be an option. What is the semantics of --threads?

that's my latest post on this:
  https://marc.info/?l=linux-kernel&m=151551213322861&w=2

working on repost ;-)

jirka

> 
> Be aware that when experimenting with serial trace writing on an 8-core 
> client machines running an HPC benchmark heavily utilizing all 8 cores 
> we noticed that single Perf tool thread contended with the benchmark 
> threads.
> 
> That manifested like libiomp.so (Intel OpenMP implementation) functions 
> appearing among the top hotspots functions and this was indication of 
> imbalance induced by the tool during profiling.
> 
> That's why we decided to first go with AIO approach, as it is posted,
> and benefit from it the most thru multi AIO, prior turning to more 
> resource consuming multi-threading alternative. 
> 
> > 
> > I just rebased the thread patchset, will make some tests (it's been few months,
> > so it needs some kicking/checking) and post it out hopefuly this week> 
> > jirka
> > 

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2018-09-13  8:00 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-07  7:07 [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads Alexey Budankov
2018-09-07  7:11 ` [PATCH v8 1/3]: perf util: map data buffer for preserving collected data Alexey Budankov
2018-09-07  7:19 ` [PATCH v8 2/3]: perf record: enable asynchronous trace writing Alexey Budankov
2018-09-07  7:39 ` [PATCH v8 3/3]: perf record: extend trace writing to multi AIO Alexey Budankov
2018-09-07  9:34 ` [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads Alexey Budankov
2018-09-10  9:18 ` Ingo Molnar
2018-09-10  9:59   ` Jiri Olsa
2018-09-10 10:03     ` Ingo Molnar
2018-09-10 10:08       ` Jiri Olsa
2018-09-10 10:13         ` Ingo Molnar
2018-09-10 10:23           ` Jiri Olsa
2018-09-10 10:45             ` Alexey Budankov
2018-09-10 10:40   ` Alexey Budankov
2018-09-10 12:06     ` Ingo Molnar
2018-09-10 13:58       ` Arnaldo Carvalho de Melo
2018-09-10 15:19         ` Alexey Budankov
2018-09-10 14:48       ` Alexey Budankov
2018-09-11  6:35         ` Ingo Molnar
2018-09-11  8:16           ` Alexey Budankov
2018-09-11  8:34             ` Jiri Olsa
2018-09-11 13:42               ` Alexey Budankov
2018-09-13  8:00                 ` Jiri Olsa
2018-09-11 14:19           ` Peter Zijlstra
2018-09-12  8:27             ` Alexey Budankov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).