linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2]: perf: reduce data loss when profiling highly parallel CPU bound workloads
@ 2018-08-23 16:33 Alexey Budankov
  2018-08-23 16:42 ` [PATCH v2 1/2]: perf util: map data buffer for preserving collected data Alexey Budankov
  2018-08-23 16:47 ` [PATCH v2 2/2]: perf record: enable asynchronous trace writing Alexey Budankov
  0 siblings, 2 replies; 19+ messages in thread
From: Alexey Budankov @ 2018-08-23 16:33 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andi Kleen,
	linux-kernel, linux-perf-users

Currently in record mode the tool implements trace writing serially. 
The algorithm loops over mapped per-cpu data buffers and stores ready 
data chunks into a trace file using write() system call.

At some circumstances the kernel may lack free space in a buffer 
because the other buffer's half is not yet written to disk due to 
some other buffer's data writing by the tool at the moment.

Thus serial trace writing implementation may cause the kernel 
to loose profiling data and that is what observed when profiling 
highly parallel CPU bound workloads on machines with big number 
of cores.

Experiment with profiling matrix multiplication code executing 128 
threads on Intel Xeon Phi (KNM) with 272 cores, like below,
demonstrates data loss metrics value of 98%:

/usr/bin/time perf record -o /tmp/perf-ser.data -a -N -B -T -R -g \
    --call-graph dwarf,1024 --user-regs=IP,SP,BP \
    --switch-events -e cycles,instructions,ref-cycles,software/period=1,name=cs,config=0x3/Duk -- \
    matrix.gcc

Data loss metrics is the ratio lost_time/elapsed_time where 
lost_time is the sum of time intervals containing PERF_RECORD_LOST 
records and elapsed_time is the elapsed application run time 
under profiling.

Applying asynchronous trace streaming thru Posix AIO API
http://man7.org/linux/man-pages/man7/aio.7.html lowers data loss 
metrics value providing ~25% improvement in average.

Sanity testing:

tools/perf/perf test
 1: vmlinux symtab matches kallsyms                       : Ok
 2: Detect openat syscall event                           : Ok
 3: Detect openat syscall event on all cpus               : Ok
 4: Read samples using the mmap interface                 : Ok
 5: Test data source output                               : Ok
 6: Parse event definition strings                        : Ok
 7: Simple expression parser                              : Ok
 8: PERF_RECORD_* events & perf_sample fields             : Ok
 9: Parse perf pmu format                                 : Ok
10: DSO data read                                         : Ok
11: DSO data cache                                        : Ok
12: DSO data reopen                                       : Ok
13: Roundtrip evsel->name                                 : Ok
14: Parse sched tracepoints fields                        : Ok
15: syscalls:sys_enter_openat event fields                : Ok
16: Setup struct perf_event_attr                          : Skip
17: Match and link multiple hists                         : Ok
18: 'import perf' in python                               : FAILED!
19: Breakpoint overflow signal handler                    : Ok
20: Breakpoint overflow sampling                          : Ok
21: Breakpoint accounting                                 : Ok
22: Number of exit events of a simple workload            : Ok
23: Software clock events period values                   : Ok
24: Object code reading                                   : Ok
25: Sample parsing                                        : Ok
26: Use a dummy software event to keep tracking           : Ok
27: Parse with no sample_id_all bit set                   : Ok
28: Filter hist entries                                   : Ok
29: Lookup mmap thread                                    : Ok
30: Share thread mg                                       : Ok
31: Sort output of hist entries                           : Ok
32: Cumulate child hist entries                           : Ok
33: Track with sched_switch                               : Ok
34: Filter fds with revents mask in a fdarray             : Ok
35: Add fd to a fdarray, making it autogrow               : Ok
36: kmod_path__parse                                      : Ok
37: Thread map                                            : Ok
38: LLVM search and compile                               :
38.1: Basic BPF llvm compile                              : Skip
38.2: kbuild searching                                    : Skip
38.3: Compile source for BPF prologue generation          : Skip
38.4: Compile source for BPF relocation                   : Skip
39: Session topology                                      : Ok
40: BPF filter                                            :
40.1: Basic BPF filtering                                 : Skip
40.2: BPF pinning                                         : Skip
40.3: BPF prologue generation                             : Skip
40.4: BPF relocation checker                              : Skip
41: Synthesize thread map                                 : Ok
42: Remove thread map                                     : Ok
43: Synthesize cpu map                                    : Ok
44: Synthesize stat config                                : Ok
45: Synthesize stat                                       : Ok
46: Synthesize stat round                                 : Ok
47: Synthesize attr update                                : Ok
48: Event times                                           : Ok
49: Read backward ring buffer                             : Ok
50: Print cpu map                                         : Ok
51: Probe SDT events                                      : Ok
52: is_printable_array                                    : Ok
53: Print bitmap                                          : Ok
54: perf hooks                                            : Ok
55: builtin clang support                                 : Skip (not compiled in)
56: unit_number__scnprintf                                : Ok
57: mem2node                                              : Ok
58: x86 rdpmc                                             : Ok
59: Convert perf time to TSC                              : Ok
60: DWARF unwind                                          : Ok
61: x86 instruction decoder - new instructions            : Ok
62: Use vfs_getname probe to get syscall args filenames   : Skip
63: Add vfs_getname probe to get syscall args filenames   : Skip
64: Check open filename arg using perf trace + vfs_getname: Skip
65: probe libc's inet_pton & backtrace it with ping       : FAILED!

---
 Alexey Budankov (2):
        perf util: map data buffer for preserving collected data
	perf record: enable asynchronous trace writing

 tools/perf/builtin-record.c | 113 +++++++++++++++++++++++++++++++++++++++++---
 tools/perf/util/evlist.c    |   7 +++
 tools/perf/util/evlist.h    |   2 +
 tools/perf/util/mmap.c      |  35 +++++++++-----
 tools/perf/util/mmap.h      |   5 +-
 5 files changed, 142 insertions(+), 20 deletions(-)





^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 1/2]: perf util: map data buffer for preserving collected data
  2018-08-23 16:33 [PATCH v2 0/2]: perf: reduce data loss when profiling highly parallel CPU bound workloads Alexey Budankov
@ 2018-08-23 16:42 ` Alexey Budankov
  2018-08-27  8:28   ` Jiri Olsa
  2018-08-27  8:33   ` Jiri Olsa
  2018-08-23 16:47 ` [PATCH v2 2/2]: perf record: enable asynchronous trace writing Alexey Budankov
  1 sibling, 2 replies; 19+ messages in thread
From: Alexey Budankov @ 2018-08-23 16:42 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andi Kleen,
	linux-kernel, linux-perf-users


The data buffer and accompanying AIO control block are allocated at 
perf_mmap object and the mapped data buffer size is equal to 
the kernel one.

The buffer is then used to preserve profiling data ready for dumping 
and queue it for asynchronous writing into perf trace thru implemented 
record__aio_write() function.

mmap_aio control structure of the size equal to the number of per-cpu 
kernel buffers is used to keep pointers to enqueued AIO control 
blocks for monitoring of completed AIO operations.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
Changes in v2:
- converted zalloc() to calloc() for allocation of mmap_aio array,
- cleared typo and adjusted fallback branch code;
---
tools/perf/builtin-record.c | 18 ++++++++++++++++++
 tools/perf/util/evlist.c    |  7 +++++++
 tools/perf/util/evlist.h    |  2 ++
 tools/perf/util/mmap.c      | 12 ++++++++++++
 tools/perf/util/mmap.h      |  3 +++
 5 files changed, 42 insertions(+)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 22ebeb92ac51..a35675e9f3aa 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -53,6 +53,7 @@
 #include <sys/mman.h>
 #include <sys/wait.h>
 #include <linux/time64.h>
+#include <aio.h>
 
 struct switch_output {
 	bool		 enabled;
@@ -121,6 +122,23 @@ static int record__write(struct record *rec, void *bf, size_t size)
 	return 0;
 }
 
+static int record__aio_write(int trace_fd, struct aiocb *cblock,
+		void *buf, size_t size, off_t off)
+{
+	cblock->aio_fildes = trace_fd;
+	cblock->aio_buf    = buf;
+	cblock->aio_nbytes = size;
+	cblock->aio_offset = off;
+	cblock->aio_sigevent.sigev_notify = SIGEV_NONE;
+
+	if (aio_write(cblock) == -1) {
+		pr_err("failed to write perf data, error: %m\n");
+		return -1;
+	}
+
+	return 0;
+}
+
 static int process_synthesized_event(struct perf_tool *tool,
 				     union perf_event *event,
 				     struct perf_sample *sample __maybe_unused,
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index e7a4b31a84fb..04596f74766c 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -718,6 +718,8 @@ static void perf_evlist__munmap_nofree(struct perf_evlist *evlist)
 void perf_evlist__munmap(struct perf_evlist *evlist)
 {
 	perf_evlist__munmap_nofree(evlist);
+	if (evlist->mmap_aio)
+		zfree(&evlist->mmap_aio);
 	zfree(&evlist->mmap);
 	zfree(&evlist->overwrite_mmap);
 }
@@ -749,6 +751,11 @@ static struct perf_mmap *perf_evlist__alloc_mmap(struct perf_evlist *evlist,
 		 */
 		refcount_set(&map[i].refcnt, 0);
 	}
+
+	evlist->mmap_aio = calloc(evlist->nr_mmaps, sizeof(struct aiocb *));
+	if (!evlist->mmap_aio)
+		zfree(&map);
+
 	return map;
 }
 
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index dc66436add98..f98b949561fd 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -15,6 +15,7 @@
 #include "util.h"
 #include <signal.h>
 #include <unistd.h>
+#include <aio.h>
 
 struct pollfd;
 struct thread_map;
@@ -43,6 +44,7 @@ struct perf_evlist {
 	} workload;
 	struct fdarray	 pollfd;
 	struct perf_mmap *mmap;
+	struct aiocb	 **mmap_aio;
 	struct perf_mmap *overwrite_mmap;
 	struct thread_map *threads;
 	struct cpu_map	  *cpus;
diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index fc832676a798..e71d46cb01cc 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -155,6 +155,10 @@ void __weak auxtrace_mmap_params__set_idx(struct auxtrace_mmap_params *mp __mayb
 
 void perf_mmap__munmap(struct perf_mmap *map)
 {
+	if (map->data != NULL) {
+		munmap(map->data, perf_mmap__mmap_len(map));
+		map->data = NULL;
+	}
 	if (map->base != NULL) {
 		munmap(map->base, perf_mmap__mmap_len(map));
 		map->base = NULL;
@@ -190,6 +194,14 @@ int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd)
 		map->base = NULL;
 		return -1;
 	}
+	map->data = mmap(NULL, perf_mmap__mmap_len(map), PROT_READ | PROT_WRITE,
+			MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (map->data == MAP_FAILED) {
+		pr_debug2("failed to mmap perf event data buffer, error %d\n",
+				errno);
+		map->data = NULL;
+		return -1;
+	}
 	map->fd = fd;
 
 	if (auxtrace_mmap__mmap(&map->auxtrace_mmap,
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index d82294db1295..1974e621e36b 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -6,6 +6,7 @@
 #include <linux/types.h>
 #include <asm/barrier.h>
 #include <stdbool.h>
+#include <aio.h>
 #include "auxtrace.h"
 #include "event.h"
 
@@ -25,6 +26,8 @@ struct perf_mmap {
 	bool		 overwrite;
 	struct auxtrace_mmap auxtrace_mmap;
 	char		 event_copy[PERF_SAMPLE_MAX_SIZE] __aligned(8);
+	void 		 *data;
+	struct aiocb	 cblock;
 };
 
 /*

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 2/2]: perf record: enable asynchronous trace writing
  2018-08-23 16:33 [PATCH v2 0/2]: perf: reduce data loss when profiling highly parallel CPU bound workloads Alexey Budankov
  2018-08-23 16:42 ` [PATCH v2 1/2]: perf util: map data buffer for preserving collected data Alexey Budankov
@ 2018-08-23 16:47 ` Alexey Budankov
  2018-08-27  8:34   ` Jiri Olsa
                     ` (2 more replies)
  1 sibling, 3 replies; 19+ messages in thread
From: Alexey Budankov @ 2018-08-23 16:47 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andi Kleen,
	linux-kernel, linux-perf-users


Trace file offsets are different for every enqueued write operation 
and calculated dynamically in trace streaming loop and don't overlap 
so write requests can be written in parallel.

record__mmap_read_sync implements sort of a barrier between spilling 
ready profiling data to disk.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 tools/perf/builtin-record.c | 97 +++++++++++++++++++++++++++++++++++++++++----
 tools/perf/util/mmap.c      | 23 +++++------
 tools/perf/util/mmap.h      |  2 +-
 3 files changed, 101 insertions(+), 21 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index a35675e9f3aa..dee63229ed37 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -132,7 +132,7 @@ static int record__aio_write(int trace_fd, struct aiocb *cblock,
 	cblock->aio_sigevent.sigev_notify = SIGEV_NONE;
 
 	if (aio_write(cblock) == -1) {
-		pr_err("failed to write perf data, error: %m\n");
+		pr_err("failed to queue perf data, error: %m\n");
 		return -1;
 	}
 
@@ -148,12 +148,14 @@ static int process_synthesized_event(struct perf_tool *tool,
 	return record__write(rec, event, event->header.size);
 }
 
-static int record__pushfn(void *to, void *bf, size_t size)
+static int record__pushfn(void *to, void *bf, size_t size, off_t off)
 {
 	struct record *rec = to;
+	struct perf_mmap *map = bf;
 
 	rec->samples++;
-	return record__write(rec, bf, size);
+	return record__aio_write(rec->session->data->file.fd, &map->cblock,
+			map->data, size, off);
 }
 
 static volatile int done;
@@ -528,13 +530,85 @@ static struct perf_event_header finished_round_event = {
 	.type = PERF_RECORD_FINISHED_ROUND,
 };
 
+static int record__mmap_read_sync(int trace_fd, struct aiocb **cblocks,
+		int cblocks_size, struct record *rec)
+{
+	size_t rem;
+	ssize_t size;
+	off_t rem_off;
+	int i, aio_ret, aio_errno, do_suspend;
+	struct perf_mmap *md;
+	struct timespec timeout0 = { 0, 0 };
+	struct timespec timeoutS = { 0, 1000 * 1000  * 1 };
+
+	if (!cblocks_size)
+		return 0;
+
+	do {
+		do_suspend = 0;
+		nanosleep(&timeoutS, NULL);
+		if (aio_suspend((const struct aiocb**)cblocks, cblocks_size, &timeout0)) {
+			if (errno == EAGAIN || errno == EINTR) {
+				do_suspend = 1;
+				continue;
+			} else {
+				pr_err("failed to sync perf data, error: %m\n");
+				break;
+			}
+		}
+		for (i = 0; i < cblocks_size; i++) {
+			if (cblocks[i] == NULL) {
+				continue;
+			}
+			aio_errno = aio_error(cblocks[i]);
+			if (aio_errno == EINPROGRESS) {
+				do_suspend = 1;
+				continue;
+			}
+			size = aio_ret = aio_return((struct aiocb*)cblocks[i]);
+			if (aio_ret < 0) {
+				if (aio_errno == EINTR) {
+					size = 0;
+				} else {
+					pr_err("failed to write perf data, error: %m\n");
+					cblocks[i] = NULL;
+					continue;
+				}
+			}
+			rec->bytes_written += size;
+			md = (struct perf_mmap*)((char*)cblocks[i] -
+				offsetof(struct perf_mmap, cblock));
+			rem = cblocks[i]->aio_nbytes - (size_t)size;
+			if (rem == 0) {
+				perf_mmap__put(md);
+				cblocks[i] = NULL;
+				if (switch_output_size(rec))
+					trigger_hit(&switch_output_trigger);
+			} else {
+				rem_off = cblocks[i]->aio_offset +
+					cblocks[i]->aio_nbytes - rem;
+				if (!record__aio_write(trace_fd,
+					(struct aiocb *)cblocks[i],
+					md->data + cblocks[i]->aio_nbytes - rem,
+					rem, rem_off))
+					do_suspend = 1;
+			}
+		}
+	} while (do_suspend);
+
+	return 0;
+}
+
 static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evlist,
 				    bool overwrite)
 {
 	u64 bytes_written = rec->bytes_written;
-	int i;
-	int rc = 0;
+	int i, rc = 0;
 	struct perf_mmap *maps;
+	int trace_fd = rec->session->data->file.fd;
+	struct aiocb **mmap_aio = rec->evlist->mmap_aio;
+	int mmap_aio_size = 0;
+	off_t off;
 
 	if (!evlist)
 		return 0;
@@ -546,14 +620,17 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
 	if (overwrite && evlist->bkw_mmap_state != BKW_MMAP_DATA_PENDING)
 		return 0;
 
+	off = lseek(trace_fd, 0, SEEK_CUR);
+
 	for (i = 0; i < evlist->nr_mmaps; i++) {
 		struct auxtrace_mmap *mm = &maps[i].auxtrace_mmap;
 
 		if (maps[i].base) {
-			if (perf_mmap__push(&maps[i], rec, record__pushfn) != 0) {
-				rc = -1;
+			rc = perf_mmap__push(&maps[i], rec, record__pushfn, &off);
+			if (rc < 0)
 				goto out;
-			}
+			else if (rc > 0)
+				mmap_aio[mmap_aio_size++] = &maps[i].cblock;
 		}
 
 		if (mm->base && !rec->opts.auxtrace_snapshot_mode &&
@@ -563,6 +640,10 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
 		}
 	}
 
+	record__mmap_read_sync(trace_fd, mmap_aio, mmap_aio_size, rec);
+
+	lseek(trace_fd, off, SEEK_SET);
+
 	/*
 	 * Mark the round finished in case we wrote
 	 * at least one event.
diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index e71d46cb01cc..c8b921c88a5d 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -292,11 +292,11 @@ int perf_mmap__read_init(struct perf_mmap *map)
 }
 
 int perf_mmap__push(struct perf_mmap *md, void *to,
-		    int push(void *to, void *buf, size_t size))
+		    int push(void *to, void *buf, size_t size, off_t), off_t *off)
 {
 	u64 head = perf_mmap__read_head(md);
 	unsigned char *data = md->base + page_size;
-	unsigned long size;
+	unsigned long size, size0 = 0;
 	void *buf;
 	int rc = 0;
 
@@ -308,23 +308,22 @@ int perf_mmap__push(struct perf_mmap *md, void *to,
 
 	if ((md->start & md->mask) + size != (md->end & md->mask)) {
 		buf = &data[md->start & md->mask];
-		size = md->mask + 1 - (md->start & md->mask);
-		md->start += size;
-
-		if (push(to, buf, size) < 0) {
-			rc = -1;
-			goto out;
-		}
+		size0 = md->mask + 1 - (md->start & md->mask);
+		md->start += size0;
+		memcpy(md->data, buf, size0);
 	}
 
 	buf = &data[md->start & md->mask];
 	size = md->end - md->start;
 	md->start += size;
+	memcpy(md->data + size0, buf, size);
 
-	if (push(to, buf, size) < 0) {
-		rc = -1;
+	rc = push(to, md, size0 + size, *off) < 0 ? -1 : 1;
+	if (rc == -1)
 		goto out;
-	}
+
+	perf_mmap__get(md);
+	*off += size0 + size;
 
 	md->prev = head;
 	perf_mmap__consume(md);
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index 1974e621e36b..6211a3a0c4c3 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -95,7 +95,7 @@ union perf_event *perf_mmap__read_forward(struct perf_mmap *map);
 union perf_event *perf_mmap__read_event(struct perf_mmap *map);
 
 int perf_mmap__push(struct perf_mmap *md, void *to,
-		    int push(void *to, void *buf, size_t size));
+		    int push(void *to, void *buf, size_t size, off_t off), off_t *off);
 
 size_t perf_mmap__mmap_len(struct perf_mmap *map);


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/2]: perf util: map data buffer for preserving collected data
  2018-08-23 16:42 ` [PATCH v2 1/2]: perf util: map data buffer for preserving collected data Alexey Budankov
@ 2018-08-27  8:28   ` Jiri Olsa
  2018-08-27  8:58     ` Alexey Budankov
  2018-08-27  8:33   ` Jiri Olsa
  1 sibling, 1 reply; 19+ messages in thread
From: Jiri Olsa @ 2018-08-27  8:28 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel,
	linux-perf-users

On Thu, Aug 23, 2018 at 07:42:09PM +0300, Alexey Budankov wrote:
> 
> The data buffer and accompanying AIO control block are allocated at 
> perf_mmap object and the mapped data buffer size is equal to 
> the kernel one.
> 
> The buffer is then used to preserve profiling data ready for dumping 
> and queue it for asynchronous writing into perf trace thru implemented 
> record__aio_write() function.
> 
> mmap_aio control structure of the size equal to the number of per-cpu 
> kernel buffers is used to keep pointers to enqueued AIO control 
> blocks for monitoring of completed AIO operations.
> 
> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
> ---
> Changes in v2:
> - converted zalloc() to calloc() for allocation of mmap_aio array,
> - cleared typo and adjusted fallback branch code;
> ---
> tools/perf/builtin-record.c | 18 ++++++++++++++++++
>  tools/perf/util/evlist.c    |  7 +++++++
>  tools/perf/util/evlist.h    |  2 ++
>  tools/perf/util/mmap.c      | 12 ++++++++++++
>  tools/perf/util/mmap.h      |  3 +++
>  5 files changed, 42 insertions(+)
> 
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 22ebeb92ac51..a35675e9f3aa 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -53,6 +53,7 @@
>  #include <sys/mman.h>
>  #include <sys/wait.h>
>  #include <linux/time64.h>
> +#include <aio.h>
>  
>  struct switch_output {
>  	bool		 enabled;
> @@ -121,6 +122,23 @@ static int record__write(struct record *rec, void *bf, size_t size)
>  	return 0;
>  }
>  
> +static int record__aio_write(int trace_fd, struct aiocb *cblock,
> +		void *buf, size_t size, off_t off)
> +{

this breaks bisection:

builtin-record.c:125:12: error: ‘record__aio_write’ defined but not used [-Werror=unused-function]
 static int record__aio_write(int trace_fd, struct aiocb *cblock,

jirka

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/2]: perf util: map data buffer for preserving collected data
  2018-08-23 16:42 ` [PATCH v2 1/2]: perf util: map data buffer for preserving collected data Alexey Budankov
  2018-08-27  8:28   ` Jiri Olsa
@ 2018-08-27  8:33   ` Jiri Olsa
  2018-08-27  9:02     ` Alexey Budankov
  1 sibling, 1 reply; 19+ messages in thread
From: Jiri Olsa @ 2018-08-27  8:33 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel,
	linux-perf-users

On Thu, Aug 23, 2018 at 07:42:09PM +0300, Alexey Budankov wrote:

SNIP

> diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
> index fc832676a798..e71d46cb01cc 100644
> --- a/tools/perf/util/mmap.c
> +++ b/tools/perf/util/mmap.c
> @@ -155,6 +155,10 @@ void __weak auxtrace_mmap_params__set_idx(struct auxtrace_mmap_params *mp __mayb
>  
>  void perf_mmap__munmap(struct perf_mmap *map)
>  {
> +	if (map->data != NULL) {
> +		munmap(map->data, perf_mmap__mmap_len(map));
> +		map->data = NULL;
> +	}
>  	if (map->base != NULL) {
>  		munmap(map->base, perf_mmap__mmap_len(map));
>  		map->base = NULL;
> @@ -190,6 +194,14 @@ int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd)
>  		map->base = NULL;
>  		return -1;
>  	}
> +	map->data = mmap(NULL, perf_mmap__mmap_len(map), PROT_READ | PROT_WRITE,
> +			MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

hum, why does map->data need to be mmap-ed?

jirka

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2]: perf record: enable asynchronous trace writing
  2018-08-23 16:47 ` [PATCH v2 2/2]: perf record: enable asynchronous trace writing Alexey Budankov
@ 2018-08-27  8:34   ` Jiri Olsa
  2018-08-27  9:12     ` Alexey Budankov
  2018-08-27  8:38   ` Jiri Olsa
  2018-08-27  8:43   ` Jiri Olsa
  2 siblings, 1 reply; 19+ messages in thread
From: Jiri Olsa @ 2018-08-27  8:34 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel,
	linux-perf-users

On Thu, Aug 23, 2018 at 07:47:01PM +0300, Alexey Budankov wrote:

SNIP

> diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
> index e71d46cb01cc..c8b921c88a5d 100644
> --- a/tools/perf/util/mmap.c
> +++ b/tools/perf/util/mmap.c
> @@ -292,11 +292,11 @@ int perf_mmap__read_init(struct perf_mmap *map)
>  }
>  
>  int perf_mmap__push(struct perf_mmap *md, void *to,
> -		    int push(void *to, void *buf, size_t size))
> +		    int push(void *to, void *buf, size_t size, off_t), off_t *off)
>  {
>  	u64 head = perf_mmap__read_head(md);
>  	unsigned char *data = md->base + page_size;
> -	unsigned long size;
> +	unsigned long size, size0 = 0;
>  	void *buf;
>  	int rc = 0;
>  
> @@ -308,23 +308,22 @@ int perf_mmap__push(struct perf_mmap *md, void *to,
>  
>  	if ((md->start & md->mask) + size != (md->end & md->mask)) {
>  		buf = &data[md->start & md->mask];
> -		size = md->mask + 1 - (md->start & md->mask);
> -		md->start += size;
> -
> -		if (push(to, buf, size) < 0) {
> -			rc = -1;
> -			goto out;
> -		}
> +		size0 = md->mask + 1 - (md->start & md->mask);
> +		md->start += size0;
> +		memcpy(md->data, buf, size0);
>  	}
>  
>  	buf = &data[md->start & md->mask];
>  	size = md->end - md->start;
>  	md->start += size;
> +	memcpy(md->data + size0, buf, size);

this will need more comments.. and explanation why we copy the data
over to another buffer.. it's interesting, it's still faster 

jirka

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2]: perf record: enable asynchronous trace writing
  2018-08-23 16:47 ` [PATCH v2 2/2]: perf record: enable asynchronous trace writing Alexey Budankov
  2018-08-27  8:34   ` Jiri Olsa
@ 2018-08-27  8:38   ` Jiri Olsa
  2018-08-27  9:33     ` Alexey Budankov
  2018-08-27  8:43   ` Jiri Olsa
  2 siblings, 1 reply; 19+ messages in thread
From: Jiri Olsa @ 2018-08-27  8:38 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel,
	linux-perf-users

On Thu, Aug 23, 2018 at 07:47:01PM +0300, Alexey Budankov wrote:

SNIP

>  static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evlist,
>  				    bool overwrite)
>  {
>  	u64 bytes_written = rec->bytes_written;
> -	int i;
> -	int rc = 0;
> +	int i, rc = 0;
>  	struct perf_mmap *maps;
> +	int trace_fd = rec->session->data->file.fd;
> +	struct aiocb **mmap_aio = rec->evlist->mmap_aio;
> +	int mmap_aio_size = 0;
> +	off_t off;
>  
>  	if (!evlist)
>  		return 0;
> @@ -546,14 +620,17 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
>  	if (overwrite && evlist->bkw_mmap_state != BKW_MMAP_DATA_PENDING)
>  		return 0;
>  
> +	off = lseek(trace_fd, 0, SEEK_CUR);
> +

with async write, do we need to query/set the offset like this
all the time?

could we just keep/update the offset value in the 'struct perf_data_file'
and skip both lseek calls?

jirka

SNIP

>  
> +	record__mmap_read_sync(trace_fd, mmap_aio, mmap_aio_size, rec);
> +
> +	lseek(trace_fd, off, SEEK_SET);
> +

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2]: perf record: enable asynchronous trace writing
  2018-08-23 16:47 ` [PATCH v2 2/2]: perf record: enable asynchronous trace writing Alexey Budankov
  2018-08-27  8:34   ` Jiri Olsa
  2018-08-27  8:38   ` Jiri Olsa
@ 2018-08-27  8:43   ` Jiri Olsa
  2018-08-27  9:45     ` Alexey Budankov
  2 siblings, 1 reply; 19+ messages in thread
From: Jiri Olsa @ 2018-08-27  8:43 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel,
	linux-perf-users

On Thu, Aug 23, 2018 at 07:47:01PM +0300, Alexey Budankov wrote:

SNIP

>  
>  static volatile int done;
> @@ -528,13 +530,85 @@ static struct perf_event_header finished_round_event = {
>  	.type = PERF_RECORD_FINISHED_ROUND,
>  };
>  
> +static int record__mmap_read_sync(int trace_fd, struct aiocb **cblocks,
> +		int cblocks_size, struct record *rec)
> +{
> +	size_t rem;
> +	ssize_t size;
> +	off_t rem_off;
> +	int i, aio_ret, aio_errno, do_suspend;
> +	struct perf_mmap *md;
> +	struct timespec timeout0 = { 0, 0 };
> +	struct timespec timeoutS = { 0, 1000 * 1000  * 1 };
> +
> +	if (!cblocks_size)
> +		return 0;
> +
> +	do {
> +		do_suspend = 0;
> +		nanosleep(&timeoutS, NULL);
> +		if (aio_suspend((const struct aiocb**)cblocks, cblocks_size, &timeout0)) {
> +			if (errno == EAGAIN || errno == EINTR) {
> +				do_suspend = 1;
> +				continue;
> +			} else {
> +				pr_err("failed to sync perf data, error: %m\n");
> +				break;
> +			}
> +		}
> +		for (i = 0; i < cblocks_size; i++) {

it looks like we could set up the async write to receive the signal
with the user pointer (sigev_value.sival_ptr), which would allow us
to get the finished descriptor right away and we wouldn't need
to iterate all of them and checking on them

jirka
> +			if (cblocks[i] == NULL) {
> +				continue;
> +			}
> +			aio_errno = aio_error(cblocks[i]);
> +			if (aio_errno == EINPROGRESS) {
> +				do_suspend = 1;
> +				continue;
> +			}

SNIP

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/2]: perf util: map data buffer for preserving collected data
  2018-08-27  8:28   ` Jiri Olsa
@ 2018-08-27  8:58     ` Alexey Budankov
  0 siblings, 0 replies; 19+ messages in thread
From: Alexey Budankov @ 2018-08-27  8:58 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel,
	linux-perf-users

Hi Jiri,

On 27.08.2018 11:28, Jiri Olsa wrote:
> On Thu, Aug 23, 2018 at 07:42:09PM +0300, Alexey Budankov wrote:
>>
>> The data buffer and accompanying AIO control block are allocated at 
>> perf_mmap object and the mapped data buffer size is equal to 
>> the kernel one.
>>
>> The buffer is then used to preserve profiling data ready for dumping 
>> and queue it for asynchronous writing into perf trace thru implemented 
>> record__aio_write() function.
>>
>> mmap_aio control structure of the size equal to the number of per-cpu 
>> kernel buffers is used to keep pointers to enqueued AIO control 
>> blocks for monitoring of completed AIO operations.
>>
>> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
>> ---
>> Changes in v2:
>> - converted zalloc() to calloc() for allocation of mmap_aio array,
>> - cleared typo and adjusted fallback branch code;
>> ---
>> tools/perf/builtin-record.c | 18 ++++++++++++++++++
>>  tools/perf/util/evlist.c    |  7 +++++++
>>  tools/perf/util/evlist.h    |  2 ++
>>  tools/perf/util/mmap.c      | 12 ++++++++++++
>>  tools/perf/util/mmap.h      |  3 +++
>>  5 files changed, 42 insertions(+)
>>
>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>> index 22ebeb92ac51..a35675e9f3aa 100644
>> --- a/tools/perf/builtin-record.c
>> +++ b/tools/perf/builtin-record.c
>> @@ -53,6 +53,7 @@
>>  #include <sys/mman.h>
>>  #include <sys/wait.h>
>>  #include <linux/time64.h>
>> +#include <aio.h>
>>  
>>  struct switch_output {
>>  	bool		 enabled;
>> @@ -121,6 +122,23 @@ static int record__write(struct record *rec, void *bf, size_t size)
>>  	return 0;
>>  }
>>  
>> +static int record__aio_write(int trace_fd, struct aiocb *cblock,
>> +		void *buf, size_t size, off_t off)
>> +{
> 
> this breaks bisection:
> 
> builtin-record.c:125:12: error: ‘record__aio_write’ defined but not used [-Werror=unused-function]
>  static int record__aio_write(int trace_fd, struct aiocb *cblock,

Moving it to [PATCH v3 2/2]. Thanks!

> 
> jirka
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/2]: perf util: map data buffer for preserving collected data
  2018-08-27  8:33   ` Jiri Olsa
@ 2018-08-27  9:02     ` Alexey Budankov
  2018-08-28  8:45       ` Jiri Olsa
  0 siblings, 1 reply; 19+ messages in thread
From: Alexey Budankov @ 2018-08-27  9:02 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel,
	linux-perf-users

Hi,

On 27.08.2018 11:33, Jiri Olsa wrote:
> On Thu, Aug 23, 2018 at 07:42:09PM +0300, Alexey Budankov wrote:
> 
> SNIP
> 
>> diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
>> index fc832676a798..e71d46cb01cc 100644
>> --- a/tools/perf/util/mmap.c
>> +++ b/tools/perf/util/mmap.c
>> @@ -155,6 +155,10 @@ void __weak auxtrace_mmap_params__set_idx(struct auxtrace_mmap_params *mp __mayb
>>  
>>  void perf_mmap__munmap(struct perf_mmap *map)
>>  {
>> +	if (map->data != NULL) {
>> +		munmap(map->data, perf_mmap__mmap_len(map));
>> +		map->data = NULL;
>> +	}
>>  	if (map->base != NULL) {
>>  		munmap(map->base, perf_mmap__mmap_len(map));
>>  		map->base = NULL;
>> @@ -190,6 +194,14 @@ int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd)
>>  		map->base = NULL;
>>  		return -1;
>>  	}
>> +	map->data = mmap(NULL, perf_mmap__mmap_len(map), PROT_READ | PROT_WRITE,
>> +			MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 
> hum, why does map->data need to be mmap-ed?

The same way as for kernel buffers. If you see better alternatives it could be applied.

> 
> jirka
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2]: perf record: enable asynchronous trace writing
  2018-08-27  8:34   ` Jiri Olsa
@ 2018-08-27  9:12     ` Alexey Budankov
  0 siblings, 0 replies; 19+ messages in thread
From: Alexey Budankov @ 2018-08-27  9:12 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel,
	linux-perf-users

Hi,

On 27.08.2018 11:34, Jiri Olsa wrote:
> On Thu, Aug 23, 2018 at 07:47:01PM +0300, Alexey Budankov wrote:
> 
> SNIP
> 
>> diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
>> index e71d46cb01cc..c8b921c88a5d 100644
>> --- a/tools/perf/util/mmap.c
>> +++ b/tools/perf/util/mmap.c
>> @@ -292,11 +292,11 @@ int perf_mmap__read_init(struct perf_mmap *map)
>>  }
>>  
>>  int perf_mmap__push(struct perf_mmap *md, void *to,
>> -		    int push(void *to, void *buf, size_t size))
>> +		    int push(void *to, void *buf, size_t size, off_t), off_t *off)
>>  {
>>  	u64 head = perf_mmap__read_head(md);
>>  	unsigned char *data = md->base + page_size;
>> -	unsigned long size;
>> +	unsigned long size, size0 = 0;
>>  	void *buf;
>>  	int rc = 0;
>>  
>> @@ -308,23 +308,22 @@ int perf_mmap__push(struct perf_mmap *md, void *to,
>>  
>>  	if ((md->start & md->mask) + size != (md->end & md->mask)) {
>>  		buf = &data[md->start & md->mask];
>> -		size = md->mask + 1 - (md->start & md->mask);
>> -		md->start += size;
>> -
>> -		if (push(to, buf, size) < 0) {
>> -			rc = -1;
>> -			goto out;
>> -		}
>> +		size0 = md->mask + 1 - (md->start & md->mask);
>> +		md->start += size0;
>> +		memcpy(md->data, buf, size0);
>>  	}
>>  
>>  	buf = &data[md->start & md->mask];
>>  	size = md->end - md->start;
>>  	md->start += size;
>> +	memcpy(md->data + size0, buf, size);
> 
> this will need more comments.. and explanation why we copy the data
> over to another buffer.. it's interesting, it's still faster 

Sure. Comments will follow in v3. 
We copy data into a buffer to release space in the kernel buffer 
as fast as possible. That lets the kernel to store more data earlier 
than other per-cpu buffers are handled.

> 
> jirka
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2]: perf record: enable asynchronous trace writing
  2018-08-27  8:38   ` Jiri Olsa
@ 2018-08-27  9:33     ` Alexey Budankov
  2018-08-27 10:05       ` Namhyung Kim
  0 siblings, 1 reply; 19+ messages in thread
From: Alexey Budankov @ 2018-08-27  9:33 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel,
	linux-perf-users

Hi,

On 27.08.2018 11:38, Jiri Olsa wrote:
> On Thu, Aug 23, 2018 at 07:47:01PM +0300, Alexey Budankov wrote:
> 
> SNIP
> 
>>  static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evlist,
>>  				    bool overwrite)
>>  {
>>  	u64 bytes_written = rec->bytes_written;
>> -	int i;
>> -	int rc = 0;
>> +	int i, rc = 0;
>>  	struct perf_mmap *maps;
>> +	int trace_fd = rec->session->data->file.fd;
>> +	struct aiocb **mmap_aio = rec->evlist->mmap_aio;
>> +	int mmap_aio_size = 0;
>> +	off_t off;
>>  
>>  	if (!evlist)
>>  		return 0;
>> @@ -546,14 +620,17 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
>>  	if (overwrite && evlist->bkw_mmap_state != BKW_MMAP_DATA_PENDING)
>>  		return 0;
>>  
>> +	off = lseek(trace_fd, 0, SEEK_CUR);
>> +
> 
> with async write, do we need to query/set the offset like this
> all the time?

It looks like we need it this way. Internally glibc AIO implements writes 
using pwrite64 syscall in our case. The sycall requires offset as a parameter 
and doesn't update file position on the completion.

> 
> could we just keep/update the offset value in the 'struct perf_data_file'
> and skip both lseek calls?

Don't see how it is possible. offset is different for every enqeued write 
operation and write areas don't intersect for the whole writing loop. 
To know the final file position it is required to iterate thru 
the loop.

> 
> jirka
> 
> SNIP
> 
>>  
>> +	record__mmap_read_sync(trace_fd, mmap_aio, mmap_aio_size, rec);
>> +
>> +	lseek(trace_fd, off, SEEK_SET);
>> +
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2]: perf record: enable asynchronous trace writing
  2018-08-27  8:43   ` Jiri Olsa
@ 2018-08-27  9:45     ` Alexey Budankov
  0 siblings, 0 replies; 19+ messages in thread
From: Alexey Budankov @ 2018-08-27  9:45 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel,
	linux-perf-users

Hi,

On 27.08.2018 11:43, Jiri Olsa wrote:
> On Thu, Aug 23, 2018 at 07:47:01PM +0300, Alexey Budankov wrote:
> 
> SNIP
> 
>>  
>>  static volatile int done;
>> @@ -528,13 +530,85 @@ static struct perf_event_header finished_round_event = {
>>  	.type = PERF_RECORD_FINISHED_ROUND,
>>  };
>>  
>> +static int record__mmap_read_sync(int trace_fd, struct aiocb **cblocks,
>> +		int cblocks_size, struct record *rec)
>> +{
>> +	size_t rem;
>> +	ssize_t size;
>> +	off_t rem_off;
>> +	int i, aio_ret, aio_errno, do_suspend;
>> +	struct perf_mmap *md;
>> +	struct timespec timeout0 = { 0, 0 };
>> +	struct timespec timeoutS = { 0, 1000 * 1000  * 1 };
>> +
>> +	if (!cblocks_size)
>> +		return 0;
>> +
>> +	do {
>> +		do_suspend = 0;
>> +		nanosleep(&timeoutS, NULL);
>> +		if (aio_suspend((const struct aiocb**)cblocks, cblocks_size, &timeout0)) {
>> +			if (errno == EAGAIN || errno == EINTR) {
>> +				do_suspend = 1;
>> +				continue;
>> +			} else {
>> +				pr_err("failed to sync perf data, error: %m\n");
>> +				break;
>> +			}
>> +		}
>> +		for (i = 0; i < cblocks_size; i++) {
> 
> it looks like we could set up the async write to receive the signal
> with the user pointer (sigev_value.sival_ptr), which would allow us
> to get the finished descriptor right away and we wouldn't need
> to iterate all of them and checking on them

Yep. This mechanism is provided by AIO API, but we still need this kind of 
synchronizing barrier to avoid memory races on mmap->data buffer between 
calls of the loop in record__mmap_read_evlist().

> 
> jirka
>> +			if (cblocks[i] == NULL) {
>> +				continue;
>> +			}
>> +			aio_errno = aio_error(cblocks[i]);
>> +			if (aio_errno == EINPROGRESS) {
>> +				do_suspend = 1;
>> +				continue;
>> +			}
> 
> SNIP
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2]: perf record: enable asynchronous trace writing
  2018-08-27  9:33     ` Alexey Budankov
@ 2018-08-27 10:05       ` Namhyung Kim
  2018-08-27 10:25         ` Alexey Budankov
  0 siblings, 1 reply; 19+ messages in thread
From: Namhyung Kim @ 2018-08-27 10:05 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Jiri Olsa, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Andi Kleen, linux-kernel, linux-perf-users,
	kernel-team

Hello,

On Mon, Aug 27, 2018 at 12:33:07PM +0300, Alexey Budankov wrote:
> Hi,
> 
> On 27.08.2018 11:38, Jiri Olsa wrote:
> > On Thu, Aug 23, 2018 at 07:47:01PM +0300, Alexey Budankov wrote:
> > 
> > SNIP
> > 
> >>  static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evlist,
> >>  				    bool overwrite)
> >>  {
> >>  	u64 bytes_written = rec->bytes_written;
> >> -	int i;
> >> -	int rc = 0;
> >> +	int i, rc = 0;
> >>  	struct perf_mmap *maps;
> >> +	int trace_fd = rec->session->data->file.fd;
> >> +	struct aiocb **mmap_aio = rec->evlist->mmap_aio;
> >> +	int mmap_aio_size = 0;
> >> +	off_t off;
> >>  
> >>  	if (!evlist)
> >>  		return 0;
> >> @@ -546,14 +620,17 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
> >>  	if (overwrite && evlist->bkw_mmap_state != BKW_MMAP_DATA_PENDING)
> >>  		return 0;
> >>  
> >> +	off = lseek(trace_fd, 0, SEEK_CUR);
> >> +
> > 
> > with async write, do we need to query/set the offset like this
> > all the time?
> 
> It looks like we need it this way. Internally glibc AIO implements writes 
> using pwrite64 syscall in our case. The sycall requires offset as a parameter 
> and doesn't update file position on the completion.
> 
> > 
> > could we just keep/update the offset value in the 'struct perf_data_file'
> > and skip both lseek calls?
> 
> Don't see how it is possible. offset is different for every enqeued write 
> operation and write areas don't intersect for the whole writing loop. 
> To know the final file position it is required to iterate thru 
> the loop.

But as far as I can see the offset is linearly updated in
perf_mmap__push() and I guess those two lseek() calls will return
a same value as the last updated offset, no?

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2]: perf record: enable asynchronous trace writing
  2018-08-27 10:05       ` Namhyung Kim
@ 2018-08-27 10:25         ` Alexey Budankov
  2018-08-27 10:38           ` Jiri Olsa
  0 siblings, 1 reply; 19+ messages in thread
From: Alexey Budankov @ 2018-08-27 10:25 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Jiri Olsa, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Andi Kleen, linux-kernel, linux-perf-users,
	kernel-team

Hi Namhyung,

On 27.08.2018 13:05, Namhyung Kim wrote:
> Hello,
> 
> On Mon, Aug 27, 2018 at 12:33:07PM +0300, Alexey Budankov wrote:
>> Hi,
>>
>> On 27.08.2018 11:38, Jiri Olsa wrote:
>>> On Thu, Aug 23, 2018 at 07:47:01PM +0300, Alexey Budankov wrote:
>>>
>>> SNIP
>>>
>>>>  static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evlist,
>>>>  				    bool overwrite)
>>>>  {
>>>>  	u64 bytes_written = rec->bytes_written;
>>>> -	int i;
>>>> -	int rc = 0;
>>>> +	int i, rc = 0;
>>>>  	struct perf_mmap *maps;
>>>> +	int trace_fd = rec->session->data->file.fd;
>>>> +	struct aiocb **mmap_aio = rec->evlist->mmap_aio;
>>>> +	int mmap_aio_size = 0;
>>>> +	off_t off;
>>>>  
>>>>  	if (!evlist)
>>>>  		return 0;
>>>> @@ -546,14 +620,17 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
>>>>  	if (overwrite && evlist->bkw_mmap_state != BKW_MMAP_DATA_PENDING)
>>>>  		return 0;
>>>>  
>>>> +	off = lseek(trace_fd, 0, SEEK_CUR);
>>>> +
>>>
>>> with async write, do we need to query/set the offset like this
>>> all the time?
>>
>> It looks like we need it this way. Internally glibc AIO implements writes 
>> using pwrite64 syscall in our case. The sycall requires offset as a parameter 
>> and doesn't update file position on the completion.
>>
>>>
>>> could we just keep/update the offset value in the 'struct perf_data_file'
>>> and skip both lseek calls?
>>
>> Don't see how it is possible. offset is different for every enqeued write 
>> operation and write areas don't intersect for the whole writing loop. 
>> To know the final file position it is required to iterate thru 
>> the loop.
> 
> But as far as I can see the offset is linearly updated in
> perf_mmap__push() and I guess those two lseek() calls will return
> a same value as the last updated offset, no?

Yes, offset is linearly calculated by perf_mmap__push() code for 
the next possible write operation, but file position is update by 
the kernel only in the second lseek() syscall after the loop. 
The first lseek() syscall reads that file position for 
the next loop iterations.

Regards,
Alexey

> 
> Thanks,
> Namhyung
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2]: perf record: enable asynchronous trace writing
  2018-08-27 10:25         ` Alexey Budankov
@ 2018-08-27 10:38           ` Jiri Olsa
  2018-08-27 10:48             ` Alexey Budankov
  0 siblings, 1 reply; 19+ messages in thread
From: Jiri Olsa @ 2018-08-27 10:38 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Namhyung Kim, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Andi Kleen,
	linux-kernel, linux-perf-users, kernel-team

On Mon, Aug 27, 2018 at 01:25:35PM +0300, Alexey Budankov wrote:
> Hi Namhyung,
> 
> On 27.08.2018 13:05, Namhyung Kim wrote:
> > Hello,
> > 
> > On Mon, Aug 27, 2018 at 12:33:07PM +0300, Alexey Budankov wrote:
> >> Hi,
> >>
> >> On 27.08.2018 11:38, Jiri Olsa wrote:
> >>> On Thu, Aug 23, 2018 at 07:47:01PM +0300, Alexey Budankov wrote:
> >>>
> >>> SNIP
> >>>
> >>>>  static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evlist,
> >>>>  				    bool overwrite)
> >>>>  {
> >>>>  	u64 bytes_written = rec->bytes_written;
> >>>> -	int i;
> >>>> -	int rc = 0;
> >>>> +	int i, rc = 0;
> >>>>  	struct perf_mmap *maps;
> >>>> +	int trace_fd = rec->session->data->file.fd;
> >>>> +	struct aiocb **mmap_aio = rec->evlist->mmap_aio;
> >>>> +	int mmap_aio_size = 0;
> >>>> +	off_t off;
> >>>>  
> >>>>  	if (!evlist)
> >>>>  		return 0;
> >>>> @@ -546,14 +620,17 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
> >>>>  	if (overwrite && evlist->bkw_mmap_state != BKW_MMAP_DATA_PENDING)
> >>>>  		return 0;
> >>>>  
> >>>> +	off = lseek(trace_fd, 0, SEEK_CUR);
> >>>> +
> >>>
> >>> with async write, do we need to query/set the offset like this
> >>> all the time?
> >>
> >> It looks like we need it this way. Internally glibc AIO implements writes 
> >> using pwrite64 syscall in our case. The sycall requires offset as a parameter 
> >> and doesn't update file position on the completion.
> >>
> >>>
> >>> could we just keep/update the offset value in the 'struct perf_data_file'
> >>> and skip both lseek calls?
> >>
> >> Don't see how it is possible. offset is different for every enqeued write 
> >> operation and write areas don't intersect for the whole writing loop. 
> >> To know the final file position it is required to iterate thru 
> >> the loop.
> > 
> > But as far as I can see the offset is linearly updated in
> > perf_mmap__push() and I guess those two lseek() calls will return
> > a same value as the last updated offset, no?
> 
> Yes, offset is linearly calculated by perf_mmap__push() code for 
> the next possible write operation, but file position is update by 
> the kernel only in the second lseek() syscall after the loop. 
> The first lseek() syscall reads that file position for 
> the next loop iterations.

does the file's offset need to get updated with lseek at all?
the async write gets offset.. so I'd think as long as we keep
offset value, we don't need to call lseek at all

jirka

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2]: perf record: enable asynchronous trace writing
  2018-08-27 10:38           ` Jiri Olsa
@ 2018-08-27 10:48             ` Alexey Budankov
  0 siblings, 0 replies; 19+ messages in thread
From: Alexey Budankov @ 2018-08-27 10:48 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Namhyung Kim, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Andi Kleen,
	linux-kernel, linux-perf-users, kernel-team


On 27.08.2018 13:38, Jiri Olsa wrote:
> On Mon, Aug 27, 2018 at 01:25:35PM +0300, Alexey Budankov wrote:
>> Hi Namhyung,
>>
>> On 27.08.2018 13:05, Namhyung Kim wrote:
>>> Hello,
>>>
>>> On Mon, Aug 27, 2018 at 12:33:07PM +0300, Alexey Budankov wrote:
>>>> Hi,
>>>>
>>>> On 27.08.2018 11:38, Jiri Olsa wrote:
>>>>> On Thu, Aug 23, 2018 at 07:47:01PM +0300, Alexey Budankov wrote:
>>>>>
>>>>> SNIP
>>>>>
>>>>>>  static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evlist,
>>>>>>  				    bool overwrite)
>>>>>>  {
>>>>>>  	u64 bytes_written = rec->bytes_written;
>>>>>> -	int i;
>>>>>> -	int rc = 0;
>>>>>> +	int i, rc = 0;
>>>>>>  	struct perf_mmap *maps;
>>>>>> +	int trace_fd = rec->session->data->file.fd;
>>>>>> +	struct aiocb **mmap_aio = rec->evlist->mmap_aio;
>>>>>> +	int mmap_aio_size = 0;
>>>>>> +	off_t off;
>>>>>>  
>>>>>>  	if (!evlist)
>>>>>>  		return 0;
>>>>>> @@ -546,14 +620,17 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
>>>>>>  	if (overwrite && evlist->bkw_mmap_state != BKW_MMAP_DATA_PENDING)
>>>>>>  		return 0;
>>>>>>  Hi
>>>>>> +	off = lseek(trace_fd, 0, SEEK_CUR);
>>>>>> +
>>>>>
>>>>> with async write, do we need to query/set the offset like this
>>>>> all the time?
>>>>
>>>> It looks like we need it this way. Internally glibc AIO implements writes 
>>>> using pwrite64 syscall in our case. The sycall requires offset as a parameter 
>>>> and doesn't update file position on the completion.
>>>>
>>>>>
>>>>> could we just keep/update the offset value in the 'struct perf_data_file'
>>>>> and skip both lseek calls?
>>>>
>>>> Don't see how it is possible. offset is different for every enqeued write 
>>>> operation and write areas don't intersect for the whole writing loop. 
>>>> To know the final file position it is required to iterate thru 
>>>> the loop.
>>>
>>> But as far as I can see the offset is linearly updated in
>>> perf_mmap__push() and I guess those two lseek() calls will return
>>> a same value as the last updated offset, no?
>>
>> Yes, offset is linearly calculated by perf_mmap__push() code for 
>> the next possible write operation, but file position is update by 
>> the kernel only in the second lseek() syscall after the loop. 
>> The first lseek() syscall reads that file position for 
>> the next loop iterations.
> 
> does the file's offset need to get updated with lseek at all?
> the async write gets offset.. so I'd think as long as we keep
> offset value, we don't need to call lseek at all

Yep, I expected the same from pwrite64() syscall which eventually 
gets this offset but that is not the case.

> 
> jirka
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/2]: perf util: map data buffer for preserving collected data
  2018-08-27  9:02     ` Alexey Budankov
@ 2018-08-28  8:45       ` Jiri Olsa
  2018-08-28  9:19         ` Alexey Budankov
  0 siblings, 1 reply; 19+ messages in thread
From: Jiri Olsa @ 2018-08-28  8:45 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel,
	linux-perf-users

On Mon, Aug 27, 2018 at 12:02:35PM +0300, Alexey Budankov wrote:
> Hi,
> 
> On 27.08.2018 11:33, Jiri Olsa wrote:
> > On Thu, Aug 23, 2018 at 07:42:09PM +0300, Alexey Budankov wrote:
> > 
> > SNIP
> > 
> >> diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
> >> index fc832676a798..e71d46cb01cc 100644
> >> --- a/tools/perf/util/mmap.c
> >> +++ b/tools/perf/util/mmap.c
> >> @@ -155,6 +155,10 @@ void __weak auxtrace_mmap_params__set_idx(struct auxtrace_mmap_params *mp __mayb
> >>  
> >>  void perf_mmap__munmap(struct perf_mmap *map)
> >>  {
> >> +	if (map->data != NULL) {
> >> +		munmap(map->data, perf_mmap__mmap_len(map));
> >> +		map->data = NULL;
> >> +	}
> >>  	if (map->base != NULL) {
> >>  		munmap(map->base, perf_mmap__mmap_len(map));
> >>  		map->base = NULL;
> >> @@ -190,6 +194,14 @@ int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd)
> >>  		map->base = NULL;
> >>  		return -1;
> >>  	}
> >> +	map->data = mmap(NULL, perf_mmap__mmap_len(map), PROT_READ | PROT_WRITE,
> >> +			MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > 
> > hum, why does map->data need to be mmap-ed?
> 
> The same way as for kernel buffers. If you see better alternatives it could be applied.

I meant why not just allocate them with mmaloc?

jirka

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/2]: perf util: map data buffer for preserving collected data
  2018-08-28  8:45       ` Jiri Olsa
@ 2018-08-28  9:19         ` Alexey Budankov
  0 siblings, 0 replies; 19+ messages in thread
From: Alexey Budankov @ 2018-08-28  9:19 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-kernel,
	linux-perf-users

Hi,

On 28.08.2018 11:45, Jiri Olsa wrote:
> On Mon, Aug 27, 2018 at 12:02:35PM +0300, Alexey Budankov wrote:
>> Hi,
>>
>> On 27.08.2018 11:33, Jiri Olsa wrote:
>>> On Thu, Aug 23, 2018 at 07:42:09PM +0300, Alexey Budankov wrote:
>>>
>>> SNIP
>>>
>>>> diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
>>>> index fc832676a798..e71d46cb01cc 100644
>>>> --- a/tools/perf/util/mmap.c
>>>> +++ b/tools/perf/util/mmap.c
>>>> @@ -155,6 +155,10 @@ void __weak auxtrace_mmap_params__set_idx(struct auxtrace_mmap_params *mp __mayb
>>>>  
>>>>  void perf_mmap__munmap(struct perf_mmap *map)
>>>>  {
>>>> +	if (map->data != NULL) {
>>>> +		munmap(map->data, perf_mmap__mmap_len(map));
>>>> +		map->data = NULL;
>>>> +	}
>>>>  	if (map->base != NULL) {
>>>>  		munmap(map->base, perf_mmap__mmap_len(map));
>>>>  		map->base = NULL;
>>>> @@ -190,6 +194,14 @@ int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd)
>>>>  		map->base = NULL;
>>>>  		return -1;
>>>>  	}
>>>> +	map->data = mmap(NULL, perf_mmap__mmap_len(map), PROT_READ | PROT_WRITE,
>>>> +			MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>>
>>> hum, why does map->data need to be mmap-ed?
>>
>> The same way as for kernel buffers. If you see better alternatives it could be applied.
> 
> I meant why not just allocate them with mmaloc?

Yep. Using malloc()/free() makes the implementation shorter. Included into [PATCH v4 1/2].

> 
> jirka
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-08-28  9:19 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-23 16:33 [PATCH v2 0/2]: perf: reduce data loss when profiling highly parallel CPU bound workloads Alexey Budankov
2018-08-23 16:42 ` [PATCH v2 1/2]: perf util: map data buffer for preserving collected data Alexey Budankov
2018-08-27  8:28   ` Jiri Olsa
2018-08-27  8:58     ` Alexey Budankov
2018-08-27  8:33   ` Jiri Olsa
2018-08-27  9:02     ` Alexey Budankov
2018-08-28  8:45       ` Jiri Olsa
2018-08-28  9:19         ` Alexey Budankov
2018-08-23 16:47 ` [PATCH v2 2/2]: perf record: enable asynchronous trace writing Alexey Budankov
2018-08-27  8:34   ` Jiri Olsa
2018-08-27  9:12     ` Alexey Budankov
2018-08-27  8:38   ` Jiri Olsa
2018-08-27  9:33     ` Alexey Budankov
2018-08-27 10:05       ` Namhyung Kim
2018-08-27 10:25         ` Alexey Budankov
2018-08-27 10:38           ` Jiri Olsa
2018-08-27 10:48             ` Alexey Budankov
2018-08-27  8:43   ` Jiri Olsa
2018-08-27  9:45     ` Alexey Budankov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).