All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v20 0/5] Introducing trace buffer mapping by user-space
@ 2024-04-06 17:36 Vincent Donnefort
  2024-04-06 17:36 ` [PATCH v20 1/5] ring-buffer: allocate sub-buffers with __GFP_COMP Vincent Donnefort
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Vincent Donnefort @ 2024-04-06 17:36 UTC (permalink / raw)
  To: rostedt, mhiramat, linux-kernel, linux-trace-kernel
  Cc: mathieu.desnoyers, kernel-team, rdunlap, Vincent Donnefort

The tracing ring-buffers can be stored on disk or sent to network
without any copy via splice. However the later doesn't allow real time
processing of the traces. A solution is to give userspace direct access
to the ring-buffer pages via a mapping. An application can now become a
consumer of the ring-buffer, in a similar fashion to what trace_pipe
offers.

Support for this new feature can already be found in libtracefs from
version 1.8, when built with EXTRA_CFLAGS=-DFORCE_MMAP_ENABLE.

Vincent

v19 -> v20:
  * Fix typos in documentation.
  * Remove useless mmap open and fault callbacks.
  * add mm.h include for vm_insert_pages

v18 -> v19:
  * Use VM_PFNMAP and vm_insert_pages
  * Allocate ring-buffer subbufs with __GFP_COMP
  * Pad the meta-page with the zero-page to align on the subbuf_order
  * Extend the ring-buffer test with mmap() dedicated suite

v17 -> v18:
  * Fix lockdep_assert_held
  * Fix spin_lock_init typo
  * Fix CONFIG_TRACER_MAX_TRACE typo

v16 -> v17:
  * Documentation and comments improvements.
  * Create get/put_snapshot_map() for clearer code.
  * Replace kzalloc with kcalloc.
  * Fix -ENOMEM handling in rb_alloc_meta_page().
  * Move flush(cpu_buffer->reader_page) behind the reader lock.
  * Move all inc/dec of cpu_buffer->mapped behind reader lock and buffer
    mutex. (removes READ_ONCE/WRITE_ONCE accesses).

v15 -> v16:
  * Add comment for the dcache flush.
  * Remove now unnecessary WRITE_ONCE for the meta-page.

v14 -> v15:
  * Add meta-page and reader-page flush. Intends to fix the mapping
    for VIVT and aliasing-VIPT data caches.
  * -EPERM on VM_EXEC.
  * Fix build warning !CONFIG_TRACER_MAX_TRACE.

v13 -> v14:
  * All cpu_buffer->mapped readers use READ_ONCE (except for swap_cpu)
  * on unmap, sync meta-page teardown with the reader_lock instead of
    the synchronize_rcu.
  * Add a dedicated spinlock for trace_array ->snapshot and ->mapped.
    (intends to fix a lockdep issue)
  * Add kerneldoc for flags and Reserved fields.
  * Add kselftest for snapshot/map mutual exclusion.

v12 -> v13:
  * Swap subbufs_{touched,lost} for Reserved fields.
  * Add a flag field in the meta-page.
  * Fix CONFIG_TRACER_MAX_TRACE.
  * Rebase on top of trace/urgent.
  * Add a comment for try_unregister_trigger()

v11 -> v12:
  * Fix code sample mmap bug.
  * Add logging in sample code.
  * Reset tracer in selftest.
  * Add a refcount for the snapshot users.
  * Prevent mapping when there are snapshot users and vice versa.
  * Refine the meta-page.
  * Fix types in the meta-page.
  * Collect Reviewed-by.

v10 -> v11:
  * Add Documentation and code sample.
  * Add a selftest.
  * Move all the update to the meta-page into a single
    rb_update_meta_page().
  * rb_update_meta_page() is now called from
    ring_buffer_map_get_reader() to fix NOBLOCK callers.
  * kerneldoc for struct trace_meta_page.
  * Add a patch to zero all the ring-buffer allocations.

v9 -> v10:
  * Refactor rb_update_meta_page()
  * In-loop declaration for foreach_subbuf_page()
  * Check for cpu_buffer->mapped overflow

v8 -> v9:
  * Fix the unlock path in ring_buffer_map()
  * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer
  * Rebase on linux-trace/for-next (3cb3091138ca0921c4569bcf7ffa062519639b6a)

v7 -> v8:
  * Drop the subbufs renaming into bpages
  * Use subbuf as a name when relevant

v6 -> v7:
  * Rebase onto lore.kernel.org/lkml/20231215175502.106587604@goodmis.org/
  * Support for subbufs
  * Rename subbufs into bpages

v5 -> v6:
  * Rebase on next-20230802.
  * (unsigned long) -> (void *) cast for virt_to_page().
  * Add a wait for the GET_READER_PAGE ioctl.
  * Move writer fields update (overrun/pages_lost/entries/pages_touched)
    in the irq_work.
  * Rearrange id in struct buffer_page.
  * Rearrange the meta-page.
  * ring_buffer_meta_page -> trace_buffer_meta_page.
  * Add meta_struct_len into the meta-page.

v4 -> v5:
  * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3)

v3 -> v4:
  * Add to the meta-page:
       - pages_lost / pages_read (allow to compute how full is the
	 ring-buffer)
       - read (allow to compute how many entries can be read)
       - A reader_page struct.
  * Rename ring_buffer_meta_header -> ring_buffer_meta
  * Rename ring_buffer_get_reader_page -> ring_buffer_map_get_reader_page
  * Properly consume events on ring_buffer_map_get_reader_page() with
    rb_advance_reader().

v2 -> v3:
  * Remove data page list (for non-consuming read)
    ** Implies removing order > 0 meta-page
  * Add a new meta page field ->read
  * Rename ring_buffer_meta_page_header into ring_buffer_meta_header

v1 -> v2:
  * Hide data_pages from the userspace struct
  * Fix META_PAGE_MAX_PAGES
  * Support for order > 0 meta-page
  * Add missing page->mapping.

Vincent Donnefort (5):
  ring-buffer: allocate sub-buffers with __GFP_COMP
  ring-buffer: Introducing ring-buffer mapping functions
  tracing: Allow user-space mapping of the ring-buffer
  Documentation: tracing: Add ring-buffer mapping
  ring-buffer/selftest: Add ring-buffer mapping test

 Documentation/trace/index.rst                 |   1 +
 Documentation/trace/ring-buffer-map.rst       | 106 +++++
 include/linux/ring_buffer.h                   |   6 +
 include/uapi/linux/trace_mmap.h               |  48 +++
 kernel/trace/ring_buffer.c                    | 403 +++++++++++++++++-
 kernel/trace/trace.c                          | 113 ++++-
 kernel/trace/trace.h                          |   1 +
 tools/testing/selftests/ring-buffer/Makefile  |   8 +
 tools/testing/selftests/ring-buffer/config    |   2 +
 .../testing/selftests/ring-buffer/map_test.c  | 302 +++++++++++++
 10 files changed, 979 insertions(+), 11 deletions(-)
 create mode 100644 Documentation/trace/ring-buffer-map.rst
 create mode 100644 include/uapi/linux/trace_mmap.h
 create mode 100644 tools/testing/selftests/ring-buffer/Makefile
 create mode 100644 tools/testing/selftests/ring-buffer/config
 create mode 100644 tools/testing/selftests/ring-buffer/map_test.c


base-commit: 7604256cecef34a82333d9f78262d3180f4eb525
-- 
2.44.0.478.gd926399ef9-goog


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v20 1/5] ring-buffer: allocate sub-buffers with __GFP_COMP
  2024-04-06 17:36 [PATCH v20 0/5] Introducing trace buffer mapping by user-space Vincent Donnefort
@ 2024-04-06 17:36 ` Vincent Donnefort
  2024-04-10 17:36   ` Steven Rostedt
  2024-04-06 17:36 ` [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions Vincent Donnefort
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 23+ messages in thread
From: Vincent Donnefort @ 2024-04-06 17:36 UTC (permalink / raw)
  To: rostedt, mhiramat, linux-kernel, linux-trace-kernel
  Cc: mathieu.desnoyers, kernel-team, rdunlap, Vincent Donnefort

In preparation for the ring-buffer memory mapping, allocate compound
pages for the ring-buffer sub-buffers to enable us to map them to
user-space with vm_insert_pages().

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 25476ead681b..cc9ebe593571 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1524,7 +1524,7 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
 		list_add(&bpage->list, pages);
 
 		page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu),
-					mflags | __GFP_ZERO,
+					mflags | __GFP_COMP | __GFP_ZERO,
 					cpu_buffer->buffer->subbuf_order);
 		if (!page)
 			goto free_pages;
@@ -1609,7 +1609,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu)
 
 	cpu_buffer->reader_page = bpage;
 
-	page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_ZERO,
+	page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_COMP | __GFP_ZERO,
 				cpu_buffer->buffer->subbuf_order);
 	if (!page)
 		goto fail_free_reader;
@@ -5579,7 +5579,7 @@ ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu)
 		goto out;
 
 	page = alloc_pages_node(cpu_to_node(cpu),
-				GFP_KERNEL | __GFP_NORETRY | __GFP_ZERO,
+				GFP_KERNEL | __GFP_NORETRY | __GFP_COMP | __GFP_ZERO,
 				cpu_buffer->buffer->subbuf_order);
 	if (!page) {
 		kfree(bpage);
-- 
2.44.0.478.gd926399ef9-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions
  2024-04-06 17:36 [PATCH v20 0/5] Introducing trace buffer mapping by user-space Vincent Donnefort
  2024-04-06 17:36 ` [PATCH v20 1/5] ring-buffer: allocate sub-buffers with __GFP_COMP Vincent Donnefort
@ 2024-04-06 17:36 ` Vincent Donnefort
  2024-04-09 20:12   ` kernel test robot
                     ` (3 more replies)
  2024-04-06 17:36 ` [PATCH v20 3/5] tracing: Allow user-space mapping of the ring-buffer Vincent Donnefort
                   ` (3 subsequent siblings)
  5 siblings, 4 replies; 23+ messages in thread
From: Vincent Donnefort @ 2024-04-06 17:36 UTC (permalink / raw)
  To: rostedt, mhiramat, linux-kernel, linux-trace-kernel
  Cc: mathieu.desnoyers, kernel-team, rdunlap, Vincent Donnefort, linux-mm

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

  ring_buffer_{map,unmap}()

And controls on the ring-buffer:

  ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

  A unique ID for each subbuf of the ring-buffer, currently they are
  only identified through their in-kernel VA.

  A meta-page, where are stored ring-buffer statistics and a
  description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

CC: <linux-mm@kvack.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index dc5ae4e96aee..96d2140b471e 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
 #include <linux/seq_file.h>
 #include <linux/poll.h>
 
+#include <uapi/linux/trace_mmap.h>
+
 struct trace_buffer;
 struct ring_buffer_iter;
 
@@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node);
 #define trace_rb_cpu_prepare	NULL
 #endif
 
+int ring_buffer_map(struct trace_buffer *buffer, int cpu,
+		    struct vm_area_struct *vma);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
 #endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index 000000000000..ffcd8dfcaa4f
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _TRACE_MMAP_H_
+#define _TRACE_MMAP_H_
+
+#include <linux/types.h>
+
+/**
+ * struct trace_buffer_meta - Ring-buffer Meta-page description
+ * @meta_page_size:	Size of this meta-page.
+ * @meta_struct_len:	Size of this structure.
+ * @subbuf_size:	Size of each sub-buffer.
+ * @nr_subbufs:		Number of subbfs in the ring-buffer, including the reader.
+ * @reader.lost_events:	Number of events lost at the time of the reader swap.
+ * @reader.id:		subbuf ID of the current reader. ID range [0 : @nr_subbufs - 1]
+ * @reader.read:	Number of bytes read on the reader subbuf.
+ * @flags:		Placeholder for now, 0 until new features are supported.
+ * @entries:		Number of entries in the ring-buffer.
+ * @overrun:		Number of entries lost in the ring-buffer.
+ * @read:		Number of entries that have been read.
+ * @Reserved1:		Reserved for future use.
+ * @Reserved2:		Reserved for future use.
+ */
+struct trace_buffer_meta {
+	__u32		meta_page_size;
+	__u32		meta_struct_len;
+
+	__u32		subbuf_size;
+	__u32		nr_subbufs;
+
+	struct {
+		__u64	lost_events;
+		__u32	id;
+		__u32	read;
+	} reader;
+
+	__u64	flags;
+
+	__u64	entries;
+	__u64	overrun;
+	__u64	read;
+
+	__u64	Reserved1;
+	__u64	Reserved2;
+};
+
+#endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index cc9ebe593571..793ecc454039 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -9,6 +9,7 @@
 #include <linux/ring_buffer.h>
 #include <linux/trace_clock.h>
 #include <linux/sched/clock.h>
+#include <linux/cacheflush.h>
 #include <linux/trace_seq.h>
 #include <linux/spinlock.h>
 #include <linux/irq_work.h>
@@ -26,6 +27,7 @@
 #include <linux/list.h>
 #include <linux/cpu.h>
 #include <linux/oom.h>
+#include <linux/mm.h>
 
 #include <asm/local64.h>
 #include <asm/local.h>
@@ -338,6 +340,7 @@ struct buffer_page {
 	local_t		 entries;	/* entries on this page */
 	unsigned long	 real_end;	/* real end of data */
 	unsigned	 order;		/* order of the page */
+	u32		 id;		/* ID for external mapping */
 	struct buffer_data_page *page;	/* Actual data page */
 };
 
@@ -484,6 +487,12 @@ struct ring_buffer_per_cpu {
 	u64				read_stamp;
 	/* pages removed since last reset */
 	unsigned long			pages_removed;
+
+	unsigned int			mapped;
+	struct mutex			mapping_lock;
+	unsigned long			*subbuf_ids;	/* ID to subbuf VA */
+	struct trace_buffer_meta	*meta_page;
+
 	/* ring buffer pages to update, > 0 to add, < 0 to remove */
 	long				nr_pages_to_update;
 	struct list_head		new_pages; /* new pages to add */
@@ -1599,6 +1608,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu)
 	init_irq_work(&cpu_buffer->irq_work.work, rb_wake_up_waiters);
 	init_waitqueue_head(&cpu_buffer->irq_work.waiters);
 	init_waitqueue_head(&cpu_buffer->irq_work.full_waiters);
+	mutex_init(&cpu_buffer->mapping_lock);
 
 	bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
 			    GFP_KERNEL, cpu_to_node(cpu));
@@ -1789,8 +1799,6 @@ bool ring_buffer_time_stamp_abs(struct trace_buffer *buffer)
 	return buffer->time_stamp_abs;
 }
 
-static void rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer);
-
 static inline unsigned long rb_page_entries(struct buffer_page *bpage)
 {
 	return local_read(&bpage->entries) & RB_WRITE_MASK;
@@ -5211,6 +5219,22 @@ static void rb_clear_buffer_page(struct buffer_page *page)
 	page->read = 0;
 }
 
+static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct trace_buffer_meta *meta = cpu_buffer->meta_page;
+
+	meta->reader.read = cpu_buffer->reader_page->read;
+	meta->reader.id = cpu_buffer->reader_page->id;
+	meta->reader.lost_events = cpu_buffer->lost_events;
+
+	meta->entries = local_read(&cpu_buffer->entries);
+	meta->overrun = local_read(&cpu_buffer->overrun);
+	meta->read = cpu_buffer->read;
+
+	/* Some archs do not have data cache coherency between kernel and user-space */
+	flush_dcache_folio(virt_to_folio(cpu_buffer->meta_page));
+}
+
 static void
 rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
 {
@@ -5255,6 +5279,9 @@ rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
 	cpu_buffer->lost_events = 0;
 	cpu_buffer->last_overrun = 0;
 
+	if (cpu_buffer->mapped)
+		rb_update_meta_page(cpu_buffer);
+
 	rb_head_page_activate(cpu_buffer);
 	cpu_buffer->pages_removed = 0;
 }
@@ -5469,6 +5496,12 @@ int ring_buffer_swap_cpu(struct trace_buffer *buffer_a,
 	cpu_buffer_a = buffer_a->buffers[cpu];
 	cpu_buffer_b = buffer_b->buffers[cpu];
 
+	/* It's up to the callers to not try to swap mapped buffers */
+	if (WARN_ON_ONCE(cpu_buffer_a->mapped || cpu_buffer_b->mapped)) {
+		ret = -EBUSY;
+		goto out;
+	}
+
 	/* At least make sure the two buffers are somewhat the same */
 	if (cpu_buffer_a->nr_pages != cpu_buffer_b->nr_pages)
 		goto out;
@@ -5733,7 +5766,8 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 	 * Otherwise, we can simply swap the page with the one passed in.
 	 */
 	if (read || (len < (commit - read)) ||
-	    cpu_buffer->reader_page == cpu_buffer->commit_page) {
+	    cpu_buffer->reader_page == cpu_buffer->commit_page ||
+	    cpu_buffer->mapped) {
 		struct buffer_data_page *rpage = cpu_buffer->reader_page->page;
 		unsigned int rpos = read;
 		unsigned int pos = 0;
@@ -5956,6 +5990,11 @@ int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order)
 
 		cpu_buffer = buffer->buffers[cpu];
 
+		if (cpu_buffer->mapped) {
+			err = -EBUSY;
+			goto error;
+		}
+
 		/* Update the number of pages to match the new size */
 		nr_pages = old_size * buffer->buffers[cpu]->nr_pages;
 		nr_pages = DIV_ROUND_UP(nr_pages, buffer->subbuf_size);
@@ -6057,6 +6096,358 @@ int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order)
 }
 EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_set);
 
+static int rb_alloc_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	struct page *page;
+
+	if (cpu_buffer->meta_page)
+		return 0;
+
+	page = alloc_page(GFP_USER | __GFP_ZERO);
+	if (!page)
+		return -ENOMEM;
+
+	cpu_buffer->meta_page = page_to_virt(page);
+
+	return 0;
+}
+
+static void rb_free_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	unsigned long addr = (unsigned long)cpu_buffer->meta_page;
+
+	free_page(addr);
+	cpu_buffer->meta_page = NULL;
+}
+
+static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer,
+				   unsigned long *subbuf_ids)
+{
+	struct trace_buffer_meta *meta = cpu_buffer->meta_page;
+	unsigned int nr_subbufs = cpu_buffer->nr_pages + 1;
+	struct buffer_page *first_subbuf, *subbuf;
+	int id = 0;
+
+	subbuf_ids[id] = (unsigned long)cpu_buffer->reader_page->page;
+	cpu_buffer->reader_page->id = id++;
+
+	first_subbuf = subbuf = rb_set_head_page(cpu_buffer);
+	do {
+		if (WARN_ON(id >= nr_subbufs))
+			break;
+
+		subbuf_ids[id] = (unsigned long)subbuf->page;
+		subbuf->id = id;
+
+		rb_inc_page(&subbuf);
+		id++;
+	} while (subbuf != first_subbuf);
+
+	/* install subbuf ID to kern VA translation */
+	cpu_buffer->subbuf_ids = subbuf_ids;
+
+	/* __rb_map_vma() pads the meta-page to align it with the sub-buffers */
+	meta->meta_page_size = PAGE_SIZE << cpu_buffer->buffer->subbuf_order;
+	meta->meta_struct_len = sizeof(*meta);
+	meta->nr_subbufs = nr_subbufs;
+	meta->subbuf_size = cpu_buffer->buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
+
+	rb_update_meta_page(cpu_buffer);
+}
+
+static struct ring_buffer_per_cpu *
+rb_get_mapped_buffer(struct trace_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+
+	if (!cpumask_test_cpu(cpu, buffer->cpumask))
+		return ERR_PTR(-EINVAL);
+
+	cpu_buffer = buffer->buffers[cpu];
+
+	mutex_lock(&cpu_buffer->mapping_lock);
+
+	if (!cpu_buffer->mapped) {
+		mutex_unlock(&cpu_buffer->mapping_lock);
+		return ERR_PTR(-ENODEV);
+	}
+
+	return cpu_buffer;
+}
+
+static void rb_put_mapped_buffer(struct ring_buffer_per_cpu *cpu_buffer)
+{
+	mutex_unlock(&cpu_buffer->mapping_lock);
+}
+
+/*
+ * Fast-path for rb_buffer_(un)map(). Called whenever the meta-page doesn't need
+ * to be set-up or torn-down.
+ */
+static int __rb_inc_dec_mapped(struct ring_buffer_per_cpu *cpu_buffer,
+			       bool inc)
+{
+	unsigned long flags;
+
+	lockdep_assert_held(&cpu_buffer->mapping_lock);
+
+	if (inc && cpu_buffer->mapped == UINT_MAX)
+		return -EBUSY;
+
+	if (WARN_ON(!inc && cpu_buffer->mapped == 0))
+		return -EINVAL;
+
+	mutex_lock(&cpu_buffer->buffer->mutex);
+	raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
+
+	if (inc)
+		cpu_buffer->mapped++;
+	else
+		cpu_buffer->mapped--;
+
+	raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
+	mutex_unlock(&cpu_buffer->buffer->mutex);
+
+	return 0;
+}
+
+#define subbuf_page(off, start) \
+	virt_to_page((void *)((start) + ((off) << PAGE_SHIFT)))
+
+#define foreach_subbuf_page(sub_order, start, page)		\
+	page = subbuf_page(0, (start));				\
+	for (int __off = 0; __off < (1 << (sub_order));		\
+	     __off++, page = subbuf_page(__off, (start)))
+
+/*
+ *   +--------------+  pgoff == 0
+ *   |   meta page  |
+ *   +--------------+  pgoff == 1
+ *   |   000000000  |
+ *   +--------------+  pgoff == (1 << subbuf_order)
+ *   | subbuffer 0  |
+ *   |              |
+ *   +--------------+  pgoff == (2 * (1 << subbuf_order))
+ *   | subbuffer 1  |
+ *   |              |
+ *         ...
+ */
+static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
+			struct vm_area_struct *vma)
+{
+	unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
+	unsigned int subbuf_pages, subbuf_order;
+	struct page **pages;
+	int p = 0, s = 0;
+	int err;
+
+	lockdep_assert_held(&cpu_buffer->mapping_lock);
+
+	subbuf_order = cpu_buffer->buffer->subbuf_order;
+	subbuf_pages = 1 << subbuf_order;
+
+	if (subbuf_order && pgoff % subbuf_pages)
+		return -EINVAL;
+
+	nr_subbufs = cpu_buffer->nr_pages + 1;
+	nr_pages = ((nr_subbufs + 1) << subbuf_order) - pgoff;
+
+	vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+	if (!vma_pages || vma_pages > nr_pages)
+		return -EINVAL;
+
+	nr_pages = vma_pages;
+
+	pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
+	if (!pages)
+		return -ENOMEM;
+
+	if (!pgoff) {
+		unsigned long meta_page_padding;
+
+		pages[p++] = virt_to_page(cpu_buffer->meta_page);
+
+		/*
+		 * Pad with the zero-page to align the meta-page with the
+		 * sub-buffers.
+		 */
+		meta_page_padding = subbuf_pages - 1;
+		while (meta_page_padding-- && p < nr_pages)
+			pages[p++] = ZERO_PAGE(0);
+	} else {
+		/* Skip the meta-page */
+		pgoff -= subbuf_pages;
+
+		s += pgoff / subbuf_pages;
+	}
+
+	while (s < nr_subbufs && p < nr_pages) {
+		struct page *page;
+
+		foreach_subbuf_page(subbuf_order, cpu_buffer->subbuf_ids[s], page) {
+			if (p >= nr_pages)
+				break;
+
+			pages[p++] = page;
+		}
+		s++;
+	}
+
+	err = vm_insert_pages(vma, vma->vm_start, pages, &nr_pages);
+
+	kfree(pages);
+
+	return err;
+}
+
+int ring_buffer_map(struct trace_buffer *buffer, int cpu,
+		    struct vm_area_struct *vma)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long flags, *subbuf_ids;
+	int err = 0;
+
+	if (!cpumask_test_cpu(cpu, buffer->cpumask))
+		return -EINVAL;
+
+	cpu_buffer = buffer->buffers[cpu];
+
+	mutex_lock(&cpu_buffer->mapping_lock);
+
+	if (cpu_buffer->mapped) {
+		err = __rb_map_vma(cpu_buffer, vma);
+		if (!err)
+			err = __rb_inc_dec_mapped(cpu_buffer, true);
+		mutex_unlock(&cpu_buffer->mapping_lock);
+		return err;
+	}
+
+	/* prevent another thread from changing buffer/sub-buffer sizes */
+	mutex_lock(&buffer->mutex);
+
+	err = rb_alloc_meta_page(cpu_buffer);
+	if (err)
+		goto unlock;
+
+	/* subbuf_ids include the reader while nr_pages does not */
+	subbuf_ids = kcalloc(cpu_buffer->nr_pages + 1, sizeof(*subbuf_ids), GFP_KERNEL);
+	if (!subbuf_ids) {
+		rb_free_meta_page(cpu_buffer);
+		err = -ENOMEM;
+		goto unlock;
+	}
+
+	atomic_inc(&cpu_buffer->resize_disabled);
+
+	/*
+	 * Lock all readers to block any subbuf swap until the subbuf IDs are
+	 * assigned.
+	 */
+	raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
+	rb_setup_ids_meta_page(cpu_buffer, subbuf_ids);
+	raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
+
+	err = __rb_map_vma(cpu_buffer, vma);
+	if (!err) {
+		raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
+		cpu_buffer->mapped = 1;
+		raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
+	} else {
+		kfree(cpu_buffer->subbuf_ids);
+		cpu_buffer->subbuf_ids = NULL;
+		rb_free_meta_page(cpu_buffer);
+	}
+unlock:
+	mutex_unlock(&buffer->mutex);
+	mutex_unlock(&cpu_buffer->mapping_lock);
+
+	return err;
+}
+
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long flags;
+	int err = 0;
+
+	if (!cpumask_test_cpu(cpu, buffer->cpumask))
+		return -EINVAL;
+
+	cpu_buffer = buffer->buffers[cpu];
+
+	mutex_lock(&cpu_buffer->mapping_lock);
+
+	if (!cpu_buffer->mapped) {
+		err = -ENODEV;
+		goto out;
+	} else if (cpu_buffer->mapped > 1) {
+		__rb_inc_dec_mapped(cpu_buffer, false);
+		goto out;
+	}
+
+	mutex_lock(&buffer->mutex);
+	raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
+
+	cpu_buffer->mapped = 0;
+
+	raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
+
+	kfree(cpu_buffer->subbuf_ids);
+	cpu_buffer->subbuf_ids = NULL;
+	rb_free_meta_page(cpu_buffer);
+	atomic_dec(&cpu_buffer->resize_disabled);
+
+	mutex_unlock(&buffer->mutex);
+out:
+	mutex_unlock(&cpu_buffer->mapping_lock);
+
+	return err;
+}
+
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long reader_size;
+	unsigned long flags;
+
+	cpu_buffer = rb_get_mapped_buffer(buffer, cpu);
+	if (IS_ERR(cpu_buffer))
+		return (int)PTR_ERR(cpu_buffer);
+
+	raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
+consume:
+	if (rb_per_cpu_empty(cpu_buffer))
+		goto out;
+
+	reader_size = rb_page_size(cpu_buffer->reader_page);
+
+	/*
+	 * There are data to be read on the current reader page, we can
+	 * return to the caller. But before that, we assume the latter will read
+	 * everything. Let's update the kernel reader accordingly.
+	 */
+	if (cpu_buffer->reader_page->read < reader_size) {
+		while (cpu_buffer->reader_page->read < reader_size)
+			rb_advance_reader(cpu_buffer);
+		goto out;
+	}
+
+	if (WARN_ON(!rb_get_reader_page(cpu_buffer)))
+		goto out;
+
+	goto consume;
+out:
+	/* Some archs do not have data cache coherency between kernel and user-space */
+	flush_dcache_folio(virt_to_folio(cpu_buffer->reader_page->page));
+
+	rb_update_meta_page(cpu_buffer);
+
+	raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
+	rb_put_mapped_buffer(cpu_buffer);
+
+	return 0;
+}
+
 /*
  * We only allocate new buffers, never free them if the CPU goes down.
  * If we were to free the buffer, then the user would lose any trace that was in
-- 
2.44.0.478.gd926399ef9-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v20 3/5] tracing: Allow user-space mapping of the ring-buffer
  2024-04-06 17:36 [PATCH v20 0/5] Introducing trace buffer mapping by user-space Vincent Donnefort
  2024-04-06 17:36 ` [PATCH v20 1/5] ring-buffer: allocate sub-buffers with __GFP_COMP Vincent Donnefort
  2024-04-06 17:36 ` [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions Vincent Donnefort
@ 2024-04-06 17:36 ` Vincent Donnefort
  2024-04-06 17:36 ` [PATCH v20 4/5] Documentation: tracing: Add ring-buffer mapping Vincent Donnefort
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 23+ messages in thread
From: Vincent Donnefort @ 2024-04-06 17:36 UTC (permalink / raw)
  To: rostedt, mhiramat, linux-kernel, linux-trace-kernel
  Cc: mathieu.desnoyers, kernel-team, rdunlap, Vincent Donnefort, linux-mm

Currently, user-space extracts data from the ring-buffer via splice,
which is handy for storage or network sharing. However, due to splice
limitations, it is imposible to do real-time analysis without a copy.

A solution for that problem is to let the user-space map the ring-buffer
directly.

The mapping is exposed via the per-CPU file trace_pipe_raw. The first
element of the mapping is the meta-page. It is followed by each
subbuffer constituting the ring-buffer, ordered by their unique page ID:

  * Meta-page -- include/uapi/linux/trace_mmap.h for a description
  * Subbuf ID 0
  * Subbuf ID 1
     ...

It is therefore easy to translate a subbuf ID into an offset in the
mapping:

  reader_id = meta->reader->id;
  reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;

When new data is available, the mapper must call a newly introduced ioctl:
TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to
point to the next reader containing unread data.

Mapping will prevent snapshot and buffer size modifications.

CC: <linux-mm@kvack.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>

diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
index ffcd8dfcaa4f..d25b9d504a7c 100644
--- a/include/uapi/linux/trace_mmap.h
+++ b/include/uapi/linux/trace_mmap.h
@@ -43,4 +43,6 @@ struct trace_buffer_meta {
 	__u64	Reserved2;
 };
 
+#define TRACE_MMAP_IOCTL_GET_READER		_IO('T', 0x1)
+
 #endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 233d1af39fff..83194bf7b1df 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1191,6 +1191,12 @@ static void tracing_snapshot_instance_cond(struct trace_array *tr,
 		return;
 	}
 
+	if (tr->mapped) {
+		trace_array_puts(tr, "*** BUFFER MEMORY MAPPED ***\n");
+		trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n");
+		return;
+	}
+
 	local_irq_save(flags);
 	update_max_tr(tr, current, smp_processor_id(), cond_data);
 	local_irq_restore(flags);
@@ -1323,7 +1329,7 @@ static int tracing_arm_snapshot_locked(struct trace_array *tr)
 	lockdep_assert_held(&trace_types_lock);
 
 	spin_lock(&tr->snapshot_trigger_lock);
-	if (tr->snapshot == UINT_MAX) {
+	if (tr->snapshot == UINT_MAX || tr->mapped) {
 		spin_unlock(&tr->snapshot_trigger_lock);
 		return -EBUSY;
 	}
@@ -6068,7 +6074,7 @@ static void tracing_set_nop(struct trace_array *tr)
 {
 	if (tr->current_trace == &nop_trace)
 		return;
-	
+
 	tr->current_trace->enabled--;
 
 	if (tr->current_trace->reset)
@@ -8194,15 +8200,32 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos,
 	return ret;
 }
 
-/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */
 static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
 	struct ftrace_buffer_info *info = file->private_data;
 	struct trace_iterator *iter = &info->iter;
+	int err;
 
-	if (cmd)
-		return -ENOIOCTLCMD;
+	if (cmd == TRACE_MMAP_IOCTL_GET_READER) {
+		if (!(file->f_flags & O_NONBLOCK)) {
+			err = ring_buffer_wait(iter->array_buffer->buffer,
+					       iter->cpu_file,
+					       iter->tr->buffer_percent,
+					       NULL, NULL);
+			if (err)
+				return err;
+		}
 
+		return ring_buffer_map_get_reader(iter->array_buffer->buffer,
+						  iter->cpu_file);
+	} else if (cmd) {
+		return -ENOTTY;
+	}
+
+	/*
+	 * An ioctl call with cmd 0 to the ring buffer file will wake up all
+	 * waiters
+	 */
 	mutex_lock(&trace_types_lock);
 
 	/* Make sure the waiters see the new wait_index */
@@ -8214,6 +8237,85 @@ static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, unsigned
 	return 0;
 }
 
+#ifdef CONFIG_TRACER_MAX_TRACE
+static int get_snapshot_map(struct trace_array *tr)
+{
+	int err = 0;
+
+	/*
+	 * Called with mmap_lock held. lockdep would be unhappy if we would now
+	 * take trace_types_lock. Instead use the specific
+	 * snapshot_trigger_lock.
+	 */
+	spin_lock(&tr->snapshot_trigger_lock);
+
+	if (tr->snapshot || tr->mapped == UINT_MAX)
+		err = -EBUSY;
+	else
+		tr->mapped++;
+
+	spin_unlock(&tr->snapshot_trigger_lock);
+
+	/* Wait for update_max_tr() to observe iter->tr->mapped */
+	if (tr->mapped == 1)
+		synchronize_rcu();
+
+	return err;
+
+}
+static void put_snapshot_map(struct trace_array *tr)
+{
+	spin_lock(&tr->snapshot_trigger_lock);
+	if (!WARN_ON(!tr->mapped))
+		tr->mapped--;
+	spin_unlock(&tr->snapshot_trigger_lock);
+}
+#else
+static inline int get_snapshot_map(struct trace_array *tr) { return 0; }
+static inline void put_snapshot_map(struct trace_array *tr) { }
+#endif
+
+static void tracing_buffers_mmap_close(struct vm_area_struct *vma)
+{
+	struct ftrace_buffer_info *info = vma->vm_file->private_data;
+	struct trace_iterator *iter = &info->iter;
+
+	WARN_ON(ring_buffer_unmap(iter->array_buffer->buffer, iter->cpu_file));
+	put_snapshot_map(iter->tr);
+}
+
+static const struct vm_operations_struct tracing_buffers_vmops = {
+	.close		= tracing_buffers_mmap_close,
+};
+
+static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	struct ftrace_buffer_info *info = filp->private_data;
+	struct trace_iterator *iter = &info->iter;
+	int ret = 0;
+
+	if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
+	    !(vma->vm_flags & VM_MAYSHARE))
+		return -EPERM;
+
+	vm_flags_mod(vma,
+		     VM_MIXEDMAP | VM_PFNMAP |
+		     VM_DONTCOPY | VM_DONTDUMP | VM_DONTEXPAND | VM_IO,
+		     VM_MAYWRITE);
+
+	vma->vm_ops = &tracing_buffers_vmops;
+
+	ret = get_snapshot_map(iter->tr);
+	if (ret)
+		return ret;
+
+	ret = ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file, vma);
+	if (ret)
+		put_snapshot_map(iter->tr);
+
+	return ret;
+}
+
 static const struct file_operations tracing_buffers_fops = {
 	.open		= tracing_buffers_open,
 	.read		= tracing_buffers_read,
@@ -8223,6 +8325,7 @@ static const struct file_operations tracing_buffers_fops = {
 	.splice_read	= tracing_buffers_splice_read,
 	.unlocked_ioctl = tracing_buffers_ioctl,
 	.llseek		= no_llseek,
+	.mmap		= tracing_buffers_mmap,
 };
 
 static ssize_t
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 64450615ca0c..749a182dab48 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -336,6 +336,7 @@ struct trace_array {
 	bool			allocated_snapshot;
 	spinlock_t		snapshot_trigger_lock;
 	unsigned int		snapshot;
+	unsigned int		mapped;
 	unsigned long		max_latency;
 #ifdef CONFIG_FSNOTIFY
 	struct dentry		*d_max_latency;
-- 
2.44.0.478.gd926399ef9-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v20 4/5] Documentation: tracing: Add ring-buffer mapping
  2024-04-06 17:36 [PATCH v20 0/5] Introducing trace buffer mapping by user-space Vincent Donnefort
                   ` (2 preceding siblings ...)
  2024-04-06 17:36 ` [PATCH v20 3/5] tracing: Allow user-space mapping of the ring-buffer Vincent Donnefort
@ 2024-04-06 17:36 ` Vincent Donnefort
  2024-04-06 17:36 ` [PATCH v20 5/5] ring-buffer/selftest: Add ring-buffer mapping test Vincent Donnefort
  2024-04-10 17:56 ` [PATCH v20 0/5] Introducing trace buffer mapping by user-space Steven Rostedt
  5 siblings, 0 replies; 23+ messages in thread
From: Vincent Donnefort @ 2024-04-06 17:36 UTC (permalink / raw)
  To: rostedt, mhiramat, linux-kernel, linux-trace-kernel
  Cc: mathieu.desnoyers, kernel-team, rdunlap, Vincent Donnefort

It is now possible to mmap() a ring-buffer to stream its content. Add
some documentation and a code example.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>

diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 5092d6c13af5..0b300901fd75 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -29,6 +29,7 @@ Linux Tracing Technologies
    timerlat-tracer
    intel_th
    ring-buffer-design
+   ring-buffer-map
    stm
    sys-t
    coresight/index
diff --git a/Documentation/trace/ring-buffer-map.rst b/Documentation/trace/ring-buffer-map.rst
new file mode 100644
index 000000000000..8e296bcc0d7f
--- /dev/null
+++ b/Documentation/trace/ring-buffer-map.rst
@@ -0,0 +1,106 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================
+Tracefs ring-buffer memory mapping
+==================================
+
+:Author: Vincent Donnefort <vdonnefort@google.com>
+
+Overview
+========
+Tracefs ring-buffer memory map provides an efficient method to stream data
+as no memory copy is necessary. The application mapping the ring-buffer becomes
+then a consumer for that ring-buffer, in a similar fashion to trace_pipe.
+
+Memory mapping setup
+====================
+The mapping works with a mmap() of the trace_pipe_raw interface.
+
+The first system page of the mapping contains ring-buffer statistics and
+description. It is referred to as the meta-page. One of the most important
+fields of the meta-page is the reader. It contains the sub-buffer ID which can
+be safely read by the mapper (see ring-buffer-design.rst).
+
+The meta-page is followed by all the sub-buffers, ordered by ascending ID. It is
+therefore effortless to know where the reader starts in the mapping:
+
+.. code-block:: c
+
+        reader_id = meta->reader->id;
+        reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;
+
+When the application is done with the current reader, it can get a new one using
+the trace_pipe_raw ioctl() TRACE_MMAP_IOCTL_GET_READER. This ioctl also updates
+the meta-page fields.
+
+Limitations
+===========
+When a mapping is in place on a Tracefs ring-buffer, it is not possible to
+either resize it (either by increasing the entire size of the ring-buffer or
+each subbuf). It is also not possible to use snapshot and causes splice to copy
+the ring buffer data instead of using the copyless swap from the ring buffer.
+
+Concurrent readers (either another application mapping that ring-buffer or the
+kernel with trace_pipe) are allowed but not recommended. They will compete for
+the ring-buffer and the output is unpredictable, just like concurrent readers on
+trace_pipe would be.
+
+Example
+=======
+
+.. code-block:: c
+
+        #include <fcntl.h>
+        #include <stdio.h>
+        #include <stdlib.h>
+        #include <unistd.h>
+
+        #include <linux/trace_mmap.h>
+
+        #include <sys/mman.h>
+        #include <sys/ioctl.h>
+
+        #define TRACE_PIPE_RAW "/sys/kernel/tracing/per_cpu/cpu0/trace_pipe_raw"
+
+        int main(void)
+        {
+                int page_size = getpagesize(), fd, reader_id;
+                unsigned long meta_len, data_len;
+                struct trace_buffer_meta *meta;
+                void *map, *reader, *data;
+
+                fd = open(TRACE_PIPE_RAW, O_RDONLY | O_NONBLOCK);
+                if (fd < 0)
+                        exit(EXIT_FAILURE);
+
+                map = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0);
+                if (map == MAP_FAILED)
+                        exit(EXIT_FAILURE);
+
+                meta = (struct trace_buffer_meta *)map;
+                meta_len = meta->meta_page_size;
+
+                printf("entries:        %llu\n", meta->entries);
+                printf("overrun:        %llu\n", meta->overrun);
+                printf("read:           %llu\n", meta->read);
+                printf("nr_subbufs:     %u\n", meta->nr_subbufs);
+
+                data_len = meta->subbuf_size * meta->nr_subbufs;
+                data = mmap(NULL, data_len, PROT_READ, MAP_SHARED, fd, meta_len);
+                if (data == MAP_FAILED)
+                        exit(EXIT_FAILURE);
+
+                if (ioctl(fd, TRACE_MMAP_IOCTL_GET_READER) < 0)
+                        exit(EXIT_FAILURE);
+
+                reader_id = meta->reader.id;
+                reader = data + meta->subbuf_size * reader_id;
+
+                printf("Current reader address: %p\n", reader);
+
+                munmap(data, data_len);
+                munmap(meta, meta_len);
+                close (fd);
+
+                return 0;
+        }
-- 
2.44.0.478.gd926399ef9-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v20 5/5] ring-buffer/selftest: Add ring-buffer mapping test
  2024-04-06 17:36 [PATCH v20 0/5] Introducing trace buffer mapping by user-space Vincent Donnefort
                   ` (3 preceding siblings ...)
  2024-04-06 17:36 ` [PATCH v20 4/5] Documentation: tracing: Add ring-buffer mapping Vincent Donnefort
@ 2024-04-06 17:36 ` Vincent Donnefort
  2024-04-06 20:46   ` Muhammad Usama Anjum
  2024-04-10 17:56 ` [PATCH v20 0/5] Introducing trace buffer mapping by user-space Steven Rostedt
  5 siblings, 1 reply; 23+ messages in thread
From: Vincent Donnefort @ 2024-04-06 17:36 UTC (permalink / raw)
  To: rostedt, mhiramat, linux-kernel, linux-trace-kernel
  Cc: mathieu.desnoyers, kernel-team, rdunlap, Vincent Donnefort,
	Shuah Khan, Shuah Khan, linux-kselftest

This test maps a ring-buffer and validate the meta-page after reset and
after emitting few events.

Cc: Shuah Khan <shuah@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: linux-kselftest@vger.kernel.org
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>

diff --git a/tools/testing/selftests/ring-buffer/Makefile b/tools/testing/selftests/ring-buffer/Makefile
new file mode 100644
index 000000000000..627c5fa6d1ab
--- /dev/null
+++ b/tools/testing/selftests/ring-buffer/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0
+CFLAGS += -Wl,-no-as-needed -Wall
+CFLAGS += $(KHDR_INCLUDES)
+CFLAGS += -D_GNU_SOURCE
+
+TEST_GEN_PROGS = map_test
+
+include ../lib.mk
diff --git a/tools/testing/selftests/ring-buffer/config b/tools/testing/selftests/ring-buffer/config
new file mode 100644
index 000000000000..d936f8f00e78
--- /dev/null
+++ b/tools/testing/selftests/ring-buffer/config
@@ -0,0 +1,2 @@
+CONFIG_FTRACE=y
+CONFIG_TRACER_SNAPSHOT=y
diff --git a/tools/testing/selftests/ring-buffer/map_test.c b/tools/testing/selftests/ring-buffer/map_test.c
new file mode 100644
index 000000000000..50a09e1371d4
--- /dev/null
+++ b/tools/testing/selftests/ring-buffer/map_test.c
@@ -0,0 +1,302 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Ring-buffer memory mapping tests
+ *
+ * Copyright (c) 2024 Vincent Donnefort <vdonnefort@google.com>
+ */
+#include <fcntl.h>
+#include <sched.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <linux/trace_mmap.h>
+
+#include <sys/mman.h>
+#include <sys/ioctl.h>
+
+#include "../user_events/user_events_selftests.h" /* share tracefs setup */
+#include "../kselftest_harness.h"
+
+#define TRACEFS_ROOT "/sys/kernel/tracing"
+
+static int __tracefs_write(const char *path, const char *value)
+{
+	int fd, ret;
+
+	fd = open(path, O_WRONLY | O_TRUNC);
+	if (fd < 0)
+		return fd;
+
+	ret = write(fd, value, strlen(value));
+
+	close(fd);
+
+	return ret == -1 ? -errno : 0;
+}
+
+static int __tracefs_write_int(const char *path, int value)
+{
+	char *str;
+	int ret;
+
+	if (asprintf(&str, "%d", value) < 0)
+		return -1;
+
+	ret = __tracefs_write(path, str);
+
+	free(str);
+
+	return ret;
+}
+
+#define tracefs_write_int(path, value) \
+	ASSERT_EQ(__tracefs_write_int((path), (value)), 0)
+
+#define tracefs_write(path, value) \
+	ASSERT_EQ(__tracefs_write((path), (value)), 0)
+
+static int tracefs_reset(void)
+{
+	if (__tracefs_write_int(TRACEFS_ROOT"/tracing_on", 0))
+		return -1;
+	if (__tracefs_write(TRACEFS_ROOT"/trace", ""))
+		return -1;
+	if (__tracefs_write(TRACEFS_ROOT"/set_event", ""))
+		return -1;
+	if (__tracefs_write(TRACEFS_ROOT"/current_tracer", "nop"))
+		return -1;
+
+	return 0;
+}
+
+struct tracefs_cpu_map_desc {
+	struct trace_buffer_meta	*meta;
+	int				cpu_fd;
+};
+
+int tracefs_cpu_map(struct tracefs_cpu_map_desc *desc, int cpu)
+{
+	int page_size = getpagesize();
+	char *cpu_path;
+	void *map;
+
+	if (asprintf(&cpu_path,
+		     TRACEFS_ROOT"/per_cpu/cpu%d/trace_pipe_raw",
+		     cpu) < 0)
+		return -ENOMEM;
+
+	desc->cpu_fd = open(cpu_path, O_RDONLY | O_NONBLOCK);
+	free(cpu_path);
+	if (desc->cpu_fd < 0)
+		return -ENODEV;
+
+	map = mmap(NULL, page_size, PROT_READ, MAP_SHARED, desc->cpu_fd, 0);
+	if (map == MAP_FAILED)
+		return -errno;
+
+	desc->meta = (struct trace_buffer_meta *)map;
+
+	return 0;
+}
+
+void tracefs_cpu_unmap(struct tracefs_cpu_map_desc *desc)
+{
+	munmap(desc->meta, desc->meta->meta_page_size);
+	close(desc->cpu_fd);
+}
+
+FIXTURE(map) {
+	struct tracefs_cpu_map_desc	map_desc;
+	bool				umount;
+};
+
+FIXTURE_VARIANT(map) {
+	int	subbuf_size;
+};
+
+FIXTURE_VARIANT_ADD(map, subbuf_size_4k) {
+	.subbuf_size = 4,
+};
+
+FIXTURE_VARIANT_ADD(map, subbuf_size_8k) {
+	.subbuf_size = 8,
+};
+
+FIXTURE_SETUP(map)
+{
+	int cpu = sched_getcpu();
+	cpu_set_t cpu_mask;
+	bool fail, umount;
+	char *message;
+
+	if (!tracefs_enabled(&message, &fail, &umount)) {
+		if (fail) {
+			TH_LOG("Tracefs setup failed: %s", message);
+			ASSERT_FALSE(fail);
+		}
+		SKIP(return, "Skipping: %s", message);
+	}
+
+	self->umount = umount;
+
+	ASSERT_GE(cpu, 0);
+
+	ASSERT_EQ(tracefs_reset(), 0);
+
+	tracefs_write_int(TRACEFS_ROOT"/buffer_subbuf_size_kb", variant->subbuf_size);
+
+	ASSERT_EQ(tracefs_cpu_map(&self->map_desc, cpu), 0);
+
+	/*
+	 * Ensure generated events will be found on this very same ring-buffer.
+	 */
+	CPU_ZERO(&cpu_mask);
+	CPU_SET(cpu, &cpu_mask);
+	ASSERT_EQ(sched_setaffinity(0, sizeof(cpu_mask), &cpu_mask), 0);
+}
+
+FIXTURE_TEARDOWN(map)
+{
+	tracefs_reset();
+
+	if (self->umount)
+		tracefs_unmount();
+
+	tracefs_cpu_unmap(&self->map_desc);
+}
+
+TEST_F(map, meta_page_check)
+{
+	struct tracefs_cpu_map_desc *desc = &self->map_desc;
+	int cnt = 0;
+
+	ASSERT_EQ(desc->meta->entries, 0);
+	ASSERT_EQ(desc->meta->overrun, 0);
+	ASSERT_EQ(desc->meta->read, 0);
+
+	ASSERT_EQ(desc->meta->reader.id, 0);
+	ASSERT_EQ(desc->meta->reader.read, 0);
+
+	ASSERT_EQ(ioctl(desc->cpu_fd, TRACE_MMAP_IOCTL_GET_READER), 0);
+	ASSERT_EQ(desc->meta->reader.id, 0);
+
+	tracefs_write_int(TRACEFS_ROOT"/tracing_on", 1);
+	for (int i = 0; i < 16; i++)
+		tracefs_write_int(TRACEFS_ROOT"/trace_marker", i);
+again:
+	ASSERT_EQ(ioctl(desc->cpu_fd, TRACE_MMAP_IOCTL_GET_READER), 0);
+
+	ASSERT_EQ(desc->meta->entries, 16);
+	ASSERT_EQ(desc->meta->overrun, 0);
+	ASSERT_EQ(desc->meta->read, 16);
+
+	ASSERT_EQ(desc->meta->reader.id, 1);
+
+	if (!(cnt++))
+		goto again;
+}
+
+TEST_F(map, data_mmap)
+{
+	struct tracefs_cpu_map_desc *desc = &self->map_desc;
+	unsigned long meta_len, data_len;
+	void *data;
+
+	meta_len = desc->meta->meta_page_size;
+	data_len = desc->meta->subbuf_size * desc->meta->nr_subbufs;
+
+	/* Map all the available subbufs */
+	data = mmap(NULL, data_len, PROT_READ, MAP_SHARED,
+		    desc->cpu_fd, meta_len);
+	ASSERT_NE(data, MAP_FAILED);
+	munmap(data, data_len);
+
+	/* Map all the available subbufs - 1 */
+	data_len -= desc->meta->subbuf_size;
+	data = mmap(NULL, data_len, PROT_READ, MAP_SHARED,
+		    desc->cpu_fd, meta_len);
+	ASSERT_NE(data, MAP_FAILED);
+	munmap(data, data_len);
+
+	/* Overflow the available subbufs by 1 */
+	meta_len += desc->meta->subbuf_size * 2;
+	data = mmap(NULL, data_len, PROT_READ, MAP_SHARED,
+		    desc->cpu_fd, meta_len);
+	ASSERT_EQ(data, MAP_FAILED);
+
+	/* Verify meta-page padding */
+	if (desc->meta->meta_page_size > getpagesize()) {
+		void *addr;
+
+		data_len = desc->meta->meta_page_size;
+		data = mmap(NULL, data_len,
+			    PROT_READ, MAP_SHARED, desc->cpu_fd, 0);
+		ASSERT_NE(data, MAP_FAILED);
+
+		addr = (void *)((unsigned long)data + getpagesize());
+		ASSERT_EQ(*((int *)addr), 0);
+		munmap(data, data_len);
+	}
+}
+
+FIXTURE(snapshot) {
+	bool	umount;
+};
+
+FIXTURE_SETUP(snapshot)
+{
+	bool fail, umount;
+	struct stat sb;
+	char *message;
+
+	if (stat(TRACEFS_ROOT"/snapshot", &sb))
+		SKIP(return, "Skipping: %s", "snapshot not available");
+
+	if (!tracefs_enabled(&message, &fail, &umount)) {
+		if (fail) {
+			TH_LOG("Tracefs setup failed: %s", message);
+			ASSERT_FALSE(fail);
+		}
+		SKIP(return, "Skipping: %s", message);
+	}
+
+	self->umount = umount;
+}
+
+FIXTURE_TEARDOWN(snapshot)
+{
+	__tracefs_write(TRACEFS_ROOT"/events/sched/sched_switch/trigger",
+			"!snapshot");
+	tracefs_reset();
+
+	if (self->umount)
+		tracefs_unmount();
+}
+
+TEST_F(snapshot, excludes_map)
+{
+	struct tracefs_cpu_map_desc map_desc;
+	int cpu = sched_getcpu();
+
+	ASSERT_GE(cpu, 0);
+	tracefs_write(TRACEFS_ROOT"/events/sched/sched_switch/trigger",
+		      "snapshot");
+	ASSERT_EQ(tracefs_cpu_map(&map_desc, cpu), -EBUSY);
+}
+
+TEST_F(snapshot, excluded_by_map)
+{
+	struct tracefs_cpu_map_desc map_desc;
+	int cpu = sched_getcpu();
+
+	ASSERT_EQ(tracefs_cpu_map(&map_desc, cpu), 0);
+
+	ASSERT_EQ(__tracefs_write(TRACEFS_ROOT"/events/sched/sched_switch/trigger",
+				  "snapshot"), -EBUSY);
+	ASSERT_EQ(__tracefs_write(TRACEFS_ROOT"/snapshot",
+				  "1"), -EBUSY);
+}
+
+TEST_HARNESS_MAIN
-- 
2.44.0.478.gd926399ef9-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v20 5/5] ring-buffer/selftest: Add ring-buffer mapping test
  2024-04-06 17:36 ` [PATCH v20 5/5] ring-buffer/selftest: Add ring-buffer mapping test Vincent Donnefort
@ 2024-04-06 20:46   ` Muhammad Usama Anjum
  0 siblings, 0 replies; 23+ messages in thread
From: Muhammad Usama Anjum @ 2024-04-06 20:46 UTC (permalink / raw)
  To: Vincent Donnefort, rostedt, mhiramat, linux-kernel, linux-trace-kernel
  Cc: Muhammad Usama Anjum, mathieu.desnoyers, kernel-team, rdunlap,
	Shuah Khan, Shuah Khan, linux-kselftest

On 4/6/24 10:36 PM, Vincent Donnefort wrote:
> This test maps a ring-buffer and validate the meta-page after reset and
> after emitting few events.
> 
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: Shuah Khan <skhan@linuxfoundation.org>
> Cc: linux-kselftest@vger.kernel.org
> Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Acked-by: Muhammad Usama Anjum <usama.anjum@collabora.com>

> 
> diff --git a/tools/testing/selftests/ring-buffer/Makefile b/tools/testing/selftests/ring-buffer/Makefile
> new file mode 100644
> index 000000000000..627c5fa6d1ab
> --- /dev/null
> +++ b/tools/testing/selftests/ring-buffer/Makefile
> @@ -0,0 +1,8 @@
> +# SPDX-License-Identifier: GPL-2.0
> +CFLAGS += -Wl,-no-as-needed -Wall
> +CFLAGS += $(KHDR_INCLUDES)
> +CFLAGS += -D_GNU_SOURCE
> +
> +TEST_GEN_PROGS = map_test
> +
> +include ../lib.mk
Please add a .gitignore file with map_test in it.

> diff --git a/tools/testing/selftests/ring-buffer/config b/tools/testing/selftests/ring-buffer/config
> new file mode 100644
> index 000000000000..d936f8f00e78
> --- /dev/null
> +++ b/tools/testing/selftests/ring-buffer/config
> @@ -0,0 +1,2 @@
> +CONFIG_FTRACE=y
> +CONFIG_TRACER_SNAPSHOT=y
> diff --git a/tools/testing/selftests/ring-buffer/map_test.c b/tools/testing/selftests/ring-buffer/map_test.c
> new file mode 100644
> index 000000000000..50a09e1371d4
> --- /dev/null
> +++ b/tools/testing/selftests/ring-buffer/map_test.c
> @@ -0,0 +1,302 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Ring-buffer memory mapping tests
> + *
> + * Copyright (c) 2024 Vincent Donnefort <vdonnefort@google.com>
> + */
> +#include <fcntl.h>
> +#include <sched.h>
> +#include <stdbool.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <unistd.h>
> +
> +#include <linux/trace_mmap.h>
> +
> +#include <sys/mman.h>
> +#include <sys/ioctl.h>
> +
> +#include "../user_events/user_events_selftests.h" /* share tracefs setup */
> +#include "../kselftest_harness.h"
> +
> +#define TRACEFS_ROOT "/sys/kernel/tracing"
> +
> +static int __tracefs_write(const char *path, const char *value)
> +{
> +	int fd, ret;
> +
> +	fd = open(path, O_WRONLY | O_TRUNC);
> +	if (fd < 0)
> +		return fd;
> +
> +	ret = write(fd, value, strlen(value));
> +
> +	close(fd);
> +
> +	return ret == -1 ? -errno : 0;
> +}
> +
> +static int __tracefs_write_int(const char *path, int value)
> +{
> +	char *str;
> +	int ret;
> +
> +	if (asprintf(&str, "%d", value) < 0)
> +		return -1;
> +
> +	ret = __tracefs_write(path, str);
> +
> +	free(str);
> +
> +	return ret;
> +}
> +
> +#define tracefs_write_int(path, value) \
> +	ASSERT_EQ(__tracefs_write_int((path), (value)), 0)
> +
> +#define tracefs_write(path, value) \
> +	ASSERT_EQ(__tracefs_write((path), (value)), 0)
> +
> +static int tracefs_reset(void)
> +{
> +	if (__tracefs_write_int(TRACEFS_ROOT"/tracing_on", 0))
> +		return -1;
> +	if (__tracefs_write(TRACEFS_ROOT"/trace", ""))
> +		return -1;
> +	if (__tracefs_write(TRACEFS_ROOT"/set_event", ""))
> +		return -1;
> +	if (__tracefs_write(TRACEFS_ROOT"/current_tracer", "nop"))
> +		return -1;
> +
> +	return 0;
> +}
> +
> +struct tracefs_cpu_map_desc {
> +	struct trace_buffer_meta	*meta;
> +	int				cpu_fd;
> +};
> +
> +int tracefs_cpu_map(struct tracefs_cpu_map_desc *desc, int cpu)
> +{
> +	int page_size = getpagesize();
> +	char *cpu_path;
> +	void *map;
> +
> +	if (asprintf(&cpu_path,
> +		     TRACEFS_ROOT"/per_cpu/cpu%d/trace_pipe_raw",
> +		     cpu) < 0)
> +		return -ENOMEM;
> +
> +	desc->cpu_fd = open(cpu_path, O_RDONLY | O_NONBLOCK);
> +	free(cpu_path);
> +	if (desc->cpu_fd < 0)
> +		return -ENODEV;
> +
> +	map = mmap(NULL, page_size, PROT_READ, MAP_SHARED, desc->cpu_fd, 0);
> +	if (map == MAP_FAILED)
> +		return -errno;
> +
> +	desc->meta = (struct trace_buffer_meta *)map;
> +
> +	return 0;
> +}
> +
> +void tracefs_cpu_unmap(struct tracefs_cpu_map_desc *desc)
> +{
> +	munmap(desc->meta, desc->meta->meta_page_size);
> +	close(desc->cpu_fd);
> +}
> +
> +FIXTURE(map) {
> +	struct tracefs_cpu_map_desc	map_desc;
> +	bool				umount;
> +};
> +
> +FIXTURE_VARIANT(map) {
> +	int	subbuf_size;
> +};
> +
> +FIXTURE_VARIANT_ADD(map, subbuf_size_4k) {
> +	.subbuf_size = 4,
> +};
> +
> +FIXTURE_VARIANT_ADD(map, subbuf_size_8k) {
> +	.subbuf_size = 8,
> +};
> +
> +FIXTURE_SETUP(map)
> +{
> +	int cpu = sched_getcpu();
> +	cpu_set_t cpu_mask;
> +	bool fail, umount;
> +	char *message;
> +
> +	if (!tracefs_enabled(&message, &fail, &umount)) {
> +		if (fail) {
> +			TH_LOG("Tracefs setup failed: %s", message);
> +			ASSERT_FALSE(fail);
> +		}
> +		SKIP(return, "Skipping: %s", message);
> +	}
> +
> +	self->umount = umount;
> +
> +	ASSERT_GE(cpu, 0);
> +
> +	ASSERT_EQ(tracefs_reset(), 0);
> +
> +	tracefs_write_int(TRACEFS_ROOT"/buffer_subbuf_size_kb", variant->subbuf_size);
> +
> +	ASSERT_EQ(tracefs_cpu_map(&self->map_desc, cpu), 0);
> +
> +	/*
> +	 * Ensure generated events will be found on this very same ring-buffer.
> +	 */
> +	CPU_ZERO(&cpu_mask);
> +	CPU_SET(cpu, &cpu_mask);
> +	ASSERT_EQ(sched_setaffinity(0, sizeof(cpu_mask), &cpu_mask), 0);
> +}
> +
> +FIXTURE_TEARDOWN(map)
> +{
> +	tracefs_reset();
> +
> +	if (self->umount)
> +		tracefs_unmount();
> +
> +	tracefs_cpu_unmap(&self->map_desc);
> +}
> +
> +TEST_F(map, meta_page_check)
> +{
> +	struct tracefs_cpu_map_desc *desc = &self->map_desc;
> +	int cnt = 0;
> +
> +	ASSERT_EQ(desc->meta->entries, 0);
> +	ASSERT_EQ(desc->meta->overrun, 0);
> +	ASSERT_EQ(desc->meta->read, 0);
> +
> +	ASSERT_EQ(desc->meta->reader.id, 0);
> +	ASSERT_EQ(desc->meta->reader.read, 0);
> +
> +	ASSERT_EQ(ioctl(desc->cpu_fd, TRACE_MMAP_IOCTL_GET_READER), 0);
> +	ASSERT_EQ(desc->meta->reader.id, 0);
> +
> +	tracefs_write_int(TRACEFS_ROOT"/tracing_on", 1);
> +	for (int i = 0; i < 16; i++)
> +		tracefs_write_int(TRACEFS_ROOT"/trace_marker", i);
> +again:
> +	ASSERT_EQ(ioctl(desc->cpu_fd, TRACE_MMAP_IOCTL_GET_READER), 0);
> +
> +	ASSERT_EQ(desc->meta->entries, 16);
> +	ASSERT_EQ(desc->meta->overrun, 0);
> +	ASSERT_EQ(desc->meta->read, 16);
> +
> +	ASSERT_EQ(desc->meta->reader.id, 1);
> +
> +	if (!(cnt++))
> +		goto again;
> +}
> +
> +TEST_F(map, data_mmap)
> +{
> +	struct tracefs_cpu_map_desc *desc = &self->map_desc;
> +	unsigned long meta_len, data_len;
> +	void *data;
> +
> +	meta_len = desc->meta->meta_page_size;
> +	data_len = desc->meta->subbuf_size * desc->meta->nr_subbufs;
> +
> +	/* Map all the available subbufs */
> +	data = mmap(NULL, data_len, PROT_READ, MAP_SHARED,
> +		    desc->cpu_fd, meta_len);
> +	ASSERT_NE(data, MAP_FAILED);
> +	munmap(data, data_len);
> +
> +	/* Map all the available subbufs - 1 */
> +	data_len -= desc->meta->subbuf_size;
> +	data = mmap(NULL, data_len, PROT_READ, MAP_SHARED,
> +		    desc->cpu_fd, meta_len);
> +	ASSERT_NE(data, MAP_FAILED);
> +	munmap(data, data_len);
> +
> +	/* Overflow the available subbufs by 1 */
> +	meta_len += desc->meta->subbuf_size * 2;
> +	data = mmap(NULL, data_len, PROT_READ, MAP_SHARED,
> +		    desc->cpu_fd, meta_len);
> +	ASSERT_EQ(data, MAP_FAILED);
> +
> +	/* Verify meta-page padding */
> +	if (desc->meta->meta_page_size > getpagesize()) {
> +		void *addr;
> +
> +		data_len = desc->meta->meta_page_size;
> +		data = mmap(NULL, data_len,
> +			    PROT_READ, MAP_SHARED, desc->cpu_fd, 0);
> +		ASSERT_NE(data, MAP_FAILED);
> +
> +		addr = (void *)((unsigned long)data + getpagesize());
> +		ASSERT_EQ(*((int *)addr), 0);
> +		munmap(data, data_len);
> +	}
> +}
> +
> +FIXTURE(snapshot) {
> +	bool	umount;
> +};
> +
> +FIXTURE_SETUP(snapshot)
> +{
> +	bool fail, umount;
> +	struct stat sb;
> +	char *message;
> +
> +	if (stat(TRACEFS_ROOT"/snapshot", &sb))
> +		SKIP(return, "Skipping: %s", "snapshot not available");
> +
> +	if (!tracefs_enabled(&message, &fail, &umount)) {
> +		if (fail) {
> +			TH_LOG("Tracefs setup failed: %s", message);
> +			ASSERT_FALSE(fail);
> +		}
> +		SKIP(return, "Skipping: %s", message);
> +	}
> +
> +	self->umount = umount;
> +}
> +
> +FIXTURE_TEARDOWN(snapshot)
> +{
> +	__tracefs_write(TRACEFS_ROOT"/events/sched/sched_switch/trigger",
> +			"!snapshot");
> +	tracefs_reset();
> +
> +	if (self->umount)
> +		tracefs_unmount();
> +}
> +
> +TEST_F(snapshot, excludes_map)
> +{
> +	struct tracefs_cpu_map_desc map_desc;
> +	int cpu = sched_getcpu();
> +
> +	ASSERT_GE(cpu, 0);
> +	tracefs_write(TRACEFS_ROOT"/events/sched/sched_switch/trigger",
> +		      "snapshot");
> +	ASSERT_EQ(tracefs_cpu_map(&map_desc, cpu), -EBUSY);
> +}
> +
> +TEST_F(snapshot, excluded_by_map)
> +{
> +	struct tracefs_cpu_map_desc map_desc;
> +	int cpu = sched_getcpu();
> +
> +	ASSERT_EQ(tracefs_cpu_map(&map_desc, cpu), 0);
> +
> +	ASSERT_EQ(__tracefs_write(TRACEFS_ROOT"/events/sched/sched_switch/trigger",
> +				  "snapshot"), -EBUSY);
> +	ASSERT_EQ(__tracefs_write(TRACEFS_ROOT"/snapshot",
> +				  "1"), -EBUSY);
> +}
> +
> +TEST_HARNESS_MAIN

-- 
BR,
Muhammad Usama Anjum

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions
  2024-04-06 17:36 ` [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions Vincent Donnefort
@ 2024-04-09 20:12   ` kernel test robot
  2024-04-10 17:43   ` Steven Rostedt
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 23+ messages in thread
From: kernel test robot @ 2024-04-09 20:12 UTC (permalink / raw)
  To: Vincent Donnefort; +Cc: oe-kbuild-all

Hi Vincent,

kernel test robot noticed the following build errors:

[auto build test ERROR on 7604256cecef34a82333d9f78262d3180f4eb525]

url:    https://github.com/intel-lab-lkp/linux/commits/Vincent-Donnefort/ring-buffer-allocate-sub-buffers-with-__GFP_COMP/20240407-013951
base:   7604256cecef34a82333d9f78262d3180f4eb525
patch link:    https://lore.kernel.org/r/20240406173649.3210836-3-vdonnefort%40google.com
patch subject: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions
config: riscv-randconfig-r122-20240409 (https://download.01.org/0day-ci/archive/20240410/202404100309.am1Aq6wU-lkp@intel.com/config)
compiler: riscv64-linux-gcc (GCC) 13.2.0
reproduce: (https://download.01.org/0day-ci/archive/20240410/202404100309.am1Aq6wU-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202404100309.am1Aq6wU-lkp@intel.com/

All errors (new ones prefixed by >>):

   riscv64-linux-ld: kernel/trace/ring_buffer.o: in function `rb_update_meta_page':
>> kernel/trace/ring_buffer.c:5227:(.text+0x1cce): undefined reference to `vm_insert_pages'


vim +5227 kernel/trace/ring_buffer.c

  5221	
  5222	static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
  5223	{
  5224		struct trace_buffer_meta *meta = cpu_buffer->meta_page;
  5225	
  5226		meta->reader.read = cpu_buffer->reader_page->read;
> 5227		meta->reader.id = cpu_buffer->reader_page->id;
  5228		meta->reader.lost_events = cpu_buffer->lost_events;
  5229	
  5230		meta->entries = local_read(&cpu_buffer->entries);
  5231		meta->overrun = local_read(&cpu_buffer->overrun);
  5232		meta->read = cpu_buffer->read;
  5233	
  5234		/* Some archs do not have data cache coherency between kernel and user-space */
  5235		flush_dcache_folio(virt_to_folio(cpu_buffer->meta_page));
  5236	}
  5237	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v20 1/5] ring-buffer: allocate sub-buffers with __GFP_COMP
  2024-04-06 17:36 ` [PATCH v20 1/5] ring-buffer: allocate sub-buffers with __GFP_COMP Vincent Donnefort
@ 2024-04-10 17:36   ` Steven Rostedt
  0 siblings, 0 replies; 23+ messages in thread
From: Steven Rostedt @ 2024-04-10 17:36 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: mhiramat, linux-kernel, linux-trace-kernel, mathieu.desnoyers,
	kernel-team, rdunlap


Hi Vincent,

Thanks for sending this. Nit: Subject should start with a capital:

  ring-buffer: Allocate sub-buffers with __GFP_COMP

-- Steve


On Sat,  6 Apr 2024 18:36:45 +0100
Vincent Donnefort <vdonnefort@google.com> wrote:

> In preparation for the ring-buffer memory mapping, allocate compound
> pages for the ring-buffer sub-buffers to enable us to map them to
> user-space with vm_insert_pages().
> 
> Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 25476ead681b..cc9ebe593571 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -1524,7 +1524,7 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
>  		list_add(&bpage->list, pages);
>  
>  		page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu),
> -					mflags | __GFP_ZERO,
> +					mflags | __GFP_COMP | __GFP_ZERO,
>  					cpu_buffer->buffer->subbuf_order);
>  		if (!page)
>  			goto free_pages;
> @@ -1609,7 +1609,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu)
>  
>  	cpu_buffer->reader_page = bpage;
>  
> -	page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_ZERO,
> +	page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_COMP | __GFP_ZERO,
>  				cpu_buffer->buffer->subbuf_order);
>  	if (!page)
>  		goto fail_free_reader;
> @@ -5579,7 +5579,7 @@ ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu)
>  		goto out;
>  
>  	page = alloc_pages_node(cpu_to_node(cpu),
> -				GFP_KERNEL | __GFP_NORETRY | __GFP_ZERO,
> +				GFP_KERNEL | __GFP_NORETRY | __GFP_COMP | __GFP_ZERO,
>  				cpu_buffer->buffer->subbuf_order);
>  	if (!page) {
>  		kfree(bpage);


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions
  2024-04-06 17:36 ` [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions Vincent Donnefort
  2024-04-09 20:12   ` kernel test robot
@ 2024-04-10 17:43   ` Steven Rostedt
  2024-04-23 16:04     ` Liam R. Howlett
  2024-04-18  6:55   ` Mike Rapoport
  2024-04-19 18:25   ` David Hildenbrand
  3 siblings, 1 reply; 23+ messages in thread
From: Steven Rostedt @ 2024-04-10 17:43 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: mhiramat, linux-kernel, linux-trace-kernel, mathieu.desnoyers,
	kernel-team, rdunlap, linux-mm

On Sat,  6 Apr 2024 18:36:46 +0100
Vincent Donnefort <vdonnefort@google.com> wrote:

> +int ring_buffer_map(struct trace_buffer *buffer, int cpu,
> +		    struct vm_area_struct *vma)
> +{
> +	struct ring_buffer_per_cpu *cpu_buffer;
> +	unsigned long flags, *subbuf_ids;
> +	int err = 0;
> +
> +	if (!cpumask_test_cpu(cpu, buffer->cpumask))
> +		return -EINVAL;
> +
> +	cpu_buffer = buffer->buffers[cpu];
> +
> +	mutex_lock(&cpu_buffer->mapping_lock);
> +
> +	if (cpu_buffer->mapped) {
> +		err = __rb_map_vma(cpu_buffer, vma);
> +		if (!err)
> +			err = __rb_inc_dec_mapped(cpu_buffer, true);
> +		mutex_unlock(&cpu_buffer->mapping_lock);
> +		return err;
> +	}
> +
> +	/* prevent another thread from changing buffer/sub-buffer sizes */
> +	mutex_lock(&buffer->mutex);
> +
> +	err = rb_alloc_meta_page(cpu_buffer);
> +	if (err)
> +		goto unlock;
> +
> +	/* subbuf_ids include the reader while nr_pages does not */
> +	subbuf_ids = kcalloc(cpu_buffer->nr_pages + 1, sizeof(*subbuf_ids), GFP_KERNEL);
> +	if (!subbuf_ids) {
> +		rb_free_meta_page(cpu_buffer);
> +		err = -ENOMEM;
> +		goto unlock;
> +	}
> +
> +	atomic_inc(&cpu_buffer->resize_disabled);
> +
> +	/*
> +	 * Lock all readers to block any subbuf swap until the subbuf IDs are
> +	 * assigned.
> +	 */
> +	raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
> +	rb_setup_ids_meta_page(cpu_buffer, subbuf_ids);
> +	raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
> +
> +	err = __rb_map_vma(cpu_buffer, vma);
> +	if (!err) {
> +		raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
> +		cpu_buffer->mapped = 1;
> +		raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
> +	} else {
> +		kfree(cpu_buffer->subbuf_ids);
> +		cpu_buffer->subbuf_ids = NULL;
> +		rb_free_meta_page(cpu_buffer);
> +	}
> +unlock:

Nit: For all labels, please add a space before them. Otherwise, diffs will
show "unlock" as the function and not "ring_buffer_map", making it harder
to find where the change is.

Same for the labels below.

-- Steve


> +	mutex_unlock(&buffer->mutex);
> +	mutex_unlock(&cpu_buffer->mapping_lock);
> +
> +	return err;
> +}
> +
> +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu)
> +{
> +	struct ring_buffer_per_cpu *cpu_buffer;
> +	unsigned long flags;
> +	int err = 0;
> +
> +	if (!cpumask_test_cpu(cpu, buffer->cpumask))
> +		return -EINVAL;
> +
> +	cpu_buffer = buffer->buffers[cpu];
> +
> +	mutex_lock(&cpu_buffer->mapping_lock);
> +
> +	if (!cpu_buffer->mapped) {
> +		err = -ENODEV;
> +		goto out;
> +	} else if (cpu_buffer->mapped > 1) {
> +		__rb_inc_dec_mapped(cpu_buffer, false);
> +		goto out;
> +	}
> +
> +	mutex_lock(&buffer->mutex);
> +	raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
> +
> +	cpu_buffer->mapped = 0;
> +
> +	raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
> +
> +	kfree(cpu_buffer->subbuf_ids);
> +	cpu_buffer->subbuf_ids = NULL;
> +	rb_free_meta_page(cpu_buffer);
> +	atomic_dec(&cpu_buffer->resize_disabled);
> +
> +	mutex_unlock(&buffer->mutex);
> +out:
> +	mutex_unlock(&cpu_buffer->mapping_lock);
> +
> +	return err;
> +}
> +
> +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu)
> +{
> +	struct ring_buffer_per_cpu *cpu_buffer;
> +	unsigned long reader_size;
> +	unsigned long flags;
> +
> +	cpu_buffer = rb_get_mapped_buffer(buffer, cpu);
> +	if (IS_ERR(cpu_buffer))
> +		return (int)PTR_ERR(cpu_buffer);
> +
> +	raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
> +consume:
> +	if (rb_per_cpu_empty(cpu_buffer))
> +		goto out;
> +
> +	reader_size = rb_page_size(cpu_buffer->reader_page);
> +
> +	/*
> +	 * There are data to be read on the current reader page, we can
> +	 * return to the caller. But before that, we assume the latter will read
> +	 * everything. Let's update the kernel reader accordingly.
> +	 */
> +	if (cpu_buffer->reader_page->read < reader_size) {
> +		while (cpu_buffer->reader_page->read < reader_size)
> +			rb_advance_reader(cpu_buffer);
> +		goto out;
> +	}
> +
> +	if (WARN_ON(!rb_get_reader_page(cpu_buffer)))
> +		goto out;
> +
> +	goto consume;
> +out:
> +	/* Some archs do not have data cache coherency between kernel and user-space */
> +	flush_dcache_folio(virt_to_folio(cpu_buffer->reader_page->page));
> +
> +	rb_update_meta_page(cpu_buffer);
> +
> +	raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
> +	rb_put_mapped_buffer(cpu_buffer);
> +
> +	return 0;
> +}
> +
>  /*
>   * We only allocate new buffers, never free them if the CPU goes down.
>   * If we were to free the buffer, then the user would lose any trace that was in


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v20 0/5] Introducing trace buffer mapping by user-space
  2024-04-06 17:36 [PATCH v20 0/5] Introducing trace buffer mapping by user-space Vincent Donnefort
                   ` (4 preceding siblings ...)
  2024-04-06 17:36 ` [PATCH v20 5/5] ring-buffer/selftest: Add ring-buffer mapping test Vincent Donnefort
@ 2024-04-10 17:56 ` Steven Rostedt
  2024-04-17  4:55   ` Mike Rapoport
  5 siblings, 1 reply; 23+ messages in thread
From: Steven Rostedt @ 2024-04-10 17:56 UTC (permalink / raw)
  To: Andrew Morton, Liam R. Howlett, Vlastimil Babka, Lorenzo Stoakes
  Cc: Vincent Donnefort, mhiramat, linux-kernel, linux-trace-kernel,
	mathieu.desnoyers, kernel-team, rdunlap


Hi Andrew, et.al.

Linus said it's a hard requirement that this code gets an Acked-by (or
Reviewed-by) from the memory sub-maintainers before he will accept it.
He was upset that we faulted in pages one at a time instead of mapping it
in one go:

  https://lore.kernel.org/all/CAHk-=wh5wWeib7+kVHpBVtUn7kx7GGadWqb5mW5FYTdewEfL=w@mail.gmail.com/

Could you take a look at patches 1-3 to make sure they look sane from a
memory management point of view?

I really want this applied in the next merge window.

Thanks!

-- Steve


On Sat,  6 Apr 2024 18:36:44 +0100
Vincent Donnefort <vdonnefort@google.com> wrote:

> The tracing ring-buffers can be stored on disk or sent to network
> without any copy via splice. However the later doesn't allow real time
> processing of the traces. A solution is to give userspace direct access
> to the ring-buffer pages via a mapping. An application can now become a
> consumer of the ring-buffer, in a similar fashion to what trace_pipe
> offers.
> 
> Support for this new feature can already be found in libtracefs from
> version 1.8, when built with EXTRA_CFLAGS=-DFORCE_MMAP_ENABLE.
> 
> Vincent
> 
> v19 -> v20:
>   * Fix typos in documentation.
>   * Remove useless mmap open and fault callbacks.
>   * add mm.h include for vm_insert_pages
> 
> v18 -> v19:
>   * Use VM_PFNMAP and vm_insert_pages
>   * Allocate ring-buffer subbufs with __GFP_COMP
>   * Pad the meta-page with the zero-page to align on the subbuf_order
>   * Extend the ring-buffer test with mmap() dedicated suite
> 
> v17 -> v18:
>   * Fix lockdep_assert_held
>   * Fix spin_lock_init typo
>   * Fix CONFIG_TRACER_MAX_TRACE typo
> 
> v16 -> v17:
>   * Documentation and comments improvements.
>   * Create get/put_snapshot_map() for clearer code.
>   * Replace kzalloc with kcalloc.
>   * Fix -ENOMEM handling in rb_alloc_meta_page().
>   * Move flush(cpu_buffer->reader_page) behind the reader lock.
>   * Move all inc/dec of cpu_buffer->mapped behind reader lock and buffer
>     mutex. (removes READ_ONCE/WRITE_ONCE accesses).
> 
> v15 -> v16:
>   * Add comment for the dcache flush.
>   * Remove now unnecessary WRITE_ONCE for the meta-page.
> 
> v14 -> v15:
>   * Add meta-page and reader-page flush. Intends to fix the mapping
>     for VIVT and aliasing-VIPT data caches.
>   * -EPERM on VM_EXEC.
>   * Fix build warning !CONFIG_TRACER_MAX_TRACE.
> 
> v13 -> v14:
>   * All cpu_buffer->mapped readers use READ_ONCE (except for swap_cpu)
>   * on unmap, sync meta-page teardown with the reader_lock instead of
>     the synchronize_rcu.
>   * Add a dedicated spinlock for trace_array ->snapshot and ->mapped.
>     (intends to fix a lockdep issue)
>   * Add kerneldoc for flags and Reserved fields.
>   * Add kselftest for snapshot/map mutual exclusion.
> 
> v12 -> v13:
>   * Swap subbufs_{touched,lost} for Reserved fields.
>   * Add a flag field in the meta-page.
>   * Fix CONFIG_TRACER_MAX_TRACE.
>   * Rebase on top of trace/urgent.
>   * Add a comment for try_unregister_trigger()
> 
> v11 -> v12:
>   * Fix code sample mmap bug.
>   * Add logging in sample code.
>   * Reset tracer in selftest.
>   * Add a refcount for the snapshot users.
>   * Prevent mapping when there are snapshot users and vice versa.
>   * Refine the meta-page.
>   * Fix types in the meta-page.
>   * Collect Reviewed-by.
> 
> v10 -> v11:
>   * Add Documentation and code sample.
>   * Add a selftest.
>   * Move all the update to the meta-page into a single
>     rb_update_meta_page().
>   * rb_update_meta_page() is now called from
>     ring_buffer_map_get_reader() to fix NOBLOCK callers.
>   * kerneldoc for struct trace_meta_page.
>   * Add a patch to zero all the ring-buffer allocations.
> 
> v9 -> v10:
>   * Refactor rb_update_meta_page()
>   * In-loop declaration for foreach_subbuf_page()
>   * Check for cpu_buffer->mapped overflow
> 
> v8 -> v9:
>   * Fix the unlock path in ring_buffer_map()
>   * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer
>   * Rebase on linux-trace/for-next (3cb3091138ca0921c4569bcf7ffa062519639b6a)
> 
> v7 -> v8:
>   * Drop the subbufs renaming into bpages
>   * Use subbuf as a name when relevant
> 
> v6 -> v7:
>   * Rebase onto lore.kernel.org/lkml/20231215175502.106587604@goodmis.org/
>   * Support for subbufs
>   * Rename subbufs into bpages
> 
> v5 -> v6:
>   * Rebase on next-20230802.
>   * (unsigned long) -> (void *) cast for virt_to_page().
>   * Add a wait for the GET_READER_PAGE ioctl.
>   * Move writer fields update (overrun/pages_lost/entries/pages_touched)
>     in the irq_work.
>   * Rearrange id in struct buffer_page.
>   * Rearrange the meta-page.
>   * ring_buffer_meta_page -> trace_buffer_meta_page.
>   * Add meta_struct_len into the meta-page.
> 
> v4 -> v5:
>   * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3)
> 
> v3 -> v4:
>   * Add to the meta-page:
>        - pages_lost / pages_read (allow to compute how full is the
> 	 ring-buffer)
>        - read (allow to compute how many entries can be read)
>        - A reader_page struct.
>   * Rename ring_buffer_meta_header -> ring_buffer_meta
>   * Rename ring_buffer_get_reader_page -> ring_buffer_map_get_reader_page
>   * Properly consume events on ring_buffer_map_get_reader_page() with
>     rb_advance_reader().
> 
> v2 -> v3:
>   * Remove data page list (for non-consuming read)
>     ** Implies removing order > 0 meta-page
>   * Add a new meta page field ->read
>   * Rename ring_buffer_meta_page_header into ring_buffer_meta_header
> 
> v1 -> v2:
>   * Hide data_pages from the userspace struct
>   * Fix META_PAGE_MAX_PAGES
>   * Support for order > 0 meta-page
>   * Add missing page->mapping.
> 
> Vincent Donnefort (5):
>   ring-buffer: allocate sub-buffers with __GFP_COMP
>   ring-buffer: Introducing ring-buffer mapping functions
>   tracing: Allow user-space mapping of the ring-buffer
>   Documentation: tracing: Add ring-buffer mapping
>   ring-buffer/selftest: Add ring-buffer mapping test
> 
>  Documentation/trace/index.rst                 |   1 +
>  Documentation/trace/ring-buffer-map.rst       | 106 +++++
>  include/linux/ring_buffer.h                   |   6 +
>  include/uapi/linux/trace_mmap.h               |  48 +++
>  kernel/trace/ring_buffer.c                    | 403 +++++++++++++++++-
>  kernel/trace/trace.c                          | 113 ++++-
>  kernel/trace/trace.h                          |   1 +
>  tools/testing/selftests/ring-buffer/Makefile  |   8 +
>  tools/testing/selftests/ring-buffer/config    |   2 +
>  .../testing/selftests/ring-buffer/map_test.c  | 302 +++++++++++++
>  10 files changed, 979 insertions(+), 11 deletions(-)
>  create mode 100644 Documentation/trace/ring-buffer-map.rst
>  create mode 100644 include/uapi/linux/trace_mmap.h
>  create mode 100644 tools/testing/selftests/ring-buffer/Makefile
>  create mode 100644 tools/testing/selftests/ring-buffer/config
>  create mode 100644 tools/testing/selftests/ring-buffer/map_test.c
> 
> 
> base-commit: 7604256cecef34a82333d9f78262d3180f4eb525


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v20 0/5] Introducing trace buffer mapping by user-space
  2024-04-10 17:56 ` [PATCH v20 0/5] Introducing trace buffer mapping by user-space Steven Rostedt
@ 2024-04-17  4:55   ` Mike Rapoport
  0 siblings, 0 replies; 23+ messages in thread
From: Mike Rapoport @ 2024-04-17  4:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andrew Morton, Liam R. Howlett, Vlastimil Babka, Lorenzo Stoakes,
	Vincent Donnefort, mhiramat, linux-kernel, linux-trace-kernel,
	mathieu.desnoyers, kernel-team, rdunlap, linux-mm

(added linux-mm)

On Wed, Apr 10, 2024 at 01:56:12PM -0400, Steven Rostedt wrote:
> 
> Hi Andrew, et.al.
> 
> Linus said it's a hard requirement that this code gets an Acked-by (or
> Reviewed-by) from the memory sub-maintainers before he will accept it.
> He was upset that we faulted in pages one at a time instead of mapping it
> in one go:
> 
>   https://lore.kernel.org/all/CAHk-=wh5wWeib7+kVHpBVtUn7kx7GGadWqb5mW5FYTdewEfL=w@mail.gmail.com/
> 
> Could you take a look at patches 1-3 to make sure they look sane from a
> memory management point of view?
> 
> I really want this applied in the next merge window.
> 
> Thanks!
> 
> -- Steve
> 
> 
> On Sat,  6 Apr 2024 18:36:44 +0100
> Vincent Donnefort <vdonnefort@google.com> wrote:
> 
> > The tracing ring-buffers can be stored on disk or sent to network
> > without any copy via splice. However the later doesn't allow real time
> > processing of the traces. A solution is to give userspace direct access
> > to the ring-buffer pages via a mapping. An application can now become a
> > consumer of the ring-buffer, in a similar fashion to what trace_pipe
> > offers.
> > 
> > Support for this new feature can already be found in libtracefs from
> > version 1.8, when built with EXTRA_CFLAGS=-DFORCE_MMAP_ENABLE.
> > 
> > Vincent
> > 
> > v19 -> v20:
> >   * Fix typos in documentation.
> >   * Remove useless mmap open and fault callbacks.
> >   * add mm.h include for vm_insert_pages
> > 
> > v18 -> v19:
> >   * Use VM_PFNMAP and vm_insert_pages
> >   * Allocate ring-buffer subbufs with __GFP_COMP
> >   * Pad the meta-page with the zero-page to align on the subbuf_order
> >   * Extend the ring-buffer test with mmap() dedicated suite
> > 
> > v17 -> v18:
> >   * Fix lockdep_assert_held
> >   * Fix spin_lock_init typo
> >   * Fix CONFIG_TRACER_MAX_TRACE typo
> > 
> > v16 -> v17:
> >   * Documentation and comments improvements.
> >   * Create get/put_snapshot_map() for clearer code.
> >   * Replace kzalloc with kcalloc.
> >   * Fix -ENOMEM handling in rb_alloc_meta_page().
> >   * Move flush(cpu_buffer->reader_page) behind the reader lock.
> >   * Move all inc/dec of cpu_buffer->mapped behind reader lock and buffer
> >     mutex. (removes READ_ONCE/WRITE_ONCE accesses).
> > 
> > v15 -> v16:
> >   * Add comment for the dcache flush.
> >   * Remove now unnecessary WRITE_ONCE for the meta-page.
> > 
> > v14 -> v15:
> >   * Add meta-page and reader-page flush. Intends to fix the mapping
> >     for VIVT and aliasing-VIPT data caches.
> >   * -EPERM on VM_EXEC.
> >   * Fix build warning !CONFIG_TRACER_MAX_TRACE.
> > 
> > v13 -> v14:
> >   * All cpu_buffer->mapped readers use READ_ONCE (except for swap_cpu)
> >   * on unmap, sync meta-page teardown with the reader_lock instead of
> >     the synchronize_rcu.
> >   * Add a dedicated spinlock for trace_array ->snapshot and ->mapped.
> >     (intends to fix a lockdep issue)
> >   * Add kerneldoc for flags and Reserved fields.
> >   * Add kselftest for snapshot/map mutual exclusion.
> > 
> > v12 -> v13:
> >   * Swap subbufs_{touched,lost} for Reserved fields.
> >   * Add a flag field in the meta-page.
> >   * Fix CONFIG_TRACER_MAX_TRACE.
> >   * Rebase on top of trace/urgent.
> >   * Add a comment for try_unregister_trigger()
> > 
> > v11 -> v12:
> >   * Fix code sample mmap bug.
> >   * Add logging in sample code.
> >   * Reset tracer in selftest.
> >   * Add a refcount for the snapshot users.
> >   * Prevent mapping when there are snapshot users and vice versa.
> >   * Refine the meta-page.
> >   * Fix types in the meta-page.
> >   * Collect Reviewed-by.
> > 
> > v10 -> v11:
> >   * Add Documentation and code sample.
> >   * Add a selftest.
> >   * Move all the update to the meta-page into a single
> >     rb_update_meta_page().
> >   * rb_update_meta_page() is now called from
> >     ring_buffer_map_get_reader() to fix NOBLOCK callers.
> >   * kerneldoc for struct trace_meta_page.
> >   * Add a patch to zero all the ring-buffer allocations.
> > 
> > v9 -> v10:
> >   * Refactor rb_update_meta_page()
> >   * In-loop declaration for foreach_subbuf_page()
> >   * Check for cpu_buffer->mapped overflow
> > 
> > v8 -> v9:
> >   * Fix the unlock path in ring_buffer_map()
> >   * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer
> >   * Rebase on linux-trace/for-next (3cb3091138ca0921c4569bcf7ffa062519639b6a)
> > 
> > v7 -> v8:
> >   * Drop the subbufs renaming into bpages
> >   * Use subbuf as a name when relevant
> > 
> > v6 -> v7:
> >   * Rebase onto lore.kernel.org/lkml/20231215175502.106587604@goodmis.org/
> >   * Support for subbufs
> >   * Rename subbufs into bpages
> > 
> > v5 -> v6:
> >   * Rebase on next-20230802.
> >   * (unsigned long) -> (void *) cast for virt_to_page().
> >   * Add a wait for the GET_READER_PAGE ioctl.
> >   * Move writer fields update (overrun/pages_lost/entries/pages_touched)
> >     in the irq_work.
> >   * Rearrange id in struct buffer_page.
> >   * Rearrange the meta-page.
> >   * ring_buffer_meta_page -> trace_buffer_meta_page.
> >   * Add meta_struct_len into the meta-page.
> > 
> > v4 -> v5:
> >   * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3)
> > 
> > v3 -> v4:
> >   * Add to the meta-page:
> >        - pages_lost / pages_read (allow to compute how full is the
> > 	 ring-buffer)
> >        - read (allow to compute how many entries can be read)
> >        - A reader_page struct.
> >   * Rename ring_buffer_meta_header -> ring_buffer_meta
> >   * Rename ring_buffer_get_reader_page -> ring_buffer_map_get_reader_page
> >   * Properly consume events on ring_buffer_map_get_reader_page() with
> >     rb_advance_reader().
> > 
> > v2 -> v3:
> >   * Remove data page list (for non-consuming read)
> >     ** Implies removing order > 0 meta-page
> >   * Add a new meta page field ->read
> >   * Rename ring_buffer_meta_page_header into ring_buffer_meta_header
> > 
> > v1 -> v2:
> >   * Hide data_pages from the userspace struct
> >   * Fix META_PAGE_MAX_PAGES
> >   * Support for order > 0 meta-page
> >   * Add missing page->mapping.
> > 
> > Vincent Donnefort (5):
> >   ring-buffer: allocate sub-buffers with __GFP_COMP
> >   ring-buffer: Introducing ring-buffer mapping functions
> >   tracing: Allow user-space mapping of the ring-buffer
> >   Documentation: tracing: Add ring-buffer mapping
> >   ring-buffer/selftest: Add ring-buffer mapping test
> > 
> >  Documentation/trace/index.rst                 |   1 +
> >  Documentation/trace/ring-buffer-map.rst       | 106 +++++
> >  include/linux/ring_buffer.h                   |   6 +
> >  include/uapi/linux/trace_mmap.h               |  48 +++
> >  kernel/trace/ring_buffer.c                    | 403 +++++++++++++++++-
> >  kernel/trace/trace.c                          | 113 ++++-
> >  kernel/trace/trace.h                          |   1 +
> >  tools/testing/selftests/ring-buffer/Makefile  |   8 +
> >  tools/testing/selftests/ring-buffer/config    |   2 +
> >  .../testing/selftests/ring-buffer/map_test.c  | 302 +++++++++++++
> >  10 files changed, 979 insertions(+), 11 deletions(-)
> >  create mode 100644 Documentation/trace/ring-buffer-map.rst
> >  create mode 100644 include/uapi/linux/trace_mmap.h
> >  create mode 100644 tools/testing/selftests/ring-buffer/Makefile
> >  create mode 100644 tools/testing/selftests/ring-buffer/config
> >  create mode 100644 tools/testing/selftests/ring-buffer/map_test.c
> > 
> > 
> > base-commit: 7604256cecef34a82333d9f78262d3180f4eb525
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions
  2024-04-06 17:36 ` [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions Vincent Donnefort
  2024-04-09 20:12   ` kernel test robot
  2024-04-10 17:43   ` Steven Rostedt
@ 2024-04-18  6:55   ` Mike Rapoport
  2024-04-19  3:43     ` Steven Rostedt
  2024-04-19 18:25   ` David Hildenbrand
  3 siblings, 1 reply; 23+ messages in thread
From: Mike Rapoport @ 2024-04-18  6:55 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: rostedt, mhiramat, linux-kernel, linux-trace-kernel,
	mathieu.desnoyers, kernel-team, rdunlap, linux-mm

On Sat, Apr 06, 2024 at 06:36:46PM +0100, Vincent Donnefort wrote:
> In preparation for allowing the user-space to map a ring-buffer, add
> a set of mapping functions:
> 
>   ring_buffer_{map,unmap}()
> 
> And controls on the ring-buffer:
> 
>   ring_buffer_map_get_reader()  /* swap reader and head */
> 
> Mapping the ring-buffer also involves:
> 
>   A unique ID for each subbuf of the ring-buffer, currently they are
>   only identified through their in-kernel VA.
> 
>   A meta-page, where are stored ring-buffer statistics and a
>   description for the current reader
> 
> The linear mapping exposes the meta-page, and each subbuf of the
> ring-buffer, ordered following their unique ID, assigned during the
> first mapping.
> 
> Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
> size will remain unmodified and the splice enabling functions will in
> reality simply memcpy the data instead of swapping subbufs.
> 
> CC: <linux-mm@kvack.org>
> Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
> 
> diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
> index dc5ae4e96aee..96d2140b471e 100644
> --- a/include/linux/ring_buffer.h
> +++ b/include/linux/ring_buffer.h
> @@ -6,6 +6,8 @@
>  #include <linux/seq_file.h>
>  #include <linux/poll.h>
>  
> +#include <uapi/linux/trace_mmap.h>
> +
>  struct trace_buffer;
>  struct ring_buffer_iter;
>  
> @@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node);
>  #define trace_rb_cpu_prepare	NULL
>  #endif
>  
> +int ring_buffer_map(struct trace_buffer *buffer, int cpu,
> +		    struct vm_area_struct *vma);
> +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
> +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
>  #endif /* _LINUX_RING_BUFFER_H */
> diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
> new file mode 100644
> index 000000000000..ffcd8dfcaa4f
> --- /dev/null
> +++ b/include/uapi/linux/trace_mmap.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _TRACE_MMAP_H_
> +#define _TRACE_MMAP_H_
> +
> +#include <linux/types.h>
> +
> +/**
> + * struct trace_buffer_meta - Ring-buffer Meta-page description
> + * @meta_page_size:	Size of this meta-page.
> + * @meta_struct_len:	Size of this structure.
> + * @subbuf_size:	Size of each sub-buffer.
> + * @nr_subbufs:		Number of subbfs in the ring-buffer, including the reader.
> + * @reader.lost_events:	Number of events lost at the time of the reader swap.
> + * @reader.id:		subbuf ID of the current reader. ID range [0 : @nr_subbufs - 1]
> + * @reader.read:	Number of bytes read on the reader subbuf.
> + * @flags:		Placeholder for now, 0 until new features are supported.
> + * @entries:		Number of entries in the ring-buffer.
> + * @overrun:		Number of entries lost in the ring-buffer.
> + * @read:		Number of entries that have been read.
> + * @Reserved1:		Reserved for future use.
> + * @Reserved2:		Reserved for future use.
> + */
> +struct trace_buffer_meta {
> +	__u32		meta_page_size;
> +	__u32		meta_struct_len;
> +
> +	__u32		subbuf_size;
> +	__u32		nr_subbufs;
> +
> +	struct {
> +		__u64	lost_events;
> +		__u32	id;
> +		__u32	read;
> +	} reader;
> +
> +	__u64	flags;
> +
> +	__u64	entries;
> +	__u64	overrun;
> +	__u64	read;
> +
> +	__u64	Reserved1;
> +	__u64	Reserved2;

Why do you need reserved fields? This structure always resides in the
beginning of a page and the rest of the page is essentially "reserved".

> +};
> +
> +#endif /* _TRACE_MMAP_H_ */
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index cc9ebe593571..793ecc454039 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c

... 

> +static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer,
> +				   unsigned long *subbuf_ids)
> +{
> +	struct trace_buffer_meta *meta = cpu_buffer->meta_page;
> +	unsigned int nr_subbufs = cpu_buffer->nr_pages + 1;
> +	struct buffer_page *first_subbuf, *subbuf;
> +	int id = 0;
> +
> +	subbuf_ids[id] = (unsigned long)cpu_buffer->reader_page->page;
> +	cpu_buffer->reader_page->id = id++;
> +
> +	first_subbuf = subbuf = rb_set_head_page(cpu_buffer);
> +	do {
> +		if (WARN_ON(id >= nr_subbufs))
> +			break;
> +
> +		subbuf_ids[id] = (unsigned long)subbuf->page;
> +		subbuf->id = id;
> +
> +		rb_inc_page(&subbuf);
> +		id++;
> +	} while (subbuf != first_subbuf);
> +
> +	/* install subbuf ID to kern VA translation */
> +	cpu_buffer->subbuf_ids = subbuf_ids;
> +
> +	/* __rb_map_vma() pads the meta-page to align it with the sub-buffers */
> +	meta->meta_page_size = PAGE_SIZE << cpu_buffer->buffer->subbuf_order;

Isn't this a single page?

> +	meta->meta_struct_len = sizeof(*meta);
> +	meta->nr_subbufs = nr_subbufs;
> +	meta->subbuf_size = cpu_buffer->buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
> +
> +	rb_update_meta_page(cpu_buffer);
> +}

...

> +#define subbuf_page(off, start) \
> +	virt_to_page((void *)((start) + ((off) << PAGE_SHIFT)))
> +
> +#define foreach_subbuf_page(sub_order, start, page)		\

Nit: usually iterators in kernel use for_each

> +	page = subbuf_page(0, (start));				\
> +	for (int __off = 0; __off < (1 << (sub_order));		\
> +	     __off++, page = subbuf_page(__off, (start)))

The pages are allocated with alloc_pages_node(.. subbuf_order) are
physically contiguous and struct pages for them are also contiguous, so
inside a subbuf_order allocation you can just do page++.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions
  2024-04-18  6:55   ` Mike Rapoport
@ 2024-04-19  3:43     ` Steven Rostedt
  2024-04-22 18:01       ` Vincent Donnefort
  0 siblings, 1 reply; 23+ messages in thread
From: Steven Rostedt @ 2024-04-19  3:43 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Vincent Donnefort, mhiramat, linux-kernel, linux-trace-kernel,
	mathieu.desnoyers, kernel-team, rdunlap, linux-mm

On Thu, 18 Apr 2024 09:55:55 +0300
Mike Rapoport <rppt@kernel.org> wrote:

Hi Mike,

Thanks for doing this review!

> > +/**
> > + * struct trace_buffer_meta - Ring-buffer Meta-page description
> > + * @meta_page_size:	Size of this meta-page.
> > + * @meta_struct_len:	Size of this structure.
> > + * @subbuf_size:	Size of each sub-buffer.
> > + * @nr_subbufs:		Number of subbfs in the ring-buffer, including the reader.
> > + * @reader.lost_events:	Number of events lost at the time of the reader swap.
> > + * @reader.id:		subbuf ID of the current reader. ID range [0 : @nr_subbufs - 1]
> > + * @reader.read:	Number of bytes read on the reader subbuf.
> > + * @flags:		Placeholder for now, 0 until new features are supported.
> > + * @entries:		Number of entries in the ring-buffer.
> > + * @overrun:		Number of entries lost in the ring-buffer.
> > + * @read:		Number of entries that have been read.
> > + * @Reserved1:		Reserved for future use.
> > + * @Reserved2:		Reserved for future use.
> > + */
> > +struct trace_buffer_meta {
> > +	__u32		meta_page_size;
> > +	__u32		meta_struct_len;
> > +
> > +	__u32		subbuf_size;
> > +	__u32		nr_subbufs;
> > +
> > +	struct {
> > +		__u64	lost_events;
> > +		__u32	id;
> > +		__u32	read;
> > +	} reader;
> > +
> > +	__u64	flags;
> > +
> > +	__u64	entries;
> > +	__u64	overrun;
> > +	__u64	read;
> > +
> > +	__u64	Reserved1;
> > +	__u64	Reserved2;  
> 
> Why do you need reserved fields? This structure always resides in the
> beginning of a page and the rest of the page is essentially "reserved".

So this code is also going to be used in arm's pkvm hypervisor code,
where it will be using these fields, but since we are looking at
keeping the same interface between the two, we don't want these used by
this interface.

We probably should add a comment about that.

> 
> > +};
> > +
> > +#endif /* _TRACE_MMAP_H_ */
> > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> > index cc9ebe593571..793ecc454039 100644
> > --- a/kernel/trace/ring_buffer.c
> > +++ b/kernel/trace/ring_buffer.c  
> 
> ... 
> 
> > +static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer,
> > +				   unsigned long *subbuf_ids)
> > +{
> > +	struct trace_buffer_meta *meta = cpu_buffer->meta_page;
> > +	unsigned int nr_subbufs = cpu_buffer->nr_pages + 1;
> > +	struct buffer_page *first_subbuf, *subbuf;
> > +	int id = 0;
> > +
> > +	subbuf_ids[id] = (unsigned long)cpu_buffer->reader_page->page;
> > +	cpu_buffer->reader_page->id = id++;
> > +
> > +	first_subbuf = subbuf = rb_set_head_page(cpu_buffer);
> > +	do {
> > +		if (WARN_ON(id >= nr_subbufs))
> > +			break;
> > +
> > +		subbuf_ids[id] = (unsigned long)subbuf->page;
> > +		subbuf->id = id;
> > +
> > +		rb_inc_page(&subbuf);
> > +		id++;
> > +	} while (subbuf != first_subbuf);
> > +
> > +	/* install subbuf ID to kern VA translation */
> > +	cpu_buffer->subbuf_ids = subbuf_ids;
> > +
> > +	/* __rb_map_vma() pads the meta-page to align it with the sub-buffers */
> > +	meta->meta_page_size = PAGE_SIZE << cpu_buffer->buffer->subbuf_order;  
> 
> Isn't this a single page?

One thing we are doing is to make sure that the subbuffers are aligned
by their size. If a subbuffer is 3 pages, it should be aligned on 3
page boundaries. This was something that Linus suggested.

> 
> > +	meta->meta_struct_len = sizeof(*meta);
> > +	meta->nr_subbufs = nr_subbufs;
> > +	meta->subbuf_size = cpu_buffer->buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
> > +
> > +	rb_update_meta_page(cpu_buffer);
> > +}  
> 
> ...
> 
> > +#define subbuf_page(off, start) \
> > +	virt_to_page((void *)((start) + ((off) << PAGE_SHIFT)))
> > +
> > +#define foreach_subbuf_page(sub_order, start, page)		\  
> 
> Nit: usually iterators in kernel use for_each

Ah, good catch. Yeah, that should be changed. But then ...

> 
> > +	page = subbuf_page(0, (start));				\
> > +	for (int __off = 0; __off < (1 << (sub_order));		\
> > +	     __off++, page = subbuf_page(__off, (start)))  
> 
> The pages are allocated with alloc_pages_node(.. subbuf_order) are
> physically contiguous and struct pages for them are also contiguous, so
> inside a subbuf_order allocation you can just do page++.
> 

I'm wondering if we should just nuke the macro. It was there because of
the previous implementation did things twice. But now it's just done
once here:

+	while (s < nr_subbufs && p < nr_pages) {
+		struct page *page;
+
+		foreach_subbuf_page(subbuf_order, cpu_buffer->subbuf_ids[s], page) {
+			if (p >= nr_pages)
+				break;
+
+			pages[p++] = page;
+		}
+		s++;
+	}

Perhaps that should just be:

	while (s < nr_subbufs && p < nr_pages) {
		struct page *page;
		int off;

		page = subbuf_page(0, cpu_buffer->subbuf_ids[s]);
		for (off = 0; off < (1 << subbuf_order); off++, page++, p++) {
			if (p >= nr_pages)
				break;

			pages[p] = page;
		}
		s++;
	}

 ?

-- Steve

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions
  2024-04-06 17:36 ` [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions Vincent Donnefort
                     ` (2 preceding siblings ...)
  2024-04-18  6:55   ` Mike Rapoport
@ 2024-04-19 18:25   ` David Hildenbrand
  2024-04-22  9:27     ` David Hildenbrand
  3 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2024-04-19 18:25 UTC (permalink / raw)
  To: Vincent Donnefort, rostedt, mhiramat, linux-kernel, linux-trace-kernel
  Cc: mathieu.desnoyers, kernel-team, rdunlap, linux-mm

On 06.04.24 19:36, Vincent Donnefort wrote:
> In preparation for allowing the user-space to map a ring-buffer, add
> a set of mapping functions:
> 
>    ring_buffer_{map,unmap}()
> 
> And controls on the ring-buffer:
> 
>    ring_buffer_map_get_reader()  /* swap reader and head */
> 
> Mapping the ring-buffer also involves:
> 
>    A unique ID for each subbuf of the ring-buffer, currently they are
>    only identified through their in-kernel VA.
> 
>    A meta-page, where are stored ring-buffer statistics and a
>    description for the current reader
> 
> The linear mapping exposes the meta-page, and each subbuf of the
> ring-buffer, ordered following their unique ID, assigned during the
> first mapping.
> 
> Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
> size will remain unmodified and the splice enabling functions will in
> reality simply memcpy the data instead of swapping subbufs.
> 
> CC: <linux-mm@kvack.org>
> Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
> 
> diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
> index dc5ae4e96aee..96d2140b471e 100644
> --- a/include/linux/ring_buffer.h
> +++ b/include/linux/ring_buffer.h
> @@ -6,6 +6,8 @@
>   #include <linux/seq_file.h>
>   #include <linux/poll.h>
>   
> +#include <uapi/linux/trace_mmap.h>
> +
>   struct trace_buffer;
>   struct ring_buffer_iter;
>   
> @@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node);
>   #define trace_rb_cpu_prepare	NULL
>   #endif
>   
> +int ring_buffer_map(struct trace_buffer *buffer, int cpu,
> +		    struct vm_area_struct *vma);
> +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
> +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
>   #endif /* _LINUX_RING_BUFFER_H */
> diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
> new file mode 100644
> index 000000000000..ffcd8dfcaa4f
> --- /dev/null
> +++ b/include/uapi/linux/trace_mmap.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _TRACE_MMAP_H_
> +#define _TRACE_MMAP_H_
> +
> +#include <linux/types.h>
> +
> +/**
> + * struct trace_buffer_meta - Ring-buffer Meta-page description
> + * @meta_page_size:	Size of this meta-page.
> + * @meta_struct_len:	Size of this structure.
> + * @subbuf_size:	Size of each sub-buffer.
> + * @nr_subbufs:		Number of subbfs in the ring-buffer, including the reader.
> + * @reader.lost_events:	Number of events lost at the time of the reader swap.
> + * @reader.id:		subbuf ID of the current reader. ID range [0 : @nr_subbufs - 1]
> + * @reader.read:	Number of bytes read on the reader subbuf.
> + * @flags:		Placeholder for now, 0 until new features are supported.
> + * @entries:		Number of entries in the ring-buffer.
> + * @overrun:		Number of entries lost in the ring-buffer.
> + * @read:		Number of entries that have been read.
> + * @Reserved1:		Reserved for future use.
> + * @Reserved2:		Reserved for future use.
> + */
> +struct trace_buffer_meta {
> +	__u32		meta_page_size;
> +	__u32		meta_struct_len;
> +
> +	__u32		subbuf_size;
> +	__u32		nr_subbufs;
> +
> +	struct {
> +		__u64	lost_events;
> +		__u32	id;
> +		__u32	read;
> +	} reader;
> +
> +	__u64	flags;
> +
> +	__u64	entries;
> +	__u64	overrun;
> +	__u64	read;
> +
> +	__u64	Reserved1;
> +	__u64	Reserved2;
> +};
> +
> +#endif /* _TRACE_MMAP_H_ */
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index cc9ebe593571..793ecc454039 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -9,6 +9,7 @@
>   #include <linux/ring_buffer.h>
>   #include <linux/trace_clock.h>
>   #include <linux/sched/clock.h>
> +#include <linux/cacheflush.h>
>   #include <linux/trace_seq.h>
>   #include <linux/spinlock.h>
>   #include <linux/irq_work.h>
> @@ -26,6 +27,7 @@
>   #include <linux/list.h>
>   #include <linux/cpu.h>
>   #include <linux/oom.h>
> +#include <linux/mm.h>
>   
>   #include <asm/local64.h>
>   #include <asm/local.h>
> @@ -338,6 +340,7 @@ struct buffer_page {
>   	local_t		 entries;	/* entries on this page */
>   	unsigned long	 real_end;	/* real end of data */
>   	unsigned	 order;		/* order of the page */
> +	u32		 id;		/* ID for external mapping */
>   	struct buffer_data_page *page;	/* Actual data page */
>   };
>   
> @@ -484,6 +487,12 @@ struct ring_buffer_per_cpu {
>   	u64				read_stamp;
>   	/* pages removed since last reset */
>   	unsigned long			pages_removed;
> +
> +	unsigned int			mapped;
> +	struct mutex			mapping_lock;
> +	unsigned long			*subbuf_ids;	/* ID to subbuf VA */
> +	struct trace_buffer_meta	*meta_page;
> +
>   	/* ring buffer pages to update, > 0 to add, < 0 to remove */
>   	long				nr_pages_to_update;
>   	struct list_head		new_pages; /* new pages to add */
> @@ -1599,6 +1608,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu)
>   	init_irq_work(&cpu_buffer->irq_work.work, rb_wake_up_waiters);
>   	init_waitqueue_head(&cpu_buffer->irq_work.waiters);
>   	init_waitqueue_head(&cpu_buffer->irq_work.full_waiters);
> +	mutex_init(&cpu_buffer->mapping_lock);
>   
>   	bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
>   			    GFP_KERNEL, cpu_to_node(cpu));
> @@ -1789,8 +1799,6 @@ bool ring_buffer_time_stamp_abs(struct trace_buffer *buffer)
>   	return buffer->time_stamp_abs;
>   }
>   
> -static void rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer);
> -
>   static inline unsigned long rb_page_entries(struct buffer_page *bpage)
>   {
>   	return local_read(&bpage->entries) & RB_WRITE_MASK;
> @@ -5211,6 +5219,22 @@ static void rb_clear_buffer_page(struct buffer_page *page)
>   	page->read = 0;
>   }
>   
> +static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
> +{
> +	struct trace_buffer_meta *meta = cpu_buffer->meta_page;
> +
> +	meta->reader.read = cpu_buffer->reader_page->read;
> +	meta->reader.id = cpu_buffer->reader_page->id;
> +	meta->reader.lost_events = cpu_buffer->lost_events;
> +
> +	meta->entries = local_read(&cpu_buffer->entries);
> +	meta->overrun = local_read(&cpu_buffer->overrun);
> +	meta->read = cpu_buffer->read;
> +
> +	/* Some archs do not have data cache coherency between kernel and user-space */
> +	flush_dcache_folio(virt_to_folio(cpu_buffer->meta_page));
> +}
> +
>   static void
>   rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
>   {
> @@ -5255,6 +5279,9 @@ rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
>   	cpu_buffer->lost_events = 0;
>   	cpu_buffer->last_overrun = 0;
>   
> +	if (cpu_buffer->mapped)
> +		rb_update_meta_page(cpu_buffer);
> +
>   	rb_head_page_activate(cpu_buffer);
>   	cpu_buffer->pages_removed = 0;
>   }
> @@ -5469,6 +5496,12 @@ int ring_buffer_swap_cpu(struct trace_buffer *buffer_a,
>   	cpu_buffer_a = buffer_a->buffers[cpu];
>   	cpu_buffer_b = buffer_b->buffers[cpu];
>   
> +	/* It's up to the callers to not try to swap mapped buffers */
> +	if (WARN_ON_ONCE(cpu_buffer_a->mapped || cpu_buffer_b->mapped)) {
> +		ret = -EBUSY;
> +		goto out;
> +	}
> +
>   	/* At least make sure the two buffers are somewhat the same */
>   	if (cpu_buffer_a->nr_pages != cpu_buffer_b->nr_pages)
>   		goto out;
> @@ -5733,7 +5766,8 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
>   	 * Otherwise, we can simply swap the page with the one passed in.
>   	 */
>   	if (read || (len < (commit - read)) ||
> -	    cpu_buffer->reader_page == cpu_buffer->commit_page) {
> +	    cpu_buffer->reader_page == cpu_buffer->commit_page ||
> +	    cpu_buffer->mapped) {
>   		struct buffer_data_page *rpage = cpu_buffer->reader_page->page;
>   		unsigned int rpos = read;
>   		unsigned int pos = 0;
> @@ -5956,6 +5990,11 @@ int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order)
>   
>   		cpu_buffer = buffer->buffers[cpu];
>   
> +		if (cpu_buffer->mapped) {
> +			err = -EBUSY;
> +			goto error;
> +		}
> +
>   		/* Update the number of pages to match the new size */
>   		nr_pages = old_size * buffer->buffers[cpu]->nr_pages;
>   		nr_pages = DIV_ROUND_UP(nr_pages, buffer->subbuf_size);
> @@ -6057,6 +6096,358 @@ int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order)
>   }
>   EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_set);
>   
> +static int rb_alloc_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
> +{
> +	struct page *page;
> +
> +	if (cpu_buffer->meta_page)
> +		return 0;
> +
> +	page = alloc_page(GFP_USER | __GFP_ZERO);
> +	if (!page)
> +		return -ENOMEM;
> +
> +	cpu_buffer->meta_page = page_to_virt(page);
> +
> +	return 0;
> +}
> +
> +static void rb_free_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
> +{
> +	unsigned long addr = (unsigned long)cpu_buffer->meta_page;
> +
> +	free_page(addr);
> +	cpu_buffer->meta_page = NULL;
> +}
> +
> +static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer,
> +				   unsigned long *subbuf_ids)
> +{
> +	struct trace_buffer_meta *meta = cpu_buffer->meta_page;
> +	unsigned int nr_subbufs = cpu_buffer->nr_pages + 1;
> +	struct buffer_page *first_subbuf, *subbuf;
> +	int id = 0;
> +
> +	subbuf_ids[id] = (unsigned long)cpu_buffer->reader_page->page;
> +	cpu_buffer->reader_page->id = id++;
> +
> +	first_subbuf = subbuf = rb_set_head_page(cpu_buffer);
> +	do {
> +		if (WARN_ON(id >= nr_subbufs))
> +			break;
> +
> +		subbuf_ids[id] = (unsigned long)subbuf->page;
> +		subbuf->id = id;
> +
> +		rb_inc_page(&subbuf);
> +		id++;
> +	} while (subbuf != first_subbuf);
> +
> +	/* install subbuf ID to kern VA translation */
> +	cpu_buffer->subbuf_ids = subbuf_ids;
> +
> +	/* __rb_map_vma() pads the meta-page to align it with the sub-buffers */
> +	meta->meta_page_size = PAGE_SIZE << cpu_buffer->buffer->subbuf_order;
> +	meta->meta_struct_len = sizeof(*meta);
> +	meta->nr_subbufs = nr_subbufs;
> +	meta->subbuf_size = cpu_buffer->buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
> +
> +	rb_update_meta_page(cpu_buffer);
> +}
> +
> +static struct ring_buffer_per_cpu *
> +rb_get_mapped_buffer(struct trace_buffer *buffer, int cpu)
> +{
> +	struct ring_buffer_per_cpu *cpu_buffer;
> +
> +	if (!cpumask_test_cpu(cpu, buffer->cpumask))
> +		return ERR_PTR(-EINVAL);
> +
> +	cpu_buffer = buffer->buffers[cpu];
> +
> +	mutex_lock(&cpu_buffer->mapping_lock);
> +
> +	if (!cpu_buffer->mapped) {
> +		mutex_unlock(&cpu_buffer->mapping_lock);
> +		return ERR_PTR(-ENODEV);
> +	}
> +
> +	return cpu_buffer;
> +}
> +
> +static void rb_put_mapped_buffer(struct ring_buffer_per_cpu *cpu_buffer)
> +{
> +	mutex_unlock(&cpu_buffer->mapping_lock);
> +}
> +
> +/*
> + * Fast-path for rb_buffer_(un)map(). Called whenever the meta-page doesn't need
> + * to be set-up or torn-down.
> + */
> +static int __rb_inc_dec_mapped(struct ring_buffer_per_cpu *cpu_buffer,
> +			       bool inc)
> +{
> +	unsigned long flags;
> +
> +	lockdep_assert_held(&cpu_buffer->mapping_lock);
> +
> +	if (inc && cpu_buffer->mapped == UINT_MAX)
> +		return -EBUSY;
> +
> +	if (WARN_ON(!inc && cpu_buffer->mapped == 0))
> +		return -EINVAL;
> +
> +	mutex_lock(&cpu_buffer->buffer->mutex);
> +	raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
> +
> +	if (inc)
> +		cpu_buffer->mapped++;
> +	else
> +		cpu_buffer->mapped--;
> +
> +	raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
> +	mutex_unlock(&cpu_buffer->buffer->mutex);
> +
> +	return 0;
> +}
> +
> +#define subbuf_page(off, start) \
> +	virt_to_page((void *)((start) + ((off) << PAGE_SHIFT)))
> +
> +#define foreach_subbuf_page(sub_order, start, page)		\
> +	page = subbuf_page(0, (start));				\
> +	for (int __off = 0; __off < (1 << (sub_order));		\
> +	     __off++, page = subbuf_page(__off, (start)))
> +
> +/*
> + *   +--------------+  pgoff == 0
> + *   |   meta page  |
> + *   +--------------+  pgoff == 1
> + *   |   000000000  |
> + *   +--------------+  pgoff == (1 << subbuf_order)
> + *   | subbuffer 0  |
> + *   |              |
> + *   +--------------+  pgoff == (2 * (1 << subbuf_order))
> + *   | subbuffer 1  |
> + *   |              |
> + *         ...
> + */
> +static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
> +			struct vm_area_struct *vma)
> +{
> +	unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
> +	unsigned int subbuf_pages, subbuf_order;
> +	struct page **pages;
> +	int p = 0, s = 0;
> +	int err;
> +
> +	lockdep_assert_held(&cpu_buffer->mapping_lock);
> +
> +	subbuf_order = cpu_buffer->buffer->subbuf_order;
> +	subbuf_pages = 1 << subbuf_order;
> +
> +	if (subbuf_order && pgoff % subbuf_pages)
> +		return -EINVAL;
> +
> +	nr_subbufs = cpu_buffer->nr_pages + 1;
> +	nr_pages = ((nr_subbufs + 1) << subbuf_order) - pgoff;
> +
> +	vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> +	if (!vma_pages || vma_pages > nr_pages)
> +		return -EINVAL;
> +
> +	nr_pages = vma_pages;
> +
> +	pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
> +	if (!pages)
> +		return -ENOMEM;
> +
> +	if (!pgoff) {
> +		unsigned long meta_page_padding;
> +
> +		pages[p++] = virt_to_page(cpu_buffer->meta_page);
> +
> +		/*
> +		 * Pad with the zero-page to align the meta-page with the
> +		 * sub-buffers.
> +		 */
> +		meta_page_padding = subbuf_pages - 1;
> +		while (meta_page_padding-- && p < nr_pages)
> +			pages[p++] = ZERO_PAGE(0);

Using the shared zeropage in a MAP_SHARED mapping that is neither VM_IO 
nor VM_PFNMAP can be problematic. So we really need patch #3 logic to 
use VM_PFNMAP.

It would be cleaner/more obvious if these VMA flag setting would reside 
here, if possible.

> +	} else {
> +		/* Skip the meta-page */
> +		pgoff -= subbuf_pages;
> +
> +		s += pgoff / subbuf_pages;
> +	}
> +
> +	while (s < nr_subbufs && p < nr_pages) {
> +		struct page *page;
> +
> +		foreach_subbuf_page(subbuf_order, cpu_buffer->subbuf_ids[s], page) {
> +			if (p >= nr_pages)
> +				break;
> +
> +			pages[p++] = page;
> +		}
> +		s++;
> +	}
> +
> +	err = vm_insert_pages(vma, vma->vm_start, pages, &nr_pages);

I think Linus suggested it ("avoid all the sub-page ref-counts entirely 
by using VM_PFNMAP, and use vm_insert_pages()"), but ... 
vm_insert_pages() will:
* Mess with mapcounts
* Mess with refcounts

See 
insert_pages()->insert_page_in_batch_locked()->insert_page_into_pte_locked().

So we'll mess with the mapcount and refcount of the shared zeropage ... 
hmmmm

If I am not wrong, vm_normal_page() would not return the shared zeropage 
in case we don't have CONFIG_ARCH_HAS_PTE_SPECIAL ... so 
unmap()->...->zap_present_ptes() would not decrement the refcount and we 
could overflow it. ... we also shouldn't ever mess with the mapcount of 
the shared zeropage in the first place.

vm_insert_page() is clearer on that: "This allows drivers to insert 
individual pages they've allocated into a user vma". You didn't allocate 
the shared zeropage.

I'm wondering if we even expect VM_MIXEDMAP and VM_PFNMAP to be set at 
the same time? vm_insert_pages() would BUG_ON in case VM_PFNMAP is 
already set and it would set VM_MIXEDMAP ... similarly 
vmf_insert_pfn_prot() would not be happy about that at all ...

BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) == 
(VM_PFNMAP|VM_MIXEDMAP));

... remap_pfn_range() is used by io_uring_mmap for a similar purpose. 
But it only supports a single PFN range at a time and requires the 
caller to handle refcounting of pages.

It's getting late in Germany, so I might be missing something; but using 
the shared zeropage here might be a problem.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions
  2024-04-19 18:25   ` David Hildenbrand
@ 2024-04-22  9:27     ` David Hildenbrand
  2024-04-22 18:20       ` Vincent Donnefort
  0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2024-04-22  9:27 UTC (permalink / raw)
  To: Vincent Donnefort, rostedt, mhiramat, linux-kernel, linux-trace-kernel
  Cc: mathieu.desnoyers, kernel-team, rdunlap, linux-mm

On 19.04.24 20:25, David Hildenbrand wrote:
> On 06.04.24 19:36, Vincent Donnefort wrote:
>> In preparation for allowing the user-space to map a ring-buffer, add
>> a set of mapping functions:
>>
>>     ring_buffer_{map,unmap}()
>>
>> And controls on the ring-buffer:
>>
>>     ring_buffer_map_get_reader()  /* swap reader and head */
>>
>> Mapping the ring-buffer also involves:
>>
>>     A unique ID for each subbuf of the ring-buffer, currently they are
>>     only identified through their in-kernel VA.
>>
>>     A meta-page, where are stored ring-buffer statistics and a
>>     description for the current reader
>>
>> The linear mapping exposes the meta-page, and each subbuf of the
>> ring-buffer, ordered following their unique ID, assigned during the
>> first mapping.
>>
>> Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
>> size will remain unmodified and the splice enabling functions will in
>> reality simply memcpy the data instead of swapping subbufs.
>>
>> CC: <linux-mm@kvack.org>
>> Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
>>
>> diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
>> index dc5ae4e96aee..96d2140b471e 100644
>> --- a/include/linux/ring_buffer.h
>> +++ b/include/linux/ring_buffer.h
>> @@ -6,6 +6,8 @@
>>    #include <linux/seq_file.h>
>>    #include <linux/poll.h>
>>    
>> +#include <uapi/linux/trace_mmap.h>
>> +
>>    struct trace_buffer;
>>    struct ring_buffer_iter;
>>    
>> @@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node);
>>    #define trace_rb_cpu_prepare	NULL
>>    #endif
>>    
>> +int ring_buffer_map(struct trace_buffer *buffer, int cpu,
>> +		    struct vm_area_struct *vma);
>> +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
>> +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
>>    #endif /* _LINUX_RING_BUFFER_H */
>> diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
>> new file mode 100644
>> index 000000000000..ffcd8dfcaa4f
>> --- /dev/null
>> +++ b/include/uapi/linux/trace_mmap.h
>> @@ -0,0 +1,46 @@
>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>> +#ifndef _TRACE_MMAP_H_
>> +#define _TRACE_MMAP_H_
>> +
>> +#include <linux/types.h>
>> +
>> +/**
>> + * struct trace_buffer_meta - Ring-buffer Meta-page description
>> + * @meta_page_size:	Size of this meta-page.
>> + * @meta_struct_len:	Size of this structure.
>> + * @subbuf_size:	Size of each sub-buffer.
>> + * @nr_subbufs:		Number of subbfs in the ring-buffer, including the reader.
>> + * @reader.lost_events:	Number of events lost at the time of the reader swap.
>> + * @reader.id:		subbuf ID of the current reader. ID range [0 : @nr_subbufs - 1]
>> + * @reader.read:	Number of bytes read on the reader subbuf.
>> + * @flags:		Placeholder for now, 0 until new features are supported.
>> + * @entries:		Number of entries in the ring-buffer.
>> + * @overrun:		Number of entries lost in the ring-buffer.
>> + * @read:		Number of entries that have been read.
>> + * @Reserved1:		Reserved for future use.
>> + * @Reserved2:		Reserved for future use.
>> + */
>> +struct trace_buffer_meta {
>> +	__u32		meta_page_size;
>> +	__u32		meta_struct_len;
>> +
>> +	__u32		subbuf_size;
>> +	__u32		nr_subbufs;
>> +
>> +	struct {
>> +		__u64	lost_events;
>> +		__u32	id;
>> +		__u32	read;
>> +	} reader;
>> +
>> +	__u64	flags;
>> +
>> +	__u64	entries;
>> +	__u64	overrun;
>> +	__u64	read;
>> +
>> +	__u64	Reserved1;
>> +	__u64	Reserved2;
>> +};
>> +
>> +#endif /* _TRACE_MMAP_H_ */
>> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
>> index cc9ebe593571..793ecc454039 100644
>> --- a/kernel/trace/ring_buffer.c
>> +++ b/kernel/trace/ring_buffer.c
>> @@ -9,6 +9,7 @@
>>    #include <linux/ring_buffer.h>
>>    #include <linux/trace_clock.h>
>>    #include <linux/sched/clock.h>
>> +#include <linux/cacheflush.h>
>>    #include <linux/trace_seq.h>
>>    #include <linux/spinlock.h>
>>    #include <linux/irq_work.h>
>> @@ -26,6 +27,7 @@
>>    #include <linux/list.h>
>>    #include <linux/cpu.h>
>>    #include <linux/oom.h>
>> +#include <linux/mm.h>
>>    
>>    #include <asm/local64.h>
>>    #include <asm/local.h>
>> @@ -338,6 +340,7 @@ struct buffer_page {
>>    	local_t		 entries;	/* entries on this page */
>>    	unsigned long	 real_end;	/* real end of data */
>>    	unsigned	 order;		/* order of the page */
>> +	u32		 id;		/* ID for external mapping */
>>    	struct buffer_data_page *page;	/* Actual data page */
>>    };
>>    
>> @@ -484,6 +487,12 @@ struct ring_buffer_per_cpu {
>>    	u64				read_stamp;
>>    	/* pages removed since last reset */
>>    	unsigned long			pages_removed;
>> +
>> +	unsigned int			mapped;
>> +	struct mutex			mapping_lock;
>> +	unsigned long			*subbuf_ids;	/* ID to subbuf VA */
>> +	struct trace_buffer_meta	*meta_page;
>> +
>>    	/* ring buffer pages to update, > 0 to add, < 0 to remove */
>>    	long				nr_pages_to_update;
>>    	struct list_head		new_pages; /* new pages to add */
>> @@ -1599,6 +1608,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu)
>>    	init_irq_work(&cpu_buffer->irq_work.work, rb_wake_up_waiters);
>>    	init_waitqueue_head(&cpu_buffer->irq_work.waiters);
>>    	init_waitqueue_head(&cpu_buffer->irq_work.full_waiters);
>> +	mutex_init(&cpu_buffer->mapping_lock);
>>    
>>    	bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
>>    			    GFP_KERNEL, cpu_to_node(cpu));
>> @@ -1789,8 +1799,6 @@ bool ring_buffer_time_stamp_abs(struct trace_buffer *buffer)
>>    	return buffer->time_stamp_abs;
>>    }
>>    
>> -static void rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer);
>> -
>>    static inline unsigned long rb_page_entries(struct buffer_page *bpage)
>>    {
>>    	return local_read(&bpage->entries) & RB_WRITE_MASK;
>> @@ -5211,6 +5219,22 @@ static void rb_clear_buffer_page(struct buffer_page *page)
>>    	page->read = 0;
>>    }
>>    
>> +static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
>> +{
>> +	struct trace_buffer_meta *meta = cpu_buffer->meta_page;
>> +
>> +	meta->reader.read = cpu_buffer->reader_page->read;
>> +	meta->reader.id = cpu_buffer->reader_page->id;
>> +	meta->reader.lost_events = cpu_buffer->lost_events;
>> +
>> +	meta->entries = local_read(&cpu_buffer->entries);
>> +	meta->overrun = local_read(&cpu_buffer->overrun);
>> +	meta->read = cpu_buffer->read;
>> +
>> +	/* Some archs do not have data cache coherency between kernel and user-space */
>> +	flush_dcache_folio(virt_to_folio(cpu_buffer->meta_page));
>> +}
>> +
>>    static void
>>    rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
>>    {
>> @@ -5255,6 +5279,9 @@ rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
>>    	cpu_buffer->lost_events = 0;
>>    	cpu_buffer->last_overrun = 0;
>>    
>> +	if (cpu_buffer->mapped)
>> +		rb_update_meta_page(cpu_buffer);
>> +
>>    	rb_head_page_activate(cpu_buffer);
>>    	cpu_buffer->pages_removed = 0;
>>    }
>> @@ -5469,6 +5496,12 @@ int ring_buffer_swap_cpu(struct trace_buffer *buffer_a,
>>    	cpu_buffer_a = buffer_a->buffers[cpu];
>>    	cpu_buffer_b = buffer_b->buffers[cpu];
>>    
>> +	/* It's up to the callers to not try to swap mapped buffers */
>> +	if (WARN_ON_ONCE(cpu_buffer_a->mapped || cpu_buffer_b->mapped)) {
>> +		ret = -EBUSY;
>> +		goto out;
>> +	}
>> +
>>    	/* At least make sure the two buffers are somewhat the same */
>>    	if (cpu_buffer_a->nr_pages != cpu_buffer_b->nr_pages)
>>    		goto out;
>> @@ -5733,7 +5766,8 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
>>    	 * Otherwise, we can simply swap the page with the one passed in.
>>    	 */
>>    	if (read || (len < (commit - read)) ||
>> -	    cpu_buffer->reader_page == cpu_buffer->commit_page) {
>> +	    cpu_buffer->reader_page == cpu_buffer->commit_page ||
>> +	    cpu_buffer->mapped) {
>>    		struct buffer_data_page *rpage = cpu_buffer->reader_page->page;
>>    		unsigned int rpos = read;
>>    		unsigned int pos = 0;
>> @@ -5956,6 +5990,11 @@ int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order)
>>    
>>    		cpu_buffer = buffer->buffers[cpu];
>>    
>> +		if (cpu_buffer->mapped) {
>> +			err = -EBUSY;
>> +			goto error;
>> +		}
>> +
>>    		/* Update the number of pages to match the new size */
>>    		nr_pages = old_size * buffer->buffers[cpu]->nr_pages;
>>    		nr_pages = DIV_ROUND_UP(nr_pages, buffer->subbuf_size);
>> @@ -6057,6 +6096,358 @@ int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order)
>>    }
>>    EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_set);
>>    
>> +static int rb_alloc_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
>> +{
>> +	struct page *page;
>> +
>> +	if (cpu_buffer->meta_page)
>> +		return 0;
>> +
>> +	page = alloc_page(GFP_USER | __GFP_ZERO);
>> +	if (!page)
>> +		return -ENOMEM;
>> +
>> +	cpu_buffer->meta_page = page_to_virt(page);
>> +
>> +	return 0;
>> +}
>> +
>> +static void rb_free_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
>> +{
>> +	unsigned long addr = (unsigned long)cpu_buffer->meta_page;
>> +
>> +	free_page(addr);
>> +	cpu_buffer->meta_page = NULL;
>> +}
>> +
>> +static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer,
>> +				   unsigned long *subbuf_ids)
>> +{
>> +	struct trace_buffer_meta *meta = cpu_buffer->meta_page;
>> +	unsigned int nr_subbufs = cpu_buffer->nr_pages + 1;
>> +	struct buffer_page *first_subbuf, *subbuf;
>> +	int id = 0;
>> +
>> +	subbuf_ids[id] = (unsigned long)cpu_buffer->reader_page->page;
>> +	cpu_buffer->reader_page->id = id++;
>> +
>> +	first_subbuf = subbuf = rb_set_head_page(cpu_buffer);
>> +	do {
>> +		if (WARN_ON(id >= nr_subbufs))
>> +			break;
>> +
>> +		subbuf_ids[id] = (unsigned long)subbuf->page;
>> +		subbuf->id = id;
>> +
>> +		rb_inc_page(&subbuf);
>> +		id++;
>> +	} while (subbuf != first_subbuf);
>> +
>> +	/* install subbuf ID to kern VA translation */
>> +	cpu_buffer->subbuf_ids = subbuf_ids;
>> +
>> +	/* __rb_map_vma() pads the meta-page to align it with the sub-buffers */
>> +	meta->meta_page_size = PAGE_SIZE << cpu_buffer->buffer->subbuf_order;
>> +	meta->meta_struct_len = sizeof(*meta);
>> +	meta->nr_subbufs = nr_subbufs;
>> +	meta->subbuf_size = cpu_buffer->buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
>> +
>> +	rb_update_meta_page(cpu_buffer);
>> +}
>> +
>> +static struct ring_buffer_per_cpu *
>> +rb_get_mapped_buffer(struct trace_buffer *buffer, int cpu)
>> +{
>> +	struct ring_buffer_per_cpu *cpu_buffer;
>> +
>> +	if (!cpumask_test_cpu(cpu, buffer->cpumask))
>> +		return ERR_PTR(-EINVAL);
>> +
>> +	cpu_buffer = buffer->buffers[cpu];
>> +
>> +	mutex_lock(&cpu_buffer->mapping_lock);
>> +
>> +	if (!cpu_buffer->mapped) {
>> +		mutex_unlock(&cpu_buffer->mapping_lock);
>> +		return ERR_PTR(-ENODEV);
>> +	}
>> +
>> +	return cpu_buffer;
>> +}
>> +
>> +static void rb_put_mapped_buffer(struct ring_buffer_per_cpu *cpu_buffer)
>> +{
>> +	mutex_unlock(&cpu_buffer->mapping_lock);
>> +}
>> +
>> +/*
>> + * Fast-path for rb_buffer_(un)map(). Called whenever the meta-page doesn't need
>> + * to be set-up or torn-down.
>> + */
>> +static int __rb_inc_dec_mapped(struct ring_buffer_per_cpu *cpu_buffer,
>> +			       bool inc)
>> +{
>> +	unsigned long flags;
>> +
>> +	lockdep_assert_held(&cpu_buffer->mapping_lock);
>> +
>> +	if (inc && cpu_buffer->mapped == UINT_MAX)
>> +		return -EBUSY;
>> +
>> +	if (WARN_ON(!inc && cpu_buffer->mapped == 0))
>> +		return -EINVAL;
>> +
>> +	mutex_lock(&cpu_buffer->buffer->mutex);
>> +	raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
>> +
>> +	if (inc)
>> +		cpu_buffer->mapped++;
>> +	else
>> +		cpu_buffer->mapped--;
>> +
>> +	raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
>> +	mutex_unlock(&cpu_buffer->buffer->mutex);
>> +
>> +	return 0;
>> +}
>> +
>> +#define subbuf_page(off, start) \
>> +	virt_to_page((void *)((start) + ((off) << PAGE_SHIFT)))
>> +
>> +#define foreach_subbuf_page(sub_order, start, page)		\
>> +	page = subbuf_page(0, (start));				\
>> +	for (int __off = 0; __off < (1 << (sub_order));		\
>> +	     __off++, page = subbuf_page(__off, (start)))
>> +
>> +/*
>> + *   +--------------+  pgoff == 0
>> + *   |   meta page  |
>> + *   +--------------+  pgoff == 1
>> + *   |   000000000  |
>> + *   +--------------+  pgoff == (1 << subbuf_order)
>> + *   | subbuffer 0  |
>> + *   |              |
>> + *   +--------------+  pgoff == (2 * (1 << subbuf_order))
>> + *   | subbuffer 1  |
>> + *   |              |
>> + *         ...
>> + */
>> +static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
>> +			struct vm_area_struct *vma)
>> +{
>> +	unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
>> +	unsigned int subbuf_pages, subbuf_order;
>> +	struct page **pages;
>> +	int p = 0, s = 0;
>> +	int err;
>> +
>> +	lockdep_assert_held(&cpu_buffer->mapping_lock);
>> +
>> +	subbuf_order = cpu_buffer->buffer->subbuf_order;
>> +	subbuf_pages = 1 << subbuf_order;
>> +
>> +	if (subbuf_order && pgoff % subbuf_pages)
>> +		return -EINVAL;
>> +
>> +	nr_subbufs = cpu_buffer->nr_pages + 1;
>> +	nr_pages = ((nr_subbufs + 1) << subbuf_order) - pgoff;
>> +
>> +	vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>> +	if (!vma_pages || vma_pages > nr_pages)
>> +		return -EINVAL;
>> +
>> +	nr_pages = vma_pages;
>> +
>> +	pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
>> +	if (!pages)
>> +		return -ENOMEM;
>> +
>> +	if (!pgoff) {
>> +		unsigned long meta_page_padding;
>> +
>> +		pages[p++] = virt_to_page(cpu_buffer->meta_page);
>> +
>> +		/*
>> +		 * Pad with the zero-page to align the meta-page with the
>> +		 * sub-buffers.
>> +		 */
>> +		meta_page_padding = subbuf_pages - 1;
>> +		while (meta_page_padding-- && p < nr_pages)
>> +			pages[p++] = ZERO_PAGE(0);
> 
> Using the shared zeropage in a MAP_SHARED mapping that is neither VM_IO
> nor VM_PFNMAP can be problematic. So we really need patch #3 logic to
> use VM_PFNMAP.
> 
> It would be cleaner/more obvious if these VMA flag setting would reside
> here, if possible.
> 
>> +	} else {
>> +		/* Skip the meta-page */
>> +		pgoff -= subbuf_pages;
>> +
>> +		s += pgoff / subbuf_pages;
>> +	}
>> +
>> +	while (s < nr_subbufs && p < nr_pages) {
>> +		struct page *page;
>> +
>> +		foreach_subbuf_page(subbuf_order, cpu_buffer->subbuf_ids[s], page) {
>> +			if (p >= nr_pages)
>> +				break;
>> +
>> +			pages[p++] = page;
>> +		}
>> +		s++;
>> +	}
>> +
>> +	err = vm_insert_pages(vma, vma->vm_start, pages, &nr_pages);
> 
> I think Linus suggested it ("avoid all the sub-page ref-counts entirely
> by using VM_PFNMAP, and use vm_insert_pages()"), but ...
> vm_insert_pages() will:
> * Mess with mapcounts
> * Mess with refcounts
> 
> See
> insert_pages()->insert_page_in_batch_locked()->insert_page_into_pte_locked().
> 
> So we'll mess with the mapcount and refcount of the shared zeropage ...
> hmmmm
> 
> If I am not wrong, vm_normal_page() would not return the shared zeropage
> in case we don't have CONFIG_ARCH_HAS_PTE_SPECIAL ... so
> unmap()->...->zap_present_ptes() would not decrement the refcount and we
> could overflow it. ... we also shouldn't ever mess with the mapcount of
> the shared zeropage in the first place.
> 
> vm_insert_page() is clearer on that: "This allows drivers to insert
> individual pages they've allocated into a user vma". You didn't allocate
> the shared zeropage.
> 
> I'm wondering if we even expect VM_MIXEDMAP and VM_PFNMAP to be set at
> the same time? vm_insert_pages() would BUG_ON in case VM_PFNMAP is
> already set and it would set VM_MIXEDMAP ... similarly
> vmf_insert_pfn_prot() would not be happy about that at all ...
> 
> BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
> (VM_PFNMAP|VM_MIXEDMAP));
> 
> ... remap_pfn_range() is used by io_uring_mmap for a similar purpose.
> But it only supports a single PFN range at a time and requires the
> caller to handle refcounting of pages.
> 
> It's getting late in Germany, so I might be missing something; but using
> the shared zeropage here might be a problem.
> 

I was just about to write code to cleanly support vm_insert_pages() to 
consume zeropages ... when I started to wonder about something else:


+	if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
+	    !(vma->vm_flags & VM_MAYSHARE))
+		return -EPERM;
+

You disallow writable mappings. But what about using mprotect() 
afterwards to allow for write permissions?

Likely you'd want to remove VM_MAYWRITE from the flags, otherwise 
mprotect() might alter succeed.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions
  2024-04-19  3:43     ` Steven Rostedt
@ 2024-04-22 18:01       ` Vincent Donnefort
  0 siblings, 0 replies; 23+ messages in thread
From: Vincent Donnefort @ 2024-04-22 18:01 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mike Rapoport, mhiramat, linux-kernel, linux-trace-kernel,
	mathieu.desnoyers, kernel-team, rdunlap, linux-mm

On Thu, Apr 18, 2024 at 11:43:46PM -0400, Steven Rostedt wrote:
> On Thu, 18 Apr 2024 09:55:55 +0300
> Mike Rapoport <rppt@kernel.org> wrote:
> 
> Hi Mike,
> 
> Thanks for doing this review!
> 
> > > +/**
> > > + * struct trace_buffer_meta - Ring-buffer Meta-page description
> > > + * @meta_page_size:	Size of this meta-page.
> > > + * @meta_struct_len:	Size of this structure.
> > > + * @subbuf_size:	Size of each sub-buffer.
> > > + * @nr_subbufs:		Number of subbfs in the ring-buffer, including the reader.
> > > + * @reader.lost_events:	Number of events lost at the time of the reader swap.
> > > + * @reader.id:		subbuf ID of the current reader. ID range [0 : @nr_subbufs - 1]
> > > + * @reader.read:	Number of bytes read on the reader subbuf.
> > > + * @flags:		Placeholder for now, 0 until new features are supported.
> > > + * @entries:		Number of entries in the ring-buffer.
> > > + * @overrun:		Number of entries lost in the ring-buffer.
> > > + * @read:		Number of entries that have been read.
> > > + * @Reserved1:		Reserved for future use.
> > > + * @Reserved2:		Reserved for future use.
> > > + */
> > > +struct trace_buffer_meta {
> > > +	__u32		meta_page_size;
> > > +	__u32		meta_struct_len;
> > > +
> > > +	__u32		subbuf_size;
> > > +	__u32		nr_subbufs;
> > > +
> > > +	struct {
> > > +		__u64	lost_events;
> > > +		__u32	id;
> > > +		__u32	read;
> > > +	} reader;
> > > +
> > > +	__u64	flags;
> > > +
> > > +	__u64	entries;
> > > +	__u64	overrun;
> > > +	__u64	read;
> > > +
> > > +	__u64	Reserved1;
> > > +	__u64	Reserved2;  
> > 
> > Why do you need reserved fields? This structure always resides in the
> > beginning of a page and the rest of the page is essentially "reserved".
> 
> So this code is also going to be used in arm's pkvm hypervisor code,
> where it will be using these fields, but since we are looking at
> keeping the same interface between the two, we don't want these used by
> this interface.
> 
> We probably should add a comment about that.
> 
> > 
> > > +};
> > > +
> > > +#endif /* _TRACE_MMAP_H_ */
> > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> > > index cc9ebe593571..793ecc454039 100644
> > > --- a/kernel/trace/ring_buffer.c
> > > +++ b/kernel/trace/ring_buffer.c  
> > 
> > ... 
> > 
> > > +static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer,
> > > +				   unsigned long *subbuf_ids)
> > > +{
> > > +	struct trace_buffer_meta *meta = cpu_buffer->meta_page;
> > > +	unsigned int nr_subbufs = cpu_buffer->nr_pages + 1;
> > > +	struct buffer_page *first_subbuf, *subbuf;
> > > +	int id = 0;
> > > +
> > > +	subbuf_ids[id] = (unsigned long)cpu_buffer->reader_page->page;
> > > +	cpu_buffer->reader_page->id = id++;
> > > +
> > > +	first_subbuf = subbuf = rb_set_head_page(cpu_buffer);
> > > +	do {
> > > +		if (WARN_ON(id >= nr_subbufs))
> > > +			break;
> > > +
> > > +		subbuf_ids[id] = (unsigned long)subbuf->page;
> > > +		subbuf->id = id;
> > > +
> > > +		rb_inc_page(&subbuf);
> > > +		id++;
> > > +	} while (subbuf != first_subbuf);
> > > +
> > > +	/* install subbuf ID to kern VA translation */
> > > +	cpu_buffer->subbuf_ids = subbuf_ids;
> > > +
> > > +	/* __rb_map_vma() pads the meta-page to align it with the sub-buffers */
> > > +	meta->meta_page_size = PAGE_SIZE << cpu_buffer->buffer->subbuf_order;  
> > 
> > Isn't this a single page?
> 
> One thing we are doing is to make sure that the subbuffers are aligned
> by their size. If a subbuffer is 3 pages, it should be aligned on 3
> page boundaries. This was something that Linus suggested.
> 
> > 
> > > +	meta->meta_struct_len = sizeof(*meta);
> > > +	meta->nr_subbufs = nr_subbufs;
> > > +	meta->subbuf_size = cpu_buffer->buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
> > > +
> > > +	rb_update_meta_page(cpu_buffer);
> > > +}  
> > 
> > ...
> > 
> > > +#define subbuf_page(off, start) \
> > > +	virt_to_page((void *)((start) + ((off) << PAGE_SHIFT)))
> > > +
> > > +#define foreach_subbuf_page(sub_order, start, page)		\  
> > 
> > Nit: usually iterators in kernel use for_each
> 
> Ah, good catch. Yeah, that should be changed. But then ...
> 
> > 
> > > +	page = subbuf_page(0, (start));				\
> > > +	for (int __off = 0; __off < (1 << (sub_order));		\
> > > +	     __off++, page = subbuf_page(__off, (start)))  
> > 
> > The pages are allocated with alloc_pages_node(.. subbuf_order) are
> > physically contiguous and struct pages for them are also contiguous, so
> > inside a subbuf_order allocation you can just do page++.
> > 
> 
> I'm wondering if we should just nuke the macro. It was there because of
> the previous implementation did things twice. But now it's just done
> once here:
> 
> +	while (s < nr_subbufs && p < nr_pages) {
> +		struct page *page;
> +
> +		foreach_subbuf_page(subbuf_order, cpu_buffer->subbuf_ids[s], page) {
> +			if (p >= nr_pages)
> +				break;
> +
> +			pages[p++] = page;
> +		}
> +		s++;
> +	}
> 
> Perhaps that should just be:
> 
> 	while (s < nr_subbufs && p < nr_pages) {
> 		struct page *page;
> 		int off;
> 
> 		page = subbuf_page(0, cpu_buffer->subbuf_ids[s]);
> 		for (off = 0; off < (1 << subbuf_order); off++, page++, p++) {
> 			if (p >= nr_pages)
> 				break;
> 
> 			pages[p] = page;
> 		}
> 		s++;
> 	}
> 
>  ?

Yeah, was hesitating to kill it with the last version. Happy to do it for the
next one.

> 
> -- Steve

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions
  2024-04-22  9:27     ` David Hildenbrand
@ 2024-04-22 18:20       ` Vincent Donnefort
  2024-04-22 18:27         ` David Hildenbrand
  0 siblings, 1 reply; 23+ messages in thread
From: Vincent Donnefort @ 2024-04-22 18:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: rostedt, mhiramat, linux-kernel, linux-trace-kernel,
	mathieu.desnoyers, kernel-team, rdunlap, linux-mm

Hi David,

Thanks for having a look, very much appreciated!

On Mon, Apr 22, 2024 at 11:27:11AM +0200, David Hildenbrand wrote:
> On 19.04.24 20:25, David Hildenbrand wrote:
> > On 06.04.24 19:36, Vincent Donnefort wrote:
> > > In preparation for allowing the user-space to map a ring-buffer, add
> > > a set of mapping functions:
> > > 
> > >     ring_buffer_{map,unmap}()
> > > 
> > > And controls on the ring-buffer:
> > > 
> > >     ring_buffer_map_get_reader()  /* swap reader and head */
> > > 
> > > Mapping the ring-buffer also involves:
> > > 
> > >     A unique ID for each subbuf of the ring-buffer, currently they are
> > >     only identified through their in-kernel VA.
> > > 
> > >     A meta-page, where are stored ring-buffer statistics and a
> > >     description for the current reader
> > > 
> > > The linear mapping exposes the meta-page, and each subbuf of the
> > > ring-buffer, ordered following their unique ID, assigned during the
> > > first mapping.
> > > 
> > > Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
> > > size will remain unmodified and the splice enabling functions will in
> > > reality simply memcpy the data instead of swapping subbufs.
> > > 
> > > CC: <linux-mm@kvack.org>
> > > Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
> > > 
> > > diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
> > > index dc5ae4e96aee..96d2140b471e 100644
> > > --- a/include/linux/ring_buffer.h
> > > +++ b/include/linux/ring_buffer.h
> > > @@ -6,6 +6,8 @@
> > >    #include <linux/seq_file.h>
> > >    #include <linux/poll.h>
> > > +#include <uapi/linux/trace_mmap.h>
> > > +
> > >    struct trace_buffer;
> > >    struct ring_buffer_iter;
> > > @@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node);
> > >    #define trace_rb_cpu_prepare	NULL
> > >    #endif
> > > +int ring_buffer_map(struct trace_buffer *buffer, int cpu,
> > > +		    struct vm_area_struct *vma);
> > > +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
> > > +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
> > >    #endif /* _LINUX_RING_BUFFER_H */
> > > diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
> > > new file mode 100644
> > > index 000000000000..ffcd8dfcaa4f
> > > --- /dev/null
> > > +++ b/include/uapi/linux/trace_mmap.h
> > > @@ -0,0 +1,46 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > > +#ifndef _TRACE_MMAP_H_
> > > +#define _TRACE_MMAP_H_
> > > +
> > > +#include <linux/types.h>
> > > +
> > > +/**
> > > + * struct trace_buffer_meta - Ring-buffer Meta-page description
> > > + * @meta_page_size:	Size of this meta-page.
> > > + * @meta_struct_len:	Size of this structure.
> > > + * @subbuf_size:	Size of each sub-buffer.
> > > + * @nr_subbufs:		Number of subbfs in the ring-buffer, including the reader.
> > > + * @reader.lost_events:	Number of events lost at the time of the reader swap.
> > > + * @reader.id:		subbuf ID of the current reader. ID range [0 : @nr_subbufs - 1]
> > > + * @reader.read:	Number of bytes read on the reader subbuf.
> > > + * @flags:		Placeholder for now, 0 until new features are supported.
> > > + * @entries:		Number of entries in the ring-buffer.
> > > + * @overrun:		Number of entries lost in the ring-buffer.
> > > + * @read:		Number of entries that have been read.
> > > + * @Reserved1:		Reserved for future use.
> > > + * @Reserved2:		Reserved for future use.
> > > + */
> > > +struct trace_buffer_meta {
> > > +	__u32		meta_page_size;
> > > +	__u32		meta_struct_len;
> > > +
> > > +	__u32		subbuf_size;
> > > +	__u32		nr_subbufs;
> > > +
> > > +	struct {
> > > +		__u64	lost_events;
> > > +		__u32	id;
> > > +		__u32	read;
> > > +	} reader;
> > > +
> > > +	__u64	flags;
> > > +
> > > +	__u64	entries;
> > > +	__u64	overrun;
> > > +	__u64	read;
> > > +
> > > +	__u64	Reserved1;
> > > +	__u64	Reserved2;
> > > +};
> > > +
> > > +#endif /* _TRACE_MMAP_H_ */
> > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> > > index cc9ebe593571..793ecc454039 100644
> > > --- a/kernel/trace/ring_buffer.c
> > > +++ b/kernel/trace/ring_buffer.c
> > > @@ -9,6 +9,7 @@
> > >    #include <linux/ring_buffer.h>
> > >    #include <linux/trace_clock.h>
> > >    #include <linux/sched/clock.h>
> > > +#include <linux/cacheflush.h>
> > >    #include <linux/trace_seq.h>
> > >    #include <linux/spinlock.h>
> > >    #include <linux/irq_work.h>
> > > @@ -26,6 +27,7 @@
> > >    #include <linux/list.h>
> > >    #include <linux/cpu.h>
> > >    #include <linux/oom.h>
> > > +#include <linux/mm.h>
> > >    #include <asm/local64.h>
> > >    #include <asm/local.h>
> > > @@ -338,6 +340,7 @@ struct buffer_page {
> > >    	local_t		 entries;	/* entries on this page */
> > >    	unsigned long	 real_end;	/* real end of data */
> > >    	unsigned	 order;		/* order of the page */
> > > +	u32		 id;		/* ID for external mapping */
> > >    	struct buffer_data_page *page;	/* Actual data page */
> > >    };
> > > @@ -484,6 +487,12 @@ struct ring_buffer_per_cpu {
> > >    	u64				read_stamp;
> > >    	/* pages removed since last reset */
> > >    	unsigned long			pages_removed;
> > > +
> > > +	unsigned int			mapped;
> > > +	struct mutex			mapping_lock;
> > > +	unsigned long			*subbuf_ids;	/* ID to subbuf VA */
> > > +	struct trace_buffer_meta	*meta_page;
> > > +
> > >    	/* ring buffer pages to update, > 0 to add, < 0 to remove */
> > >    	long				nr_pages_to_update;
> > >    	struct list_head		new_pages; /* new pages to add */
> > > @@ -1599,6 +1608,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu)
> > >    	init_irq_work(&cpu_buffer->irq_work.work, rb_wake_up_waiters);
> > >    	init_waitqueue_head(&cpu_buffer->irq_work.waiters);
> > >    	init_waitqueue_head(&cpu_buffer->irq_work.full_waiters);
> > > +	mutex_init(&cpu_buffer->mapping_lock);
> > >    	bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
> > >    			    GFP_KERNEL, cpu_to_node(cpu));
> > > @@ -1789,8 +1799,6 @@ bool ring_buffer_time_stamp_abs(struct trace_buffer *buffer)
> > >    	return buffer->time_stamp_abs;
> > >    }
> > > -static void rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer);
> > > -
> > >    static inline unsigned long rb_page_entries(struct buffer_page *bpage)
> > >    {
> > >    	return local_read(&bpage->entries) & RB_WRITE_MASK;
> > > @@ -5211,6 +5219,22 @@ static void rb_clear_buffer_page(struct buffer_page *page)
> > >    	page->read = 0;
> > >    }
> > > +static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
> > > +{
> > > +	struct trace_buffer_meta *meta = cpu_buffer->meta_page;
> > > +
> > > +	meta->reader.read = cpu_buffer->reader_page->read;
> > > +	meta->reader.id = cpu_buffer->reader_page->id;
> > > +	meta->reader.lost_events = cpu_buffer->lost_events;
> > > +
> > > +	meta->entries = local_read(&cpu_buffer->entries);
> > > +	meta->overrun = local_read(&cpu_buffer->overrun);
> > > +	meta->read = cpu_buffer->read;
> > > +
> > > +	/* Some archs do not have data cache coherency between kernel and user-space */
> > > +	flush_dcache_folio(virt_to_folio(cpu_buffer->meta_page));
> > > +}
> > > +
> > >    static void
> > >    rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
> > >    {
> > > @@ -5255,6 +5279,9 @@ rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
> > >    	cpu_buffer->lost_events = 0;
> > >    	cpu_buffer->last_overrun = 0;
> > > +	if (cpu_buffer->mapped)
> > > +		rb_update_meta_page(cpu_buffer);
> > > +
> > >    	rb_head_page_activate(cpu_buffer);
> > >    	cpu_buffer->pages_removed = 0;
> > >    }
> > > @@ -5469,6 +5496,12 @@ int ring_buffer_swap_cpu(struct trace_buffer *buffer_a,
> > >    	cpu_buffer_a = buffer_a->buffers[cpu];
> > >    	cpu_buffer_b = buffer_b->buffers[cpu];
> > > +	/* It's up to the callers to not try to swap mapped buffers */
> > > +	if (WARN_ON_ONCE(cpu_buffer_a->mapped || cpu_buffer_b->mapped)) {
> > > +		ret = -EBUSY;
> > > +		goto out;
> > > +	}
> > > +
> > >    	/* At least make sure the two buffers are somewhat the same */
> > >    	if (cpu_buffer_a->nr_pages != cpu_buffer_b->nr_pages)
> > >    		goto out;
> > > @@ -5733,7 +5766,8 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
> > >    	 * Otherwise, we can simply swap the page with the one passed in.
> > >    	 */
> > >    	if (read || (len < (commit - read)) ||
> > > -	    cpu_buffer->reader_page == cpu_buffer->commit_page) {
> > > +	    cpu_buffer->reader_page == cpu_buffer->commit_page ||
> > > +	    cpu_buffer->mapped) {
> > >    		struct buffer_data_page *rpage = cpu_buffer->reader_page->page;
> > >    		unsigned int rpos = read;
> > >    		unsigned int pos = 0;
> > > @@ -5956,6 +5990,11 @@ int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order)
> > >    		cpu_buffer = buffer->buffers[cpu];
> > > +		if (cpu_buffer->mapped) {
> > > +			err = -EBUSY;
> > > +			goto error;
> > > +		}
> > > +
> > >    		/* Update the number of pages to match the new size */
> > >    		nr_pages = old_size * buffer->buffers[cpu]->nr_pages;
> > >    		nr_pages = DIV_ROUND_UP(nr_pages, buffer->subbuf_size);
> > > @@ -6057,6 +6096,358 @@ int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order)
> > >    }
> > >    EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_set);
> > > +static int rb_alloc_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
> > > +{
> > > +	struct page *page;
> > > +
> > > +	if (cpu_buffer->meta_page)
> > > +		return 0;
> > > +
> > > +	page = alloc_page(GFP_USER | __GFP_ZERO);
> > > +	if (!page)
> > > +		return -ENOMEM;
> > > +
> > > +	cpu_buffer->meta_page = page_to_virt(page);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static void rb_free_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
> > > +{
> > > +	unsigned long addr = (unsigned long)cpu_buffer->meta_page;
> > > +
> > > +	free_page(addr);
> > > +	cpu_buffer->meta_page = NULL;
> > > +}
> > > +
> > > +static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer,
> > > +				   unsigned long *subbuf_ids)
> > > +{
> > > +	struct trace_buffer_meta *meta = cpu_buffer->meta_page;
> > > +	unsigned int nr_subbufs = cpu_buffer->nr_pages + 1;
> > > +	struct buffer_page *first_subbuf, *subbuf;
> > > +	int id = 0;
> > > +
> > > +	subbuf_ids[id] = (unsigned long)cpu_buffer->reader_page->page;
> > > +	cpu_buffer->reader_page->id = id++;
> > > +
> > > +	first_subbuf = subbuf = rb_set_head_page(cpu_buffer);
> > > +	do {
> > > +		if (WARN_ON(id >= nr_subbufs))
> > > +			break;
> > > +
> > > +		subbuf_ids[id] = (unsigned long)subbuf->page;
> > > +		subbuf->id = id;
> > > +
> > > +		rb_inc_page(&subbuf);
> > > +		id++;
> > > +	} while (subbuf != first_subbuf);
> > > +
> > > +	/* install subbuf ID to kern VA translation */
> > > +	cpu_buffer->subbuf_ids = subbuf_ids;
> > > +
> > > +	/* __rb_map_vma() pads the meta-page to align it with the sub-buffers */
> > > +	meta->meta_page_size = PAGE_SIZE << cpu_buffer->buffer->subbuf_order;
> > > +	meta->meta_struct_len = sizeof(*meta);
> > > +	meta->nr_subbufs = nr_subbufs;
> > > +	meta->subbuf_size = cpu_buffer->buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
> > > +
> > > +	rb_update_meta_page(cpu_buffer);
> > > +}
> > > +
> > > +static struct ring_buffer_per_cpu *
> > > +rb_get_mapped_buffer(struct trace_buffer *buffer, int cpu)
> > > +{
> > > +	struct ring_buffer_per_cpu *cpu_buffer;
> > > +
> > > +	if (!cpumask_test_cpu(cpu, buffer->cpumask))
> > > +		return ERR_PTR(-EINVAL);
> > > +
> > > +	cpu_buffer = buffer->buffers[cpu];
> > > +
> > > +	mutex_lock(&cpu_buffer->mapping_lock);
> > > +
> > > +	if (!cpu_buffer->mapped) {
> > > +		mutex_unlock(&cpu_buffer->mapping_lock);
> > > +		return ERR_PTR(-ENODEV);
> > > +	}
> > > +
> > > +	return cpu_buffer;
> > > +}
> > > +
> > > +static void rb_put_mapped_buffer(struct ring_buffer_per_cpu *cpu_buffer)
> > > +{
> > > +	mutex_unlock(&cpu_buffer->mapping_lock);
> > > +}
> > > +
> > > +/*
> > > + * Fast-path for rb_buffer_(un)map(). Called whenever the meta-page doesn't need
> > > + * to be set-up or torn-down.
> > > + */
> > > +static int __rb_inc_dec_mapped(struct ring_buffer_per_cpu *cpu_buffer,
> > > +			       bool inc)
> > > +{
> > > +	unsigned long flags;
> > > +
> > > +	lockdep_assert_held(&cpu_buffer->mapping_lock);
> > > +
> > > +	if (inc && cpu_buffer->mapped == UINT_MAX)
> > > +		return -EBUSY;
> > > +
> > > +	if (WARN_ON(!inc && cpu_buffer->mapped == 0))
> > > +		return -EINVAL;
> > > +
> > > +	mutex_lock(&cpu_buffer->buffer->mutex);
> > > +	raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
> > > +
> > > +	if (inc)
> > > +		cpu_buffer->mapped++;
> > > +	else
> > > +		cpu_buffer->mapped--;
> > > +
> > > +	raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
> > > +	mutex_unlock(&cpu_buffer->buffer->mutex);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +#define subbuf_page(off, start) \
> > > +	virt_to_page((void *)((start) + ((off) << PAGE_SHIFT)))
> > > +
> > > +#define foreach_subbuf_page(sub_order, start, page)		\
> > > +	page = subbuf_page(0, (start));				\
> > > +	for (int __off = 0; __off < (1 << (sub_order));		\
> > > +	     __off++, page = subbuf_page(__off, (start)))
> > > +
> > > +/*
> > > + *   +--------------+  pgoff == 0
> > > + *   |   meta page  |
> > > + *   +--------------+  pgoff == 1
> > > + *   |   000000000  |
> > > + *   +--------------+  pgoff == (1 << subbuf_order)
> > > + *   | subbuffer 0  |
> > > + *   |              |
> > > + *   +--------------+  pgoff == (2 * (1 << subbuf_order))
> > > + *   | subbuffer 1  |
> > > + *   |              |
> > > + *         ...
> > > + */
> > > +static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
> > > +			struct vm_area_struct *vma)
> > > +{
> > > +	unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
> > > +	unsigned int subbuf_pages, subbuf_order;
> > > +	struct page **pages;
> > > +	int p = 0, s = 0;
> > > +	int err;
> > > +
> > > +	lockdep_assert_held(&cpu_buffer->mapping_lock);
> > > +
> > > +	subbuf_order = cpu_buffer->buffer->subbuf_order;
> > > +	subbuf_pages = 1 << subbuf_order;
> > > +
> > > +	if (subbuf_order && pgoff % subbuf_pages)
> > > +		return -EINVAL;
> > > +
> > > +	nr_subbufs = cpu_buffer->nr_pages + 1;
> > > +	nr_pages = ((nr_subbufs + 1) << subbuf_order) - pgoff;
> > > +
> > > +	vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> > > +	if (!vma_pages || vma_pages > nr_pages)
> > > +		return -EINVAL;
> > > +
> > > +	nr_pages = vma_pages;
> > > +
> > > +	pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
> > > +	if (!pages)
> > > +		return -ENOMEM;
> > > +
> > > +	if (!pgoff) {
> > > +		unsigned long meta_page_padding;
> > > +
> > > +		pages[p++] = virt_to_page(cpu_buffer->meta_page);
> > > +
> > > +		/*
> > > +		 * Pad with the zero-page to align the meta-page with the
> > > +		 * sub-buffers.
> > > +		 */
> > > +		meta_page_padding = subbuf_pages - 1;
> > > +		while (meta_page_padding-- && p < nr_pages)
> > > +			pages[p++] = ZERO_PAGE(0);
> > 
> > Using the shared zeropage in a MAP_SHARED mapping that is neither VM_IO
> > nor VM_PFNMAP can be problematic. So we really need patch #3 logic to
> > use VM_PFNMAP.
> > 
> > It would be cleaner/more obvious if these VMA flag setting would reside

tracing_buffers_mmap() sets both VM_IO and VM_PFNMAP. But, yeah as vma is
already passed to ring_buffer_map(). The flags could be set here as this
function is what sets the requirements really.

> > here, if possible.
> > 
> > > +	} else {
> > > +		/* Skip the meta-page */
> > > +		pgoff -= subbuf_pages;
> > > +
> > > +		s += pgoff / subbuf_pages;
> > > +	}
> > > +
> > > +	while (s < nr_subbufs && p < nr_pages) {
> > > +		struct page *page;
> > > +
> > > +		foreach_subbuf_page(subbuf_order, cpu_buffer->subbuf_ids[s], page) {
> > > +			if (p >= nr_pages)
> > > +				break;
> > > +
> > > +			pages[p++] = page;
> > > +		}
> > > +		s++;
> > > +	}
> > > +
> > > +	err = vm_insert_pages(vma, vma->vm_start, pages, &nr_pages);
> > 
> > I think Linus suggested it ("avoid all the sub-page ref-counts entirely
> > by using VM_PFNMAP, and use vm_insert_pages()"), but ...
> > vm_insert_pages() will:
> > * Mess with mapcounts
> > * Mess with refcounts
> > 
> > See
> > insert_pages()->insert_page_in_batch_locked()->insert_page_into_pte_locked().
> > 
> > So we'll mess with the mapcount and refcount of the shared zeropage ...
> > hmmmm
> > 
> > If I am not wrong, vm_normal_page() would not return the shared zeropage
> > in case we don't have CONFIG_ARCH_HAS_PTE_SPECIAL ... so
> > unmap()->...->zap_present_ptes() would not decrement the refcount and we
> > could overflow it. ... we also shouldn't ever mess with the mapcount of
> > the shared zeropage in the first place.
> > 
> > vm_insert_page() is clearer on that: "This allows drivers to insert
> > individual pages they've allocated into a user vma". You didn't allocate
> > the shared zeropage.
> > 
> > I'm wondering if we even expect VM_MIXEDMAP and VM_PFNMAP to be set at
> > the same time? vm_insert_pages() would BUG_ON in case VM_PFNMAP is
> > already set and it would set VM_MIXEDMAP ... similarly
> > vmf_insert_pfn_prot() would not be happy about that at all ...
> > 
> > BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
> > (VM_PFNMAP|VM_MIXEDMAP));
> > 
> > ... remap_pfn_range() is used by io_uring_mmap for a similar purpose.
> > But it only supports a single PFN range at a time and requires the
> > caller to handle refcounting of pages.
> > 
> > It's getting late in Germany, so I might be missing something; but using
> > the shared zeropage here might be a problem.
> > 
> 
> I was just about to write code to cleanly support vm_insert_pages() to
> consume zeropages ... when I started to wonder about something else:

Alternatively, we could at the moment allocate a meta-page of the same size than
the subbufs? (of course the downside is to waste a bit of memory)

> 
> 
> +	if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
> +	    !(vma->vm_flags & VM_MAYSHARE))
> +		return -EPERM;
> +
> 
> You disallow writable mappings. But what about using mprotect() afterwards
> to allow for write permissions?
> 
> Likely you'd want to remove VM_MAYWRITE from the flags, otherwise mprotect()
> might alter succeed.

The vm_flags_mod below is clearing VM_MAYWRITE already. Isn't it enough? 

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions
  2024-04-22 18:20       ` Vincent Donnefort
@ 2024-04-22 18:27         ` David Hildenbrand
  2024-04-22 20:31           ` Vincent Donnefort
  0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2024-04-22 18:27 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: rostedt, mhiramat, linux-kernel, linux-trace-kernel,
	mathieu.desnoyers, kernel-team, rdunlap, linux-mm

On 22.04.24 20:20, Vincent Donnefort wrote:
> Hi David,
> 
> Thanks for having a look, very much appreciated!
> 
> On Mon, Apr 22, 2024 at 11:27:11AM +0200, David Hildenbrand wrote:
>> On 19.04.24 20:25, David Hildenbrand wrote:
>>> On 06.04.24 19:36, Vincent Donnefort wrote:
>>>> In preparation for allowing the user-space to map a ring-buffer, add
>>>> a set of mapping functions:
>>>>
>>>>      ring_buffer_{map,unmap}()
>>>>
>>>> And controls on the ring-buffer:
>>>>
>>>>      ring_buffer_map_get_reader()  /* swap reader and head */
>>>>
>>>> Mapping the ring-buffer also involves:
>>>>
>>>>      A unique ID for each subbuf of the ring-buffer, currently they are
>>>>      only identified through their in-kernel VA.
>>>>
>>>>      A meta-page, where are stored ring-buffer statistics and a
>>>>      description for the current reader
>>>>
>>>> The linear mapping exposes the meta-page, and each subbuf of the
>>>> ring-buffer, ordered following their unique ID, assigned during the
>>>> first mapping.
>>>>
>>>> Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
>>>> size will remain unmodified and the splice enabling functions will in
>>>> reality simply memcpy the data instead of swapping subbufs.
>>>>
>>>> CC: <linux-mm@kvack.org>
>>>> Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
>>>>
>>>> diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
>>>> index dc5ae4e96aee..96d2140b471e 100644
>>>> --- a/include/linux/ring_buffer.h
>>>> +++ b/include/linux/ring_buffer.h
>>>> @@ -6,6 +6,8 @@
>>>>     #include <linux/seq_file.h>
>>>>     #include <linux/poll.h>
>>>> +#include <uapi/linux/trace_mmap.h>
>>>> +
>>>>     struct trace_buffer;
>>>>     struct ring_buffer_iter;
>>>> @@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node);
>>>>     #define trace_rb_cpu_prepare	NULL
>>>>     #endif
>>>> +int ring_buffer_map(struct trace_buffer *buffer, int cpu,
>>>> +		    struct vm_area_struct *vma);
>>>> +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
>>>> +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
>>>>     #endif /* _LINUX_RING_BUFFER_H */
>>>> diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
>>>> new file mode 100644
>>>> index 000000000000..ffcd8dfcaa4f
>>>> --- /dev/null
>>>> +++ b/include/uapi/linux/trace_mmap.h
>>>> @@ -0,0 +1,46 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>>>> +#ifndef _TRACE_MMAP_H_
>>>> +#define _TRACE_MMAP_H_
>>>> +
>>>> +#include <linux/types.h>
>>>> +
>>>> +/**
>>>> + * struct trace_buffer_meta - Ring-buffer Meta-page description
>>>> + * @meta_page_size:	Size of this meta-page.
>>>> + * @meta_struct_len:	Size of this structure.
>>>> + * @subbuf_size:	Size of each sub-buffer.
>>>> + * @nr_subbufs:		Number of subbfs in the ring-buffer, including the reader.
>>>> + * @reader.lost_events:	Number of events lost at the time of the reader swap.
>>>> + * @reader.id:		subbuf ID of the current reader. ID range [0 : @nr_subbufs - 1]
>>>> + * @reader.read:	Number of bytes read on the reader subbuf.
>>>> + * @flags:		Placeholder for now, 0 until new features are supported.
>>>> + * @entries:		Number of entries in the ring-buffer.
>>>> + * @overrun:		Number of entries lost in the ring-buffer.
>>>> + * @read:		Number of entries that have been read.
>>>> + * @Reserved1:		Reserved for future use.
>>>> + * @Reserved2:		Reserved for future use.
>>>> + */
>>>> +struct trace_buffer_meta {
>>>> +	__u32		meta_page_size;
>>>> +	__u32		meta_struct_len;
>>>> +
>>>> +	__u32		subbuf_size;
>>>> +	__u32		nr_subbufs;
>>>> +
>>>> +	struct {
>>>> +		__u64	lost_events;
>>>> +		__u32	id;
>>>> +		__u32	read;
>>>> +	} reader;
>>>> +
>>>> +	__u64	flags;
>>>> +
>>>> +	__u64	entries;
>>>> +	__u64	overrun;
>>>> +	__u64	read;
>>>> +
>>>> +	__u64	Reserved1;
>>>> +	__u64	Reserved2;
>>>> +};
>>>> +
>>>> +#endif /* _TRACE_MMAP_H_ */
>>>> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
>>>> index cc9ebe593571..793ecc454039 100644
>>>> --- a/kernel/trace/ring_buffer.c
>>>> +++ b/kernel/trace/ring_buffer.c
>>>> @@ -9,6 +9,7 @@
>>>>     #include <linux/ring_buffer.h>
>>>>     #include <linux/trace_clock.h>
>>>>     #include <linux/sched/clock.h>
>>>> +#include <linux/cacheflush.h>
>>>>     #include <linux/trace_seq.h>
>>>>     #include <linux/spinlock.h>
>>>>     #include <linux/irq_work.h>
>>>> @@ -26,6 +27,7 @@
>>>>     #include <linux/list.h>
>>>>     #include <linux/cpu.h>
>>>>     #include <linux/oom.h>
>>>> +#include <linux/mm.h>
>>>>     #include <asm/local64.h>
>>>>     #include <asm/local.h>
>>>> @@ -338,6 +340,7 @@ struct buffer_page {
>>>>     	local_t		 entries;	/* entries on this page */
>>>>     	unsigned long	 real_end;	/* real end of data */
>>>>     	unsigned	 order;		/* order of the page */
>>>> +	u32		 id;		/* ID for external mapping */
>>>>     	struct buffer_data_page *page;	/* Actual data page */
>>>>     };
>>>> @@ -484,6 +487,12 @@ struct ring_buffer_per_cpu {
>>>>     	u64				read_stamp;
>>>>     	/* pages removed since last reset */
>>>>     	unsigned long			pages_removed;
>>>> +
>>>> +	unsigned int			mapped;
>>>> +	struct mutex			mapping_lock;
>>>> +	unsigned long			*subbuf_ids;	/* ID to subbuf VA */
>>>> +	struct trace_buffer_meta	*meta_page;
>>>> +
>>>>     	/* ring buffer pages to update, > 0 to add, < 0 to remove */
>>>>     	long				nr_pages_to_update;
>>>>     	struct list_head		new_pages; /* new pages to add */
>>>> @@ -1599,6 +1608,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu)
>>>>     	init_irq_work(&cpu_buffer->irq_work.work, rb_wake_up_waiters);
>>>>     	init_waitqueue_head(&cpu_buffer->irq_work.waiters);
>>>>     	init_waitqueue_head(&cpu_buffer->irq_work.full_waiters);
>>>> +	mutex_init(&cpu_buffer->mapping_lock);
>>>>     	bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
>>>>     			    GFP_KERNEL, cpu_to_node(cpu));
>>>> @@ -1789,8 +1799,6 @@ bool ring_buffer_time_stamp_abs(struct trace_buffer *buffer)
>>>>     	return buffer->time_stamp_abs;
>>>>     }
>>>> -static void rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer);
>>>> -
>>>>     static inline unsigned long rb_page_entries(struct buffer_page *bpage)
>>>>     {
>>>>     	return local_read(&bpage->entries) & RB_WRITE_MASK;
>>>> @@ -5211,6 +5219,22 @@ static void rb_clear_buffer_page(struct buffer_page *page)
>>>>     	page->read = 0;
>>>>     }
>>>> +static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
>>>> +{
>>>> +	struct trace_buffer_meta *meta = cpu_buffer->meta_page;
>>>> +
>>>> +	meta->reader.read = cpu_buffer->reader_page->read;
>>>> +	meta->reader.id = cpu_buffer->reader_page->id;
>>>> +	meta->reader.lost_events = cpu_buffer->lost_events;
>>>> +
>>>> +	meta->entries = local_read(&cpu_buffer->entries);
>>>> +	meta->overrun = local_read(&cpu_buffer->overrun);
>>>> +	meta->read = cpu_buffer->read;
>>>> +
>>>> +	/* Some archs do not have data cache coherency between kernel and user-space */
>>>> +	flush_dcache_folio(virt_to_folio(cpu_buffer->meta_page));
>>>> +}
>>>> +
>>>>     static void
>>>>     rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
>>>>     {
>>>> @@ -5255,6 +5279,9 @@ rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
>>>>     	cpu_buffer->lost_events = 0;
>>>>     	cpu_buffer->last_overrun = 0;
>>>> +	if (cpu_buffer->mapped)
>>>> +		rb_update_meta_page(cpu_buffer);
>>>> +
>>>>     	rb_head_page_activate(cpu_buffer);
>>>>     	cpu_buffer->pages_removed = 0;
>>>>     }
>>>> @@ -5469,6 +5496,12 @@ int ring_buffer_swap_cpu(struct trace_buffer *buffer_a,
>>>>     	cpu_buffer_a = buffer_a->buffers[cpu];
>>>>     	cpu_buffer_b = buffer_b->buffers[cpu];
>>>> +	/* It's up to the callers to not try to swap mapped buffers */
>>>> +	if (WARN_ON_ONCE(cpu_buffer_a->mapped || cpu_buffer_b->mapped)) {
>>>> +		ret = -EBUSY;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>>     	/* At least make sure the two buffers are somewhat the same */
>>>>     	if (cpu_buffer_a->nr_pages != cpu_buffer_b->nr_pages)
>>>>     		goto out;
>>>> @@ -5733,7 +5766,8 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
>>>>     	 * Otherwise, we can simply swap the page with the one passed in.
>>>>     	 */
>>>>     	if (read || (len < (commit - read)) ||
>>>> -	    cpu_buffer->reader_page == cpu_buffer->commit_page) {
>>>> +	    cpu_buffer->reader_page == cpu_buffer->commit_page ||
>>>> +	    cpu_buffer->mapped) {
>>>>     		struct buffer_data_page *rpage = cpu_buffer->reader_page->page;
>>>>     		unsigned int rpos = read;
>>>>     		unsigned int pos = 0;
>>>> @@ -5956,6 +5990,11 @@ int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order)
>>>>     		cpu_buffer = buffer->buffers[cpu];
>>>> +		if (cpu_buffer->mapped) {
>>>> +			err = -EBUSY;
>>>> +			goto error;
>>>> +		}
>>>> +
>>>>     		/* Update the number of pages to match the new size */
>>>>     		nr_pages = old_size * buffer->buffers[cpu]->nr_pages;
>>>>     		nr_pages = DIV_ROUND_UP(nr_pages, buffer->subbuf_size);
>>>> @@ -6057,6 +6096,358 @@ int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order)
>>>>     }
>>>>     EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_set);
>>>> +static int rb_alloc_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
>>>> +{
>>>> +	struct page *page;
>>>> +
>>>> +	if (cpu_buffer->meta_page)
>>>> +		return 0;
>>>> +
>>>> +	page = alloc_page(GFP_USER | __GFP_ZERO);
>>>> +	if (!page)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	cpu_buffer->meta_page = page_to_virt(page);
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static void rb_free_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
>>>> +{
>>>> +	unsigned long addr = (unsigned long)cpu_buffer->meta_page;
>>>> +
>>>> +	free_page(addr);
>>>> +	cpu_buffer->meta_page = NULL;
>>>> +}
>>>> +
>>>> +static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer,
>>>> +				   unsigned long *subbuf_ids)
>>>> +{
>>>> +	struct trace_buffer_meta *meta = cpu_buffer->meta_page;
>>>> +	unsigned int nr_subbufs = cpu_buffer->nr_pages + 1;
>>>> +	struct buffer_page *first_subbuf, *subbuf;
>>>> +	int id = 0;
>>>> +
>>>> +	subbuf_ids[id] = (unsigned long)cpu_buffer->reader_page->page;
>>>> +	cpu_buffer->reader_page->id = id++;
>>>> +
>>>> +	first_subbuf = subbuf = rb_set_head_page(cpu_buffer);
>>>> +	do {
>>>> +		if (WARN_ON(id >= nr_subbufs))
>>>> +			break;
>>>> +
>>>> +		subbuf_ids[id] = (unsigned long)subbuf->page;
>>>> +		subbuf->id = id;
>>>> +
>>>> +		rb_inc_page(&subbuf);
>>>> +		id++;
>>>> +	} while (subbuf != first_subbuf);
>>>> +
>>>> +	/* install subbuf ID to kern VA translation */
>>>> +	cpu_buffer->subbuf_ids = subbuf_ids;
>>>> +
>>>> +	/* __rb_map_vma() pads the meta-page to align it with the sub-buffers */
>>>> +	meta->meta_page_size = PAGE_SIZE << cpu_buffer->buffer->subbuf_order;
>>>> +	meta->meta_struct_len = sizeof(*meta);
>>>> +	meta->nr_subbufs = nr_subbufs;
>>>> +	meta->subbuf_size = cpu_buffer->buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
>>>> +
>>>> +	rb_update_meta_page(cpu_buffer);
>>>> +}
>>>> +
>>>> +static struct ring_buffer_per_cpu *
>>>> +rb_get_mapped_buffer(struct trace_buffer *buffer, int cpu)
>>>> +{
>>>> +	struct ring_buffer_per_cpu *cpu_buffer;
>>>> +
>>>> +	if (!cpumask_test_cpu(cpu, buffer->cpumask))
>>>> +		return ERR_PTR(-EINVAL);
>>>> +
>>>> +	cpu_buffer = buffer->buffers[cpu];
>>>> +
>>>> +	mutex_lock(&cpu_buffer->mapping_lock);
>>>> +
>>>> +	if (!cpu_buffer->mapped) {
>>>> +		mutex_unlock(&cpu_buffer->mapping_lock);
>>>> +		return ERR_PTR(-ENODEV);
>>>> +	}
>>>> +
>>>> +	return cpu_buffer;
>>>> +}
>>>> +
>>>> +static void rb_put_mapped_buffer(struct ring_buffer_per_cpu *cpu_buffer)
>>>> +{
>>>> +	mutex_unlock(&cpu_buffer->mapping_lock);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Fast-path for rb_buffer_(un)map(). Called whenever the meta-page doesn't need
>>>> + * to be set-up or torn-down.
>>>> + */
>>>> +static int __rb_inc_dec_mapped(struct ring_buffer_per_cpu *cpu_buffer,
>>>> +			       bool inc)
>>>> +{
>>>> +	unsigned long flags;
>>>> +
>>>> +	lockdep_assert_held(&cpu_buffer->mapping_lock);
>>>> +
>>>> +	if (inc && cpu_buffer->mapped == UINT_MAX)
>>>> +		return -EBUSY;
>>>> +
>>>> +	if (WARN_ON(!inc && cpu_buffer->mapped == 0))
>>>> +		return -EINVAL;
>>>> +
>>>> +	mutex_lock(&cpu_buffer->buffer->mutex);
>>>> +	raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
>>>> +
>>>> +	if (inc)
>>>> +		cpu_buffer->mapped++;
>>>> +	else
>>>> +		cpu_buffer->mapped--;
>>>> +
>>>> +	raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
>>>> +	mutex_unlock(&cpu_buffer->buffer->mutex);
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +#define subbuf_page(off, start) \
>>>> +	virt_to_page((void *)((start) + ((off) << PAGE_SHIFT)))
>>>> +
>>>> +#define foreach_subbuf_page(sub_order, start, page)		\
>>>> +	page = subbuf_page(0, (start));				\
>>>> +	for (int __off = 0; __off < (1 << (sub_order));		\
>>>> +	     __off++, page = subbuf_page(__off, (start)))
>>>> +
>>>> +/*
>>>> + *   +--------------+  pgoff == 0
>>>> + *   |   meta page  |
>>>> + *   +--------------+  pgoff == 1
>>>> + *   |   000000000  |
>>>> + *   +--------------+  pgoff == (1 << subbuf_order)
>>>> + *   | subbuffer 0  |
>>>> + *   |              |
>>>> + *   +--------------+  pgoff == (2 * (1 << subbuf_order))
>>>> + *   | subbuffer 1  |
>>>> + *   |              |
>>>> + *         ...
>>>> + */
>>>> +static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
>>>> +			struct vm_area_struct *vma)
>>>> +{
>>>> +	unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
>>>> +	unsigned int subbuf_pages, subbuf_order;
>>>> +	struct page **pages;
>>>> +	int p = 0, s = 0;
>>>> +	int err;
>>>> +
>>>> +	lockdep_assert_held(&cpu_buffer->mapping_lock);
>>>> +
>>>> +	subbuf_order = cpu_buffer->buffer->subbuf_order;
>>>> +	subbuf_pages = 1 << subbuf_order;
>>>> +
>>>> +	if (subbuf_order && pgoff % subbuf_pages)
>>>> +		return -EINVAL;
>>>> +
>>>> +	nr_subbufs = cpu_buffer->nr_pages + 1;
>>>> +	nr_pages = ((nr_subbufs + 1) << subbuf_order) - pgoff;
>>>> +
>>>> +	vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>>>> +	if (!vma_pages || vma_pages > nr_pages)
>>>> +		return -EINVAL;
>>>> +
>>>> +	nr_pages = vma_pages;
>>>> +
>>>> +	pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
>>>> +	if (!pages)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	if (!pgoff) {
>>>> +		unsigned long meta_page_padding;
>>>> +
>>>> +		pages[p++] = virt_to_page(cpu_buffer->meta_page);
>>>> +
>>>> +		/*
>>>> +		 * Pad with the zero-page to align the meta-page with the
>>>> +		 * sub-buffers.
>>>> +		 */
>>>> +		meta_page_padding = subbuf_pages - 1;
>>>> +		while (meta_page_padding-- && p < nr_pages)
>>>> +			pages[p++] = ZERO_PAGE(0);
>>>
>>> Using the shared zeropage in a MAP_SHARED mapping that is neither VM_IO
>>> nor VM_PFNMAP can be problematic. So we really need patch #3 logic to
>>> use VM_PFNMAP.
>>>
>>> It would be cleaner/more obvious if these VMA flag setting would reside
> 
> tracing_buffers_mmap() sets both VM_IO and VM_PFNMAP. But, yeah as vma is
> already passed to ring_buffer_map(). The flags could be set here as this
> function is what sets the requirements really.
> 
>>> here, if possible.
>>>
>>>> +	} else {
>>>> +		/* Skip the meta-page */
>>>> +		pgoff -= subbuf_pages;
>>>> +
>>>> +		s += pgoff / subbuf_pages;
>>>> +	}
>>>> +
>>>> +	while (s < nr_subbufs && p < nr_pages) {
>>>> +		struct page *page;
>>>> +
>>>> +		foreach_subbuf_page(subbuf_order, cpu_buffer->subbuf_ids[s], page) {
>>>> +			if (p >= nr_pages)
>>>> +				break;
>>>> +
>>>> +			pages[p++] = page;
>>>> +		}
>>>> +		s++;
>>>> +	}
>>>> +
>>>> +	err = vm_insert_pages(vma, vma->vm_start, pages, &nr_pages);
>>>
>>> I think Linus suggested it ("avoid all the sub-page ref-counts entirely
>>> by using VM_PFNMAP, and use vm_insert_pages()"), but ...
>>> vm_insert_pages() will:
>>> * Mess with mapcounts
>>> * Mess with refcounts
>>>
>>> See
>>> insert_pages()->insert_page_in_batch_locked()->insert_page_into_pte_locked().
>>>
>>> So we'll mess with the mapcount and refcount of the shared zeropage ...
>>> hmmmm
>>>
>>> If I am not wrong, vm_normal_page() would not return the shared zeropage
>>> in case we don't have CONFIG_ARCH_HAS_PTE_SPECIAL ... so
>>> unmap()->...->zap_present_ptes() would not decrement the refcount and we
>>> could overflow it. ... we also shouldn't ever mess with the mapcount of
>>> the shared zeropage in the first place.
>>>
>>> vm_insert_page() is clearer on that: "This allows drivers to insert
>>> individual pages they've allocated into a user vma". You didn't allocate
>>> the shared zeropage.
>>>
>>> I'm wondering if we even expect VM_MIXEDMAP and VM_PFNMAP to be set at
>>> the same time? vm_insert_pages() would BUG_ON in case VM_PFNMAP is
>>> already set and it would set VM_MIXEDMAP ... similarly
>>> vmf_insert_pfn_prot() would not be happy about that at all ...
>>>
>>> BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
>>> (VM_PFNMAP|VM_MIXEDMAP));
>>>
>>> ... remap_pfn_range() is used by io_uring_mmap for a similar purpose.
>>> But it only supports a single PFN range at a time and requires the
>>> caller to handle refcounting of pages.
>>>
>>> It's getting late in Germany, so I might be missing something; but using
>>> the shared zeropage here might be a problem.
>>>
>>
>> I was just about to write code to cleanly support vm_insert_pages() to
>> consume zeropages ... when I started to wonder about something else:
> 
> Alternatively, we could at the moment allocate a meta-page of the same size than
> the subbufs? (of course the downside is to waste a bit of memory)

Supporting the shared zeropage should be possible. We just have to be 
careful that no other code can accidentially set it writable. I'll have 
to further think about that (not just your user, but making it sane to 
use by new code as well).

So far, we primarily only have shared zeropages in the MAP_PRIVATE 
mappings. Dax is a corner case where we have them in MAP_SHARED 
mappings. PFNMAP would not be problematic, but MIXEDMAP is. (again, I am 
not 100% sure about combining both ...)

I'll further think about that one, and try crafting a reasonable patch.

Could we start with no shared zeropage and implement that as an 
optimization on top?

> 
>>
>>
>> +	if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
>> +	    !(vma->vm_flags & VM_MAYSHARE))
>> +		return -EPERM;
>> +
>>
>> You disallow writable mappings. But what about using mprotect() afterwards
>> to allow for write permissions?
>>
>> Likely you'd want to remove VM_MAYWRITE from the flags, otherwise mprotect()
>> might alter succeed.
> 
> The vm_flags_mod below is clearing VM_MAYWRITE already. Isn't it enough?

Oh, maybe I missed that!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions
  2024-04-22 18:27         ` David Hildenbrand
@ 2024-04-22 20:31           ` Vincent Donnefort
  2024-04-23 15:18             ` David Hildenbrand
  0 siblings, 1 reply; 23+ messages in thread
From: Vincent Donnefort @ 2024-04-22 20:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: rostedt, mhiramat, linux-kernel, linux-trace-kernel,
	mathieu.desnoyers, kernel-team, rdunlap, linux-mm

On Mon, Apr 22, 2024 at 08:27:17PM +0200, David Hildenbrand wrote:
> On 22.04.24 20:20, Vincent Donnefort wrote:
> > Hi David,
> > 
> > Thanks for having a look, very much appreciated!
> > 
> > On Mon, Apr 22, 2024 at 11:27:11AM +0200, David Hildenbrand wrote:
> > > On 19.04.24 20:25, David Hildenbrand wrote:
> > > > On 06.04.24 19:36, Vincent Donnefort wrote:
> > > > > In preparation for allowing the user-space to map a ring-buffer, add
> > > > > a set of mapping functions:
> > > > > 
> > > > >      ring_buffer_{map,unmap}()
> > > > > 
> > > > > And controls on the ring-buffer:
> > > > > 
> > > > >      ring_buffer_map_get_reader()  /* swap reader and head */
> > > > > 
> > > > > Mapping the ring-buffer also involves:
> > > > > 
> > > > >      A unique ID for each subbuf of the ring-buffer, currently they are
> > > > >      only identified through their in-kernel VA.
> > > > > 
> > > > >      A meta-page, where are stored ring-buffer statistics and a
> > > > >      description for the current reader
> > > > > 
> > > > > The linear mapping exposes the meta-page, and each subbuf of the
> > > > > ring-buffer, ordered following their unique ID, assigned during the
> > > > > first mapping.
> > > > > 
> > > > > Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
> > > > > size will remain unmodified and the splice enabling functions will in
> > > > > reality simply memcpy the data instead of swapping subbufs.
> > > > > 
> > > > > CC: <linux-mm@kvack.org>
> > > > > Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
> > > > > 
> > > > > diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
> > > > > index dc5ae4e96aee..96d2140b471e 100644
> > > > > --- a/include/linux/ring_buffer.h
> > > > > +++ b/include/linux/ring_buffer.h
> > > > > @@ -6,6 +6,8 @@
> > > > >     #include <linux/seq_file.h>
> > > > >     #include <linux/poll.h>
> > > > > +#include <uapi/linux/trace_mmap.h>
> > > > > +
> > > > >     struct trace_buffer;
> > > > >     struct ring_buffer_iter;
> > > > > @@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node);
> > > > >     #define trace_rb_cpu_prepare	NULL
> > > > >     #endif
> > > > > +int ring_buffer_map(struct trace_buffer *buffer, int cpu,
> > > > > +		    struct vm_area_struct *vma);
> > > > > +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
> > > > > +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
> > > > >     #endif /* _LINUX_RING_BUFFER_H */
> > > > > diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
> > > > > new file mode 100644
> > > > > index 000000000000..ffcd8dfcaa4f
> > > > > --- /dev/null
> > > > > +++ b/include/uapi/linux/trace_mmap.h
> > > > > @@ -0,0 +1,46 @@
> > > > > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > > > > +#ifndef _TRACE_MMAP_H_
> > > > > +#define _TRACE_MMAP_H_
> > > > > +
> > > > > +#include <linux/types.h>
> > > > > +
> > > > > +/**
> > > > > + * struct trace_buffer_meta - Ring-buffer Meta-page description
> > > > > + * @meta_page_size:	Size of this meta-page.
> > > > > + * @meta_struct_len:	Size of this structure.
> > > > > + * @subbuf_size:	Size of each sub-buffer.
> > > > > + * @nr_subbufs:		Number of subbfs in the ring-buffer, including the reader.
> > > > > + * @reader.lost_events:	Number of events lost at the time of the reader swap.
> > > > > + * @reader.id:		subbuf ID of the current reader. ID range [0 : @nr_subbufs - 1]
> > > > > + * @reader.read:	Number of bytes read on the reader subbuf.
> > > > > + * @flags:		Placeholder for now, 0 until new features are supported.
> > > > > + * @entries:		Number of entries in the ring-buffer.
> > > > > + * @overrun:		Number of entries lost in the ring-buffer.
> > > > > + * @read:		Number of entries that have been read.
> > > > > + * @Reserved1:		Reserved for future use.
> > > > > + * @Reserved2:		Reserved for future use.
> > > > > + */
> > > > > +struct trace_buffer_meta {
> > > > > +	__u32		meta_page_size;
> > > > > +	__u32		meta_struct_len;
> > > > > +
> > > > > +	__u32		subbuf_size;
> > > > > +	__u32		nr_subbufs;
> > > > > +
> > > > > +	struct {
> > > > > +		__u64	lost_events;
> > > > > +		__u32	id;
> > > > > +		__u32	read;
> > > > > +	} reader;
> > > > > +
> > > > > +	__u64	flags;
> > > > > +
> > > > > +	__u64	entries;
> > > > > +	__u64	overrun;
> > > > > +	__u64	read;
> > > > > +
> > > > > +	__u64	Reserved1;
> > > > > +	__u64	Reserved2;
> > > > > +};
> > > > > +
> > > > > +#endif /* _TRACE_MMAP_H_ */
> > > > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> > > > > index cc9ebe593571..793ecc454039 100644
> > > > > --- a/kernel/trace/ring_buffer.c
> > > > > +++ b/kernel/trace/ring_buffer.c
> > > > > @@ -9,6 +9,7 @@
> > > > >     #include <linux/ring_buffer.h>
> > > > >     #include <linux/trace_clock.h>
> > > > >     #include <linux/sched/clock.h>
> > > > > +#include <linux/cacheflush.h>
> > > > >     #include <linux/trace_seq.h>
> > > > >     #include <linux/spinlock.h>
> > > > >     #include <linux/irq_work.h>
> > > > > @@ -26,6 +27,7 @@
> > > > >     #include <linux/list.h>
> > > > >     #include <linux/cpu.h>
> > > > >     #include <linux/oom.h>
> > > > > +#include <linux/mm.h>
> > > > >     #include <asm/local64.h>
> > > > >     #include <asm/local.h>
> > > > > @@ -338,6 +340,7 @@ struct buffer_page {
> > > > >     	local_t		 entries;	/* entries on this page */
> > > > >     	unsigned long	 real_end;	/* real end of data */
> > > > >     	unsigned	 order;		/* order of the page */
> > > > > +	u32		 id;		/* ID for external mapping */
> > > > >     	struct buffer_data_page *page;	/* Actual data page */
> > > > >     };
> > > > > @@ -484,6 +487,12 @@ struct ring_buffer_per_cpu {
> > > > >     	u64				read_stamp;
> > > > >     	/* pages removed since last reset */
> > > > >     	unsigned long			pages_removed;
> > > > > +
> > > > > +	unsigned int			mapped;
> > > > > +	struct mutex			mapping_lock;
> > > > > +	unsigned long			*subbuf_ids;	/* ID to subbuf VA */
> > > > > +	struct trace_buffer_meta	*meta_page;
> > > > > +
> > > > >     	/* ring buffer pages to update, > 0 to add, < 0 to remove */
> > > > >     	long				nr_pages_to_update;
> > > > >     	struct list_head		new_pages; /* new pages to add */
> > > > > @@ -1599,6 +1608,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu)
> > > > >     	init_irq_work(&cpu_buffer->irq_work.work, rb_wake_up_waiters);
> > > > >     	init_waitqueue_head(&cpu_buffer->irq_work.waiters);
> > > > >     	init_waitqueue_head(&cpu_buffer->irq_work.full_waiters);
> > > > > +	mutex_init(&cpu_buffer->mapping_lock);
> > > > >     	bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
> > > > >     			    GFP_KERNEL, cpu_to_node(cpu));
> > > > > @@ -1789,8 +1799,6 @@ bool ring_buffer_time_stamp_abs(struct trace_buffer *buffer)
> > > > >     	return buffer->time_stamp_abs;
> > > > >     }
> > > > > -static void rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer);
> > > > > -
> > > > >     static inline unsigned long rb_page_entries(struct buffer_page *bpage)
> > > > >     {
> > > > >     	return local_read(&bpage->entries) & RB_WRITE_MASK;
> > > > > @@ -5211,6 +5219,22 @@ static void rb_clear_buffer_page(struct buffer_page *page)
> > > > >     	page->read = 0;
> > > > >     }
> > > > > +static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
> > > > > +{
> > > > > +	struct trace_buffer_meta *meta = cpu_buffer->meta_page;
> > > > > +
> > > > > +	meta->reader.read = cpu_buffer->reader_page->read;
> > > > > +	meta->reader.id = cpu_buffer->reader_page->id;
> > > > > +	meta->reader.lost_events = cpu_buffer->lost_events;
> > > > > +
> > > > > +	meta->entries = local_read(&cpu_buffer->entries);
> > > > > +	meta->overrun = local_read(&cpu_buffer->overrun);
> > > > > +	meta->read = cpu_buffer->read;
> > > > > +
> > > > > +	/* Some archs do not have data cache coherency between kernel and user-space */
> > > > > +	flush_dcache_folio(virt_to_folio(cpu_buffer->meta_page));
> > > > > +}
> > > > > +
> > > > >     static void
> > > > >     rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
> > > > >     {
> > > > > @@ -5255,6 +5279,9 @@ rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
> > > > >     	cpu_buffer->lost_events = 0;
> > > > >     	cpu_buffer->last_overrun = 0;
> > > > > +	if (cpu_buffer->mapped)
> > > > > +		rb_update_meta_page(cpu_buffer);
> > > > > +
> > > > >     	rb_head_page_activate(cpu_buffer);
> > > > >     	cpu_buffer->pages_removed = 0;
> > > > >     }
> > > > > @@ -5469,6 +5496,12 @@ int ring_buffer_swap_cpu(struct trace_buffer *buffer_a,
> > > > >     	cpu_buffer_a = buffer_a->buffers[cpu];
> > > > >     	cpu_buffer_b = buffer_b->buffers[cpu];
> > > > > +	/* It's up to the callers to not try to swap mapped buffers */
> > > > > +	if (WARN_ON_ONCE(cpu_buffer_a->mapped || cpu_buffer_b->mapped)) {
> > > > > +		ret = -EBUSY;
> > > > > +		goto out;
> > > > > +	}
> > > > > +
> > > > >     	/* At least make sure the two buffers are somewhat the same */
> > > > >     	if (cpu_buffer_a->nr_pages != cpu_buffer_b->nr_pages)
> > > > >     		goto out;
> > > > > @@ -5733,7 +5766,8 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
> > > > >     	 * Otherwise, we can simply swap the page with the one passed in.
> > > > >     	 */
> > > > >     	if (read || (len < (commit - read)) ||
> > > > > -	    cpu_buffer->reader_page == cpu_buffer->commit_page) {
> > > > > +	    cpu_buffer->reader_page == cpu_buffer->commit_page ||
> > > > > +	    cpu_buffer->mapped) {
> > > > >     		struct buffer_data_page *rpage = cpu_buffer->reader_page->page;
> > > > >     		unsigned int rpos = read;
> > > > >     		unsigned int pos = 0;
> > > > > @@ -5956,6 +5990,11 @@ int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order)
> > > > >     		cpu_buffer = buffer->buffers[cpu];
> > > > > +		if (cpu_buffer->mapped) {
> > > > > +			err = -EBUSY;
> > > > > +			goto error;
> > > > > +		}
> > > > > +
> > > > >     		/* Update the number of pages to match the new size */
> > > > >     		nr_pages = old_size * buffer->buffers[cpu]->nr_pages;
> > > > >     		nr_pages = DIV_ROUND_UP(nr_pages, buffer->subbuf_size);
> > > > > @@ -6057,6 +6096,358 @@ int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order)
> > > > >     }
> > > > >     EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_set);
> > > > > +static int rb_alloc_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
> > > > > +{
> > > > > +	struct page *page;
> > > > > +
> > > > > +	if (cpu_buffer->meta_page)
> > > > > +		return 0;
> > > > > +
> > > > > +	page = alloc_page(GFP_USER | __GFP_ZERO);
> > > > > +	if (!page)
> > > > > +		return -ENOMEM;
> > > > > +
> > > > > +	cpu_buffer->meta_page = page_to_virt(page);
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +static void rb_free_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
> > > > > +{
> > > > > +	unsigned long addr = (unsigned long)cpu_buffer->meta_page;
> > > > > +
> > > > > +	free_page(addr);
> > > > > +	cpu_buffer->meta_page = NULL;
> > > > > +}
> > > > > +
> > > > > +static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer,
> > > > > +				   unsigned long *subbuf_ids)
> > > > > +{
> > > > > +	struct trace_buffer_meta *meta = cpu_buffer->meta_page;
> > > > > +	unsigned int nr_subbufs = cpu_buffer->nr_pages + 1;
> > > > > +	struct buffer_page *first_subbuf, *subbuf;
> > > > > +	int id = 0;
> > > > > +
> > > > > +	subbuf_ids[id] = (unsigned long)cpu_buffer->reader_page->page;
> > > > > +	cpu_buffer->reader_page->id = id++;
> > > > > +
> > > > > +	first_subbuf = subbuf = rb_set_head_page(cpu_buffer);
> > > > > +	do {
> > > > > +		if (WARN_ON(id >= nr_subbufs))
> > > > > +			break;
> > > > > +
> > > > > +		subbuf_ids[id] = (unsigned long)subbuf->page;
> > > > > +		subbuf->id = id;
> > > > > +
> > > > > +		rb_inc_page(&subbuf);
> > > > > +		id++;
> > > > > +	} while (subbuf != first_subbuf);
> > > > > +
> > > > > +	/* install subbuf ID to kern VA translation */
> > > > > +	cpu_buffer->subbuf_ids = subbuf_ids;
> > > > > +
> > > > > +	/* __rb_map_vma() pads the meta-page to align it with the sub-buffers */
> > > > > +	meta->meta_page_size = PAGE_SIZE << cpu_buffer->buffer->subbuf_order;
> > > > > +	meta->meta_struct_len = sizeof(*meta);
> > > > > +	meta->nr_subbufs = nr_subbufs;
> > > > > +	meta->subbuf_size = cpu_buffer->buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
> > > > > +
> > > > > +	rb_update_meta_page(cpu_buffer);
> > > > > +}
> > > > > +
> > > > > +static struct ring_buffer_per_cpu *
> > > > > +rb_get_mapped_buffer(struct trace_buffer *buffer, int cpu)
> > > > > +{
> > > > > +	struct ring_buffer_per_cpu *cpu_buffer;
> > > > > +
> > > > > +	if (!cpumask_test_cpu(cpu, buffer->cpumask))
> > > > > +		return ERR_PTR(-EINVAL);
> > > > > +
> > > > > +	cpu_buffer = buffer->buffers[cpu];
> > > > > +
> > > > > +	mutex_lock(&cpu_buffer->mapping_lock);
> > > > > +
> > > > > +	if (!cpu_buffer->mapped) {
> > > > > +		mutex_unlock(&cpu_buffer->mapping_lock);
> > > > > +		return ERR_PTR(-ENODEV);
> > > > > +	}
> > > > > +
> > > > > +	return cpu_buffer;
> > > > > +}
> > > > > +
> > > > > +static void rb_put_mapped_buffer(struct ring_buffer_per_cpu *cpu_buffer)
> > > > > +{
> > > > > +	mutex_unlock(&cpu_buffer->mapping_lock);
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * Fast-path for rb_buffer_(un)map(). Called whenever the meta-page doesn't need
> > > > > + * to be set-up or torn-down.
> > > > > + */
> > > > > +static int __rb_inc_dec_mapped(struct ring_buffer_per_cpu *cpu_buffer,
> > > > > +			       bool inc)
> > > > > +{
> > > > > +	unsigned long flags;
> > > > > +
> > > > > +	lockdep_assert_held(&cpu_buffer->mapping_lock);
> > > > > +
> > > > > +	if (inc && cpu_buffer->mapped == UINT_MAX)
> > > > > +		return -EBUSY;
> > > > > +
> > > > > +	if (WARN_ON(!inc && cpu_buffer->mapped == 0))
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	mutex_lock(&cpu_buffer->buffer->mutex);
> > > > > +	raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
> > > > > +
> > > > > +	if (inc)
> > > > > +		cpu_buffer->mapped++;
> > > > > +	else
> > > > > +		cpu_buffer->mapped--;
> > > > > +
> > > > > +	raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
> > > > > +	mutex_unlock(&cpu_buffer->buffer->mutex);
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +#define subbuf_page(off, start) \
> > > > > +	virt_to_page((void *)((start) + ((off) << PAGE_SHIFT)))
> > > > > +
> > > > > +#define foreach_subbuf_page(sub_order, start, page)		\
> > > > > +	page = subbuf_page(0, (start));				\
> > > > > +	for (int __off = 0; __off < (1 << (sub_order));		\
> > > > > +	     __off++, page = subbuf_page(__off, (start)))
> > > > > +
> > > > > +/*
> > > > > + *   +--------------+  pgoff == 0
> > > > > + *   |   meta page  |
> > > > > + *   +--------------+  pgoff == 1
> > > > > + *   |   000000000  |
> > > > > + *   +--------------+  pgoff == (1 << subbuf_order)
> > > > > + *   | subbuffer 0  |
> > > > > + *   |              |
> > > > > + *   +--------------+  pgoff == (2 * (1 << subbuf_order))
> > > > > + *   | subbuffer 1  |
> > > > > + *   |              |
> > > > > + *         ...
> > > > > + */
> > > > > +static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
> > > > > +			struct vm_area_struct *vma)
> > > > > +{
> > > > > +	unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
> > > > > +	unsigned int subbuf_pages, subbuf_order;
> > > > > +	struct page **pages;
> > > > > +	int p = 0, s = 0;
> > > > > +	int err;
> > > > > +
> > > > > +	lockdep_assert_held(&cpu_buffer->mapping_lock);
> > > > > +
> > > > > +	subbuf_order = cpu_buffer->buffer->subbuf_order;
> > > > > +	subbuf_pages = 1 << subbuf_order;
> > > > > +
> > > > > +	if (subbuf_order && pgoff % subbuf_pages)
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	nr_subbufs = cpu_buffer->nr_pages + 1;
> > > > > +	nr_pages = ((nr_subbufs + 1) << subbuf_order) - pgoff;
> > > > > +
> > > > > +	vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> > > > > +	if (!vma_pages || vma_pages > nr_pages)
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	nr_pages = vma_pages;
> > > > > +
> > > > > +	pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
> > > > > +	if (!pages)
> > > > > +		return -ENOMEM;
> > > > > +
> > > > > +	if (!pgoff) {
> > > > > +		unsigned long meta_page_padding;
> > > > > +
> > > > > +		pages[p++] = virt_to_page(cpu_buffer->meta_page);
> > > > > +
> > > > > +		/*
> > > > > +		 * Pad with the zero-page to align the meta-page with the
> > > > > +		 * sub-buffers.
> > > > > +		 */
> > > > > +		meta_page_padding = subbuf_pages - 1;
> > > > > +		while (meta_page_padding-- && p < nr_pages)
> > > > > +			pages[p++] = ZERO_PAGE(0);
> > > > 
> > > > Using the shared zeropage in a MAP_SHARED mapping that is neither VM_IO
> > > > nor VM_PFNMAP can be problematic. So we really need patch #3 logic to
> > > > use VM_PFNMAP.
> > > > 
> > > > It would be cleaner/more obvious if these VMA flag setting would reside
> > 
> > tracing_buffers_mmap() sets both VM_IO and VM_PFNMAP. But, yeah as vma is
> > already passed to ring_buffer_map(). The flags could be set here as this
> > function is what sets the requirements really.
> > 
> > > > here, if possible.
> > > > 
> > > > > +	} else {
> > > > > +		/* Skip the meta-page */
> > > > > +		pgoff -= subbuf_pages;
> > > > > +
> > > > > +		s += pgoff / subbuf_pages;
> > > > > +	}
> > > > > +
> > > > > +	while (s < nr_subbufs && p < nr_pages) {
> > > > > +		struct page *page;
> > > > > +
> > > > > +		foreach_subbuf_page(subbuf_order, cpu_buffer->subbuf_ids[s], page) {
> > > > > +			if (p >= nr_pages)
> > > > > +				break;
> > > > > +
> > > > > +			pages[p++] = page;
> > > > > +		}
> > > > > +		s++;
> > > > > +	}
> > > > > +
> > > > > +	err = vm_insert_pages(vma, vma->vm_start, pages, &nr_pages);
> > > > 
> > > > I think Linus suggested it ("avoid all the sub-page ref-counts entirely
> > > > by using VM_PFNMAP, and use vm_insert_pages()"), but ...
> > > > vm_insert_pages() will:
> > > > * Mess with mapcounts
> > > > * Mess with refcounts
> > > > 
> > > > See
> > > > insert_pages()->insert_page_in_batch_locked()->insert_page_into_pte_locked().
> > > > 
> > > > So we'll mess with the mapcount and refcount of the shared zeropage ...
> > > > hmmmm
> > > > 
> > > > If I am not wrong, vm_normal_page() would not return the shared zeropage
> > > > in case we don't have CONFIG_ARCH_HAS_PTE_SPECIAL ... so
> > > > unmap()->...->zap_present_ptes() would not decrement the refcount and we
> > > > could overflow it. ... we also shouldn't ever mess with the mapcount of
> > > > the shared zeropage in the first place.
> > > > 
> > > > vm_insert_page() is clearer on that: "This allows drivers to insert
> > > > individual pages they've allocated into a user vma". You didn't allocate
> > > > the shared zeropage.
> > > > 
> > > > I'm wondering if we even expect VM_MIXEDMAP and VM_PFNMAP to be set at
> > > > the same time? vm_insert_pages() would BUG_ON in case VM_PFNMAP is
> > > > already set and it would set VM_MIXEDMAP ... similarly
> > > > vmf_insert_pfn_prot() would not be happy about that at all ...
> > > > 
> > > > BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
> > > > (VM_PFNMAP|VM_MIXEDMAP));
> > > > 
> > > > ... remap_pfn_range() is used by io_uring_mmap for a similar purpose.
> > > > But it only supports a single PFN range at a time and requires the
> > > > caller to handle refcounting of pages.
> > > > 
> > > > It's getting late in Germany, so I might be missing something; but using
> > > > the shared zeropage here might be a problem.
> > > > 
> > > 
> > > I was just about to write code to cleanly support vm_insert_pages() to
> > > consume zeropages ... when I started to wonder about something else:
> > 
> > Alternatively, we could at the moment allocate a meta-page of the same size than
> > the subbufs? (of course the downside is to waste a bit of memory)
> 
> Supporting the shared zeropage should be possible. We just have to be
> careful that no other code can accidentially set it writable. I'll have to
> further think about that (not just your user, but making it sane to use by
> new code as well).
> 
> So far, we primarily only have shared zeropages in the MAP_PRIVATE mappings.
> Dax is a corner case where we have them in MAP_SHARED mappings. PFNMAP would
> not be problematic, but MIXEDMAP is. (again, I am not 100% sure about
> combining both ...)
> 
> I'll further think about that one, and try crafting a reasonable patch.
> 
> Could we start with no shared zeropage and implement that as an optimization
> on top?

Of course, I'm happy to update this series without the zero-page and to bring it
back later, in a separate patch, as soon as vm_insert_pages() is compatible.

Thanks!

> 
> > 
> > > 
> > > 
> > > +	if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
> > > +	    !(vma->vm_flags & VM_MAYSHARE))
> > > +		return -EPERM;
> > > +
> > > 
> > > You disallow writable mappings. But what about using mprotect() afterwards
> > > to allow for write permissions?
> > > 
> > > Likely you'd want to remove VM_MAYWRITE from the flags, otherwise mprotect()
> > > might alter succeed.
> > 
> > The vm_flags_mod below is clearing VM_MAYWRITE already. Isn't it enough?
> 
> Oh, maybe I missed that!
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions
  2024-04-22 20:31           ` Vincent Donnefort
@ 2024-04-23 15:18             ` David Hildenbrand
  0 siblings, 0 replies; 23+ messages in thread
From: David Hildenbrand @ 2024-04-23 15:18 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: rostedt, mhiramat, linux-kernel, linux-trace-kernel,
	mathieu.desnoyers, kernel-team, rdunlap, linux-mm

On 22.04.24 22:31, Vincent Donnefort wrote:
> On Mon, Apr 22, 2024 at 08:27:17PM +0200, David Hildenbrand wrote:
>> On 22.04.24 20:20, Vincent Donnefort wrote:
>>> Hi David,
>>>
>>> Thanks for having a look, very much appreciated!
>>>
>>> On Mon, Apr 22, 2024 at 11:27:11AM +0200, David Hildenbrand wrote:
>>>> On 19.04.24 20:25, David Hildenbrand wrote:
>>>>> On 06.04.24 19:36, Vincent Donnefort wrote:
>>>>>> In preparation for allowing the user-space to map a ring-buffer, add
>>>>>> a set of mapping functions:
>>>>>>
>>>>>>       ring_buffer_{map,unmap}()
>>>>>>
>>>>>> And controls on the ring-buffer:
>>>>>>
>>>>>>       ring_buffer_map_get_reader()  /* swap reader and head */
>>>>>>
>>>>>> Mapping the ring-buffer also involves:
>>>>>>
>>>>>>       A unique ID for each subbuf of the ring-buffer, currently they are
>>>>>>       only identified through their in-kernel VA.
>>>>>>
>>>>>>       A meta-page, where are stored ring-buffer statistics and a
>>>>>>       description for the current reader
>>>>>>
>>>>>> The linear mapping exposes the meta-page, and each subbuf of the
>>>>>> ring-buffer, ordered following their unique ID, assigned during the
>>>>>> first mapping.
>>>>>>
>>>>>> Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
>>>>>> size will remain unmodified and the splice enabling functions will in
>>>>>> reality simply memcpy the data instead of swapping subbufs.
>>>>>>
>>>>>> CC: <linux-mm@kvack.org>
>>>>>> Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
>>>>>>
>>>>>> diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
>>>>>> index dc5ae4e96aee..96d2140b471e 100644
>>>>>> --- a/include/linux/ring_buffer.h
>>>>>> +++ b/include/linux/ring_buffer.h
>>>>>> @@ -6,6 +6,8 @@
>>>>>>      #include <linux/seq_file.h>
>>>>>>      #include <linux/poll.h>
>>>>>> +#include <uapi/linux/trace_mmap.h>
>>>>>> +
>>>>>>      struct trace_buffer;
>>>>>>      struct ring_buffer_iter;
>>>>>> @@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node);
>>>>>>      #define trace_rb_cpu_prepare	NULL
>>>>>>      #endif
>>>>>> +int ring_buffer_map(struct trace_buffer *buffer, int cpu,
>>>>>> +		    struct vm_area_struct *vma);
>>>>>> +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
>>>>>> +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
>>>>>>      #endif /* _LINUX_RING_BUFFER_H */
>>>>>> diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
>>>>>> new file mode 100644
>>>>>> index 000000000000..ffcd8dfcaa4f
>>>>>> --- /dev/null
>>>>>> +++ b/include/uapi/linux/trace_mmap.h
>>>>>> @@ -0,0 +1,46 @@
>>>>>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>>>>>> +#ifndef _TRACE_MMAP_H_
>>>>>> +#define _TRACE_MMAP_H_
>>>>>> +
>>>>>> +#include <linux/types.h>
>>>>>> +
>>>>>> +/**
>>>>>> + * struct trace_buffer_meta - Ring-buffer Meta-page description
>>>>>> + * @meta_page_size:	Size of this meta-page.
>>>>>> + * @meta_struct_len:	Size of this structure.
>>>>>> + * @subbuf_size:	Size of each sub-buffer.
>>>>>> + * @nr_subbufs:		Number of subbfs in the ring-buffer, including the reader.
>>>>>> + * @reader.lost_events:	Number of events lost at the time of the reader swap.
>>>>>> + * @reader.id:		subbuf ID of the current reader. ID range [0 : @nr_subbufs - 1]
>>>>>> + * @reader.read:	Number of bytes read on the reader subbuf.
>>>>>> + * @flags:		Placeholder for now, 0 until new features are supported.
>>>>>> + * @entries:		Number of entries in the ring-buffer.
>>>>>> + * @overrun:		Number of entries lost in the ring-buffer.
>>>>>> + * @read:		Number of entries that have been read.
>>>>>> + * @Reserved1:		Reserved for future use.
>>>>>> + * @Reserved2:		Reserved for future use.
>>>>>> + */
>>>>>> +struct trace_buffer_meta {
>>>>>> +	__u32		meta_page_size;
>>>>>> +	__u32		meta_struct_len;
>>>>>> +
>>>>>> +	__u32		subbuf_size;
>>>>>> +	__u32		nr_subbufs;
>>>>>> +
>>>>>> +	struct {
>>>>>> +		__u64	lost_events;
>>>>>> +		__u32	id;
>>>>>> +		__u32	read;
>>>>>> +	} reader;
>>>>>> +
>>>>>> +	__u64	flags;
>>>>>> +
>>>>>> +	__u64	entries;
>>>>>> +	__u64	overrun;
>>>>>> +	__u64	read;
>>>>>> +
>>>>>> +	__u64	Reserved1;
>>>>>> +	__u64	Reserved2;
>>>>>> +};
>>>>>> +
>>>>>> +#endif /* _TRACE_MMAP_H_ */
>>>>>> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
>>>>>> index cc9ebe593571..793ecc454039 100644
>>>>>> --- a/kernel/trace/ring_buffer.c
>>>>>> +++ b/kernel/trace/ring_buffer.c
>>>>>> @@ -9,6 +9,7 @@
>>>>>>      #include <linux/ring_buffer.h>
>>>>>>      #include <linux/trace_clock.h>
>>>>>>      #include <linux/sched/clock.h>
>>>>>> +#include <linux/cacheflush.h>
>>>>>>      #include <linux/trace_seq.h>
>>>>>>      #include <linux/spinlock.h>
>>>>>>      #include <linux/irq_work.h>
>>>>>> @@ -26,6 +27,7 @@
>>>>>>      #include <linux/list.h>
>>>>>>      #include <linux/cpu.h>
>>>>>>      #include <linux/oom.h>
>>>>>> +#include <linux/mm.h>
>>>>>>      #include <asm/local64.h>
>>>>>>      #include <asm/local.h>
>>>>>> @@ -338,6 +340,7 @@ struct buffer_page {
>>>>>>      	local_t		 entries;	/* entries on this page */
>>>>>>      	unsigned long	 real_end;	/* real end of data */
>>>>>>      	unsigned	 order;		/* order of the page */
>>>>>> +	u32		 id;		/* ID for external mapping */
>>>>>>      	struct buffer_data_page *page;	/* Actual data page */
>>>>>>      };
>>>>>> @@ -484,6 +487,12 @@ struct ring_buffer_per_cpu {
>>>>>>      	u64				read_stamp;
>>>>>>      	/* pages removed since last reset */
>>>>>>      	unsigned long			pages_removed;
>>>>>> +
>>>>>> +	unsigned int			mapped;
>>>>>> +	struct mutex			mapping_lock;
>>>>>> +	unsigned long			*subbuf_ids;	/* ID to subbuf VA */
>>>>>> +	struct trace_buffer_meta	*meta_page;
>>>>>> +
>>>>>>      	/* ring buffer pages to update, > 0 to add, < 0 to remove */
>>>>>>      	long				nr_pages_to_update;
>>>>>>      	struct list_head		new_pages; /* new pages to add */
>>>>>> @@ -1599,6 +1608,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu)
>>>>>>      	init_irq_work(&cpu_buffer->irq_work.work, rb_wake_up_waiters);
>>>>>>      	init_waitqueue_head(&cpu_buffer->irq_work.waiters);
>>>>>>      	init_waitqueue_head(&cpu_buffer->irq_work.full_waiters);
>>>>>> +	mutex_init(&cpu_buffer->mapping_lock);
>>>>>>      	bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
>>>>>>      			    GFP_KERNEL, cpu_to_node(cpu));
>>>>>> @@ -1789,8 +1799,6 @@ bool ring_buffer_time_stamp_abs(struct trace_buffer *buffer)
>>>>>>      	return buffer->time_stamp_abs;
>>>>>>      }
>>>>>> -static void rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer);
>>>>>> -
>>>>>>      static inline unsigned long rb_page_entries(struct buffer_page *bpage)
>>>>>>      {
>>>>>>      	return local_read(&bpage->entries) & RB_WRITE_MASK;
>>>>>> @@ -5211,6 +5219,22 @@ static void rb_clear_buffer_page(struct buffer_page *page)
>>>>>>      	page->read = 0;
>>>>>>      }
>>>>>> +static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
>>>>>> +{
>>>>>> +	struct trace_buffer_meta *meta = cpu_buffer->meta_page;
>>>>>> +
>>>>>> +	meta->reader.read = cpu_buffer->reader_page->read;
>>>>>> +	meta->reader.id = cpu_buffer->reader_page->id;
>>>>>> +	meta->reader.lost_events = cpu_buffer->lost_events;
>>>>>> +
>>>>>> +	meta->entries = local_read(&cpu_buffer->entries);
>>>>>> +	meta->overrun = local_read(&cpu_buffer->overrun);
>>>>>> +	meta->read = cpu_buffer->read;
>>>>>> +
>>>>>> +	/* Some archs do not have data cache coherency between kernel and user-space */
>>>>>> +	flush_dcache_folio(virt_to_folio(cpu_buffer->meta_page));
>>>>>> +}
>>>>>> +
>>>>>>      static void
>>>>>>      rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
>>>>>>      {
>>>>>> @@ -5255,6 +5279,9 @@ rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
>>>>>>      	cpu_buffer->lost_events = 0;
>>>>>>      	cpu_buffer->last_overrun = 0;
>>>>>> +	if (cpu_buffer->mapped)
>>>>>> +		rb_update_meta_page(cpu_buffer);
>>>>>> +
>>>>>>      	rb_head_page_activate(cpu_buffer);
>>>>>>      	cpu_buffer->pages_removed = 0;
>>>>>>      }
>>>>>> @@ -5469,6 +5496,12 @@ int ring_buffer_swap_cpu(struct trace_buffer *buffer_a,
>>>>>>      	cpu_buffer_a = buffer_a->buffers[cpu];
>>>>>>      	cpu_buffer_b = buffer_b->buffers[cpu];
>>>>>> +	/* It's up to the callers to not try to swap mapped buffers */
>>>>>> +	if (WARN_ON_ONCE(cpu_buffer_a->mapped || cpu_buffer_b->mapped)) {
>>>>>> +		ret = -EBUSY;
>>>>>> +		goto out;
>>>>>> +	}
>>>>>> +
>>>>>>      	/* At least make sure the two buffers are somewhat the same */
>>>>>>      	if (cpu_buffer_a->nr_pages != cpu_buffer_b->nr_pages)
>>>>>>      		goto out;
>>>>>> @@ -5733,7 +5766,8 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
>>>>>>      	 * Otherwise, we can simply swap the page with the one passed in.
>>>>>>      	 */
>>>>>>      	if (read || (len < (commit - read)) ||
>>>>>> -	    cpu_buffer->reader_page == cpu_buffer->commit_page) {
>>>>>> +	    cpu_buffer->reader_page == cpu_buffer->commit_page ||
>>>>>> +	    cpu_buffer->mapped) {
>>>>>>      		struct buffer_data_page *rpage = cpu_buffer->reader_page->page;
>>>>>>      		unsigned int rpos = read;
>>>>>>      		unsigned int pos = 0;
>>>>>> @@ -5956,6 +5990,11 @@ int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order)
>>>>>>      		cpu_buffer = buffer->buffers[cpu];
>>>>>> +		if (cpu_buffer->mapped) {
>>>>>> +			err = -EBUSY;
>>>>>> +			goto error;
>>>>>> +		}
>>>>>> +
>>>>>>      		/* Update the number of pages to match the new size */
>>>>>>      		nr_pages = old_size * buffer->buffers[cpu]->nr_pages;
>>>>>>      		nr_pages = DIV_ROUND_UP(nr_pages, buffer->subbuf_size);
>>>>>> @@ -6057,6 +6096,358 @@ int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order)
>>>>>>      }
>>>>>>      EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_set);
>>>>>> +static int rb_alloc_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
>>>>>> +{
>>>>>> +	struct page *page;
>>>>>> +
>>>>>> +	if (cpu_buffer->meta_page)
>>>>>> +		return 0;
>>>>>> +
>>>>>> +	page = alloc_page(GFP_USER | __GFP_ZERO);
>>>>>> +	if (!page)
>>>>>> +		return -ENOMEM;
>>>>>> +
>>>>>> +	cpu_buffer->meta_page = page_to_virt(page);
>>>>>> +
>>>>>> +	return 0;
>>>>>> +}
>>>>>> +
>>>>>> +static void rb_free_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
>>>>>> +{
>>>>>> +	unsigned long addr = (unsigned long)cpu_buffer->meta_page;
>>>>>> +
>>>>>> +	free_page(addr);
>>>>>> +	cpu_buffer->meta_page = NULL;
>>>>>> +}
>>>>>> +
>>>>>> +static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer,
>>>>>> +				   unsigned long *subbuf_ids)
>>>>>> +{
>>>>>> +	struct trace_buffer_meta *meta = cpu_buffer->meta_page;
>>>>>> +	unsigned int nr_subbufs = cpu_buffer->nr_pages + 1;
>>>>>> +	struct buffer_page *first_subbuf, *subbuf;
>>>>>> +	int id = 0;
>>>>>> +
>>>>>> +	subbuf_ids[id] = (unsigned long)cpu_buffer->reader_page->page;
>>>>>> +	cpu_buffer->reader_page->id = id++;
>>>>>> +
>>>>>> +	first_subbuf = subbuf = rb_set_head_page(cpu_buffer);
>>>>>> +	do {
>>>>>> +		if (WARN_ON(id >= nr_subbufs))
>>>>>> +			break;
>>>>>> +
>>>>>> +		subbuf_ids[id] = (unsigned long)subbuf->page;
>>>>>> +		subbuf->id = id;
>>>>>> +
>>>>>> +		rb_inc_page(&subbuf);
>>>>>> +		id++;
>>>>>> +	} while (subbuf != first_subbuf);
>>>>>> +
>>>>>> +	/* install subbuf ID to kern VA translation */
>>>>>> +	cpu_buffer->subbuf_ids = subbuf_ids;
>>>>>> +
>>>>>> +	/* __rb_map_vma() pads the meta-page to align it with the sub-buffers */
>>>>>> +	meta->meta_page_size = PAGE_SIZE << cpu_buffer->buffer->subbuf_order;
>>>>>> +	meta->meta_struct_len = sizeof(*meta);
>>>>>> +	meta->nr_subbufs = nr_subbufs;
>>>>>> +	meta->subbuf_size = cpu_buffer->buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
>>>>>> +
>>>>>> +	rb_update_meta_page(cpu_buffer);
>>>>>> +}
>>>>>> +
>>>>>> +static struct ring_buffer_per_cpu *
>>>>>> +rb_get_mapped_buffer(struct trace_buffer *buffer, int cpu)
>>>>>> +{
>>>>>> +	struct ring_buffer_per_cpu *cpu_buffer;
>>>>>> +
>>>>>> +	if (!cpumask_test_cpu(cpu, buffer->cpumask))
>>>>>> +		return ERR_PTR(-EINVAL);
>>>>>> +
>>>>>> +	cpu_buffer = buffer->buffers[cpu];
>>>>>> +
>>>>>> +	mutex_lock(&cpu_buffer->mapping_lock);
>>>>>> +
>>>>>> +	if (!cpu_buffer->mapped) {
>>>>>> +		mutex_unlock(&cpu_buffer->mapping_lock);
>>>>>> +		return ERR_PTR(-ENODEV);
>>>>>> +	}
>>>>>> +
>>>>>> +	return cpu_buffer;
>>>>>> +}
>>>>>> +
>>>>>> +static void rb_put_mapped_buffer(struct ring_buffer_per_cpu *cpu_buffer)
>>>>>> +{
>>>>>> +	mutex_unlock(&cpu_buffer->mapping_lock);
>>>>>> +}
>>>>>> +
>>>>>> +/*
>>>>>> + * Fast-path for rb_buffer_(un)map(). Called whenever the meta-page doesn't need
>>>>>> + * to be set-up or torn-down.
>>>>>> + */
>>>>>> +static int __rb_inc_dec_mapped(struct ring_buffer_per_cpu *cpu_buffer,
>>>>>> +			       bool inc)
>>>>>> +{
>>>>>> +	unsigned long flags;
>>>>>> +
>>>>>> +	lockdep_assert_held(&cpu_buffer->mapping_lock);
>>>>>> +
>>>>>> +	if (inc && cpu_buffer->mapped == UINT_MAX)
>>>>>> +		return -EBUSY;
>>>>>> +
>>>>>> +	if (WARN_ON(!inc && cpu_buffer->mapped == 0))
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	mutex_lock(&cpu_buffer->buffer->mutex);
>>>>>> +	raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
>>>>>> +
>>>>>> +	if (inc)
>>>>>> +		cpu_buffer->mapped++;
>>>>>> +	else
>>>>>> +		cpu_buffer->mapped--;
>>>>>> +
>>>>>> +	raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
>>>>>> +	mutex_unlock(&cpu_buffer->buffer->mutex);
>>>>>> +
>>>>>> +	return 0;
>>>>>> +}
>>>>>> +
>>>>>> +#define subbuf_page(off, start) \
>>>>>> +	virt_to_page((void *)((start) + ((off) << PAGE_SHIFT)))
>>>>>> +
>>>>>> +#define foreach_subbuf_page(sub_order, start, page)		\
>>>>>> +	page = subbuf_page(0, (start));				\
>>>>>> +	for (int __off = 0; __off < (1 << (sub_order));		\
>>>>>> +	     __off++, page = subbuf_page(__off, (start)))
>>>>>> +
>>>>>> +/*
>>>>>> + *   +--------------+  pgoff == 0
>>>>>> + *   |   meta page  |
>>>>>> + *   +--------------+  pgoff == 1
>>>>>> + *   |   000000000  |
>>>>>> + *   +--------------+  pgoff == (1 << subbuf_order)
>>>>>> + *   | subbuffer 0  |
>>>>>> + *   |              |
>>>>>> + *   +--------------+  pgoff == (2 * (1 << subbuf_order))
>>>>>> + *   | subbuffer 1  |
>>>>>> + *   |              |
>>>>>> + *         ...
>>>>>> + */
>>>>>> +static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
>>>>>> +			struct vm_area_struct *vma)
>>>>>> +{
>>>>>> +	unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
>>>>>> +	unsigned int subbuf_pages, subbuf_order;
>>>>>> +	struct page **pages;
>>>>>> +	int p = 0, s = 0;
>>>>>> +	int err;
>>>>>> +
>>>>>> +	lockdep_assert_held(&cpu_buffer->mapping_lock);
>>>>>> +
>>>>>> +	subbuf_order = cpu_buffer->buffer->subbuf_order;
>>>>>> +	subbuf_pages = 1 << subbuf_order;
>>>>>> +
>>>>>> +	if (subbuf_order && pgoff % subbuf_pages)
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	nr_subbufs = cpu_buffer->nr_pages + 1;
>>>>>> +	nr_pages = ((nr_subbufs + 1) << subbuf_order) - pgoff;
>>>>>> +
>>>>>> +	vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>>>>>> +	if (!vma_pages || vma_pages > nr_pages)
>>>>>> +		return -EINVAL;
>>>>>> +
>>>>>> +	nr_pages = vma_pages;
>>>>>> +
>>>>>> +	pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
>>>>>> +	if (!pages)
>>>>>> +		return -ENOMEM;
>>>>>> +
>>>>>> +	if (!pgoff) {
>>>>>> +		unsigned long meta_page_padding;
>>>>>> +
>>>>>> +		pages[p++] = virt_to_page(cpu_buffer->meta_page);
>>>>>> +
>>>>>> +		/*
>>>>>> +		 * Pad with the zero-page to align the meta-page with the
>>>>>> +		 * sub-buffers.
>>>>>> +		 */
>>>>>> +		meta_page_padding = subbuf_pages - 1;
>>>>>> +		while (meta_page_padding-- && p < nr_pages)
>>>>>> +			pages[p++] = ZERO_PAGE(0);
>>>>>
>>>>> Using the shared zeropage in a MAP_SHARED mapping that is neither VM_IO
>>>>> nor VM_PFNMAP can be problematic. So we really need patch #3 logic to
>>>>> use VM_PFNMAP.
>>>>>
>>>>> It would be cleaner/more obvious if these VMA flag setting would reside
>>>
>>> tracing_buffers_mmap() sets both VM_IO and VM_PFNMAP. But, yeah as vma is
>>> already passed to ring_buffer_map(). The flags could be set here as this
>>> function is what sets the requirements really.
>>>
>>>>> here, if possible.
>>>>>
>>>>>> +	} else {
>>>>>> +		/* Skip the meta-page */
>>>>>> +		pgoff -= subbuf_pages;
>>>>>> +
>>>>>> +		s += pgoff / subbuf_pages;
>>>>>> +	}
>>>>>> +
>>>>>> +	while (s < nr_subbufs && p < nr_pages) {
>>>>>> +		struct page *page;
>>>>>> +
>>>>>> +		foreach_subbuf_page(subbuf_order, cpu_buffer->subbuf_ids[s], page) {
>>>>>> +			if (p >= nr_pages)
>>>>>> +				break;
>>>>>> +
>>>>>> +			pages[p++] = page;
>>>>>> +		}
>>>>>> +		s++;
>>>>>> +	}
>>>>>> +
>>>>>> +	err = vm_insert_pages(vma, vma->vm_start, pages, &nr_pages);
>>>>>
>>>>> I think Linus suggested it ("avoid all the sub-page ref-counts entirely
>>>>> by using VM_PFNMAP, and use vm_insert_pages()"), but ...
>>>>> vm_insert_pages() will:
>>>>> * Mess with mapcounts
>>>>> * Mess with refcounts
>>>>>
>>>>> See
>>>>> insert_pages()->insert_page_in_batch_locked()->insert_page_into_pte_locked().
>>>>>
>>>>> So we'll mess with the mapcount and refcount of the shared zeropage ...
>>>>> hmmmm
>>>>>
>>>>> If I am not wrong, vm_normal_page() would not return the shared zeropage
>>>>> in case we don't have CONFIG_ARCH_HAS_PTE_SPECIAL ... so
>>>>> unmap()->...->zap_present_ptes() would not decrement the refcount and we
>>>>> could overflow it. ... we also shouldn't ever mess with the mapcount of
>>>>> the shared zeropage in the first place.
>>>>>
>>>>> vm_insert_page() is clearer on that: "This allows drivers to insert
>>>>> individual pages they've allocated into a user vma". You didn't allocate
>>>>> the shared zeropage.
>>>>>
>>>>> I'm wondering if we even expect VM_MIXEDMAP and VM_PFNMAP to be set at
>>>>> the same time? vm_insert_pages() would BUG_ON in case VM_PFNMAP is
>>>>> already set and it would set VM_MIXEDMAP ... similarly
>>>>> vmf_insert_pfn_prot() would not be happy about that at all ...
>>>>>
>>>>> BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
>>>>> (VM_PFNMAP|VM_MIXEDMAP));
>>>>>
>>>>> ... remap_pfn_range() is used by io_uring_mmap for a similar purpose.
>>>>> But it only supports a single PFN range at a time and requires the
>>>>> caller to handle refcounting of pages.
>>>>>
>>>>> It's getting late in Germany, so I might be missing something; but using
>>>>> the shared zeropage here might be a problem.
>>>>>
>>>>
>>>> I was just about to write code to cleanly support vm_insert_pages() to
>>>> consume zeropages ... when I started to wonder about something else:
>>>
>>> Alternatively, we could at the moment allocate a meta-page of the same size than
>>> the subbufs? (of course the downside is to waste a bit of memory)
>>
>> Supporting the shared zeropage should be possible. We just have to be
>> careful that no other code can accidentially set it writable. I'll have to
>> further think about that (not just your user, but making it sane to use by
>> new code as well).
>>
>> So far, we primarily only have shared zeropages in the MAP_PRIVATE mappings.
>> Dax is a corner case where we have them in MAP_SHARED mappings. PFNMAP would
>> not be problematic, but MIXEDMAP is. (again, I am not 100% sure about
>> combining both ...)
>>
>> I'll further think about that one, and try crafting a reasonable patch.
>>
>> Could we start with no shared zeropage and implement that as an optimization
>> on top?
> 
> Of course, I'm happy to update this series without the zero-page and to bring it
> back later, in a separate patch, as soon as vm_insert_pages() is compatible.
> 

Good, let's explore that path first!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions
  2024-04-10 17:43   ` Steven Rostedt
@ 2024-04-23 16:04     ` Liam R. Howlett
  2024-04-28 21:24       ` Steven Rostedt
  0 siblings, 1 reply; 23+ messages in thread
From: Liam R. Howlett @ 2024-04-23 16:04 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Vincent Donnefort, mhiramat, linux-kernel, linux-trace-kernel,
	mathieu.desnoyers, kernel-team, rdunlap, linux-mm

* Steven Rostedt <rostedt@goodmis.org> [240410 13:41]:
> On Sat,  6 Apr 2024 18:36:46 +0100
> Vincent Donnefort <vdonnefort@google.com> wrote:
> 
> > +int ring_buffer_map(struct trace_buffer *buffer, int cpu,
> > +		    struct vm_area_struct *vma)
> > +{
> > +	struct ring_buffer_per_cpu *cpu_buffer;
> > +	unsigned long flags, *subbuf_ids;
> > +	int err = 0;
> > +
> > +	if (!cpumask_test_cpu(cpu, buffer->cpumask))
> > +		return -EINVAL;
> > +
> > +	cpu_buffer = buffer->buffers[cpu];
> > +
> > +	mutex_lock(&cpu_buffer->mapping_lock);
> > +
> > +	if (cpu_buffer->mapped) {
> > +		err = __rb_map_vma(cpu_buffer, vma);
> > +		if (!err)
> > +			err = __rb_inc_dec_mapped(cpu_buffer, true);
> > +		mutex_unlock(&cpu_buffer->mapping_lock);
> > +		return err;
> > +	}
> > +
> > +	/* prevent another thread from changing buffer/sub-buffer sizes */
> > +	mutex_lock(&buffer->mutex);
> > +
> > +	err = rb_alloc_meta_page(cpu_buffer);
> > +	if (err)
> > +		goto unlock;
> > +
> > +	/* subbuf_ids include the reader while nr_pages does not */
> > +	subbuf_ids = kcalloc(cpu_buffer->nr_pages + 1, sizeof(*subbuf_ids), GFP_KERNEL);
> > +	if (!subbuf_ids) {
> > +		rb_free_meta_page(cpu_buffer);
> > +		err = -ENOMEM;
> > +		goto unlock;
> > +	}
> > +
> > +	atomic_inc(&cpu_buffer->resize_disabled);
> > +
> > +	/*
> > +	 * Lock all readers to block any subbuf swap until the subbuf IDs are
> > +	 * assigned.
> > +	 */
> > +	raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
> > +	rb_setup_ids_meta_page(cpu_buffer, subbuf_ids);
> > +	raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
> > +
> > +	err = __rb_map_vma(cpu_buffer, vma);
> > +	if (!err) {
> > +		raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
> > +		cpu_buffer->mapped = 1;
> > +		raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
> > +	} else {
> > +		kfree(cpu_buffer->subbuf_ids);
> > +		cpu_buffer->subbuf_ids = NULL;
> > +		rb_free_meta_page(cpu_buffer);
> > +	}
> > +unlock:
> 
> Nit: For all labels, please add a space before them. Otherwise, diffs will
> show "unlock" as the function and not "ring_buffer_map", making it harder
> to find where the change is.
> 

Isn't the inclusion of a space before labels outdate since 'git diff'
superseds 'diff' and fixed this issue?

Thanks,
Liam

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions
  2024-04-23 16:04     ` Liam R. Howlett
@ 2024-04-28 21:24       ` Steven Rostedt
  0 siblings, 0 replies; 23+ messages in thread
From: Steven Rostedt @ 2024-04-28 21:24 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: Vincent Donnefort, mhiramat, linux-kernel, linux-trace-kernel,
	mathieu.desnoyers, kernel-team, rdunlap, linux-mm

On Tue, 23 Apr 2024 12:04:15 -0400
"Liam R. Howlett" <Liam.Howlett@oracle.com> wrote:

> > Nit: For all labels, please add a space before them. Otherwise, diffs will
> > show "unlock" as the function and not "ring_buffer_map", making it harder
> > to find where the change is.
> >   
> 
> Isn't the inclusion of a space before labels outdate since 'git diff'
> superseds 'diff' and fixed this issue?

I still use patch and quilt ;-) and that still suffers from that on
reject files (used for backporting and such).

-- Steve

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2024-04-28 21:24 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-06 17:36 [PATCH v20 0/5] Introducing trace buffer mapping by user-space Vincent Donnefort
2024-04-06 17:36 ` [PATCH v20 1/5] ring-buffer: allocate sub-buffers with __GFP_COMP Vincent Donnefort
2024-04-10 17:36   ` Steven Rostedt
2024-04-06 17:36 ` [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions Vincent Donnefort
2024-04-09 20:12   ` kernel test robot
2024-04-10 17:43   ` Steven Rostedt
2024-04-23 16:04     ` Liam R. Howlett
2024-04-28 21:24       ` Steven Rostedt
2024-04-18  6:55   ` Mike Rapoport
2024-04-19  3:43     ` Steven Rostedt
2024-04-22 18:01       ` Vincent Donnefort
2024-04-19 18:25   ` David Hildenbrand
2024-04-22  9:27     ` David Hildenbrand
2024-04-22 18:20       ` Vincent Donnefort
2024-04-22 18:27         ` David Hildenbrand
2024-04-22 20:31           ` Vincent Donnefort
2024-04-23 15:18             ` David Hildenbrand
2024-04-06 17:36 ` [PATCH v20 3/5] tracing: Allow user-space mapping of the ring-buffer Vincent Donnefort
2024-04-06 17:36 ` [PATCH v20 4/5] Documentation: tracing: Add ring-buffer mapping Vincent Donnefort
2024-04-06 17:36 ` [PATCH v20 5/5] ring-buffer/selftest: Add ring-buffer mapping test Vincent Donnefort
2024-04-06 20:46   ` Muhammad Usama Anjum
2024-04-10 17:56 ` [PATCH v20 0/5] Introducing trace buffer mapping by user-space Steven Rostedt
2024-04-17  4:55   ` Mike Rapoport

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.