All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next v1 0/8]  Introduce BPF iterators for io_uring and epoll
@ 2021-11-16  5:42 Kumar Kartikeya Dwivedi
  2021-11-16  5:42 ` [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers Kumar Kartikeya Dwivedi
                   ` (7 more replies)
  0 siblings, 8 replies; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-11-16  5:42 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Pavel Emelyanov, Alexander Mihalicyn, Andrei Vagin, criu,
	io-uring, linux-fsdevel

 The CRIU [0] project developers are exploring potential uses of the BPF
 subsystem to do complicated tasks that are difficult to add support for in the
 kernel using existing interfaces.  Even if they are implemented using procfs,
 or kcmp, it is difficult to make it perform well without having some kind of
 programmable introspection into the kernel data structures. Moreover, for
 procfs based state inspection, the output format once agreed upon is set in
 stone and hard to extend, and at the same time inefficient to consume from
 programs (where it is first converted from machine readable form to human
 readable form, only to be converted again to machine readable form).  In
 addition to this, kcmp based file set matching algorithm performs poorly since
 each file in one set needs to be compared to each file in another set, to
 determine struct file equivalence.

 This set adds a io_uring file iterator (for registered files), a io_uring ubuf
 iterator (for registered buffers), and a epoll iterator (for registered items
 (files, registered using EPOLL_CTL_ADD)) to overcome these limitations.  Using
 existing task, task_file, task_vma iterators, all of these can be combined
 together to significantly enhance and speed up the task dumping procedure.

 The two immediate use cases are io_uring checkpoint/restore support and epoll
 checkpoint/restore support. The first is unimplemented, and the second is being
 expedited using a new epoll iterator. In the future, more stages of the
 checkpointing sequence can be offloaded to eBPF programs to reduce process
 downtime, e.g. in pre-dump stage, before task is seized.

 The io_uring file iterator is even more important now due to the advent of
 descriptorless files in io_uring [1], which makes dumping a task's files a lot
 more harder for CRIU, since there is no visibility into these hidden
 descriptors that the task depends upon for operation. Similarly, the
 io_uring_ubuf iterator is useful in case original VMA used in registering a
 buffer has been destroyed.

 Please see the individual patches for more details.

   [0]: https://criu.org/Main_Page
   [1]: https://lwn.net/Articles/863071

Kumar Kartikeya Dwivedi (8):
  io_uring: Implement eBPF iterator for registered buffers
  bpf: Add bpf_page_to_pfn helper
  io_uring: Implement eBPF iterator for registered files
  epoll: Implement eBPF iterator for registered items
  selftests/bpf: Add test for io_uring BPF iterators
  selftests/bpf: Add test for epoll BPF iterator
  selftests/bpf: Test partial reads for io_uring, epoll iterators
  selftests/bpf: Fix btf_dump test for bpf_iter_link_info

 fs/eventpoll.c                                | 196 +++++++++-
 fs/io_uring.c                                 | 334 ++++++++++++++++
 include/linux/bpf.h                           |   6 +
 include/uapi/linux/bpf.h                      |  15 +
 kernel/trace/bpf_trace.c                      |   2 +
 scripts/bpf_doc.py                            |   2 +
 tools/include/uapi/linux/bpf.h                |  15 +
 .../selftests/bpf/prog_tests/bpf_iter.c       | 362 +++++++++++++++++-
 .../selftests/bpf/prog_tests/btf_dump.c       |   4 +-
 .../selftests/bpf/progs/bpf_iter_epoll.c      |  33 ++
 .../selftests/bpf/progs/bpf_iter_io_uring.c   |  50 +++
 11 files changed, 1015 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_epoll.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_io_uring.c

-- 
2.33.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers
  2021-11-16  5:42 [PATCH bpf-next v1 0/8] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
@ 2021-11-16  5:42 ` Kumar Kartikeya Dwivedi
  2021-11-18 17:21   ` Yonghong Song
  2021-11-18 22:02   ` Alexei Starovoitov
  2021-11-16  5:42 ` [PATCH bpf-next v1 2/8] bpf: Add bpf_page_to_pfn helper Kumar Kartikeya Dwivedi
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-11-16  5:42 UTC (permalink / raw)
  To: bpf
  Cc: Jens Axboe, Pavel Begunkov, io-uring, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Pavel Emelyanov,
	Alexander Mihalicyn, Andrei Vagin, criu, linux-fsdevel

This change adds eBPF iterator for buffers registered in io_uring ctx.
It gives access to the ctx, the index of the registered buffer, and a
pointer to the io_uring_ubuf itself. This allows the iterator to save
info related to buffers added to an io_uring instance, that isn't easy
to export using the fdinfo interface (like exact struct page composing
the registered buffer).

The primary usecase this is enabling is checkpoint/restore support.

Note that we need to use mutex_trylock when the file is read from, in
seq_start functions, as the order of lock taken is opposite of what it
would be when io_uring operation reads the same file.  We take
seq_file->lock, then ctx->uring_lock, while io_uring would first take
ctx->uring_lock and then seq_file->lock for the same ctx.

This can lead to a deadlock scenario described below:

      CPU 0				CPU 1

      vfs_read
      mutex_lock(&seq_file->lock)	io_read
					mutex_lock(&ctx->uring_lock)
      mutex_lock(&ctx->uring_lock) # switched to mutex_trylock
					mutex_lock(&seq_file->lock)

The trylock also protects the case where io_uring tries to read from
iterator attached to itself (same ctx), where the order of locks would
be:
 io_uring_enter
  mutex_lock(&ctx->uring_lock) <-----------.
  io_read				    \
   seq_read				     \
    mutex_lock(&seq_file->lock)		     /
    mutex_lock(&ctx->uring_lock) # deadlock-`

In both these cases (recursive read and contended uring_lock), -EDEADLK
is returned to userspace.

In the future, this iterator will be extended to directly support
iteration of bvec Flexible Array Member, so that when there is no
corresponding VMA that maps to the registered buffer (e.g. if VMA is
destroyed after pinning pages), we are able to reconstruct the
registration on restore by dumping the page contents and then replaying
them into a temporary mapping used for registration later. All this is
out of scope for the current series however, but builds upon this
iterator.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: io-uring@vger.kernel.org
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 fs/io_uring.c                  | 179 +++++++++++++++++++++++++++++++++
 include/linux/bpf.h            |   2 +
 include/uapi/linux/bpf.h       |   3 +
 tools/include/uapi/linux/bpf.h |   3 +
 4 files changed, 187 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index b07196b4511c..46a110989155 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -81,6 +81,7 @@
 #include <linux/tracehook.h>
 #include <linux/audit.h>
 #include <linux/security.h>
+#include <linux/btf_ids.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/io_uring.h>
@@ -11125,3 +11126,181 @@ static int __init io_uring_init(void)
 	return 0;
 };
 __initcall(io_uring_init);
+
+#ifdef CONFIG_BPF_SYSCALL
+
+BTF_ID_LIST(btf_io_uring_ids)
+BTF_ID(struct, io_ring_ctx)
+BTF_ID(struct, io_mapped_ubuf)
+
+struct bpf_io_uring_seq_info {
+	struct io_ring_ctx *ctx;
+	unsigned long index;
+};
+
+static int bpf_io_uring_init_seq(void *priv_data, struct bpf_iter_aux_info *aux)
+{
+	struct bpf_io_uring_seq_info *info = priv_data;
+	struct io_ring_ctx *ctx = aux->ctx;
+
+	info->ctx = ctx;
+	return 0;
+}
+
+static int bpf_io_uring_iter_attach(struct bpf_prog *prog,
+				    union bpf_iter_link_info *linfo,
+				    struct bpf_iter_aux_info *aux)
+{
+	struct io_ring_ctx *ctx;
+	struct fd f;
+	int ret;
+
+	f = fdget(linfo->io_uring.io_uring_fd);
+	if (unlikely(!f.file))
+		return -EBADF;
+
+	ret = -EOPNOTSUPP;
+	if (unlikely(f.file->f_op != &io_uring_fops))
+		goto out_fput;
+
+	ret = -ENXIO;
+	ctx = f.file->private_data;
+	if (unlikely(!percpu_ref_tryget(&ctx->refs)))
+		goto out_fput;
+
+	ret = 0;
+	aux->ctx = ctx;
+
+out_fput:
+	fdput(f);
+	return ret;
+}
+
+static void bpf_io_uring_iter_detach(struct bpf_iter_aux_info *aux)
+{
+	percpu_ref_put(&aux->ctx->refs);
+}
+
+/* io_uring iterator for registered buffers */
+
+struct bpf_iter__io_uring_buf {
+	__bpf_md_ptr(struct bpf_iter_meta *, meta);
+	__bpf_md_ptr(struct io_ring_ctx *, ctx);
+	__bpf_md_ptr(struct io_mapped_ubuf *, ubuf);
+	unsigned long index;
+};
+
+static void *__bpf_io_uring_buf_seq_get_next(struct bpf_io_uring_seq_info *info)
+{
+	if (info->index < info->ctx->nr_user_bufs)
+		return info->ctx->user_bufs[info->index++];
+	return NULL;
+}
+
+static void *bpf_io_uring_buf_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct bpf_io_uring_seq_info *info = seq->private;
+	struct io_mapped_ubuf *ubuf;
+
+	/* Indicate to userspace that the uring lock is contended */
+	if (!mutex_trylock(&info->ctx->uring_lock))
+		return ERR_PTR(-EDEADLK);
+
+	ubuf = __bpf_io_uring_buf_seq_get_next(info);
+	if (!ubuf)
+		return NULL;
+
+	if (*pos == 0)
+		++*pos;
+	return ubuf;
+}
+
+static void *bpf_io_uring_buf_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct bpf_io_uring_seq_info *info = seq->private;
+
+	++*pos;
+	return __bpf_io_uring_buf_seq_get_next(info);
+}
+
+DEFINE_BPF_ITER_FUNC(io_uring_buf, struct bpf_iter_meta *meta,
+		     struct io_ring_ctx *ctx, struct io_mapped_ubuf *ubuf,
+		     unsigned long index)
+
+static int __bpf_io_uring_buf_seq_show(struct seq_file *seq, void *v, bool in_stop)
+{
+	struct bpf_io_uring_seq_info *info = seq->private;
+	struct bpf_iter__io_uring_buf ctx;
+	struct bpf_iter_meta meta;
+	struct bpf_prog *prog;
+
+	meta.seq = seq;
+	prog = bpf_iter_get_info(&meta, in_stop);
+	if (!prog)
+		return 0;
+
+	ctx.meta = &meta;
+	ctx.ctx = info->ctx;
+	ctx.ubuf = v;
+	ctx.index = info->index ? info->index - !in_stop : 0;
+
+	return bpf_iter_run_prog(prog, &ctx);
+}
+
+static int bpf_io_uring_buf_seq_show(struct seq_file *seq, void *v)
+{
+	return __bpf_io_uring_buf_seq_show(seq, v, false);
+}
+
+static void bpf_io_uring_buf_seq_stop(struct seq_file *seq, void *v)
+{
+	struct bpf_io_uring_seq_info *info = seq->private;
+
+	/* If IS_ERR(v) is true, then ctx->uring_lock wasn't taken */
+	if (IS_ERR(v))
+		return;
+	if (!v)
+		__bpf_io_uring_buf_seq_show(seq, v, true);
+	else if (info->index) /* restart from index */
+		info->index--;
+	mutex_unlock(&info->ctx->uring_lock);
+}
+
+static const struct seq_operations bpf_io_uring_buf_seq_ops = {
+	.start = bpf_io_uring_buf_seq_start,
+	.next  = bpf_io_uring_buf_seq_next,
+	.stop  = bpf_io_uring_buf_seq_stop,
+	.show  = bpf_io_uring_buf_seq_show,
+};
+
+static const struct bpf_iter_seq_info bpf_io_uring_buf_seq_info = {
+	.seq_ops          = &bpf_io_uring_buf_seq_ops,
+	.init_seq_private = bpf_io_uring_init_seq,
+	.fini_seq_private = NULL,
+	.seq_priv_size    = sizeof(struct bpf_io_uring_seq_info),
+};
+
+static struct bpf_iter_reg io_uring_buf_reg_info = {
+	.target            = "io_uring_buf",
+	.feature	   = BPF_ITER_RESCHED,
+	.attach_target     = bpf_io_uring_iter_attach,
+	.detach_target     = bpf_io_uring_iter_detach,
+	.ctx_arg_info_size = 2,
+	.ctx_arg_info = {
+		{ offsetof(struct bpf_iter__io_uring_buf, ctx),
+		  PTR_TO_BTF_ID },
+		{ offsetof(struct bpf_iter__io_uring_buf, ubuf),
+		  PTR_TO_BTF_ID_OR_NULL },
+	},
+	.seq_info	   = &bpf_io_uring_buf_seq_info,
+};
+
+static int __init io_uring_iter_init(void)
+{
+	io_uring_buf_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0];
+	io_uring_buf_reg_info.ctx_arg_info[1].btf_id = btf_io_uring_ids[1];
+	return bpf_iter_reg_target(&io_uring_buf_reg_info);
+}
+late_initcall(io_uring_iter_init);
+
+#endif
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 56098c866704..ddb9d4520a3f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1509,8 +1509,10 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
 	extern int bpf_iter_ ## target(args);			\
 	int __init bpf_iter_ ## target(args) { return 0; }
 
+struct io_ring_ctx;
 struct bpf_iter_aux_info {
 	struct bpf_map *map;
+	struct io_ring_ctx *ctx;
 };
 
 typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6297eafdc40f..3323defa99a1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -91,6 +91,9 @@ union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+	struct {
+		__u32   io_uring_fd;
+	} io_uring;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 6297eafdc40f..3323defa99a1 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -91,6 +91,9 @@ union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+	struct {
+		__u32   io_uring_fd;
+	} io_uring;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v1 2/8] bpf: Add bpf_page_to_pfn helper
  2021-11-16  5:42 [PATCH bpf-next v1 0/8] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
  2021-11-16  5:42 ` [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers Kumar Kartikeya Dwivedi
@ 2021-11-16  5:42 ` Kumar Kartikeya Dwivedi
  2021-11-17 12:35     ` kernel test robot
                     ` (2 more replies)
  2021-11-16  5:42 ` [PATCH bpf-next v1 3/8] io_uring: Implement eBPF iterator for registered files Kumar Kartikeya Dwivedi
                   ` (5 subsequent siblings)
  7 siblings, 3 replies; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-11-16  5:42 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Pavel Emelyanov, Alexander Mihalicyn, Andrei Vagin, criu,
	io-uring, linux-fsdevel

In CRIU, we need to be able to determine whether the page pinned by
io_uring is still present in the same range in the process VMA.
/proc/<pid>/pagemap gives us the PFN, hence using this helper we can
establish this mapping easily from the iterator side.

It is a simple wrapper over the in-kernel page_to_pfn helper, and
ensures the passed in pointer is a struct page PTR_TO_BTF_ID. This is
obtained from the bvec of io_uring_ubuf for the CRIU usecase.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 fs/io_uring.c                  | 17 +++++++++++++++++
 include/linux/bpf.h            |  1 +
 include/uapi/linux/bpf.h       |  9 +++++++++
 kernel/trace/bpf_trace.c       |  2 ++
 scripts/bpf_doc.py             |  2 ++
 tools/include/uapi/linux/bpf.h |  9 +++++++++
 6 files changed, 40 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 46a110989155..9e9df6767e29 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -11295,6 +11295,23 @@ static struct bpf_iter_reg io_uring_buf_reg_info = {
 	.seq_info	   = &bpf_io_uring_buf_seq_info,
 };
 
+BPF_CALL_1(bpf_page_to_pfn, struct page *, page)
+{
+	/* PTR_TO_BTF_ID can be NULL */
+	if (!page)
+		return U64_MAX;
+	return page_to_pfn(page);
+}
+
+BTF_ID_LIST_SINGLE(btf_page_to_pfn_ids, struct, page)
+
+const struct bpf_func_proto bpf_page_to_pfn_proto = {
+	.func		= bpf_page_to_pfn,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_BTF_ID,
+	.arg1_btf_id	= &btf_page_to_pfn_ids[0],
+};
+
 static int __init io_uring_iter_init(void)
 {
 	io_uring_buf_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0];
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ddb9d4520a3f..fe7b499da781 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2166,6 +2166,7 @@ extern const struct bpf_func_proto bpf_sk_setsockopt_proto;
 extern const struct bpf_func_proto bpf_sk_getsockopt_proto;
 extern const struct bpf_func_proto bpf_kallsyms_lookup_name_proto;
 extern const struct bpf_func_proto bpf_find_vma_proto;
+extern const struct bpf_func_proto bpf_page_to_pfn_proto;
 
 const struct bpf_func_proto *tracing_prog_func_proto(
   enum bpf_func_id func_id, const struct bpf_prog *prog);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 3323defa99a1..b70e9da3d722 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4960,6 +4960,14 @@ union bpf_attr {
  *		**-ENOENT** if *task->mm* is NULL, or no vma contains *addr*.
  *		**-EBUSY** if failed to try lock mmap_lock.
  *		**-EINVAL** for invalid **flags**.
+ *
+ * long bpf_page_to_pfn(struct page *page)
+ *	Description
+ *		Obtain the page frame number (PFN) for the given *struct page*
+ *		pointer.
+ *	Return
+ *		Page Frame Number corresponding to the page pointed to by the
+ *		*struct page* pointer, or U64_MAX if pointer is NULL.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5143,6 +5151,7 @@ union bpf_attr {
 	FN(skc_to_unix_sock),		\
 	FN(kallsyms_lookup_name),	\
 	FN(find_vma),			\
+	FN(page_to_pfn),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 25ea521fb8f1..f68a8433be1a 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1212,6 +1212,8 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_find_vma_proto;
 	case BPF_FUNC_trace_vprintk:
 		return bpf_get_trace_vprintk_proto();
+	case BPF_FUNC_page_to_pfn:
+		return &bpf_page_to_pfn_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
index a6403ddf5de7..ae68ca794980 100755
--- a/scripts/bpf_doc.py
+++ b/scripts/bpf_doc.py
@@ -549,6 +549,7 @@ class PrinterHelpers(Printer):
             'struct socket',
             'struct file',
             'struct bpf_timer',
+            'struct page',
     ]
     known_types = {
             '...',
@@ -598,6 +599,7 @@ class PrinterHelpers(Printer):
             'struct socket',
             'struct file',
             'struct bpf_timer',
+            'struct page',
     }
     mapped_types = {
             'u8': '__u8',
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 3323defa99a1..b70e9da3d722 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4960,6 +4960,14 @@ union bpf_attr {
  *		**-ENOENT** if *task->mm* is NULL, or no vma contains *addr*.
  *		**-EBUSY** if failed to try lock mmap_lock.
  *		**-EINVAL** for invalid **flags**.
+ *
+ * long bpf_page_to_pfn(struct page *page)
+ *	Description
+ *		Obtain the page frame number (PFN) for the given *struct page*
+ *		pointer.
+ *	Return
+ *		Page Frame Number corresponding to the page pointed to by the
+ *		*struct page* pointer, or U64_MAX if pointer is NULL.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5143,6 +5151,7 @@ union bpf_attr {
 	FN(skc_to_unix_sock),		\
 	FN(kallsyms_lookup_name),	\
 	FN(find_vma),			\
+	FN(page_to_pfn),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v1 3/8] io_uring: Implement eBPF iterator for registered files
  2021-11-16  5:42 [PATCH bpf-next v1 0/8] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
  2021-11-16  5:42 ` [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers Kumar Kartikeya Dwivedi
  2021-11-16  5:42 ` [PATCH bpf-next v1 2/8] bpf: Add bpf_page_to_pfn helper Kumar Kartikeya Dwivedi
@ 2021-11-16  5:42 ` Kumar Kartikeya Dwivedi
  2021-11-18 17:33   ` Yonghong Song
  2021-11-16  5:42 ` [PATCH bpf-next v1 4/8] epoll: Implement eBPF iterator for registered items Kumar Kartikeya Dwivedi
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-11-16  5:42 UTC (permalink / raw)
  To: bpf
  Cc: Jens Axboe, Pavel Begunkov, io-uring, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Pavel Emelyanov,
	Alexander Mihalicyn, Andrei Vagin, criu, linux-fsdevel

This change adds eBPF iterator for buffers registered in io_uring ctx.
It gives access to the ctx, the index of the registered buffer, and a
pointer to the struct file itself. This allows the iterator to save
info related to the file added to an io_uring instance, that isn't easy
to export using the fdinfo interface (like being able to match
registered files to a task's file set). Getting access to underlying
struct file allows deduplication and efficient pairing with task file
set (obtained using task_file iterator).

The primary usecase this is enabling is checkpoint/restore support.

Note that we need to use mutex_trylock when the file is read from, in
seq_start functions, as the order of lock taken is opposite of what it
would be when io_uring operation reads the same file.  We take
seq_file->lock, then ctx->uring_lock, while io_uring would first take
ctx->uring_lock and then seq_file->lock for the same ctx.

This can lead to a deadlock scenario described below:

      CPU 0                             CPU 1

      vfs_read
      mutex_lock(&seq_file->lock)       io_read
					mutex_lock(&ctx->uring_lock)
      mutex_lock(&ctx->uring_lock) # switched to mutex_trylock
					mutex_lock(&seq_file->lock)

The trylock also protects the case where io_uring tries to read from
iterator attached to itself (same ctx), where the order of locks would
be:
 io_uring_enter
  mutex_lock(&ctx->uring_lock) <-----------.
  io_read                                   \
   seq_read                                  \
    mutex_lock(&seq_file->lock)              /
    mutex_lock(&ctx->uring_lock) # deadlock-`

In both these cases (recursive read and contended uring_lock), -EDEADLK
is returned to userspace.

With the advent of descriptorless files supported by io_uring, this
iterator provides the required visibility and introspection of io_uring
instance for the purposes of dumping and restoring it.

In the future, this iterator will be extended to support direct
inspection of a lot of file state (currently descriptorless files
are obtained using openat2 and socket) to dump file state for these
hidden files. Later, we can explore filling in the gaps for dumping
file state for more file types (those not hidden in io_uring ctx).
All this is out of scope for the current series however, but builds
upon this iterator.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: io-uring@vger.kernel.org
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 fs/io_uring.c | 140 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 139 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9e9df6767e29..7ac479c95d4e 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -11132,6 +11132,7 @@ __initcall(io_uring_init);
 BTF_ID_LIST(btf_io_uring_ids)
 BTF_ID(struct, io_ring_ctx)
 BTF_ID(struct, io_mapped_ubuf)
+BTF_ID(struct, file)
 
 struct bpf_io_uring_seq_info {
 	struct io_ring_ctx *ctx;
@@ -11312,11 +11313,148 @@ const struct bpf_func_proto bpf_page_to_pfn_proto = {
 	.arg1_btf_id	= &btf_page_to_pfn_ids[0],
 };
 
+/* io_uring iterator for registered files */
+
+struct bpf_iter__io_uring_file {
+	__bpf_md_ptr(struct bpf_iter_meta *, meta);
+	__bpf_md_ptr(struct io_ring_ctx *, ctx);
+	__bpf_md_ptr(struct file *, file);
+	unsigned long index;
+};
+
+static void *__bpf_io_uring_file_seq_get_next(struct bpf_io_uring_seq_info *info)
+{
+	struct file *file = NULL;
+
+	if (info->index < info->ctx->nr_user_files) {
+		/* file set can be sparse */
+		file = io_file_from_index(info->ctx, info->index++);
+		/* use info as a distinct pointer to distinguish between empty
+		 * slot and valid file, since we cannot return NULL for this
+		 * case if we want iter prog to still be invoked with file ==
+		 * NULL.
+		 */
+		if (!file)
+			return info;
+	}
+
+	return file;
+}
+
+static void *bpf_io_uring_file_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct bpf_io_uring_seq_info *info = seq->private;
+	struct file *file;
+
+	/* Indicate to userspace that the uring lock is contended */
+	if (!mutex_trylock(&info->ctx->uring_lock))
+		return ERR_PTR(-EDEADLK);
+
+	file = __bpf_io_uring_file_seq_get_next(info);
+	if (!file)
+		return NULL;
+
+	if (*pos == 0)
+		++*pos;
+	return file;
+}
+
+static void *bpf_io_uring_file_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct bpf_io_uring_seq_info *info = seq->private;
+
+	++*pos;
+	return __bpf_io_uring_file_seq_get_next(info);
+}
+
+DEFINE_BPF_ITER_FUNC(io_uring_file, struct bpf_iter_meta *meta,
+		     struct io_ring_ctx *ctx, struct file *file,
+		     unsigned long index)
+
+static int __bpf_io_uring_file_seq_show(struct seq_file *seq, void *v, bool in_stop)
+{
+	struct bpf_io_uring_seq_info *info = seq->private;
+	struct bpf_iter__io_uring_file ctx;
+	struct bpf_iter_meta meta;
+	struct bpf_prog *prog;
+
+	meta.seq = seq;
+	prog = bpf_iter_get_info(&meta, in_stop);
+	if (!prog)
+		return 0;
+
+	ctx.meta = &meta;
+	ctx.ctx = info->ctx;
+	/* when we encounter empty slot, v will point to info */
+	ctx.file = v == info ? NULL : v;
+	ctx.index = info->index ? info->index - !in_stop : 0;
+
+	return bpf_iter_run_prog(prog, &ctx);
+}
+
+static int bpf_io_uring_file_seq_show(struct seq_file *seq, void *v)
+{
+	return __bpf_io_uring_file_seq_show(seq, v, false);
+}
+
+static void bpf_io_uring_file_seq_stop(struct seq_file *seq, void *v)
+{
+	struct bpf_io_uring_seq_info *info = seq->private;
+
+	/* If IS_ERR(v) is true, then ctx->uring_lock wasn't taken */
+	if (IS_ERR(v))
+		return;
+	if (!v)
+		__bpf_io_uring_file_seq_show(seq, v, true);
+	else if (info->index) /* restart from index */
+		info->index--;
+	mutex_unlock(&info->ctx->uring_lock);
+}
+
+static const struct seq_operations bpf_io_uring_file_seq_ops = {
+	.start = bpf_io_uring_file_seq_start,
+	.next  = bpf_io_uring_file_seq_next,
+	.stop  = bpf_io_uring_file_seq_stop,
+	.show  = bpf_io_uring_file_seq_show,
+};
+
+static const struct bpf_iter_seq_info bpf_io_uring_file_seq_info = {
+	.seq_ops          = &bpf_io_uring_file_seq_ops,
+	.init_seq_private = bpf_io_uring_init_seq,
+	.fini_seq_private = NULL,
+	.seq_priv_size    = sizeof(struct bpf_io_uring_seq_info),
+};
+
+static struct bpf_iter_reg io_uring_file_reg_info = {
+	.target            = "io_uring_file",
+	.feature           = BPF_ITER_RESCHED,
+	.attach_target     = bpf_io_uring_iter_attach,
+	.detach_target     = bpf_io_uring_iter_detach,
+	.ctx_arg_info_size = 2,
+	.ctx_arg_info = {
+		{ offsetof(struct bpf_iter__io_uring_file, ctx),
+		  PTR_TO_BTF_ID },
+		{ offsetof(struct bpf_iter__io_uring_file, file),
+		  PTR_TO_BTF_ID_OR_NULL },
+	},
+	.seq_info	   = &bpf_io_uring_file_seq_info,
+};
+
 static int __init io_uring_iter_init(void)
 {
+	int ret;
+
 	io_uring_buf_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0];
 	io_uring_buf_reg_info.ctx_arg_info[1].btf_id = btf_io_uring_ids[1];
-	return bpf_iter_reg_target(&io_uring_buf_reg_info);
+	io_uring_file_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0];
+	io_uring_file_reg_info.ctx_arg_info[1].btf_id = btf_io_uring_ids[2];
+	ret = bpf_iter_reg_target(&io_uring_buf_reg_info);
+	if (ret)
+		return ret;
+	ret = bpf_iter_reg_target(&io_uring_file_reg_info);
+	if (ret)
+		bpf_iter_unreg_target(&io_uring_buf_reg_info);
+	return ret;
 }
 late_initcall(io_uring_iter_init);
 
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v1 4/8] epoll: Implement eBPF iterator for registered items
  2021-11-16  5:42 [PATCH bpf-next v1 0/8] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
                   ` (2 preceding siblings ...)
  2021-11-16  5:42 ` [PATCH bpf-next v1 3/8] io_uring: Implement eBPF iterator for registered files Kumar Kartikeya Dwivedi
@ 2021-11-16  5:42 ` Kumar Kartikeya Dwivedi
  2021-11-18 17:50   ` Yonghong Song
  2021-11-16  5:42 ` [PATCH bpf-next v1 5/8] selftests/bpf: Add test for io_uring BPF iterators Kumar Kartikeya Dwivedi
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-11-16  5:42 UTC (permalink / raw)
  To: bpf
  Cc: Alexander Viro, linux-fsdevel, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Pavel Emelyanov,
	Alexander Mihalicyn, Andrei Vagin, criu, io-uring

This patch adds eBPF iterator for epoll items (epitems) registered in an
epoll instance. It gives access to the eventpoll ctx, and the registered
epoll item (struct epitem). This allows the iterator to inspect the
registered file and be able to use others iterators to associate it with
a task's fdtable.

The primary usecase this is enabling is expediting existing eventpoll
checkpoint/restore support in the CRIU project. This iterator allows us
to switch from a worst case O(n^2) algorithm to a single O(n) pass over
task and epoll registered descriptors.

We also make sure we're iterating over a live file, one that is not
going away. The case we're concerned about is a file that has its
f_count as zero, but is waiting for iterator bpf_seq_read to release
ep->mtx, so that it can remove its epitem. Since such a file will
disappear once iteration is done, and it is being destructed, we use
get_file_rcu to ensure it is alive when invoking the BPF program.

Getting access to a file that is going to disappear after iteration
is not useful anyway. This does have a performance overhead however
(since file reference will be raised and dropped for each file).

The rcu_read_lock around get_file_rcu isn't strictly required for
lifetime management since fput path is serialized on ep->mtx to call
ep_remove, hence the epi->ffd.file pointer remains stable during our
seq_start/seq_stop bracketing.

To be able to continue back from the position we were iterating, we
store the epi->ffi.fd and use ep_find_tfd to find the target file again.
It would be more appropriate to use both struct file pointer and fd
number to find the last file, but see below for why that cannot be done.

Taking reference to struct file and walking RB-Tree to find it again
will lead to reference cycle issue if the iterator after partial read
takes reference to socket which later is used in creating a descriptor
cycle using SCM_RIGHTS. An example that was encountered when working on
this is mentioned below.

  Let there be Unix sockets SK1, SK2, epoll fd EP, and epoll iterator
  ITER.
  Let SK1 be registered in EP, then on a partial read it is possible
  that ITER returns from read and takes reference to SK1 to be able to
  find it later in RB-Tree and continue the iteration.  If SK1 sends
  ITER over to SK2 using SCM_RIGHTS, and SK2 sends SK2 over to SK1 using
  SCM_RIGHTS, and both fds are not consumed on the corresponding receive
  ends, a cycle is created.  When all of SK1, SK2, EP, and ITER are
  closed, SK1's receive queue holds reference to SK2, and SK2's receive
  queue holds reference to ITER, which holds a reference to SK1.
  All file descriptors except EP leak.

To resolve it, we would need to hook into the Unix Socket GC mechanism,
but the alternative of using ep_find_tfd is much more simpler. The
finding of the last position in face of concurrent modification of the
epoll set is at best an approximation anyway. For the case of CRIU, the
epoll set remains stable.

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 fs/eventpoll.c                 | 196 ++++++++++++++++++++++++++++++++-
 include/linux/bpf.h            |   5 +-
 include/uapi/linux/bpf.h       |   3 +
 tools/include/uapi/linux/bpf.h |   3 +
 4 files changed, 205 insertions(+), 2 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 06f4c5ae1451..aa21628b6307 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -37,6 +37,7 @@
 #include <linux/seq_file.h>
 #include <linux/compat.h>
 #include <linux/rculist.h>
+#include <linux/btf_ids.h>
 #include <net/busy_poll.h>
 
 /*
@@ -985,7 +986,6 @@ static struct epitem *ep_find(struct eventpoll *ep, struct file *file, int fd)
 	return epir;
 }
 
-#ifdef CONFIG_KCMP
 static struct epitem *ep_find_tfd(struct eventpoll *ep, int tfd, unsigned long toff)
 {
 	struct rb_node *rbp;
@@ -1005,6 +1005,7 @@ static struct epitem *ep_find_tfd(struct eventpoll *ep, int tfd, unsigned long t
 	return NULL;
 }
 
+#ifdef CONFIG_KCMP
 struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd,
 				     unsigned long toff)
 {
@@ -2385,3 +2386,196 @@ static int __init eventpoll_init(void)
 	return 0;
 }
 fs_initcall(eventpoll_init);
+
+#ifdef CONFIG_BPF_SYSCALL
+
+BTF_ID_LIST(btf_epoll_ids)
+BTF_ID(struct, eventpoll)
+BTF_ID(struct, epitem)
+
+struct bpf_epoll_iter_seq_info {
+	struct eventpoll *ep;
+	struct rb_node *rbp;
+	int tfd;
+};
+
+static int bpf_epoll_init_seq(void *priv_data, struct bpf_iter_aux_info *aux)
+{
+	struct bpf_epoll_iter_seq_info *info = priv_data;
+
+	info->ep = aux->ep->private_data;
+	info->tfd = -1;
+	return 0;
+}
+
+static int bpf_epoll_iter_attach(struct bpf_prog *prog,
+				 union bpf_iter_link_info *linfo,
+				 struct bpf_iter_aux_info *aux)
+{
+	struct file *file;
+	int ret;
+
+	file = fget(linfo->epoll.epoll_fd);
+	if (!file)
+		return -EBADF;
+
+	ret = -EOPNOTSUPP;
+	if (unlikely(!is_file_epoll(file)))
+		goto out_fput;
+
+	aux->ep = file;
+	return 0;
+out_fput:
+	fput(file);
+	return ret;
+}
+
+static void bpf_epoll_iter_detach(struct bpf_iter_aux_info *aux)
+{
+	fput(aux->ep);
+}
+
+struct bpf_iter__epoll {
+	__bpf_md_ptr(struct bpf_iter_meta *, meta);
+	__bpf_md_ptr(struct eventpoll *, ep);
+	__bpf_md_ptr(struct epitem *, epi);
+};
+
+static void *bpf_epoll_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	struct bpf_epoll_iter_seq_info *info = seq->private;
+	struct epitem *epi;
+
+	mutex_lock(&info->ep->mtx);
+	/* already iterated? */
+	if (info->tfd == -2)
+		return NULL;
+	/* partially iterated */
+	if (info->tfd >= 0) {
+		epi = ep_find_tfd(info->ep, info->tfd, 0);
+		if (!epi)
+			return NULL;
+		info->rbp = &epi->rbn;
+		return epi;
+	}
+	WARN_ON(info->tfd != -1);
+	/* first iteration */
+	info->rbp = rb_first_cached(&info->ep->rbr);
+	if (!info->rbp)
+		return NULL;
+	if (*pos == 0)
+		++*pos;
+	return rb_entry(info->rbp, struct epitem, rbn);
+}
+
+static void *bpf_epoll_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct bpf_epoll_iter_seq_info *info = seq->private;
+
+	++*pos;
+	info->rbp = rb_next(info->rbp);
+	return info->rbp ? rb_entry(info->rbp, struct epitem, rbn) : NULL;
+}
+
+DEFINE_BPF_ITER_FUNC(epoll, struct bpf_iter_meta *meta, struct eventpoll *ep,
+		     struct epitem *epi)
+
+static int __bpf_epoll_seq_show(struct seq_file *seq, void *v, bool in_stop)
+{
+	struct bpf_epoll_iter_seq_info *info = seq->private;
+	struct bpf_iter__epoll ctx;
+	struct bpf_iter_meta meta;
+	struct bpf_prog *prog;
+	int ret;
+
+	meta.seq = seq;
+	prog = bpf_iter_get_info(&meta, in_stop);
+	if (!prog)
+		return 0;
+
+	ctx.meta = &meta;
+	ctx.ep = info->ep;
+	ctx.epi = v;
+	if (ctx.epi) {
+		/* The file we are going to pass to prog may already have its f_count as
+		 * 0, hence before invoking the prog, we always try to get the reference
+		 * if it isn't zero, failing which we skip the file. This is usually the
+		 * case for files that are closed before calling EPOLL_CTL_DEL for them,
+		 * which would wait for us to release ep->mtx before doing ep_remove.
+		 */
+		rcu_read_lock();
+		ret = get_file_rcu(ctx.epi->ffd.file);
+		rcu_read_unlock();
+		if (!ret)
+			return 0;
+	}
+	ret = bpf_iter_run_prog(prog, &ctx);
+	/* fput queues work asynchronously, so in our case, either task_work
+	 * for non-exiting task, and otherwise delayed_fput, so holding
+	 * ep->mtx and calling fput (which will take the same lock) in
+	 * this context will not deadlock us, in case f_count is 1 at this
+	 * point.
+	 */
+	if (ctx.epi)
+		fput(ctx.epi->ffd.file);
+	return ret;
+}
+
+static int bpf_epoll_seq_show(struct seq_file *seq, void *v)
+{
+	return __bpf_epoll_seq_show(seq, v, false);
+}
+
+static void bpf_epoll_seq_stop(struct seq_file *seq, void *v)
+{
+	struct bpf_epoll_iter_seq_info *info = seq->private;
+	struct epitem *epi;
+
+	if (!v) {
+		__bpf_epoll_seq_show(seq, v, true);
+		/* done iterating */
+		info->tfd = -2;
+	} else {
+		epi = rb_entry(info->rbp, struct epitem, rbn);
+		info->tfd = epi->ffd.fd;
+	}
+	mutex_unlock(&info->ep->mtx);
+}
+
+static const struct seq_operations bpf_epoll_seq_ops = {
+	.start = bpf_epoll_seq_start,
+	.next  = bpf_epoll_seq_next,
+	.stop  = bpf_epoll_seq_stop,
+	.show  = bpf_epoll_seq_show,
+};
+
+static const struct bpf_iter_seq_info bpf_epoll_seq_info = {
+	.seq_ops          = &bpf_epoll_seq_ops,
+	.init_seq_private = bpf_epoll_init_seq,
+	.seq_priv_size    = sizeof(struct bpf_epoll_iter_seq_info),
+};
+
+static struct bpf_iter_reg epoll_reg_info = {
+	.target            = "epoll",
+	.feature           = BPF_ITER_RESCHED,
+	.attach_target     = bpf_epoll_iter_attach,
+	.detach_target     = bpf_epoll_iter_detach,
+	.ctx_arg_info_size = 2,
+	.ctx_arg_info = {
+		{ offsetof(struct bpf_iter__epoll, ep),
+		  PTR_TO_BTF_ID },
+		{ offsetof(struct bpf_iter__epoll, epi),
+		  PTR_TO_BTF_ID_OR_NULL },
+	},
+	.seq_info	   = &bpf_epoll_seq_info,
+};
+
+static int __init epoll_iter_init(void)
+{
+	epoll_reg_info.ctx_arg_info[0].btf_id = btf_epoll_ids[0];
+	epoll_reg_info.ctx_arg_info[1].btf_id = btf_epoll_ids[1];
+	return bpf_iter_reg_target(&epoll_reg_info);
+}
+late_initcall(epoll_iter_init);
+
+#endif
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index fe7b499da781..eb1c9acdc40b 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1512,7 +1512,10 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
 struct io_ring_ctx;
 struct bpf_iter_aux_info {
 	struct bpf_map *map;
-	struct io_ring_ctx *ctx;
+	union {
+		struct io_ring_ctx *ctx;
+		struct file *ep;
+	};
 };
 
 typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b70e9da3d722..64e18c1dcfca 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -94,6 +94,9 @@ union bpf_iter_link_info {
 	struct {
 		__u32   io_uring_fd;
 	} io_uring;
+	struct {
+		__u32   epoll_fd;
+	} epoll;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index b70e9da3d722..64e18c1dcfca 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -94,6 +94,9 @@ union bpf_iter_link_info {
 	struct {
 		__u32   io_uring_fd;
 	} io_uring;
+	struct {
+		__u32   epoll_fd;
+	} epoll;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v1 5/8] selftests/bpf: Add test for io_uring BPF iterators
  2021-11-16  5:42 [PATCH bpf-next v1 0/8] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
                   ` (3 preceding siblings ...)
  2021-11-16  5:42 ` [PATCH bpf-next v1 4/8] epoll: Implement eBPF iterator for registered items Kumar Kartikeya Dwivedi
@ 2021-11-16  5:42 ` Kumar Kartikeya Dwivedi
  2021-11-18 17:54   ` Yonghong Song
  2021-11-16  5:42 ` [PATCH bpf-next v1 6/8] selftests/bpf: Add test for epoll BPF iterator Kumar Kartikeya Dwivedi
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-11-16  5:42 UTC (permalink / raw)
  To: bpf
  Cc: Jens Axboe, Pavel Begunkov, io-uring, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Pavel Emelyanov,
	Alexander Mihalicyn, Andrei Vagin, criu, linux-fsdevel

This exercises the io_uring_buf and io_uring_file iterators, and tests
sparse file sets as well.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: io-uring@vger.kernel.org
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 .../selftests/bpf/prog_tests/bpf_iter.c       | 226 ++++++++++++++++++
 .../selftests/bpf/progs/bpf_iter_io_uring.c   |  50 ++++
 2 files changed, 276 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_io_uring.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
index 3e10abce3e5a..baf11fe2f88d 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
@@ -1,6 +1,9 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright (c) 2020 Facebook */
+#include <sys/mman.h>
 #include <test_progs.h>
+#include <linux/io_uring.h>
+
 #include "bpf_iter_ipv6_route.skel.h"
 #include "bpf_iter_netlink.skel.h"
 #include "bpf_iter_bpf_map.skel.h"
@@ -26,6 +29,7 @@
 #include "bpf_iter_bpf_sk_storage_map.skel.h"
 #include "bpf_iter_test_kern5.skel.h"
 #include "bpf_iter_test_kern6.skel.h"
+#include "bpf_iter_io_uring.skel.h"
 
 static int duration;
 
@@ -1239,6 +1243,224 @@ static void test_task_vma(void)
 	bpf_iter_task_vma__destroy(skel);
 }
 
+static int sys_io_uring_setup(u32 entries, struct io_uring_params *p)
+{
+	return syscall(__NR_io_uring_setup, entries, p);
+}
+
+static int io_uring_register_bufs(int io_uring_fd, struct iovec *iovs, unsigned int nr)
+{
+	return syscall(__NR_io_uring_register, io_uring_fd,
+		       IORING_REGISTER_BUFFERS, iovs, nr);
+}
+
+static int io_uring_register_files(int io_uring_fd, int *fds, unsigned int nr)
+{
+	return syscall(__NR_io_uring_register, io_uring_fd,
+		       IORING_REGISTER_FILES, fds, nr);
+}
+
+static unsigned long long page_addr_to_pfn(unsigned long addr)
+{
+	int page_size = sysconf(_SC_PAGE_SIZE), fd, ret;
+	unsigned long long pfn;
+
+	if (page_size < 0)
+		return 0;
+	fd = open("/proc/self/pagemap", O_RDONLY);
+	if (fd < 0)
+		return 0;
+
+	ret = pread(fd, &pfn, sizeof(pfn), (addr / page_size) * 8);
+	close(fd);
+	if (ret < 0)
+		return 0;
+	/* Bits 0-54 have PFN for non-swapped page */
+	return pfn & 0x7fffffffffffff;
+}
+
+void test_io_uring_buf(void)
+{
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+	char rbuf[4096], buf[4096] = "B\n";
+	union bpf_iter_link_info linfo;
+	struct bpf_iter_io_uring *skel;
+	int ret, fd, i, len = 128;
+	struct io_uring_params p;
+	struct iovec iovs[8];
+	int iter_fd;
+	char *str;
+
+	opts.link_info = &linfo;
+	opts.link_info_len = sizeof(linfo);
+
+	skel = bpf_iter_io_uring__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "bpf_iter_io_uring__open_and_load"))
+		return;
+
+	for (i = 0; i < ARRAY_SIZE(iovs); i++) {
+		iovs[i].iov_len	 = len;
+		iovs[i].iov_base = mmap(NULL, len, PROT_READ | PROT_WRITE,
+					MAP_ANONYMOUS | MAP_SHARED, -1, 0);
+		if (iovs[i].iov_base == MAP_FAILED)
+			goto end;
+		len *= 2;
+	}
+
+	memset(&p, 0, sizeof(p));
+	fd = sys_io_uring_setup(1, &p);
+	if (!ASSERT_GE(fd, 0, "io_uring_setup"))
+		goto end;
+
+	linfo.io_uring.io_uring_fd = fd;
+	skel->links.dump_io_uring_buf = bpf_program__attach_iter(skel->progs.dump_io_uring_buf,
+								 &opts);
+	if (!ASSERT_OK_PTR(skel->links.dump_io_uring_buf, "bpf_program__attach_iter"))
+		goto end_close_fd;
+
+	ret = io_uring_register_bufs(fd, iovs, ARRAY_SIZE(iovs));
+	if (!ASSERT_OK(ret, "io_uring_register_bufs"))
+		goto end_close_fd;
+
+	/* "B\n" */
+	len = 2;
+	str = buf + len;
+	for (int j = 0; j < ARRAY_SIZE(iovs); j++) {
+		ret = snprintf(str, sizeof(buf) - len, "%d:0x%lx:%zu\n", j,
+			       (unsigned long)iovs[j].iov_base,
+			       iovs[j].iov_len);
+		if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf"))
+			goto end_close_fd;
+		len += ret;
+		str += ret;
+
+		ret = snprintf(str, sizeof(buf) - len, "`-PFN for bvec[0]=%llu\n",
+			       page_addr_to_pfn((unsigned long)iovs[j].iov_base));
+		if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf"))
+			goto end_close_fd;
+		len += ret;
+		str += ret;
+	}
+
+	ret = snprintf(str, sizeof(buf) - len, "E:%zu\n", ARRAY_SIZE(iovs));
+	if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf"))
+		goto end_close_fd;
+
+	iter_fd = bpf_iter_create(bpf_link__fd(skel->links.dump_io_uring_buf));
+	if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create"))
+		goto end_close_fd;
+
+	ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf));
+	if (!ASSERT_GT(ret, 0, "read_fd_into_buffer"))
+		goto end_close_iter;
+
+	ASSERT_OK(strcmp(rbuf, buf), "compare iterator output");
+
+	puts("=== Expected Output ===");
+	printf("%s", buf);
+	puts("==== Actual Output ====");
+	printf("%s", rbuf);
+	puts("=======================");
+
+end_close_iter:
+	close(iter_fd);
+end_close_fd:
+	close(fd);
+end:
+	while (i--)
+		munmap(iovs[i].iov_base, iovs[i].iov_len);
+	bpf_iter_io_uring__destroy(skel);
+}
+
+void test_io_uring_file(void)
+{
+	int reg_files[] = { [0 ... 7] = -1 };
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+	char buf[4096] = "B\n", rbuf[4096] = {}, *str;
+	union bpf_iter_link_info linfo = {};
+	struct bpf_iter_io_uring *skel;
+	int iter_fd, fd, len = 0, ret;
+	struct io_uring_params p;
+
+	opts.link_info = &linfo;
+	opts.link_info_len = sizeof(linfo);
+
+	skel = bpf_iter_io_uring__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "bpf_iter_io_uring__open_and_load"))
+		return;
+
+	/* "B\n" */
+	len = 2;
+	str = buf + len;
+	ret = snprintf(str, sizeof(buf) - len, "B\n");
+	for (int i = 0; i < ARRAY_SIZE(reg_files); i++) {
+		char templ[] = "/tmp/io_uringXXXXXX";
+		const char *name, *def = "<none>";
+
+		/* create sparse set */
+		if (i & 1) {
+			name = def;
+		} else {
+			reg_files[i] = mkstemp(templ);
+			if (!ASSERT_GE(reg_files[i], 0, templ))
+				goto end_close_reg_files;
+			name = templ;
+			ASSERT_OK(unlink(name), "unlink");
+		}
+		ret = snprintf(str, sizeof(buf) - len, "%d:%s%s\n", i, name, name != def ? " (deleted)" : "");
+		if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf"))
+			goto end_close_reg_files;
+		len += ret;
+		str += ret;
+	}
+
+	ret = snprintf(str, sizeof(buf) - len, "E:%zu\n", ARRAY_SIZE(reg_files));
+	if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf"))
+		goto end_close_reg_files;
+
+	memset(&p, 0, sizeof(p));
+	fd = sys_io_uring_setup(1, &p);
+	if (!ASSERT_GE(fd, 0, "io_uring_setup"))
+		goto end_close_reg_files;
+
+	linfo.io_uring.io_uring_fd = fd;
+	skel->links.dump_io_uring_file = bpf_program__attach_iter(skel->progs.dump_io_uring_file,
+								  &opts);
+	if (!ASSERT_OK_PTR(skel->links.dump_io_uring_file, "bpf_program__attach_iter"))
+		goto end_close_fd;
+
+	iter_fd = bpf_iter_create(bpf_link__fd(skel->links.dump_io_uring_file));
+	if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create"))
+		goto end;
+
+	ret = io_uring_register_files(fd, reg_files, ARRAY_SIZE(reg_files));
+	if (!ASSERT_OK(ret, "io_uring_register_files"))
+		goto end_iter_fd;
+
+	ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf));
+	if (!ASSERT_GT(ret, 0, "read_fd_into_buffer(iterator_fd, buf)"))
+		goto end_iter_fd;
+
+	ASSERT_OK(strcmp(rbuf, buf), "compare iterator output");
+
+	puts("=== Expected Output ===");
+	printf("%s", buf);
+	puts("==== Actual Output ====");
+	printf("%s", rbuf);
+	puts("=======================");
+end_iter_fd:
+	close(iter_fd);
+end_close_fd:
+	close(fd);
+end_close_reg_files:
+	for (int i = 0; i < ARRAY_SIZE(reg_files); i++) {
+		if (reg_files[i] != -1)
+			close(reg_files[i]);
+	}
+end:
+	bpf_iter_io_uring__destroy(skel);
+}
+
 void test_bpf_iter(void)
 {
 	if (test__start_subtest("btf_id_or_null"))
@@ -1299,4 +1521,8 @@ void test_bpf_iter(void)
 		test_rdonly_buf_out_of_bound();
 	if (test__start_subtest("buf-neg-offset"))
 		test_buf_neg_offset();
+	if (test__start_subtest("io_uring_buf"))
+		test_io_uring_buf();
+	if (test__start_subtest("io_uring_file"))
+		test_io_uring_file();
 }
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_io_uring.c b/tools/testing/selftests/bpf/progs/bpf_iter_io_uring.c
new file mode 100644
index 000000000000..caf8bd0bf8d4
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_io_uring.c
@@ -0,0 +1,50 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "bpf_iter.h"
+#include <bpf/bpf_helpers.h>
+
+SEC("iter/io_uring_buf")
+int dump_io_uring_buf(struct bpf_iter__io_uring_buf *ctx)
+{
+	struct io_mapped_ubuf *ubuf = ctx->ubuf;
+	struct seq_file *seq = ctx->meta->seq;
+	unsigned int index = ctx->index;
+
+	if (!ctx->meta->seq_num)
+		BPF_SEQ_PRINTF(seq, "B\n");
+
+	if (ubuf) {
+		BPF_SEQ_PRINTF(seq, "%u:0x%lx:%lu\n", index, (unsigned long)ubuf->ubuf,
+			       (unsigned long)ubuf->ubuf_end - ubuf->ubuf);
+		BPF_SEQ_PRINTF(seq, "`-PFN for bvec[0]=%lu\n",
+			       (unsigned long)bpf_page_to_pfn(ubuf->bvec[0].bv_page));
+	} else {
+		BPF_SEQ_PRINTF(seq, "E:%u\n", index);
+	}
+	return 0;
+}
+
+SEC("iter/io_uring_file")
+int dump_io_uring_file(struct bpf_iter__io_uring_file *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	unsigned int index = ctx->index;
+	struct file *file = ctx->file;
+	char buf[256] = "";
+
+	if (!ctx->meta->seq_num)
+		BPF_SEQ_PRINTF(seq, "B\n");
+	/* for io_uring_file iterator, this is the terminating condition */
+	if (ctx->ctx->nr_user_files == index) {
+		BPF_SEQ_PRINTF(seq, "E:%u\n", index);
+		return 0;
+	}
+	if (file) {
+		bpf_d_path(&file->f_path, buf, sizeof(buf));
+		BPF_SEQ_PRINTF(seq, "%u:%s\n", index, buf);
+	} else {
+		BPF_SEQ_PRINTF(seq, "%u:<none>\n", index);
+	}
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v1 6/8] selftests/bpf: Add test for epoll BPF iterator
  2021-11-16  5:42 [PATCH bpf-next v1 0/8] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
                   ` (4 preceding siblings ...)
  2021-11-16  5:42 ` [PATCH bpf-next v1 5/8] selftests/bpf: Add test for io_uring BPF iterators Kumar Kartikeya Dwivedi
@ 2021-11-16  5:42 ` Kumar Kartikeya Dwivedi
  2021-11-16  5:42 ` [PATCH bpf-next v1 7/8] selftests/bpf: Test partial reads for io_uring, epoll iterators Kumar Kartikeya Dwivedi
  2021-11-16  5:42 ` [PATCH bpf-next v1 8/8] selftests/bpf: Fix btf_dump test for bpf_iter_link_info Kumar Kartikeya Dwivedi
  7 siblings, 0 replies; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-11-16  5:42 UTC (permalink / raw)
  To: bpf
  Cc: Alexander Viro, linux-fsdevel, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Pavel Emelyanov,
	Alexander Mihalicyn, Andrei Vagin, criu, io-uring

This tests the epoll iterator, including peeking into the epitem to
inspect the registered file and fd number, and verifying that in
userspace.

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 .../selftests/bpf/prog_tests/bpf_iter.c       | 121 ++++++++++++++++++
 .../selftests/bpf/progs/bpf_iter_epoll.c      |  33 +++++
 2 files changed, 154 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_epoll.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
index baf11fe2f88d..45e186385f9a 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright (c) 2020 Facebook */
 #include <sys/mman.h>
+#include <sys/epoll.h>
 #include <test_progs.h>
 #include <linux/io_uring.h>
 
@@ -30,6 +31,7 @@
 #include "bpf_iter_test_kern5.skel.h"
 #include "bpf_iter_test_kern6.skel.h"
 #include "bpf_iter_io_uring.skel.h"
+#include "bpf_iter_epoll.skel.h"
 
 static int duration;
 
@@ -1461,6 +1463,123 @@ void test_io_uring_file(void)
 	bpf_iter_io_uring__destroy(skel);
 }
 
+void test_epoll(void)
+{
+	const char *fmt = "B\npipe:%d\nsocket:%d\npipe:%d\nsocket:%d\nE\n";
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+	char buf[4096] = {}, rbuf[4096] = {};
+	union bpf_iter_link_info linfo;
+	int fds[2], sk[2], epfd, ret;
+	struct bpf_iter_epoll *skel;
+	struct epoll_event ev = {};
+	int iter_fd, set[4];
+	char *s, *t;
+
+	opts.link_info = &linfo;
+	opts.link_info_len = sizeof(linfo);
+
+	skel = bpf_iter_epoll__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "bpf_iter_epoll__open_and_load"))
+		return;
+
+	epfd = epoll_create1(EPOLL_CLOEXEC);
+	if (!ASSERT_GE(epfd, 0, "epoll_create1"))
+		goto end;
+
+	ret = pipe(fds);
+	if (!ASSERT_OK(ret, "pipe(fds)"))
+		goto end_epfd;
+
+	ret = socketpair(AF_UNIX, SOCK_STREAM, 0, sk);
+	if (!ASSERT_OK(ret, "socketpair"))
+		goto end_pipe;
+
+	ev.events = EPOLLIN;
+
+	ret = epoll_ctl(epfd, EPOLL_CTL_ADD, fds[0], &ev);
+	if (!ASSERT_OK(ret, "epoll_ctl"))
+		goto end_sk;
+
+	ret = epoll_ctl(epfd, EPOLL_CTL_ADD, sk[0], &ev);
+	if (!ASSERT_OK(ret, "epoll_ctl"))
+		goto end_sk;
+
+	ret = epoll_ctl(epfd, EPOLL_CTL_ADD, fds[1], &ev);
+	if (!ASSERT_OK(ret, "epoll_ctl"))
+		goto end_sk;
+
+	ret = epoll_ctl(epfd, EPOLL_CTL_ADD, sk[1], &ev);
+	if (!ASSERT_OK(ret, "epoll_ctl"))
+		goto end_sk;
+
+	linfo.epoll.epoll_fd = epfd;
+	skel->links.dump_epoll = bpf_program__attach_iter(skel->progs.dump_epoll, &opts);
+	if (!ASSERT_OK_PTR(skel->links.dump_epoll, "bpf_program__attach_iter"))
+		goto end_sk;
+
+	iter_fd = bpf_iter_create(bpf_link__fd(skel->links.dump_epoll));
+	if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create"))
+		goto end_sk;
+
+	ret = epoll_ctl(epfd, EPOLL_CTL_ADD, iter_fd, &ev);
+	if (!ASSERT_EQ(ret, -1, "epoll_ctl add for iter_fd"))
+		goto end_iter_fd;
+
+	ret = snprintf(buf, sizeof(buf), fmt, fds[0], sk[0], fds[1], sk[1]);
+	if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf), "snprintf"))
+		goto end_iter_fd;
+
+	ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf));
+	if (!ASSERT_GT(ret, 0, "read_fd_into_buffer"))
+		goto end_iter_fd;
+
+	puts("=== Expected Output ===");
+	printf("%s", buf);
+	puts("==== Actual Output ====");
+	printf("%s", rbuf);
+	puts("=======================");
+
+	s = rbuf;
+	while ((s = strtok_r(s, "\n", &t))) {
+		int fd = -1;
+
+		if (s[0] == 'B' || s[0] == 'E')
+			goto next;
+		ASSERT_EQ(sscanf(s, s[0] == 'p' ? "pipe:%d" : "socket:%d", &fd), 1, s);
+		if (fd == fds[0]) {
+			ASSERT_NEQ(set[0], 1, "pipe[0]");
+			set[0] = 1;
+		} else if (fd == fds[1]) {
+			ASSERT_NEQ(set[1], 1, "pipe[1]");
+			set[1] = 1;
+		} else if (fd == sk[0]) {
+			ASSERT_NEQ(set[2], 1, "sk[0]");
+			set[2] = 1;
+		} else if (fd == sk[1]) {
+			ASSERT_NEQ(set[3], 1, "sk[1]");
+			set[3] = 1;
+		} else {
+			ASSERT_TRUE(0, "Incorrect fd in iterator output");
+		}
+next:
+		s = NULL;
+	}
+	for (int i = 0; i < ARRAY_SIZE(set); i++)
+		ASSERT_EQ(set[i], 1, "fd found");
+end_iter_fd:
+	close(iter_fd);
+end_sk:
+	close(sk[1]);
+	close(sk[0]);
+end_pipe:
+	close(fds[1]);
+	close(fds[0]);
+end_epfd:
+	close(epfd);
+end:
+	bpf_iter_epoll__destroy(skel);
+}
+
 void test_bpf_iter(void)
 {
 	if (test__start_subtest("btf_id_or_null"))
@@ -1525,4 +1644,6 @@ void test_bpf_iter(void)
 		test_io_uring_buf();
 	if (test__start_subtest("io_uring_file"))
 		test_io_uring_file();
+	if (test__start_subtest("epoll"))
+		test_epoll();
 }
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_epoll.c b/tools/testing/selftests/bpf/progs/bpf_iter_epoll.c
new file mode 100644
index 000000000000..0afc74d154a1
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_epoll.c
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "bpf_iter.h"
+#include <bpf/bpf_helpers.h>
+
+extern void pipefifo_fops __ksym;
+
+SEC("iter/epoll")
+int dump_epoll(struct bpf_iter__epoll *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	struct epitem *epi = ctx->epi;
+	char sstr[] = "socket";
+	char pstr[] = "pipe";
+
+	if (!ctx->meta->seq_num) {
+		BPF_SEQ_PRINTF(seq, "B\n");
+	}
+	if (epi) {
+		struct file *f = epi->ffd.file;
+		char *str;
+
+		if (f->f_op == &pipefifo_fops)
+			str = pstr;
+		else
+			str = sstr;
+		BPF_SEQ_PRINTF(seq, "%s:%d\n", str, epi->ffd.fd);
+	} else {
+		BPF_SEQ_PRINTF(seq, "E\n");
+	}
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v1 7/8] selftests/bpf: Test partial reads for io_uring, epoll iterators
  2021-11-16  5:42 [PATCH bpf-next v1 0/8] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
                   ` (5 preceding siblings ...)
  2021-11-16  5:42 ` [PATCH bpf-next v1 6/8] selftests/bpf: Add test for epoll BPF iterator Kumar Kartikeya Dwivedi
@ 2021-11-16  5:42 ` Kumar Kartikeya Dwivedi
  2021-11-16  5:42 ` [PATCH bpf-next v1 8/8] selftests/bpf: Fix btf_dump test for bpf_iter_link_info Kumar Kartikeya Dwivedi
  7 siblings, 0 replies; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-11-16  5:42 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Pavel Emelyanov, Alexander Mihalicyn, Andrei Vagin, criu,
	io-uring, linux-fsdevel

Ensure that the output is consistent in face of partial reads that
return to userspace and then resume again later. To this end, we do
reads in 1-byte chunks, which is a bit stupid in real life, but works
well to simulate interrupted iteration. This also tests case where
seq_file buffer is consumed (after seq_printf) on interrupted read
before iterator invoked BPF prog again.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 .../selftests/bpf/prog_tests/bpf_iter.c       | 33 ++++++++++++-------
 1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
index 45e186385f9a..c27f3e10211c 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
@@ -72,13 +72,13 @@ static void do_dummy_read(struct bpf_program *prog)
 	bpf_link__destroy(link);
 }
 
-static int read_fd_into_buffer(int fd, char *buf, int size)
+static int __read_fd_into_buffer(int fd, char *buf, int size, size_t chunks)
 {
 	int bufleft = size;
 	int len;
 
 	do {
-		len = read(fd, buf, bufleft);
+		len = read(fd, buf, chunks ?: bufleft);
 		if (len > 0) {
 			buf += len;
 			bufleft -= len;
@@ -88,6 +88,11 @@ static int read_fd_into_buffer(int fd, char *buf, int size)
 	return len < 0 ? len : size - bufleft;
 }
 
+static int read_fd_into_buffer(int fd, char *buf, int size)
+{
+	return __read_fd_into_buffer(fd, buf, size, 0);
+}
+
 static void test_ipv6_route(void)
 {
 	struct bpf_iter_ipv6_route *skel;
@@ -1281,7 +1286,7 @@ static unsigned long long page_addr_to_pfn(unsigned long addr)
 	return pfn & 0x7fffffffffffff;
 }
 
-void test_io_uring_buf(void)
+void test_io_uring_buf(bool partial)
 {
 	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
 	char rbuf[4096], buf[4096] = "B\n";
@@ -1352,7 +1357,7 @@ void test_io_uring_buf(void)
 	if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create"))
 		goto end_close_fd;
 
-	ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf));
+	ret = __read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf), partial);
 	if (!ASSERT_GT(ret, 0, "read_fd_into_buffer"))
 		goto end_close_iter;
 
@@ -1374,7 +1379,7 @@ void test_io_uring_buf(void)
 	bpf_iter_io_uring__destroy(skel);
 }
 
-void test_io_uring_file(void)
+void test_io_uring_file(bool partial)
 {
 	int reg_files[] = { [0 ... 7] = -1 };
 	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
@@ -1439,7 +1444,7 @@ void test_io_uring_file(void)
 	if (!ASSERT_OK(ret, "io_uring_register_files"))
 		goto end_iter_fd;
 
-	ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf));
+	ret = __read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf), partial);
 	if (!ASSERT_GT(ret, 0, "read_fd_into_buffer(iterator_fd, buf)"))
 		goto end_iter_fd;
 
@@ -1463,7 +1468,7 @@ void test_io_uring_file(void)
 	bpf_iter_io_uring__destroy(skel);
 }
 
-void test_epoll(void)
+void test_epoll(bool partial)
 {
 	const char *fmt = "B\npipe:%d\nsocket:%d\npipe:%d\nsocket:%d\nE\n";
 	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
@@ -1529,7 +1534,7 @@ void test_epoll(void)
 	if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf), "snprintf"))
 		goto end_iter_fd;
 
-	ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf));
+	ret = __read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf), partial);
 	if (!ASSERT_GT(ret, 0, "read_fd_into_buffer"))
 		goto end_iter_fd;
 
@@ -1641,9 +1646,15 @@ void test_bpf_iter(void)
 	if (test__start_subtest("buf-neg-offset"))
 		test_buf_neg_offset();
 	if (test__start_subtest("io_uring_buf"))
-		test_io_uring_buf();
+		test_io_uring_buf(false);
 	if (test__start_subtest("io_uring_file"))
-		test_io_uring_file();
+		test_io_uring_file(false);
 	if (test__start_subtest("epoll"))
-		test_epoll();
+		test_epoll(false);
+	if (test__start_subtest("io_uring_buf-partial"))
+		test_io_uring_buf(true);
+	if (test__start_subtest("io_uring_file-partial"))
+		test_io_uring_file(true);
+	if (test__start_subtest("epoll-partial"))
+		test_epoll(true);
 }
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH bpf-next v1 8/8] selftests/bpf: Fix btf_dump test for bpf_iter_link_info
  2021-11-16  5:42 [PATCH bpf-next v1 0/8] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
                   ` (6 preceding siblings ...)
  2021-11-16  5:42 ` [PATCH bpf-next v1 7/8] selftests/bpf: Test partial reads for io_uring, epoll iterators Kumar Kartikeya Dwivedi
@ 2021-11-16  5:42 ` Kumar Kartikeya Dwivedi
  7 siblings, 0 replies; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-11-16  5:42 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Pavel Emelyanov, Alexander Mihalicyn, Andrei Vagin, criu,
	io-uring, linux-fsdevel

Since we changed the definition while adding io_uring and epoll iterator
support, adjust the selftest to check against the updated definition.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 tools/testing/selftests/bpf/prog_tests/btf_dump.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/btf_dump.c b/tools/testing/selftests/bpf/prog_tests/btf_dump.c
index d6272013a5a3..a2fc006e074a 100644
--- a/tools/testing/selftests/bpf/prog_tests/btf_dump.c
+++ b/tools/testing/selftests/bpf/prog_tests/btf_dump.c
@@ -736,7 +736,9 @@ static void test_btf_dump_struct_data(struct btf *btf, struct btf_dump *d,
 
 	/* union with nested struct */
 	TEST_BTF_DUMP_DATA(btf, d, "union", str, union bpf_iter_link_info, BTF_F_COMPACT,
-			   "(union bpf_iter_link_info){.map = (struct){.map_fd = (__u32)1,},}",
+			   "(union bpf_iter_link_info){.map = (struct){.map_fd = (__u32)1,},"
+			   ".io_uring = (struct){.io_uring_fd = (__u32)1,},"
+			   ".epoll = (struct){.epoll_fd = (__u32)1,},}",
 			   { .map = { .map_fd = 1 }});
 
 	/* struct skb with nested structs/unions; because type output is so
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 2/8] bpf: Add bpf_page_to_pfn helper
  2021-11-16  5:42 ` [PATCH bpf-next v1 2/8] bpf: Add bpf_page_to_pfn helper Kumar Kartikeya Dwivedi
@ 2021-11-17 12:35     ` kernel test robot
  2021-11-17 13:39     ` kernel test robot
  2021-11-18 17:27   ` Yonghong Song
  2 siblings, 0 replies; 34+ messages in thread
From: kernel test robot @ 2021-11-17 12:35 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, bpf
  Cc: kbuild-all, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Pavel Emelyanov, Alexander Mihalicyn, Andrei Vagin, criu,
	io-uring, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1670 bytes --]

Hi Kumar,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/0day-ci/linux/commits/Kumar-Kartikeya-Dwivedi/Introduce-BPF-iterators-for-io_uring-and-epoll/20211116-140234
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: arm-randconfig-r006-20211116 (attached as .config)
compiler: arm-linux-gnueabi-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/567c9b678d5ade1a63c993cfa10394902d4671ca
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Kumar-Kartikeya-Dwivedi/Introduce-BPF-iterators-for-io_uring-and-epoll/20211116-140234
        git checkout 567c9b678d5ade1a63c993cfa10394902d4671ca
        # save the attached .config to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=arm SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   arm-linux-gnueabi-ld: kernel/trace/bpf_trace.o: in function `bpf_tracing_func_proto':
>> bpf_trace.c:(.text+0x1930): undefined reference to `bpf_page_to_pfn_proto'
>> arm-linux-gnueabi-ld: bpf_trace.c:(.text+0x1934): undefined reference to `bpf_page_to_pfn_proto'

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 45725 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 2/8] bpf: Add bpf_page_to_pfn helper
@ 2021-11-17 12:35     ` kernel test robot
  0 siblings, 0 replies; 34+ messages in thread
From: kernel test robot @ 2021-11-17 12:35 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 1705 bytes --]

Hi Kumar,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/0day-ci/linux/commits/Kumar-Kartikeya-Dwivedi/Introduce-BPF-iterators-for-io_uring-and-epoll/20211116-140234
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: arm-randconfig-r006-20211116 (attached as .config)
compiler: arm-linux-gnueabi-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/567c9b678d5ade1a63c993cfa10394902d4671ca
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Kumar-Kartikeya-Dwivedi/Introduce-BPF-iterators-for-io_uring-and-epoll/20211116-140234
        git checkout 567c9b678d5ade1a63c993cfa10394902d4671ca
        # save the attached .config to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=arm SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   arm-linux-gnueabi-ld: kernel/trace/bpf_trace.o: in function `bpf_tracing_func_proto':
>> bpf_trace.c:(.text+0x1930): undefined reference to `bpf_page_to_pfn_proto'
>> arm-linux-gnueabi-ld: bpf_trace.c:(.text+0x1934): undefined reference to `bpf_page_to_pfn_proto'

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 45725 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 2/8] bpf: Add bpf_page_to_pfn helper
  2021-11-16  5:42 ` [PATCH bpf-next v1 2/8] bpf: Add bpf_page_to_pfn helper Kumar Kartikeya Dwivedi
@ 2021-11-17 13:39     ` kernel test robot
  2021-11-17 13:39     ` kernel test robot
  2021-11-18 17:27   ` Yonghong Song
  2 siblings, 0 replies; 34+ messages in thread
From: kernel test robot @ 2021-11-17 13:39 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, bpf
  Cc: kbuild-all, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Pavel Emelyanov, Alexander Mihalicyn, Andrei Vagin, criu,
	io-uring, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 6825 bytes --]

Hi Kumar,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/0day-ci/linux/commits/Kumar-Kartikeya-Dwivedi/Introduce-BPF-iterators-for-io_uring-and-epoll/20211116-140234
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: i386-randconfig-a014-20211116 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/567c9b678d5ade1a63c993cfa10394902d4671ca
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Kumar-Kartikeya-Dwivedi/Introduce-BPF-iterators-for-io_uring-and-epoll/20211116-140234
        git checkout 567c9b678d5ade1a63c993cfa10394902d4671ca
        # save the attached .config to linux build tree
        mkdir build_dir
        make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   ld: kernel/trace/bpf_trace.o: in function `bpf_tracing_func_proto':
>> kernel/trace/bpf_trace.c:1216: undefined reference to `bpf_page_to_pfn_proto'


vim +1216 kernel/trace/bpf_trace.c

  1093	
  1094	static const struct bpf_func_proto *
  1095	bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
  1096	{
  1097		switch (func_id) {
  1098		case BPF_FUNC_map_lookup_elem:
  1099			return &bpf_map_lookup_elem_proto;
  1100		case BPF_FUNC_map_update_elem:
  1101			return &bpf_map_update_elem_proto;
  1102		case BPF_FUNC_map_delete_elem:
  1103			return &bpf_map_delete_elem_proto;
  1104		case BPF_FUNC_map_push_elem:
  1105			return &bpf_map_push_elem_proto;
  1106		case BPF_FUNC_map_pop_elem:
  1107			return &bpf_map_pop_elem_proto;
  1108		case BPF_FUNC_map_peek_elem:
  1109			return &bpf_map_peek_elem_proto;
  1110		case BPF_FUNC_ktime_get_ns:
  1111			return &bpf_ktime_get_ns_proto;
  1112		case BPF_FUNC_ktime_get_boot_ns:
  1113			return &bpf_ktime_get_boot_ns_proto;
  1114		case BPF_FUNC_ktime_get_coarse_ns:
  1115			return &bpf_ktime_get_coarse_ns_proto;
  1116		case BPF_FUNC_tail_call:
  1117			return &bpf_tail_call_proto;
  1118		case BPF_FUNC_get_current_pid_tgid:
  1119			return &bpf_get_current_pid_tgid_proto;
  1120		case BPF_FUNC_get_current_task:
  1121			return &bpf_get_current_task_proto;
  1122		case BPF_FUNC_get_current_task_btf:
  1123			return &bpf_get_current_task_btf_proto;
  1124		case BPF_FUNC_task_pt_regs:
  1125			return &bpf_task_pt_regs_proto;
  1126		case BPF_FUNC_get_current_uid_gid:
  1127			return &bpf_get_current_uid_gid_proto;
  1128		case BPF_FUNC_get_current_comm:
  1129			return &bpf_get_current_comm_proto;
  1130		case BPF_FUNC_trace_printk:
  1131			return bpf_get_trace_printk_proto();
  1132		case BPF_FUNC_get_smp_processor_id:
  1133			return &bpf_get_smp_processor_id_proto;
  1134		case BPF_FUNC_get_numa_node_id:
  1135			return &bpf_get_numa_node_id_proto;
  1136		case BPF_FUNC_perf_event_read:
  1137			return &bpf_perf_event_read_proto;
  1138		case BPF_FUNC_current_task_under_cgroup:
  1139			return &bpf_current_task_under_cgroup_proto;
  1140		case BPF_FUNC_get_prandom_u32:
  1141			return &bpf_get_prandom_u32_proto;
  1142		case BPF_FUNC_probe_write_user:
  1143			return security_locked_down(LOCKDOWN_BPF_WRITE_USER) < 0 ?
  1144			       NULL : bpf_get_probe_write_proto();
  1145		case BPF_FUNC_probe_read_user:
  1146			return &bpf_probe_read_user_proto;
  1147		case BPF_FUNC_probe_read_kernel:
  1148			return security_locked_down(LOCKDOWN_BPF_READ_KERNEL) < 0 ?
  1149			       NULL : &bpf_probe_read_kernel_proto;
  1150		case BPF_FUNC_probe_read_user_str:
  1151			return &bpf_probe_read_user_str_proto;
  1152		case BPF_FUNC_probe_read_kernel_str:
  1153			return security_locked_down(LOCKDOWN_BPF_READ_KERNEL) < 0 ?
  1154			       NULL : &bpf_probe_read_kernel_str_proto;
  1155	#ifdef CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
  1156		case BPF_FUNC_probe_read:
  1157			return security_locked_down(LOCKDOWN_BPF_READ_KERNEL) < 0 ?
  1158			       NULL : &bpf_probe_read_compat_proto;
  1159		case BPF_FUNC_probe_read_str:
  1160			return security_locked_down(LOCKDOWN_BPF_READ_KERNEL) < 0 ?
  1161			       NULL : &bpf_probe_read_compat_str_proto;
  1162	#endif
  1163	#ifdef CONFIG_CGROUPS
  1164		case BPF_FUNC_get_current_cgroup_id:
  1165			return &bpf_get_current_cgroup_id_proto;
  1166		case BPF_FUNC_get_current_ancestor_cgroup_id:
  1167			return &bpf_get_current_ancestor_cgroup_id_proto;
  1168	#endif
  1169		case BPF_FUNC_send_signal:
  1170			return &bpf_send_signal_proto;
  1171		case BPF_FUNC_send_signal_thread:
  1172			return &bpf_send_signal_thread_proto;
  1173		case BPF_FUNC_perf_event_read_value:
  1174			return &bpf_perf_event_read_value_proto;
  1175		case BPF_FUNC_get_ns_current_pid_tgid:
  1176			return &bpf_get_ns_current_pid_tgid_proto;
  1177		case BPF_FUNC_ringbuf_output:
  1178			return &bpf_ringbuf_output_proto;
  1179		case BPF_FUNC_ringbuf_reserve:
  1180			return &bpf_ringbuf_reserve_proto;
  1181		case BPF_FUNC_ringbuf_submit:
  1182			return &bpf_ringbuf_submit_proto;
  1183		case BPF_FUNC_ringbuf_discard:
  1184			return &bpf_ringbuf_discard_proto;
  1185		case BPF_FUNC_ringbuf_query:
  1186			return &bpf_ringbuf_query_proto;
  1187		case BPF_FUNC_jiffies64:
  1188			return &bpf_jiffies64_proto;
  1189		case BPF_FUNC_get_task_stack:
  1190			return &bpf_get_task_stack_proto;
  1191		case BPF_FUNC_copy_from_user:
  1192			return prog->aux->sleepable ? &bpf_copy_from_user_proto : NULL;
  1193		case BPF_FUNC_snprintf_btf:
  1194			return &bpf_snprintf_btf_proto;
  1195		case BPF_FUNC_per_cpu_ptr:
  1196			return &bpf_per_cpu_ptr_proto;
  1197		case BPF_FUNC_this_cpu_ptr:
  1198			return &bpf_this_cpu_ptr_proto;
  1199		case BPF_FUNC_task_storage_get:
  1200			return &bpf_task_storage_get_proto;
  1201		case BPF_FUNC_task_storage_delete:
  1202			return &bpf_task_storage_delete_proto;
  1203		case BPF_FUNC_for_each_map_elem:
  1204			return &bpf_for_each_map_elem_proto;
  1205		case BPF_FUNC_snprintf:
  1206			return &bpf_snprintf_proto;
  1207		case BPF_FUNC_get_func_ip:
  1208			return &bpf_get_func_ip_proto_tracing;
  1209		case BPF_FUNC_get_branch_snapshot:
  1210			return &bpf_get_branch_snapshot_proto;
  1211		case BPF_FUNC_find_vma:
  1212			return &bpf_find_vma_proto;
  1213		case BPF_FUNC_trace_vprintk:
  1214			return bpf_get_trace_vprintk_proto();
  1215		case BPF_FUNC_page_to_pfn:
> 1216			return &bpf_page_to_pfn_proto;
  1217		default:
  1218			return bpf_base_func_proto(func_id);
  1219		}
  1220	}
  1221	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 46321 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 2/8] bpf: Add bpf_page_to_pfn helper
@ 2021-11-17 13:39     ` kernel test robot
  0 siblings, 0 replies; 34+ messages in thread
From: kernel test robot @ 2021-11-17 13:39 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 6990 bytes --]

Hi Kumar,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/0day-ci/linux/commits/Kumar-Kartikeya-Dwivedi/Introduce-BPF-iterators-for-io_uring-and-epoll/20211116-140234
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: i386-randconfig-a014-20211116 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/567c9b678d5ade1a63c993cfa10394902d4671ca
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Kumar-Kartikeya-Dwivedi/Introduce-BPF-iterators-for-io_uring-and-epoll/20211116-140234
        git checkout 567c9b678d5ade1a63c993cfa10394902d4671ca
        # save the attached .config to linux build tree
        mkdir build_dir
        make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   ld: kernel/trace/bpf_trace.o: in function `bpf_tracing_func_proto':
>> kernel/trace/bpf_trace.c:1216: undefined reference to `bpf_page_to_pfn_proto'


vim +1216 kernel/trace/bpf_trace.c

  1093	
  1094	static const struct bpf_func_proto *
  1095	bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
  1096	{
  1097		switch (func_id) {
  1098		case BPF_FUNC_map_lookup_elem:
  1099			return &bpf_map_lookup_elem_proto;
  1100		case BPF_FUNC_map_update_elem:
  1101			return &bpf_map_update_elem_proto;
  1102		case BPF_FUNC_map_delete_elem:
  1103			return &bpf_map_delete_elem_proto;
  1104		case BPF_FUNC_map_push_elem:
  1105			return &bpf_map_push_elem_proto;
  1106		case BPF_FUNC_map_pop_elem:
  1107			return &bpf_map_pop_elem_proto;
  1108		case BPF_FUNC_map_peek_elem:
  1109			return &bpf_map_peek_elem_proto;
  1110		case BPF_FUNC_ktime_get_ns:
  1111			return &bpf_ktime_get_ns_proto;
  1112		case BPF_FUNC_ktime_get_boot_ns:
  1113			return &bpf_ktime_get_boot_ns_proto;
  1114		case BPF_FUNC_ktime_get_coarse_ns:
  1115			return &bpf_ktime_get_coarse_ns_proto;
  1116		case BPF_FUNC_tail_call:
  1117			return &bpf_tail_call_proto;
  1118		case BPF_FUNC_get_current_pid_tgid:
  1119			return &bpf_get_current_pid_tgid_proto;
  1120		case BPF_FUNC_get_current_task:
  1121			return &bpf_get_current_task_proto;
  1122		case BPF_FUNC_get_current_task_btf:
  1123			return &bpf_get_current_task_btf_proto;
  1124		case BPF_FUNC_task_pt_regs:
  1125			return &bpf_task_pt_regs_proto;
  1126		case BPF_FUNC_get_current_uid_gid:
  1127			return &bpf_get_current_uid_gid_proto;
  1128		case BPF_FUNC_get_current_comm:
  1129			return &bpf_get_current_comm_proto;
  1130		case BPF_FUNC_trace_printk:
  1131			return bpf_get_trace_printk_proto();
  1132		case BPF_FUNC_get_smp_processor_id:
  1133			return &bpf_get_smp_processor_id_proto;
  1134		case BPF_FUNC_get_numa_node_id:
  1135			return &bpf_get_numa_node_id_proto;
  1136		case BPF_FUNC_perf_event_read:
  1137			return &bpf_perf_event_read_proto;
  1138		case BPF_FUNC_current_task_under_cgroup:
  1139			return &bpf_current_task_under_cgroup_proto;
  1140		case BPF_FUNC_get_prandom_u32:
  1141			return &bpf_get_prandom_u32_proto;
  1142		case BPF_FUNC_probe_write_user:
  1143			return security_locked_down(LOCKDOWN_BPF_WRITE_USER) < 0 ?
  1144			       NULL : bpf_get_probe_write_proto();
  1145		case BPF_FUNC_probe_read_user:
  1146			return &bpf_probe_read_user_proto;
  1147		case BPF_FUNC_probe_read_kernel:
  1148			return security_locked_down(LOCKDOWN_BPF_READ_KERNEL) < 0 ?
  1149			       NULL : &bpf_probe_read_kernel_proto;
  1150		case BPF_FUNC_probe_read_user_str:
  1151			return &bpf_probe_read_user_str_proto;
  1152		case BPF_FUNC_probe_read_kernel_str:
  1153			return security_locked_down(LOCKDOWN_BPF_READ_KERNEL) < 0 ?
  1154			       NULL : &bpf_probe_read_kernel_str_proto;
  1155	#ifdef CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
  1156		case BPF_FUNC_probe_read:
  1157			return security_locked_down(LOCKDOWN_BPF_READ_KERNEL) < 0 ?
  1158			       NULL : &bpf_probe_read_compat_proto;
  1159		case BPF_FUNC_probe_read_str:
  1160			return security_locked_down(LOCKDOWN_BPF_READ_KERNEL) < 0 ?
  1161			       NULL : &bpf_probe_read_compat_str_proto;
  1162	#endif
  1163	#ifdef CONFIG_CGROUPS
  1164		case BPF_FUNC_get_current_cgroup_id:
  1165			return &bpf_get_current_cgroup_id_proto;
  1166		case BPF_FUNC_get_current_ancestor_cgroup_id:
  1167			return &bpf_get_current_ancestor_cgroup_id_proto;
  1168	#endif
  1169		case BPF_FUNC_send_signal:
  1170			return &bpf_send_signal_proto;
  1171		case BPF_FUNC_send_signal_thread:
  1172			return &bpf_send_signal_thread_proto;
  1173		case BPF_FUNC_perf_event_read_value:
  1174			return &bpf_perf_event_read_value_proto;
  1175		case BPF_FUNC_get_ns_current_pid_tgid:
  1176			return &bpf_get_ns_current_pid_tgid_proto;
  1177		case BPF_FUNC_ringbuf_output:
  1178			return &bpf_ringbuf_output_proto;
  1179		case BPF_FUNC_ringbuf_reserve:
  1180			return &bpf_ringbuf_reserve_proto;
  1181		case BPF_FUNC_ringbuf_submit:
  1182			return &bpf_ringbuf_submit_proto;
  1183		case BPF_FUNC_ringbuf_discard:
  1184			return &bpf_ringbuf_discard_proto;
  1185		case BPF_FUNC_ringbuf_query:
  1186			return &bpf_ringbuf_query_proto;
  1187		case BPF_FUNC_jiffies64:
  1188			return &bpf_jiffies64_proto;
  1189		case BPF_FUNC_get_task_stack:
  1190			return &bpf_get_task_stack_proto;
  1191		case BPF_FUNC_copy_from_user:
  1192			return prog->aux->sleepable ? &bpf_copy_from_user_proto : NULL;
  1193		case BPF_FUNC_snprintf_btf:
  1194			return &bpf_snprintf_btf_proto;
  1195		case BPF_FUNC_per_cpu_ptr:
  1196			return &bpf_per_cpu_ptr_proto;
  1197		case BPF_FUNC_this_cpu_ptr:
  1198			return &bpf_this_cpu_ptr_proto;
  1199		case BPF_FUNC_task_storage_get:
  1200			return &bpf_task_storage_get_proto;
  1201		case BPF_FUNC_task_storage_delete:
  1202			return &bpf_task_storage_delete_proto;
  1203		case BPF_FUNC_for_each_map_elem:
  1204			return &bpf_for_each_map_elem_proto;
  1205		case BPF_FUNC_snprintf:
  1206			return &bpf_snprintf_proto;
  1207		case BPF_FUNC_get_func_ip:
  1208			return &bpf_get_func_ip_proto_tracing;
  1209		case BPF_FUNC_get_branch_snapshot:
  1210			return &bpf_get_branch_snapshot_proto;
  1211		case BPF_FUNC_find_vma:
  1212			return &bpf_find_vma_proto;
  1213		case BPF_FUNC_trace_vprintk:
  1214			return bpf_get_trace_vprintk_proto();
  1215		case BPF_FUNC_page_to_pfn:
> 1216			return &bpf_page_to_pfn_proto;
  1217		default:
  1218			return bpf_base_func_proto(func_id);
  1219		}
  1220	}
  1221	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 46321 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers
  2021-11-16  5:42 ` [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers Kumar Kartikeya Dwivedi
@ 2021-11-18 17:21   ` Yonghong Song
  2021-11-18 18:28     ` Kumar Kartikeya Dwivedi
  2021-11-18 22:02   ` Alexei Starovoitov
  1 sibling, 1 reply; 34+ messages in thread
From: Yonghong Song @ 2021-11-18 17:21 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, bpf
  Cc: Jens Axboe, Pavel Begunkov, io-uring, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Pavel Emelyanov,
	Alexander Mihalicyn, Andrei Vagin, criu, linux-fsdevel



On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:
> This change adds eBPF iterator for buffers registered in io_uring ctx.
> It gives access to the ctx, the index of the registered buffer, and a
> pointer to the io_uring_ubuf itself. This allows the iterator to save
> info related to buffers added to an io_uring instance, that isn't easy
> to export using the fdinfo interface (like exact struct page composing
> the registered buffer).
> 
> The primary usecase this is enabling is checkpoint/restore support.
> 
> Note that we need to use mutex_trylock when the file is read from, in
> seq_start functions, as the order of lock taken is opposite of what it
> would be when io_uring operation reads the same file.  We take
> seq_file->lock, then ctx->uring_lock, while io_uring would first take
> ctx->uring_lock and then seq_file->lock for the same ctx.
> 
> This can lead to a deadlock scenario described below:
> 
>        CPU 0				CPU 1
> 
>        vfs_read
>        mutex_lock(&seq_file->lock)	io_read
> 					mutex_lock(&ctx->uring_lock)
>        mutex_lock(&ctx->uring_lock) # switched to mutex_trylock
> 					mutex_lock(&seq_file->lock)

It is not clear which mutex_lock switched to mutex_trylock.
 From below example, it looks like &ctx->uring_lock. But if this is
the case, we could have deadlock, right?

> 
> The trylock also protects the case where io_uring tries to read from
> iterator attached to itself (same ctx), where the order of locks would
> be:
>   io_uring_enter
>    mutex_lock(&ctx->uring_lock) <-----------.
>    io_read				    \
>     seq_read				     \
>      mutex_lock(&seq_file->lock)		     /
>      mutex_lock(&ctx->uring_lock) # deadlock-`
> 
> In both these cases (recursive read and contended uring_lock), -EDEADLK
> is returned to userspace.
> 
> In the future, this iterator will be extended to directly support
> iteration of bvec Flexible Array Member, so that when there is no
> corresponding VMA that maps to the registered buffer (e.g. if VMA is
> destroyed after pinning pages), we are able to reconstruct the
> registration on restore by dumping the page contents and then replaying
> them into a temporary mapping used for registration later. All this is
> out of scope for the current series however, but builds upon this
> iterator.
> 
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Pavel Begunkov <asml.silence@gmail.com>
> Cc: io-uring@vger.kernel.org
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>   fs/io_uring.c                  | 179 +++++++++++++++++++++++++++++++++
>   include/linux/bpf.h            |   2 +
>   include/uapi/linux/bpf.h       |   3 +
>   tools/include/uapi/linux/bpf.h |   3 +
>   4 files changed, 187 insertions(+)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index b07196b4511c..46a110989155 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -81,6 +81,7 @@
>   #include <linux/tracehook.h>
>   #include <linux/audit.h>
>   #include <linux/security.h>
> +#include <linux/btf_ids.h>
>   
>   #define CREATE_TRACE_POINTS
>   #include <trace/events/io_uring.h>
> @@ -11125,3 +11126,181 @@ static int __init io_uring_init(void)
>   	return 0;
>   };
>   __initcall(io_uring_init);
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +
> +BTF_ID_LIST(btf_io_uring_ids)
> +BTF_ID(struct, io_ring_ctx)
> +BTF_ID(struct, io_mapped_ubuf)
> +
> +struct bpf_io_uring_seq_info {
> +	struct io_ring_ctx *ctx;
> +	unsigned long index;
> +};
> +
> +static int bpf_io_uring_init_seq(void *priv_data, struct bpf_iter_aux_info *aux)
> +{
> +	struct bpf_io_uring_seq_info *info = priv_data;
> +	struct io_ring_ctx *ctx = aux->ctx;
> +
> +	info->ctx = ctx;
> +	return 0;
> +}
> +
> +static int bpf_io_uring_iter_attach(struct bpf_prog *prog,
> +				    union bpf_iter_link_info *linfo,
> +				    struct bpf_iter_aux_info *aux)
> +{
> +	struct io_ring_ctx *ctx;
> +	struct fd f;
> +	int ret;
> +
> +	f = fdget(linfo->io_uring.io_uring_fd);
> +	if (unlikely(!f.file))
> +		return -EBADF;
> +
> +	ret = -EOPNOTSUPP;
> +	if (unlikely(f.file->f_op != &io_uring_fops))
> +		goto out_fput;
> +
> +	ret = -ENXIO;
> +	ctx = f.file->private_data;
> +	if (unlikely(!percpu_ref_tryget(&ctx->refs)))
> +		goto out_fput;
> +
> +	ret = 0;
> +	aux->ctx = ctx;
> +
> +out_fput:
> +	fdput(f);
> +	return ret;
> +}
> +
> +static void bpf_io_uring_iter_detach(struct bpf_iter_aux_info *aux)
> +{
> +	percpu_ref_put(&aux->ctx->refs);
> +}
> +
> +/* io_uring iterator for registered buffers */
> +
> +struct bpf_iter__io_uring_buf {
> +	__bpf_md_ptr(struct bpf_iter_meta *, meta);
> +	__bpf_md_ptr(struct io_ring_ctx *, ctx);
> +	__bpf_md_ptr(struct io_mapped_ubuf *, ubuf);
> +	unsigned long index;
> +};

I would suggest to change "unsigned long index" to either u32 or u64.
This structure is also the bpf program context and in bpf program 
context, "index" will be u64. Then on 32bit system, we potentially
could have issues.

> +
> +static void *__bpf_io_uring_buf_seq_get_next(struct bpf_io_uring_seq_info *info)
> +{
> +	if (info->index < info->ctx->nr_user_bufs)
> +		return info->ctx->user_bufs[info->index++];
> +	return NULL;
> +}
> +
> +static void *bpf_io_uring_buf_seq_start(struct seq_file *seq, loff_t *pos)
> +{
> +	struct bpf_io_uring_seq_info *info = seq->private;
> +	struct io_mapped_ubuf *ubuf;
> +
> +	/* Indicate to userspace that the uring lock is contended */
> +	if (!mutex_trylock(&info->ctx->uring_lock))
> +		return ERR_PTR(-EDEADLK);
> +
> +	ubuf = __bpf_io_uring_buf_seq_get_next(info);
> +	if (!ubuf)
> +		return NULL;
> +
> +	if (*pos == 0)
> +		++*pos;
> +	return ubuf;
> +}
> +
> +static void *bpf_io_uring_buf_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> +{
> +	struct bpf_io_uring_seq_info *info = seq->private;
> +
> +	++*pos;
> +	return __bpf_io_uring_buf_seq_get_next(info);
> +}
> +
> +DEFINE_BPF_ITER_FUNC(io_uring_buf, struct bpf_iter_meta *meta,
> +		     struct io_ring_ctx *ctx, struct io_mapped_ubuf *ubuf,
> +		     unsigned long index)

Again, change "unsigned long" to "u32" or "u64".

> +
> +static int __bpf_io_uring_buf_seq_show(struct seq_file *seq, void *v, bool in_stop)
> +{
> +	struct bpf_io_uring_seq_info *info = seq->private;
> +	struct bpf_iter__io_uring_buf ctx;
> +	struct bpf_iter_meta meta;
> +	struct bpf_prog *prog;
> +
> +	meta.seq = seq;
> +	prog = bpf_iter_get_info(&meta, in_stop);
> +	if (!prog)
> +		return 0;
> +
> +	ctx.meta = &meta;
> +	ctx.ctx = info->ctx;
> +	ctx.ubuf = v;
> +	ctx.index = info->index ? info->index - !in_stop : 0;
> +
> +	return bpf_iter_run_prog(prog, &ctx);
> +}
> +
> +static int bpf_io_uring_buf_seq_show(struct seq_file *seq, void *v)
> +{
> +	return __bpf_io_uring_buf_seq_show(seq, v, false);
> +}
> +
> +static void bpf_io_uring_buf_seq_stop(struct seq_file *seq, void *v)
> +{
> +	struct bpf_io_uring_seq_info *info = seq->private;
> +
> +	/* If IS_ERR(v) is true, then ctx->uring_lock wasn't taken */
> +	if (IS_ERR(v))
> +		return;
> +	if (!v)
> +		__bpf_io_uring_buf_seq_show(seq, v, true);
> +	else if (info->index) /* restart from index */
> +		info->index--;
> +	mutex_unlock(&info->ctx->uring_lock);
> +}
> +
> +static const struct seq_operations bpf_io_uring_buf_seq_ops = {
> +	.start = bpf_io_uring_buf_seq_start,
> +	.next  = bpf_io_uring_buf_seq_next,
> +	.stop  = bpf_io_uring_buf_seq_stop,
> +	.show  = bpf_io_uring_buf_seq_show,
> +};
> +
> +static const struct bpf_iter_seq_info bpf_io_uring_buf_seq_info = {
> +	.seq_ops          = &bpf_io_uring_buf_seq_ops,
> +	.init_seq_private = bpf_io_uring_init_seq,
> +	.fini_seq_private = NULL,
> +	.seq_priv_size    = sizeof(struct bpf_io_uring_seq_info),
> +};
> +
> +static struct bpf_iter_reg io_uring_buf_reg_info = {
> +	.target            = "io_uring_buf",
> +	.feature	   = BPF_ITER_RESCHED,
> +	.attach_target     = bpf_io_uring_iter_attach,
> +	.detach_target     = bpf_io_uring_iter_detach,

Since you have this extra `io_uring_fd` for the iterator, you may want
to implement show_fdinfo and fill_link_info callback functions here.

> +	.ctx_arg_info_size = 2,
> +	.ctx_arg_info = {
> +		{ offsetof(struct bpf_iter__io_uring_buf, ctx),
> +		  PTR_TO_BTF_ID },
> +		{ offsetof(struct bpf_iter__io_uring_buf, ubuf),
> +		  PTR_TO_BTF_ID_OR_NULL },
> +	},
> +	.seq_info	   = &bpf_io_uring_buf_seq_info,
> +};
> +
> +static int __init io_uring_iter_init(void)
> +{
> +	io_uring_buf_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0];
> +	io_uring_buf_reg_info.ctx_arg_info[1].btf_id = btf_io_uring_ids[1];
> +	return bpf_iter_reg_target(&io_uring_buf_reg_info);
> +}
> +late_initcall(io_uring_iter_init);
> +
> +#endif
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 56098c866704..ddb9d4520a3f 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1509,8 +1509,10 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
>   	extern int bpf_iter_ ## target(args);			\
>   	int __init bpf_iter_ ## target(args) { return 0; }
>   
> +struct io_ring_ctx;
>   struct bpf_iter_aux_info {
>   	struct bpf_map *map;
> +	struct io_ring_ctx *ctx;
>   };

Can we use union here? Note that below bpf_iter_link_info in 
uapi/linux/bpf.h, map_fd and io_uring_fd is also an union.

>   
>   typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 6297eafdc40f..3323defa99a1 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -91,6 +91,9 @@ union bpf_iter_link_info {
>   	struct {
>   		__u32	map_fd;
>   	} map;
> +	struct {
> +		__u32   io_uring_fd;
> +	} io_uring;
>   };
>   
>   /* BPF syscall commands, see bpf(2) man-page for more details. */
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 6297eafdc40f..3323defa99a1 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -91,6 +91,9 @@ union bpf_iter_link_info {
>   	struct {
>   		__u32	map_fd;
>   	} map;
> +	struct {
> +		__u32   io_uring_fd;
> +	} io_uring;
>   };
>   
>   /* BPF syscall commands, see bpf(2) man-page for more details. */
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 2/8] bpf: Add bpf_page_to_pfn helper
  2021-11-16  5:42 ` [PATCH bpf-next v1 2/8] bpf: Add bpf_page_to_pfn helper Kumar Kartikeya Dwivedi
  2021-11-17 12:35     ` kernel test robot
  2021-11-17 13:39     ` kernel test robot
@ 2021-11-18 17:27   ` Yonghong Song
  2021-11-18 18:30     ` Kumar Kartikeya Dwivedi
  2 siblings, 1 reply; 34+ messages in thread
From: Yonghong Song @ 2021-11-18 17:27 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Pavel Emelyanov, Alexander Mihalicyn, Andrei Vagin, criu,
	io-uring, linux-fsdevel



On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:
> In CRIU, we need to be able to determine whether the page pinned by
> io_uring is still present in the same range in the process VMA.
> /proc/<pid>/pagemap gives us the PFN, hence using this helper we can
> establish this mapping easily from the iterator side.
> 
> It is a simple wrapper over the in-kernel page_to_pfn helper, and
> ensures the passed in pointer is a struct page PTR_TO_BTF_ID. This is
> obtained from the bvec of io_uring_ubuf for the CRIU usecase.
> 
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>   fs/io_uring.c                  | 17 +++++++++++++++++
>   include/linux/bpf.h            |  1 +
>   include/uapi/linux/bpf.h       |  9 +++++++++
>   kernel/trace/bpf_trace.c       |  2 ++
>   scripts/bpf_doc.py             |  2 ++
>   tools/include/uapi/linux/bpf.h |  9 +++++++++
>   6 files changed, 40 insertions(+)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 46a110989155..9e9df6767e29 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -11295,6 +11295,23 @@ static struct bpf_iter_reg io_uring_buf_reg_info = {
>   	.seq_info	   = &bpf_io_uring_buf_seq_info,
>   };
>   
> +BPF_CALL_1(bpf_page_to_pfn, struct page *, page)
> +{
> +	/* PTR_TO_BTF_ID can be NULL */
> +	if (!page)
> +		return U64_MAX;
> +	return page_to_pfn(page);
> +}
> +
> +BTF_ID_LIST_SINGLE(btf_page_to_pfn_ids, struct, page)
> +
> +const struct bpf_func_proto bpf_page_to_pfn_proto = {
> +	.func		= bpf_page_to_pfn,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_BTF_ID,
> +	.arg1_btf_id	= &btf_page_to_pfn_ids[0],

Does this helper need to be gpl_only?

> +};
> +
>   static int __init io_uring_iter_init(void)
>   {
>   	io_uring_buf_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0];
[...]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 3/8] io_uring: Implement eBPF iterator for registered files
  2021-11-16  5:42 ` [PATCH bpf-next v1 3/8] io_uring: Implement eBPF iterator for registered files Kumar Kartikeya Dwivedi
@ 2021-11-18 17:33   ` Yonghong Song
  0 siblings, 0 replies; 34+ messages in thread
From: Yonghong Song @ 2021-11-18 17:33 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, bpf
  Cc: Jens Axboe, Pavel Begunkov, io-uring, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Pavel Emelyanov,
	Alexander Mihalicyn, Andrei Vagin, criu, linux-fsdevel



On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:
> This change adds eBPF iterator for buffers registered in io_uring ctx.
> It gives access to the ctx, the index of the registered buffer, and a
> pointer to the struct file itself. This allows the iterator to save
> info related to the file added to an io_uring instance, that isn't easy
> to export using the fdinfo interface (like being able to match
> registered files to a task's file set). Getting access to underlying
> struct file allows deduplication and efficient pairing with task file
> set (obtained using task_file iterator).
> 
> The primary usecase this is enabling is checkpoint/restore support.
> 
> Note that we need to use mutex_trylock when the file is read from, in
> seq_start functions, as the order of lock taken is opposite of what it
> would be when io_uring operation reads the same file.  We take
> seq_file->lock, then ctx->uring_lock, while io_uring would first take
> ctx->uring_lock and then seq_file->lock for the same ctx.
> 
> This can lead to a deadlock scenario described below:
> 
>        CPU 0                             CPU 1
> 
>        vfs_read
>        mutex_lock(&seq_file->lock)       io_read
> 					mutex_lock(&ctx->uring_lock)
>        mutex_lock(&ctx->uring_lock) # switched to mutex_trylock
> 					mutex_lock(&seq_file->lock)
> 
> The trylock also protects the case where io_uring tries to read from
> iterator attached to itself (same ctx), where the order of locks would
> be:
>   io_uring_enter
>    mutex_lock(&ctx->uring_lock) <-----------.
>    io_read                                   \
>     seq_read                                  \
>      mutex_lock(&seq_file->lock)              /
>      mutex_lock(&ctx->uring_lock) # deadlock-`
> 
> In both these cases (recursive read and contended uring_lock), -EDEADLK
> is returned to userspace.
> 
> With the advent of descriptorless files supported by io_uring, this
> iterator provides the required visibility and introspection of io_uring
> instance for the purposes of dumping and restoring it.
> 
> In the future, this iterator will be extended to support direct
> inspection of a lot of file state (currently descriptorless files
> are obtained using openat2 and socket) to dump file state for these
> hidden files. Later, we can explore filling in the gaps for dumping
> file state for more file types (those not hidden in io_uring ctx).
> All this is out of scope for the current series however, but builds
> upon this iterator.
> 
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Pavel Begunkov <asml.silence@gmail.com>
> Cc: io-uring@vger.kernel.org
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>   fs/io_uring.c | 140 +++++++++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 139 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 9e9df6767e29..7ac479c95d4e 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -11132,6 +11132,7 @@ __initcall(io_uring_init);
>   BTF_ID_LIST(btf_io_uring_ids)
>   BTF_ID(struct, io_ring_ctx)
>   BTF_ID(struct, io_mapped_ubuf)
> +BTF_ID(struct, file)
>   
>   struct bpf_io_uring_seq_info {
>   	struct io_ring_ctx *ctx;
> @@ -11312,11 +11313,148 @@ const struct bpf_func_proto bpf_page_to_pfn_proto = {
>   	.arg1_btf_id	= &btf_page_to_pfn_ids[0],
>   };
>   
> +/* io_uring iterator for registered files */
> +
> +struct bpf_iter__io_uring_file {
> +	__bpf_md_ptr(struct bpf_iter_meta *, meta);
> +	__bpf_md_ptr(struct io_ring_ctx *, ctx);
> +	__bpf_md_ptr(struct file *, file);
> +	unsigned long index;

change "unisnged long" to either u32 or u64, maybe just u64?

> +};
> +
> +static void *__bpf_io_uring_file_seq_get_next(struct bpf_io_uring_seq_info *info)
> +{
> +	struct file *file = NULL;
> +
> +	if (info->index < info->ctx->nr_user_files) {
> +		/* file set can be sparse */
> +		file = io_file_from_index(info->ctx, info->index++);
> +		/* use info as a distinct pointer to distinguish between empty
> +		 * slot and valid file, since we cannot return NULL for this
> +		 * case if we want iter prog to still be invoked with file ==
> +		 * NULL.
> +		 */
> +		if (!file)
> +			return info;
> +	}
> +
> +	return file;
> +}
> +
> +static void *bpf_io_uring_file_seq_start(struct seq_file *seq, loff_t *pos)
> +{
> +	struct bpf_io_uring_seq_info *info = seq->private;
> +	struct file *file;
> +
> +	/* Indicate to userspace that the uring lock is contended */
> +	if (!mutex_trylock(&info->ctx->uring_lock))
> +		return ERR_PTR(-EDEADLK);
> +
> +	file = __bpf_io_uring_file_seq_get_next(info);
> +	if (!file)
> +		return NULL;
> +
> +	if (*pos == 0)
> +		++*pos;
> +	return file;
> +}
> +
> +static void *bpf_io_uring_file_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> +{
> +	struct bpf_io_uring_seq_info *info = seq->private;
> +
> +	++*pos;
> +	return __bpf_io_uring_file_seq_get_next(info);
> +}
> +
> +DEFINE_BPF_ITER_FUNC(io_uring_file, struct bpf_iter_meta *meta,
> +		     struct io_ring_ctx *ctx, struct file *file,
> +		     unsigned long index)

unsigned long => u64?

> +
> +static int __bpf_io_uring_file_seq_show(struct seq_file *seq, void *v, bool in_stop)
> +{
> +	struct bpf_io_uring_seq_info *info = seq->private;
> +	struct bpf_iter__io_uring_file ctx;
> +	struct bpf_iter_meta meta;
> +	struct bpf_prog *prog;
> +
> +	meta.seq = seq;
> +	prog = bpf_iter_get_info(&meta, in_stop);
> +	if (!prog)
> +		return 0;
> +
> +	ctx.meta = &meta;
> +	ctx.ctx = info->ctx;
> +	/* when we encounter empty slot, v will point to info */
> +	ctx.file = v == info ? NULL : v;
> +	ctx.index = info->index ? info->index - !in_stop : 0;
> +
> +	return bpf_iter_run_prog(prog, &ctx);
> +}
> +
> +static int bpf_io_uring_file_seq_show(struct seq_file *seq, void *v)
> +{
> +	return __bpf_io_uring_file_seq_show(seq, v, false);
> +}
> +
[...]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 4/8] epoll: Implement eBPF iterator for registered items
  2021-11-16  5:42 ` [PATCH bpf-next v1 4/8] epoll: Implement eBPF iterator for registered items Kumar Kartikeya Dwivedi
@ 2021-11-18 17:50   ` Yonghong Song
  0 siblings, 0 replies; 34+ messages in thread
From: Yonghong Song @ 2021-11-18 17:50 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, bpf
  Cc: Alexander Viro, linux-fsdevel, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Pavel Emelyanov,
	Alexander Mihalicyn, Andrei Vagin, criu, io-uring



On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:
> This patch adds eBPF iterator for epoll items (epitems) registered in an
> epoll instance. It gives access to the eventpoll ctx, and the registered
> epoll item (struct epitem). This allows the iterator to inspect the
> registered file and be able to use others iterators to associate it with
> a task's fdtable.
> 
> The primary usecase this is enabling is expediting existing eventpoll
> checkpoint/restore support in the CRIU project. This iterator allows us
> to switch from a worst case O(n^2) algorithm to a single O(n) pass over
> task and epoll registered descriptors.
> 
> We also make sure we're iterating over a live file, one that is not
> going away. The case we're concerned about is a file that has its
> f_count as zero, but is waiting for iterator bpf_seq_read to release
> ep->mtx, so that it can remove its epitem. Since such a file will
> disappear once iteration is done, and it is being destructed, we use
> get_file_rcu to ensure it is alive when invoking the BPF program.
> 
> Getting access to a file that is going to disappear after iteration
> is not useful anyway. This does have a performance overhead however
> (since file reference will be raised and dropped for each file).
> 
> The rcu_read_lock around get_file_rcu isn't strictly required for
> lifetime management since fput path is serialized on ep->mtx to call
> ep_remove, hence the epi->ffd.file pointer remains stable during our
> seq_start/seq_stop bracketing.
> 
> To be able to continue back from the position we were iterating, we
> store the epi->ffi.fd and use ep_find_tfd to find the target file again.
> It would be more appropriate to use both struct file pointer and fd
> number to find the last file, but see below for why that cannot be done.
> 
> Taking reference to struct file and walking RB-Tree to find it again
> will lead to reference cycle issue if the iterator after partial read
> takes reference to socket which later is used in creating a descriptor
> cycle using SCM_RIGHTS. An example that was encountered when working on
> this is mentioned below.
> 
>    Let there be Unix sockets SK1, SK2, epoll fd EP, and epoll iterator
>    ITER.
>    Let SK1 be registered in EP, then on a partial read it is possible
>    that ITER returns from read and takes reference to SK1 to be able to
>    find it later in RB-Tree and continue the iteration.  If SK1 sends
>    ITER over to SK2 using SCM_RIGHTS, and SK2 sends SK2 over to SK1 using
>    SCM_RIGHTS, and both fds are not consumed on the corresponding receive
>    ends, a cycle is created.  When all of SK1, SK2, EP, and ITER are
>    closed, SK1's receive queue holds reference to SK2, and SK2's receive
>    queue holds reference to ITER, which holds a reference to SK1.
>    All file descriptors except EP leak.
> 
> To resolve it, we would need to hook into the Unix Socket GC mechanism,
> but the alternative of using ep_find_tfd is much more simpler. The
> finding of the last position in face of concurrent modification of the
> epoll set is at best an approximation anyway. For the case of CRIU, the
> epoll set remains stable.
> 
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: linux-fsdevel@vger.kernel.org
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>   fs/eventpoll.c                 | 196 ++++++++++++++++++++++++++++++++-
>   include/linux/bpf.h            |   5 +-
>   include/uapi/linux/bpf.h       |   3 +
>   tools/include/uapi/linux/bpf.h |   3 +
>   4 files changed, 205 insertions(+), 2 deletions(-)
> 
[...]
> +
> +static const struct bpf_iter_seq_info bpf_epoll_seq_info = {
> +	.seq_ops          = &bpf_epoll_seq_ops,
> +	.init_seq_private = bpf_epoll_init_seq,
> +	.seq_priv_size    = sizeof(struct bpf_epoll_iter_seq_info),
> +};
> +
> +static struct bpf_iter_reg epoll_reg_info = {
> +	.target            = "epoll",
> +	.feature           = BPF_ITER_RESCHED,
> +	.attach_target     = bpf_epoll_iter_attach,
> +	.detach_target     = bpf_epoll_iter_detach,

implement show_fdinfo and fill_link_info?

There are some bpftool work needed as well to show the information
in user space. An example is e60495eafdba ("bpftool: Implement 
link_query for bpf iterators").


> +	.ctx_arg_info_size = 2,
> +	.ctx_arg_info = {
> +		{ offsetof(struct bpf_iter__epoll, ep),
> +		  PTR_TO_BTF_ID },
> +		{ offsetof(struct bpf_iter__epoll, epi),
> +		  PTR_TO_BTF_ID_OR_NULL },
> +	},
> +	.seq_info	   = &bpf_epoll_seq_info,
> +};
> +
> +static int __init epoll_iter_init(void)
> +{
> +	epoll_reg_info.ctx_arg_info[0].btf_id = btf_epoll_ids[0];
> +	epoll_reg_info.ctx_arg_info[1].btf_id = btf_epoll_ids[1];
> +	return bpf_iter_reg_target(&epoll_reg_info);
> +}
> +late_initcall(epoll_iter_init);
> +
> +#endif
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index fe7b499da781..eb1c9acdc40b 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1512,7 +1512,10 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
>   struct io_ring_ctx;
>   struct bpf_iter_aux_info {
>   	struct bpf_map *map;
> -	struct io_ring_ctx *ctx;
> +	union {
> +		struct io_ring_ctx *ctx;
> +		struct file *ep;
> +	};

You changed to union here. I think we can change
"struct bpf_iter_aux_info" to "union bpf_iter_aux_info".
This should make code simpler and easy to understand.

>   };
>   
>   typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index b70e9da3d722..64e18c1dcfca 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -94,6 +94,9 @@ union bpf_iter_link_info {
>   	struct {
>   		__u32   io_uring_fd;
>   	} io_uring;
> +	struct {
> +		__u32   epoll_fd;
> +	} epoll;
>   };
>   
>   /* BPF syscall commands, see bpf(2) man-page for more details. */
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index b70e9da3d722..64e18c1dcfca 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -94,6 +94,9 @@ union bpf_iter_link_info {
>   	struct {
>   		__u32   io_uring_fd;
>   	} io_uring;
> +	struct {
> +		__u32   epoll_fd;
> +	} epoll;
>   };
>   
>   /* BPF syscall commands, see bpf(2) man-page for more details. */
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 5/8] selftests/bpf: Add test for io_uring BPF iterators
  2021-11-16  5:42 ` [PATCH bpf-next v1 5/8] selftests/bpf: Add test for io_uring BPF iterators Kumar Kartikeya Dwivedi
@ 2021-11-18 17:54   ` Yonghong Song
  2021-11-18 18:33     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 34+ messages in thread
From: Yonghong Song @ 2021-11-18 17:54 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, bpf
  Cc: Jens Axboe, Pavel Begunkov, io-uring, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Pavel Emelyanov,
	Alexander Mihalicyn, Andrei Vagin, criu, linux-fsdevel



On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:
> This exercises the io_uring_buf and io_uring_file iterators, and tests
> sparse file sets as well.
> 
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Pavel Begunkov <asml.silence@gmail.com>
> Cc: io-uring@vger.kernel.org
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>   .../selftests/bpf/prog_tests/bpf_iter.c       | 226 ++++++++++++++++++
>   .../selftests/bpf/progs/bpf_iter_io_uring.c   |  50 ++++
>   2 files changed, 276 insertions(+)
>   create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_io_uring.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
> index 3e10abce3e5a..baf11fe2f88d 100644
> --- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
> +++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
> @@ -1,6 +1,9 @@
>   // SPDX-License-Identifier: GPL-2.0
>   /* Copyright (c) 2020 Facebook */
> +#include <sys/mman.h>
>   #include <test_progs.h>
> +#include <linux/io_uring.h>
> +
>   #include "bpf_iter_ipv6_route.skel.h"
>   #include "bpf_iter_netlink.skel.h"
>   #include "bpf_iter_bpf_map.skel.h"
> @@ -26,6 +29,7 @@
>   #include "bpf_iter_bpf_sk_storage_map.skel.h"
>   #include "bpf_iter_test_kern5.skel.h"
>   #include "bpf_iter_test_kern6.skel.h"
> +#include "bpf_iter_io_uring.skel.h"
>   
>   static int duration;
>   
> @@ -1239,6 +1243,224 @@ static void test_task_vma(void)
>   	bpf_iter_task_vma__destroy(skel);
>   }
>   
> +static int sys_io_uring_setup(u32 entries, struct io_uring_params *p)
> +{
> +	return syscall(__NR_io_uring_setup, entries, p);
> +}
> +
> +static int io_uring_register_bufs(int io_uring_fd, struct iovec *iovs, unsigned int nr)
> +{
> +	return syscall(__NR_io_uring_register, io_uring_fd,
> +		       IORING_REGISTER_BUFFERS, iovs, nr);
> +}
> +
> +static int io_uring_register_files(int io_uring_fd, int *fds, unsigned int nr)
> +{
> +	return syscall(__NR_io_uring_register, io_uring_fd,
> +		       IORING_REGISTER_FILES, fds, nr);
> +}
> +
> +static unsigned long long page_addr_to_pfn(unsigned long addr)
> +{
> +	int page_size = sysconf(_SC_PAGE_SIZE), fd, ret;
> +	unsigned long long pfn;
> +
> +	if (page_size < 0)
> +		return 0;
> +	fd = open("/proc/self/pagemap", O_RDONLY);
> +	if (fd < 0)
> +		return 0;
> +
> +	ret = pread(fd, &pfn, sizeof(pfn), (addr / page_size) * 8);
> +	close(fd);
> +	if (ret < 0)
> +		return 0;
> +	/* Bits 0-54 have PFN for non-swapped page */
> +	return pfn & 0x7fffffffffffff;
> +}
> +
> +void test_io_uring_buf(void)
> +{
> +	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
> +	char rbuf[4096], buf[4096] = "B\n";
> +	union bpf_iter_link_info linfo;
> +	struct bpf_iter_io_uring *skel;
> +	int ret, fd, i, len = 128;
> +	struct io_uring_params p;
> +	struct iovec iovs[8];
> +	int iter_fd;
> +	char *str;
> +
> +	opts.link_info = &linfo;
> +	opts.link_info_len = sizeof(linfo);
> +
> +	skel = bpf_iter_io_uring__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "bpf_iter_io_uring__open_and_load"))
> +		return;
> +
> +	for (i = 0; i < ARRAY_SIZE(iovs); i++) {
> +		iovs[i].iov_len	 = len;
> +		iovs[i].iov_base = mmap(NULL, len, PROT_READ | PROT_WRITE,
> +					MAP_ANONYMOUS | MAP_SHARED, -1, 0);
> +		if (iovs[i].iov_base == MAP_FAILED)
> +			goto end;
> +		len *= 2;
> +	}
> +
> +	memset(&p, 0, sizeof(p));
> +	fd = sys_io_uring_setup(1, &p);
> +	if (!ASSERT_GE(fd, 0, "io_uring_setup"))
> +		goto end;
> +
> +	linfo.io_uring.io_uring_fd = fd;
> +	skel->links.dump_io_uring_buf = bpf_program__attach_iter(skel->progs.dump_io_uring_buf,
> +								 &opts);
> +	if (!ASSERT_OK_PTR(skel->links.dump_io_uring_buf, "bpf_program__attach_iter"))
> +		goto end_close_fd;
> +
> +	ret = io_uring_register_bufs(fd, iovs, ARRAY_SIZE(iovs));
> +	if (!ASSERT_OK(ret, "io_uring_register_bufs"))
> +		goto end_close_fd;
> +
> +	/* "B\n" */
> +	len = 2;
> +	str = buf + len;
> +	for (int j = 0; j < ARRAY_SIZE(iovs); j++) {
> +		ret = snprintf(str, sizeof(buf) - len, "%d:0x%lx:%zu\n", j,
> +			       (unsigned long)iovs[j].iov_base,
> +			       iovs[j].iov_len);
> +		if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf"))
> +			goto end_close_fd;
> +		len += ret;
> +		str += ret;
> +
> +		ret = snprintf(str, sizeof(buf) - len, "`-PFN for bvec[0]=%llu\n",
> +			       page_addr_to_pfn((unsigned long)iovs[j].iov_base));
> +		if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf"))
> +			goto end_close_fd;
> +		len += ret;
> +		str += ret;
> +	}
> +
> +	ret = snprintf(str, sizeof(buf) - len, "E:%zu\n", ARRAY_SIZE(iovs));
> +	if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf"))
> +		goto end_close_fd;
> +
> +	iter_fd = bpf_iter_create(bpf_link__fd(skel->links.dump_io_uring_buf));
> +	if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create"))
> +		goto end_close_fd;
> +
> +	ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf));
> +	if (!ASSERT_GT(ret, 0, "read_fd_into_buffer"))
> +		goto end_close_iter;
> +
> +	ASSERT_OK(strcmp(rbuf, buf), "compare iterator output");
> +
> +	puts("=== Expected Output ===");
> +	printf("%s", buf);
> +	puts("==== Actual Output ====");
> +	printf("%s", rbuf);
> +	puts("=======================");

Maybe you can do an actual comparison and use ASSERT_* macros to check 
result?

> +
> +end_close_iter:
> +	close(iter_fd);
> +end_close_fd:
> +	close(fd);
> +end:
> +	while (i--)
> +		munmap(iovs[i].iov_base, iovs[i].iov_len);
> +	bpf_iter_io_uring__destroy(skel);
> +}
> +
> +void test_io_uring_file(void)
> +{
> +	int reg_files[] = { [0 ... 7] = -1 };
> +	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
> +	char buf[4096] = "B\n", rbuf[4096] = {}, *str;
> +	union bpf_iter_link_info linfo = {};
> +	struct bpf_iter_io_uring *skel;
> +	int iter_fd, fd, len = 0, ret;
> +	struct io_uring_params p;
> +
> +	opts.link_info = &linfo;
> +	opts.link_info_len = sizeof(linfo);
> +
> +	skel = bpf_iter_io_uring__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "bpf_iter_io_uring__open_and_load"))
> +		return;
> +
> +	/* "B\n" */
> +	len = 2;
> +	str = buf + len;
> +	ret = snprintf(str, sizeof(buf) - len, "B\n");
> +	for (int i = 0; i < ARRAY_SIZE(reg_files); i++) {
> +		char templ[] = "/tmp/io_uringXXXXXX";
> +		const char *name, *def = "<none>";
> +
> +		/* create sparse set */
> +		if (i & 1) {
> +			name = def;
> +		} else {
> +			reg_files[i] = mkstemp(templ);
> +			if (!ASSERT_GE(reg_files[i], 0, templ))
> +				goto end_close_reg_files;
> +			name = templ;
> +			ASSERT_OK(unlink(name), "unlink");
> +		}
> +		ret = snprintf(str, sizeof(buf) - len, "%d:%s%s\n", i, name, name != def ? " (deleted)" : "");
> +		if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf"))
> +			goto end_close_reg_files;
> +		len += ret;
> +		str += ret;
> +	}
> +
> +	ret = snprintf(str, sizeof(buf) - len, "E:%zu\n", ARRAY_SIZE(reg_files));
> +	if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf"))
> +		goto end_close_reg_files;
> +
> +	memset(&p, 0, sizeof(p));
> +	fd = sys_io_uring_setup(1, &p);
> +	if (!ASSERT_GE(fd, 0, "io_uring_setup"))
> +		goto end_close_reg_files;
> +
> +	linfo.io_uring.io_uring_fd = fd;
> +	skel->links.dump_io_uring_file = bpf_program__attach_iter(skel->progs.dump_io_uring_file,
> +								  &opts);
> +	if (!ASSERT_OK_PTR(skel->links.dump_io_uring_file, "bpf_program__attach_iter"))
> +		goto end_close_fd;
> +
> +	iter_fd = bpf_iter_create(bpf_link__fd(skel->links.dump_io_uring_file));
> +	if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create"))
> +		goto end;
> +
> +	ret = io_uring_register_files(fd, reg_files, ARRAY_SIZE(reg_files));
> +	if (!ASSERT_OK(ret, "io_uring_register_files"))
> +		goto end_iter_fd;
> +
> +	ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf));
> +	if (!ASSERT_GT(ret, 0, "read_fd_into_buffer(iterator_fd, buf)"))
> +		goto end_iter_fd;
> +
> +	ASSERT_OK(strcmp(rbuf, buf), "compare iterator output");
> +
> +	puts("=== Expected Output ===");
> +	printf("%s", buf);
> +	puts("==== Actual Output ====");
> +	printf("%s", rbuf);
> +	puts("=======================");

The same as above.

> +end_iter_fd:
> +	close(iter_fd);
> +end_close_fd:
> +	close(fd);
> +end_close_reg_files:
> +	for (int i = 0; i < ARRAY_SIZE(reg_files); i++) {
> +		if (reg_files[i] != -1)
> +			close(reg_files[i]);
> +	}
> +end:
> +	bpf_iter_io_uring__destroy(skel);
> +}
[...]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers
  2021-11-18 17:21   ` Yonghong Song
@ 2021-11-18 18:28     ` Kumar Kartikeya Dwivedi
  2021-11-18 19:13       ` Yonghong Song
  0 siblings, 1 reply; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-11-18 18:28 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Jens Axboe, Pavel Begunkov, io-uring, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Pavel Emelyanov,
	Alexander Mihalicyn, Andrei Vagin, criu, linux-fsdevel

On Thu, Nov 18, 2021 at 10:51:59PM IST, Yonghong Song wrote:
>
>
> On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:
> > This change adds eBPF iterator for buffers registered in io_uring ctx.
> > It gives access to the ctx, the index of the registered buffer, and a
> > pointer to the io_uring_ubuf itself. This allows the iterator to save
> > info related to buffers added to an io_uring instance, that isn't easy
> > to export using the fdinfo interface (like exact struct page composing
> > the registered buffer).
> >
> > The primary usecase this is enabling is checkpoint/restore support.
> >
> > Note that we need to use mutex_trylock when the file is read from, in
> > seq_start functions, as the order of lock taken is opposite of what it
> > would be when io_uring operation reads the same file.  We take
> > seq_file->lock, then ctx->uring_lock, while io_uring would first take
> > ctx->uring_lock and then seq_file->lock for the same ctx.
> >
> > This can lead to a deadlock scenario described below:
> >
> >        CPU 0				CPU 1
> >
> >        vfs_read
> >        mutex_lock(&seq_file->lock)	io_read
> > 					mutex_lock(&ctx->uring_lock)
> >        mutex_lock(&ctx->uring_lock) # switched to mutex_trylock
> > 					mutex_lock(&seq_file->lock)
>
> It is not clear which mutex_lock switched to mutex_trylock.

The one in vfs_read.

> From below example, it looks like &ctx->uring_lock. But if this is
> the case, we could have deadlock, right?
>

Sorry, I will make the commit message clearer in the next version.

The sequence on CPU 0 is for normal read(2) on iterator.
For CPU 1, it is an io_uring instance trying to do same on iterator attached to
itself.

So CPU 0 does

sys_read
vfs_read
 bpf_seq_read
 mutex_lock(&seq_file->lock)    # A
  io_uring_buf_seq_start
  mutex_lock(&ctx->uring_lock)  # B

and CPU 1 does

io_uring_enter
mutex_lock(&ctx->uring_lock)    # B
 io_read
  bpf_seq_read
  mutex_lock(&seq_file->lock)   # A
  ...

Since the order of locks is opposite, it can deadlock. So I switched the
mutex_lock in io_uring_buf_seq_start to trylock, so it can return an error for
this case, then it will release seq_file->lock and CPU 1 will make progress.

The second problem without use of trylock is described below (for same case of
io_uring reading from iterator attached to itself).

Let me know if I missed something.

> >
> > The trylock also protects the case where io_uring tries to read from
> > iterator attached to itself (same ctx), where the order of locks would
> > be:
> >   io_uring_enter
> >    mutex_lock(&ctx->uring_lock) <-----------.
> >    io_read				    \
> >     seq_read				     \
> >      mutex_lock(&seq_file->lock)		     /
> >      mutex_lock(&ctx->uring_lock) # deadlock-`
> >
> > In both these cases (recursive read and contended uring_lock), -EDEADLK
> > is returned to userspace.
> >
> > In the future, this iterator will be extended to directly support
> > iteration of bvec Flexible Array Member, so that when there is no
> > corresponding VMA that maps to the registered buffer (e.g. if VMA is
> > destroyed after pinning pages), we are able to reconstruct the
> > registration on restore by dumping the page contents and then replaying
> > them into a temporary mapping used for registration later. All this is
> > out of scope for the current series however, but builds upon this
> > iterator.
> >
> > Cc: Jens Axboe <axboe@kernel.dk>
> > Cc: Pavel Begunkov <asml.silence@gmail.com>
> > Cc: io-uring@vger.kernel.org
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >   fs/io_uring.c                  | 179 +++++++++++++++++++++++++++++++++
> >   include/linux/bpf.h            |   2 +
> >   include/uapi/linux/bpf.h       |   3 +
> >   tools/include/uapi/linux/bpf.h |   3 +
> >   4 files changed, 187 insertions(+)
> >
> > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > index b07196b4511c..46a110989155 100644
> > --- a/fs/io_uring.c
> > +++ b/fs/io_uring.c
> > @@ -81,6 +81,7 @@
> >   #include <linux/tracehook.h>
> >   #include <linux/audit.h>
> >   #include <linux/security.h>
> > +#include <linux/btf_ids.h>
> >   #define CREATE_TRACE_POINTS
> >   #include <trace/events/io_uring.h>
> > @@ -11125,3 +11126,181 @@ static int __init io_uring_init(void)
> >   	return 0;
> >   };
> >   __initcall(io_uring_init);
> > +
> > +#ifdef CONFIG_BPF_SYSCALL
> > +
> > +BTF_ID_LIST(btf_io_uring_ids)
> > +BTF_ID(struct, io_ring_ctx)
> > +BTF_ID(struct, io_mapped_ubuf)
> > +
> > +struct bpf_io_uring_seq_info {
> > +	struct io_ring_ctx *ctx;
> > +	unsigned long index;
> > +};
> > +
> > +static int bpf_io_uring_init_seq(void *priv_data, struct bpf_iter_aux_info *aux)
> > +{
> > +	struct bpf_io_uring_seq_info *info = priv_data;
> > +	struct io_ring_ctx *ctx = aux->ctx;
> > +
> > +	info->ctx = ctx;
> > +	return 0;
> > +}
> > +
> > +static int bpf_io_uring_iter_attach(struct bpf_prog *prog,
> > +				    union bpf_iter_link_info *linfo,
> > +				    struct bpf_iter_aux_info *aux)
> > +{
> > +	struct io_ring_ctx *ctx;
> > +	struct fd f;
> > +	int ret;
> > +
> > +	f = fdget(linfo->io_uring.io_uring_fd);
> > +	if (unlikely(!f.file))
> > +		return -EBADF;
> > +
> > +	ret = -EOPNOTSUPP;
> > +	if (unlikely(f.file->f_op != &io_uring_fops))
> > +		goto out_fput;
> > +
> > +	ret = -ENXIO;
> > +	ctx = f.file->private_data;
> > +	if (unlikely(!percpu_ref_tryget(&ctx->refs)))
> > +		goto out_fput;
> > +
> > +	ret = 0;
> > +	aux->ctx = ctx;
> > +
> > +out_fput:
> > +	fdput(f);
> > +	return ret;
> > +}
> > +
> > +static void bpf_io_uring_iter_detach(struct bpf_iter_aux_info *aux)
> > +{
> > +	percpu_ref_put(&aux->ctx->refs);
> > +}
> > +
> > +/* io_uring iterator for registered buffers */
> > +
> > +struct bpf_iter__io_uring_buf {
> > +	__bpf_md_ptr(struct bpf_iter_meta *, meta);
> > +	__bpf_md_ptr(struct io_ring_ctx *, ctx);
> > +	__bpf_md_ptr(struct io_mapped_ubuf *, ubuf);
> > +	unsigned long index;
> > +};
>
> I would suggest to change "unsigned long index" to either u32 or u64.
> This structure is also the bpf program context and in bpf program context,
> "index" will be u64. Then on 32bit system, we potentially
> could have issues.
>

Ack, will do.

> > +
> > +static void *__bpf_io_uring_buf_seq_get_next(struct bpf_io_uring_seq_info *info)
> > +{
> > +	if (info->index < info->ctx->nr_user_bufs)
> > +		return info->ctx->user_bufs[info->index++];
> > +	return NULL;
> > +}
> > +
> > +static void *bpf_io_uring_buf_seq_start(struct seq_file *seq, loff_t *pos)
> > +{
> > +	struct bpf_io_uring_seq_info *info = seq->private;
> > +	struct io_mapped_ubuf *ubuf;
> > +
> > +	/* Indicate to userspace that the uring lock is contended */
> > +	if (!mutex_trylock(&info->ctx->uring_lock))
> > +		return ERR_PTR(-EDEADLK);
> > +
> > +	ubuf = __bpf_io_uring_buf_seq_get_next(info);
> > +	if (!ubuf)
> > +		return NULL;
> > +
> > +	if (*pos == 0)
> > +		++*pos;
> > +	return ubuf;
> > +}
> > +
> > +static void *bpf_io_uring_buf_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> > +{
> > +	struct bpf_io_uring_seq_info *info = seq->private;
> > +
> > +	++*pos;
> > +	return __bpf_io_uring_buf_seq_get_next(info);
> > +}
> > +
> > +DEFINE_BPF_ITER_FUNC(io_uring_buf, struct bpf_iter_meta *meta,
> > +		     struct io_ring_ctx *ctx, struct io_mapped_ubuf *ubuf,
> > +		     unsigned long index)
>
> Again, change "unsigned long" to "u32" or "u64".
>

Ack.

> > [...]
> > +static struct bpf_iter_reg io_uring_buf_reg_info = {
> > +	.target            = "io_uring_buf",
> > +	.feature	   = BPF_ITER_RESCHED,
> > +	.attach_target     = bpf_io_uring_iter_attach,
> > +	.detach_target     = bpf_io_uring_iter_detach,
>
> Since you have this extra `io_uring_fd` for the iterator, you may want
> to implement show_fdinfo and fill_link_info callback functions here.
>

Ack, but some questions:

What should it have? e.g. it easy to go from map_id to map fd if one wants
access to the map attached to the iterator, but not sure how one can obtain more
information about target fd from io_uring or epoll iterators.

Should I delegate to their show_fdinfo and dump using that?

The type/target is already available in link_info, not sure what other useful
information can be added there, which allows obtaining the io_uring/epoll fd.

> > +	.ctx_arg_info_size = 2,
> > +	.ctx_arg_info = {
> > +		{ offsetof(struct bpf_iter__io_uring_buf, ctx),
> > +		  PTR_TO_BTF_ID },
> > +		{ offsetof(struct bpf_iter__io_uring_buf, ubuf),
> > +		  PTR_TO_BTF_ID_OR_NULL },
> > +	},
> > +	.seq_info	   = &bpf_io_uring_buf_seq_info,
> > +};
> > +
> > +static int __init io_uring_iter_init(void)
> > +{
> > +	io_uring_buf_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0];
> > +	io_uring_buf_reg_info.ctx_arg_info[1].btf_id = btf_io_uring_ids[1];
> > +	return bpf_iter_reg_target(&io_uring_buf_reg_info);
> > +}
> > +late_initcall(io_uring_iter_init);
> > +
> > +#endif
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 56098c866704..ddb9d4520a3f 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -1509,8 +1509,10 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
> >   	extern int bpf_iter_ ## target(args);			\
> >   	int __init bpf_iter_ ## target(args) { return 0; }
> > +struct io_ring_ctx;
> >   struct bpf_iter_aux_info {
> >   	struct bpf_map *map;
> > +	struct io_ring_ctx *ctx;
> >   };
>
> Can we use union here? Note that below bpf_iter_link_info in
> uapi/linux/bpf.h, map_fd and io_uring_fd is also an union.
>

So the reason I didn't use a union was the link->aux.map check in
bpf_iter.c::__get_seq_info. Even if we switch to union bpf_iter_aux_info, it
needs some way to determine whether link is for map type, so maybe a string
comparison there? Leaving it out of union felt cleaner, also I move both
io_ring_ctx and eventpoll file into a union in later patch.

> >   typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 6297eafdc40f..3323defa99a1 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -91,6 +91,9 @@ union bpf_iter_link_info {
> >   	struct {
> >   		__u32	map_fd;
> >   	} map;
> > +	struct {
> > +		__u32   io_uring_fd;
> > +	} io_uring;
> >   };
> >   /* BPF syscall commands, see bpf(2) man-page for more details. */
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index 6297eafdc40f..3323defa99a1 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -91,6 +91,9 @@ union bpf_iter_link_info {
> >   	struct {
> >   		__u32	map_fd;
> >   	} map;
> > +	struct {
> > +		__u32   io_uring_fd;
> > +	} io_uring;
> >   };
> >   /* BPF syscall commands, see bpf(2) man-page for more details. */
> >

--
Kartikeya

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 2/8] bpf: Add bpf_page_to_pfn helper
  2021-11-18 17:27   ` Yonghong Song
@ 2021-11-18 18:30     ` Kumar Kartikeya Dwivedi
  2021-11-18 19:18       ` Yonghong Song
  0 siblings, 1 reply; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-11-18 18:30 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Pavel Emelyanov, Alexander Mihalicyn, Andrei Vagin, criu,
	io-uring, linux-fsdevel

On Thu, Nov 18, 2021 at 10:57:15PM IST, Yonghong Song wrote:
>
>
> On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:
> > In CRIU, we need to be able to determine whether the page pinned by
> > io_uring is still present in the same range in the process VMA.
> > /proc/<pid>/pagemap gives us the PFN, hence using this helper we can
> > establish this mapping easily from the iterator side.
> >
> > It is a simple wrapper over the in-kernel page_to_pfn helper, and
> > ensures the passed in pointer is a struct page PTR_TO_BTF_ID. This is
> > obtained from the bvec of io_uring_ubuf for the CRIU usecase.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >   fs/io_uring.c                  | 17 +++++++++++++++++
> >   include/linux/bpf.h            |  1 +
> >   include/uapi/linux/bpf.h       |  9 +++++++++
> >   kernel/trace/bpf_trace.c       |  2 ++
> >   scripts/bpf_doc.py             |  2 ++
> >   tools/include/uapi/linux/bpf.h |  9 +++++++++
> >   6 files changed, 40 insertions(+)
> >
> > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > index 46a110989155..9e9df6767e29 100644
> > --- a/fs/io_uring.c
> > +++ b/fs/io_uring.c
> > @@ -11295,6 +11295,23 @@ static struct bpf_iter_reg io_uring_buf_reg_info = {
> >   	.seq_info	   = &bpf_io_uring_buf_seq_info,
> >   };
> > +BPF_CALL_1(bpf_page_to_pfn, struct page *, page)
> > +{
> > +	/* PTR_TO_BTF_ID can be NULL */
> > +	if (!page)
> > +		return U64_MAX;
> > +	return page_to_pfn(page);
> > +}
> > +
> > +BTF_ID_LIST_SINGLE(btf_page_to_pfn_ids, struct, page)
> > +
> > +const struct bpf_func_proto bpf_page_to_pfn_proto = {
> > +	.func		= bpf_page_to_pfn,
> > +	.ret_type	= RET_INTEGER,
> > +	.arg1_type	= ARG_PTR_TO_BTF_ID,
> > +	.arg1_btf_id	= &btf_page_to_pfn_ids[0],
>
> Does this helper need to be gpl_only?
>

Not sure about it, it wraps over a macro.

> > +};
> > +
> >   static int __init io_uring_iter_init(void)
> >   {
> >   	io_uring_buf_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0];
> [...]

--
Kartikeya

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 5/8] selftests/bpf: Add test for io_uring BPF iterators
  2021-11-18 17:54   ` Yonghong Song
@ 2021-11-18 18:33     ` Kumar Kartikeya Dwivedi
  2021-11-18 19:21       ` Yonghong Song
  0 siblings, 1 reply; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-11-18 18:33 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Jens Axboe, Pavel Begunkov, io-uring, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Pavel Emelyanov,
	Alexander Mihalicyn, Andrei Vagin, criu, linux-fsdevel

On Thu, Nov 18, 2021 at 11:24:19PM IST, Yonghong Song wrote:
>
>
> On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:
> > This exercises the io_uring_buf and io_uring_file iterators, and tests
> > sparse file sets as well.
> >
> > Cc: Jens Axboe <axboe@kernel.dk>
> > Cc: Pavel Begunkov <asml.silence@gmail.com>
> > Cc: io-uring@vger.kernel.org
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >   .../selftests/bpf/prog_tests/bpf_iter.c       | 226 ++++++++++++++++++
> >   .../selftests/bpf/progs/bpf_iter_io_uring.c   |  50 ++++
> >   2 files changed, 276 insertions(+)
> >   create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_io_uring.c
> >
> > diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
> > index 3e10abce3e5a..baf11fe2f88d 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
> > @@ -1,6 +1,9 @@
> >   // SPDX-License-Identifier: GPL-2.0
> >   /* Copyright (c) 2020 Facebook */
> > +#include <sys/mman.h>
> >   #include <test_progs.h>
> > +#include <linux/io_uring.h>
> > +
> >   #include "bpf_iter_ipv6_route.skel.h"
> >   #include "bpf_iter_netlink.skel.h"
> >   #include "bpf_iter_bpf_map.skel.h"
> > @@ -26,6 +29,7 @@
> >   #include "bpf_iter_bpf_sk_storage_map.skel.h"
> >   #include "bpf_iter_test_kern5.skel.h"
> >   #include "bpf_iter_test_kern6.skel.h"
> > +#include "bpf_iter_io_uring.skel.h"
> >   static int duration;
> > @@ -1239,6 +1243,224 @@ static void test_task_vma(void)
> >   	bpf_iter_task_vma__destroy(skel);
> >   }
> > +static int sys_io_uring_setup(u32 entries, struct io_uring_params *p)
> > +{
> > +	return syscall(__NR_io_uring_setup, entries, p);
> > +}
> > +
> > +static int io_uring_register_bufs(int io_uring_fd, struct iovec *iovs, unsigned int nr)
> > +{
> > +	return syscall(__NR_io_uring_register, io_uring_fd,
> > +		       IORING_REGISTER_BUFFERS, iovs, nr);
> > +}
> > +
> > +static int io_uring_register_files(int io_uring_fd, int *fds, unsigned int nr)
> > +{
> > +	return syscall(__NR_io_uring_register, io_uring_fd,
> > +		       IORING_REGISTER_FILES, fds, nr);
> > +}
> > +
> > +static unsigned long long page_addr_to_pfn(unsigned long addr)
> > +{
> > +	int page_size = sysconf(_SC_PAGE_SIZE), fd, ret;
> > +	unsigned long long pfn;
> > +
> > +	if (page_size < 0)
> > +		return 0;
> > +	fd = open("/proc/self/pagemap", O_RDONLY);
> > +	if (fd < 0)
> > +		return 0;
> > +
> > +	ret = pread(fd, &pfn, sizeof(pfn), (addr / page_size) * 8);
> > +	close(fd);
> > +	if (ret < 0)
> > +		return 0;
> > +	/* Bits 0-54 have PFN for non-swapped page */
> > +	return pfn & 0x7fffffffffffff;
> > +}
> > +
> > +void test_io_uring_buf(void)
> > +{
> > +	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
> > +	char rbuf[4096], buf[4096] = "B\n";
> > +	union bpf_iter_link_info linfo;
> > +	struct bpf_iter_io_uring *skel;
> > +	int ret, fd, i, len = 128;
> > +	struct io_uring_params p;
> > +	struct iovec iovs[8];
> > +	int iter_fd;
> > +	char *str;
> > +
> > +	opts.link_info = &linfo;
> > +	opts.link_info_len = sizeof(linfo);
> > +
> > +	skel = bpf_iter_io_uring__open_and_load();
> > +	if (!ASSERT_OK_PTR(skel, "bpf_iter_io_uring__open_and_load"))
> > +		return;
> > +
> > +	for (i = 0; i < ARRAY_SIZE(iovs); i++) {
> > +		iovs[i].iov_len	 = len;
> > +		iovs[i].iov_base = mmap(NULL, len, PROT_READ | PROT_WRITE,
> > +					MAP_ANONYMOUS | MAP_SHARED, -1, 0);
> > +		if (iovs[i].iov_base == MAP_FAILED)
> > +			goto end;
> > +		len *= 2;
> > +	}
> > +
> > +	memset(&p, 0, sizeof(p));
> > +	fd = sys_io_uring_setup(1, &p);
> > +	if (!ASSERT_GE(fd, 0, "io_uring_setup"))
> > +		goto end;
> > +
> > +	linfo.io_uring.io_uring_fd = fd;
> > +	skel->links.dump_io_uring_buf = bpf_program__attach_iter(skel->progs.dump_io_uring_buf,
> > +								 &opts);
> > +	if (!ASSERT_OK_PTR(skel->links.dump_io_uring_buf, "bpf_program__attach_iter"))
> > +		goto end_close_fd;
> > +
> > +	ret = io_uring_register_bufs(fd, iovs, ARRAY_SIZE(iovs));
> > +	if (!ASSERT_OK(ret, "io_uring_register_bufs"))
> > +		goto end_close_fd;
> > +
> > +	/* "B\n" */
> > +	len = 2;
> > +	str = buf + len;
> > +	for (int j = 0; j < ARRAY_SIZE(iovs); j++) {
> > +		ret = snprintf(str, sizeof(buf) - len, "%d:0x%lx:%zu\n", j,
> > +			       (unsigned long)iovs[j].iov_base,
> > +			       iovs[j].iov_len);
> > +		if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf"))
> > +			goto end_close_fd;
> > +		len += ret;
> > +		str += ret;
> > +
> > +		ret = snprintf(str, sizeof(buf) - len, "`-PFN for bvec[0]=%llu\n",
> > +			       page_addr_to_pfn((unsigned long)iovs[j].iov_base));
> > +		if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf"))
> > +			goto end_close_fd;
> > +		len += ret;
> > +		str += ret;
> > +	}
> > +
> > +	ret = snprintf(str, sizeof(buf) - len, "E:%zu\n", ARRAY_SIZE(iovs));
> > +	if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf"))
> > +		goto end_close_fd;
> > +
> > +	iter_fd = bpf_iter_create(bpf_link__fd(skel->links.dump_io_uring_buf));
> > +	if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create"))
> > +		goto end_close_fd;
> > +
> > +	ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf));
> > +	if (!ASSERT_GT(ret, 0, "read_fd_into_buffer"))
> > +		goto end_close_iter;
> > +
> > +	ASSERT_OK(strcmp(rbuf, buf), "compare iterator output");
> > +
> > +	puts("=== Expected Output ===");
> > +	printf("%s", buf);
> > +	puts("==== Actual Output ====");
> > +	printf("%s", rbuf);
> > +	puts("=======================");
>
> Maybe you can do an actual comparison and use ASSERT_* macros to check
> result?
>

I already did that in the line above first "puts". The printing is just for
better debugging, to show the incorrect output in case test fails. Also in epoll
test in the next patch the order of entries is not fixed, since they are sorted
using struct file pointer.

> > +
> > +end_close_iter:
> > +	close(iter_fd);
> > +end_close_fd:
> > +	close(fd);
> > +end:
> > +	while (i--)
> > +		munmap(iovs[i].iov_base, iovs[i].iov_len);
> > +	bpf_iter_io_uring__destroy(skel);
> > +}
> > +
> > +void test_io_uring_file(void)
> > +{
> > +	int reg_files[] = { [0 ... 7] = -1 };
> > +	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
> > +	char buf[4096] = "B\n", rbuf[4096] = {}, *str;
> > +	union bpf_iter_link_info linfo = {};
> > +	struct bpf_iter_io_uring *skel;
> > +	int iter_fd, fd, len = 0, ret;
> > +	struct io_uring_params p;
> > +
> > +	opts.link_info = &linfo;
> > +	opts.link_info_len = sizeof(linfo);
> > +
> > +	skel = bpf_iter_io_uring__open_and_load();
> > +	if (!ASSERT_OK_PTR(skel, "bpf_iter_io_uring__open_and_load"))
> > +		return;
> > +
> > +	/* "B\n" */
> > +	len = 2;
> > +	str = buf + len;
> > +	ret = snprintf(str, sizeof(buf) - len, "B\n");
> > +	for (int i = 0; i < ARRAY_SIZE(reg_files); i++) {
> > +		char templ[] = "/tmp/io_uringXXXXXX";
> > +		const char *name, *def = "<none>";
> > +
> > +		/* create sparse set */
> > +		if (i & 1) {
> > +			name = def;
> > +		} else {
> > +			reg_files[i] = mkstemp(templ);
> > +			if (!ASSERT_GE(reg_files[i], 0, templ))
> > +				goto end_close_reg_files;
> > +			name = templ;
> > +			ASSERT_OK(unlink(name), "unlink");
> > +		}
> > +		ret = snprintf(str, sizeof(buf) - len, "%d:%s%s\n", i, name, name != def ? " (deleted)" : "");
> > +		if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf"))
> > +			goto end_close_reg_files;
> > +		len += ret;
> > +		str += ret;
> > +	}
> > +
> > +	ret = snprintf(str, sizeof(buf) - len, "E:%zu\n", ARRAY_SIZE(reg_files));
> > +	if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf"))
> > +		goto end_close_reg_files;
> > +
> > +	memset(&p, 0, sizeof(p));
> > +	fd = sys_io_uring_setup(1, &p);
> > +	if (!ASSERT_GE(fd, 0, "io_uring_setup"))
> > +		goto end_close_reg_files;
> > +
> > +	linfo.io_uring.io_uring_fd = fd;
> > +	skel->links.dump_io_uring_file = bpf_program__attach_iter(skel->progs.dump_io_uring_file,
> > +								  &opts);
> > +	if (!ASSERT_OK_PTR(skel->links.dump_io_uring_file, "bpf_program__attach_iter"))
> > +		goto end_close_fd;
> > +
> > +	iter_fd = bpf_iter_create(bpf_link__fd(skel->links.dump_io_uring_file));
> > +	if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create"))
> > +		goto end;
> > +
> > +	ret = io_uring_register_files(fd, reg_files, ARRAY_SIZE(reg_files));
> > +	if (!ASSERT_OK(ret, "io_uring_register_files"))
> > +		goto end_iter_fd;
> > +
> > +	ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf));
> > +	if (!ASSERT_GT(ret, 0, "read_fd_into_buffer(iterator_fd, buf)"))
> > +		goto end_iter_fd;
> > +
> > +	ASSERT_OK(strcmp(rbuf, buf), "compare iterator output");
> > +
> > +	puts("=== Expected Output ===");
> > +	printf("%s", buf);
> > +	puts("==== Actual Output ====");
> > +	printf("%s", rbuf);
> > +	puts("=======================");
>
> The same as above.
>
> > +end_iter_fd:
> > +	close(iter_fd);
> > +end_close_fd:
> > +	close(fd);
> > +end_close_reg_files:
> > +	for (int i = 0; i < ARRAY_SIZE(reg_files); i++) {
> > +		if (reg_files[i] != -1)
> > +			close(reg_files[i]);
> > +	}
> > +end:
> > +	bpf_iter_io_uring__destroy(skel);
> > +}
> [...]

--
Kartikeya

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers
  2021-11-18 18:28     ` Kumar Kartikeya Dwivedi
@ 2021-11-18 19:13       ` Yonghong Song
  0 siblings, 0 replies; 34+ messages in thread
From: Yonghong Song @ 2021-11-18 19:13 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Jens Axboe, Pavel Begunkov, io-uring, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Pavel Emelyanov,
	Alexander Mihalicyn, Andrei Vagin, criu, linux-fsdevel



On 11/18/21 10:28 AM, Kumar Kartikeya Dwivedi wrote:
> On Thu, Nov 18, 2021 at 10:51:59PM IST, Yonghong Song wrote:
>>
>>
>> On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:
>>> This change adds eBPF iterator for buffers registered in io_uring ctx.
>>> It gives access to the ctx, the index of the registered buffer, and a
>>> pointer to the io_uring_ubuf itself. This allows the iterator to save
>>> info related to buffers added to an io_uring instance, that isn't easy
>>> to export using the fdinfo interface (like exact struct page composing
>>> the registered buffer).
>>>
>>> The primary usecase this is enabling is checkpoint/restore support.
>>>
>>> Note that we need to use mutex_trylock when the file is read from, in
>>> seq_start functions, as the order of lock taken is opposite of what it
>>> would be when io_uring operation reads the same file.  We take
>>> seq_file->lock, then ctx->uring_lock, while io_uring would first take
>>> ctx->uring_lock and then seq_file->lock for the same ctx.
>>>
>>> This can lead to a deadlock scenario described below:
>>>
>>>         CPU 0				CPU 1
>>>
>>>         vfs_read
>>>         mutex_lock(&seq_file->lock)	io_read
>>> 					mutex_lock(&ctx->uring_lock)
>>>         mutex_lock(&ctx->uring_lock) # switched to mutex_trylock
>>> 					mutex_lock(&seq_file->lock)
>>
>> It is not clear which mutex_lock switched to mutex_trylock.
> 
> The one in vfs_read.
> 
>>  From below example, it looks like &ctx->uring_lock. But if this is
>> the case, we could have deadlock, right?
>>
> 
> Sorry, I will make the commit message clearer in the next version.
> 
> The sequence on CPU 0 is for normal read(2) on iterator.
> For CPU 1, it is an io_uring instance trying to do same on iterator attached to
> itself.
> 
> So CPU 0 does
> 
> sys_read
> vfs_read
>   bpf_seq_read
>   mutex_lock(&seq_file->lock)    # A
>    io_uring_buf_seq_start
>    mutex_lock(&ctx->uring_lock)  # B
> 
> and CPU 1 does
> 
> io_uring_enter
> mutex_lock(&ctx->uring_lock)    # B
>   io_read
>    bpf_seq_read
>    mutex_lock(&seq_file->lock)   # A
>    ...
> 
> Since the order of locks is opposite, it can deadlock. So I switched the
> mutex_lock in io_uring_buf_seq_start to trylock, so it can return an error for
> this case, then it will release seq_file->lock and CPU 1 will make progress.
> 
> The second problem without use of trylock is described below (for same case of
> io_uring reading from iterator attached to itself).
> 
> Let me know if I missed something.

Thanks for explanation. The above diagram is much better.

> 
>>>
>>> The trylock also protects the case where io_uring tries to read from
>>> iterator attached to itself (same ctx), where the order of locks would
>>> be:
>>>    io_uring_enter
>>>     mutex_lock(&ctx->uring_lock) <-----------.
>>>     io_read				    \
>>>      seq_read				     \
>>>       mutex_lock(&seq_file->lock)		     /
>>>       mutex_lock(&ctx->uring_lock) # deadlock-`
>>>
>>> In both these cases (recursive read and contended uring_lock), -EDEADLK
>>> is returned to userspace.
>>>
>>> In the future, this iterator will be extended to directly support
>>> iteration of bvec Flexible Array Member, so that when there is no
>>> corresponding VMA that maps to the registered buffer (e.g. if VMA is
>>> destroyed after pinning pages), we are able to reconstruct the
>>> registration on restore by dumping the page contents and then replaying
>>> them into a temporary mapping used for registration later. All this is
>>> out of scope for the current series however, but builds upon this
>>> iterator.
>>>
>>> Cc: Jens Axboe <axboe@kernel.dk>
>>> Cc: Pavel Begunkov <asml.silence@gmail.com>
>>> Cc: io-uring@vger.kernel.org
>>> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>>> ---
>>>    fs/io_uring.c                  | 179 +++++++++++++++++++++++++++++++++
>>>    include/linux/bpf.h            |   2 +
>>>    include/uapi/linux/bpf.h       |   3 +
>>>    tools/include/uapi/linux/bpf.h |   3 +
>>>    4 files changed, 187 insertions(+)
>>>
[...]
> 
>>> [...]
>>> +static struct bpf_iter_reg io_uring_buf_reg_info = {
>>> +	.target            = "io_uring_buf",
>>> +	.feature	   = BPF_ITER_RESCHED,
>>> +	.attach_target     = bpf_io_uring_iter_attach,
>>> +	.detach_target     = bpf_io_uring_iter_detach,
>>
>> Since you have this extra `io_uring_fd` for the iterator, you may want
>> to implement show_fdinfo and fill_link_info callback functions here.
>>
> 
> Ack, but some questions:
> 
> What should it have? e.g. it easy to go from map_id to map fd if one wants
> access to the map attached to the iterator, but not sure how one can obtain more
> information about target fd from io_uring or epoll iterators.

Just to be clear, I am talking about uapi struct bpf_link_info.
I agree that fd is not really useful. So I guess it is up to you
whether you want to show fd to user or not. We can always
add it later if needed.

> 
> Should I delegate to their show_fdinfo and dump using that?
> 
> The type/target is already available in link_info, not sure what other useful
> information can be added there, which allows obtaining the io_uring/epoll fd.
> 
>>> +	.ctx_arg_info_size = 2,
>>> +	.ctx_arg_info = {
>>> +		{ offsetof(struct bpf_iter__io_uring_buf, ctx),
>>> +		  PTR_TO_BTF_ID },
>>> +		{ offsetof(struct bpf_iter__io_uring_buf, ubuf),
>>> +		  PTR_TO_BTF_ID_OR_NULL },
>>> +	},
>>> +	.seq_info	   = &bpf_io_uring_buf_seq_info,
>>> +};
>>> +
>>> +static int __init io_uring_iter_init(void)
>>> +{
>>> +	io_uring_buf_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0];
>>> +	io_uring_buf_reg_info.ctx_arg_info[1].btf_id = btf_io_uring_ids[1];
>>> +	return bpf_iter_reg_target(&io_uring_buf_reg_info);
>>> +}
>>> +late_initcall(io_uring_iter_init);
>>> +
>>> +#endif
>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>> index 56098c866704..ddb9d4520a3f 100644
>>> --- a/include/linux/bpf.h
>>> +++ b/include/linux/bpf.h
>>> @@ -1509,8 +1509,10 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
>>>    	extern int bpf_iter_ ## target(args);			\
>>>    	int __init bpf_iter_ ## target(args) { return 0; }
>>> +struct io_ring_ctx;
>>>    struct bpf_iter_aux_info {
>>>    	struct bpf_map *map;
>>> +	struct io_ring_ctx *ctx;
>>>    };
>>
>> Can we use union here? Note that below bpf_iter_link_info in
>> uapi/linux/bpf.h, map_fd and io_uring_fd is also an union.
>>
> 
> So the reason I didn't use a union was the link->aux.map check in
> bpf_iter.c::__get_seq_info. Even if we switch to union bpf_iter_aux_info, it
> needs some way to determine whether link is for map type, so maybe a string
> comparison there? Leaving it out of union felt cleaner, also I move both
> io_ring_ctx and eventpoll file into a union in later patch.

I see. the seq_ops for map element iterator is different from others.
the seq_ops is not from main iter registration but from map_ops.

I think your change is okay. But maybe a comment to explain why
map is different from others in struct bpf_iter_aux_info.

> 
>>>    typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index 6297eafdc40f..3323defa99a1 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -91,6 +91,9 @@ union bpf_iter_link_info {
>>>    	struct {
>>>    		__u32	map_fd;
>>>    	} map;
>>> +	struct {
>>> +		__u32   io_uring_fd;
>>> +	} io_uring;
>>>    };
>>>    /* BPF syscall commands, see bpf(2) man-page for more details. */
>>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>>> index 6297eafdc40f..3323defa99a1 100644
>>> --- a/tools/include/uapi/linux/bpf.h
>>> +++ b/tools/include/uapi/linux/bpf.h
>>> @@ -91,6 +91,9 @@ union bpf_iter_link_info {
>>>    	struct {
>>>    		__u32	map_fd;
>>>    	} map;
>>> +	struct {
>>> +		__u32   io_uring_fd;
>>> +	} io_uring;
>>>    };
>>>    /* BPF syscall commands, see bpf(2) man-page for more details. */
>>>
> 
> --
> Kartikeya
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 2/8] bpf: Add bpf_page_to_pfn helper
  2021-11-18 18:30     ` Kumar Kartikeya Dwivedi
@ 2021-11-18 19:18       ` Yonghong Song
  2021-11-18 19:22         ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 34+ messages in thread
From: Yonghong Song @ 2021-11-18 19:18 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Pavel Emelyanov, Alexander Mihalicyn, Andrei Vagin, criu,
	io-uring, linux-fsdevel



On 11/18/21 10:30 AM, Kumar Kartikeya Dwivedi wrote:
> On Thu, Nov 18, 2021 at 10:57:15PM IST, Yonghong Song wrote:
>>
>>
>> On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:
>>> In CRIU, we need to be able to determine whether the page pinned by
>>> io_uring is still present in the same range in the process VMA.
>>> /proc/<pid>/pagemap gives us the PFN, hence using this helper we can
>>> establish this mapping easily from the iterator side.
>>>
>>> It is a simple wrapper over the in-kernel page_to_pfn helper, and
>>> ensures the passed in pointer is a struct page PTR_TO_BTF_ID. This is
>>> obtained from the bvec of io_uring_ubuf for the CRIU usecase.
>>>
>>> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>>> ---
>>>    fs/io_uring.c                  | 17 +++++++++++++++++
>>>    include/linux/bpf.h            |  1 +
>>>    include/uapi/linux/bpf.h       |  9 +++++++++
>>>    kernel/trace/bpf_trace.c       |  2 ++
>>>    scripts/bpf_doc.py             |  2 ++
>>>    tools/include/uapi/linux/bpf.h |  9 +++++++++
>>>    6 files changed, 40 insertions(+)
>>>
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index 46a110989155..9e9df6767e29 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -11295,6 +11295,23 @@ static struct bpf_iter_reg io_uring_buf_reg_info = {
>>>    	.seq_info	   = &bpf_io_uring_buf_seq_info,
>>>    };
>>> +BPF_CALL_1(bpf_page_to_pfn, struct page *, page)
>>> +{
>>> +	/* PTR_TO_BTF_ID can be NULL */
>>> +	if (!page)
>>> +		return U64_MAX;
>>> +	return page_to_pfn(page);
>>> +}
>>> +
>>> +BTF_ID_LIST_SINGLE(btf_page_to_pfn_ids, struct, page)
>>> +
>>> +const struct bpf_func_proto bpf_page_to_pfn_proto = {
>>> +	.func		= bpf_page_to_pfn,
>>> +	.ret_type	= RET_INTEGER,
>>> +	.arg1_type	= ARG_PTR_TO_BTF_ID,
>>> +	.arg1_btf_id	= &btf_page_to_pfn_ids[0],
>>
>> Does this helper need to be gpl_only?

The typically guideline whether the same info can be retrieved from
userspace. If yes, no gpl is needed. Otherwe, it needs to be gpl.

Also, the helper is implemented in io_uring.c and the helper
is used by tracing programs. Maybe we can put the helper
in bpf_trace.c? The helper itself is not tied to io_uring, right?

>>
> 
> Not sure about it, it wraps over a macro.
> 
>>> +};
>>> +
>>>    static int __init io_uring_iter_init(void)
>>>    {
>>>    	io_uring_buf_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0];
>> [...]
> 
> --
> Kartikeya
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 5/8] selftests/bpf: Add test for io_uring BPF iterators
  2021-11-18 18:33     ` Kumar Kartikeya Dwivedi
@ 2021-11-18 19:21       ` Yonghong Song
  0 siblings, 0 replies; 34+ messages in thread
From: Yonghong Song @ 2021-11-18 19:21 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Jens Axboe, Pavel Begunkov, io-uring, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Pavel Emelyanov,
	Alexander Mihalicyn, Andrei Vagin, criu, linux-fsdevel



On 11/18/21 10:33 AM, Kumar Kartikeya Dwivedi wrote:
> On Thu, Nov 18, 2021 at 11:24:19PM IST, Yonghong Song wrote:
>>
>>
>> On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:
>>> This exercises the io_uring_buf and io_uring_file iterators, and tests
>>> sparse file sets as well.
>>>
>>> Cc: Jens Axboe <axboe@kernel.dk>
>>> Cc: Pavel Begunkov <asml.silence@gmail.com>
>>> Cc: io-uring@vger.kernel.org
>>> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>>> ---
>>>    .../selftests/bpf/prog_tests/bpf_iter.c       | 226 ++++++++++++++++++
>>>    .../selftests/bpf/progs/bpf_iter_io_uring.c   |  50 ++++
>>>    2 files changed, 276 insertions(+)
>>>    create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_io_uring.c
>>>
[...]
>>> +
>>> +	ret = snprintf(str, sizeof(buf) - len, "E:%zu\n", ARRAY_SIZE(iovs));
>>> +	if (!ASSERT_GE(ret, 0, "snprintf") || !ASSERT_LT(ret, sizeof(buf) - len, "snprintf"))
>>> +		goto end_close_fd;
>>> +
>>> +	iter_fd = bpf_iter_create(bpf_link__fd(skel->links.dump_io_uring_buf));
>>> +	if (!ASSERT_GE(iter_fd, 0, "bpf_iter_create"))
>>> +		goto end_close_fd;
>>> +
>>> +	ret = read_fd_into_buffer(iter_fd, rbuf, sizeof(rbuf));
>>> +	if (!ASSERT_GT(ret, 0, "read_fd_into_buffer"))
>>> +		goto end_close_iter;
>>> +
>>> +	ASSERT_OK(strcmp(rbuf, buf), "compare iterator output");
>>> +
>>> +	puts("=== Expected Output ===");
>>> +	printf("%s", buf);
>>> +	puts("==== Actual Output ====");
>>> +	printf("%s", rbuf);
>>> +	puts("=======================");
>>
>> Maybe you can do an actual comparison and use ASSERT_* macros to check
>> result?
>>
> 
> I already did that in the line above first "puts". The printing is just for
> better debugging, to show the incorrect output in case test fails. Also in epoll
> test in the next patch the order of entries is not fixed, since they are sorted
> using struct file pointer.

I see, maybe the following which prints out details only if failure?

	if (!ASSERT_OK(strcmp(rbuf, buf), "compare iterator output")) {
		puts("=== ...");
		...
	}

> 
>>> +
>>> +end_close_iter:
>>> +	close(iter_fd);
>>> +end_close_fd:
>>> +	close(fd);
>>> +end:
>>> +	while (i--)
>>> +		munmap(iovs[i].iov_base, iovs[i].iov_len);
>>> +	bpf_iter_io_uring__destroy(skel);
>>> +}
>>> +
[...]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 2/8] bpf: Add bpf_page_to_pfn helper
  2021-11-18 19:18       ` Yonghong Song
@ 2021-11-18 19:22         ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-11-18 19:22 UTC (permalink / raw)
  To: Yonghong Song
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Pavel Emelyanov, Alexander Mihalicyn, Andrei Vagin, criu,
	io-uring, linux-fsdevel

On Fri, Nov 19, 2021 at 12:48:39AM IST, Yonghong Song wrote:
>
>
> On 11/18/21 10:30 AM, Kumar Kartikeya Dwivedi wrote:
> > On Thu, Nov 18, 2021 at 10:57:15PM IST, Yonghong Song wrote:
> > >
> > >
> > > On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:
> > > > In CRIU, we need to be able to determine whether the page pinned by
> > > > io_uring is still present in the same range in the process VMA.
> > > > /proc/<pid>/pagemap gives us the PFN, hence using this helper we can
> > > > establish this mapping easily from the iterator side.
> > > >
> > > > It is a simple wrapper over the in-kernel page_to_pfn helper, and
> > > > ensures the passed in pointer is a struct page PTR_TO_BTF_ID. This is
> > > > obtained from the bvec of io_uring_ubuf for the CRIU usecase.
> > > >
> > > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > > ---
> > > >    fs/io_uring.c                  | 17 +++++++++++++++++
> > > >    include/linux/bpf.h            |  1 +
> > > >    include/uapi/linux/bpf.h       |  9 +++++++++
> > > >    kernel/trace/bpf_trace.c       |  2 ++
> > > >    scripts/bpf_doc.py             |  2 ++
> > > >    tools/include/uapi/linux/bpf.h |  9 +++++++++
> > > >    6 files changed, 40 insertions(+)
> > > >
> > > > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > > > index 46a110989155..9e9df6767e29 100644
> > > > --- a/fs/io_uring.c
> > > > +++ b/fs/io_uring.c
> > > > @@ -11295,6 +11295,23 @@ static struct bpf_iter_reg io_uring_buf_reg_info = {
> > > >    	.seq_info	   = &bpf_io_uring_buf_seq_info,
> > > >    };
> > > > +BPF_CALL_1(bpf_page_to_pfn, struct page *, page)
> > > > +{
> > > > +	/* PTR_TO_BTF_ID can be NULL */
> > > > +	if (!page)
> > > > +		return U64_MAX;
> > > > +	return page_to_pfn(page);
> > > > +}
> > > > +
> > > > +BTF_ID_LIST_SINGLE(btf_page_to_pfn_ids, struct, page)
> > > > +
> > > > +const struct bpf_func_proto bpf_page_to_pfn_proto = {
> > > > +	.func		= bpf_page_to_pfn,
> > > > +	.ret_type	= RET_INTEGER,
> > > > +	.arg1_type	= ARG_PTR_TO_BTF_ID,
> > > > +	.arg1_btf_id	= &btf_page_to_pfn_ids[0],
> > >
> > > Does this helper need to be gpl_only?
>
> The typically guideline whether the same info can be retrieved from
> userspace. If yes, no gpl is needed. Otherwe, it needs to be gpl.
>

I think it can already be obtained using /proc/<PID>/pagemap, so not needed.

> Also, the helper is implemented in io_uring.c and the helper
> is used by tracing programs. Maybe we can put the helper
> in bpf_trace.c? The helper itself is not tied to io_uring, right?
>

Right, I'll move it outside CONFIG_IO_URING in v2. That's also why the build
failed on this patch.

--
Kartikeya

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers
  2021-11-16  5:42 ` [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers Kumar Kartikeya Dwivedi
  2021-11-18 17:21   ` Yonghong Song
@ 2021-11-18 22:02   ` Alexei Starovoitov
  2021-11-19  4:15     ` Kumar Kartikeya Dwivedi
  1 sibling, 1 reply; 34+ messages in thread
From: Alexei Starovoitov @ 2021-11-18 22:02 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Jens Axboe, Pavel Begunkov, io-uring, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Pavel Emelyanov,
	Alexander Mihalicyn, Andrei Vagin, criu, linux-fsdevel

On Tue, Nov 16, 2021 at 11:12:30AM +0530, Kumar Kartikeya Dwivedi wrote:
> This change adds eBPF iterator for buffers registered in io_uring ctx.
> It gives access to the ctx, the index of the registered buffer, and a
> pointer to the io_uring_ubuf itself. This allows the iterator to save
> info related to buffers added to an io_uring instance, that isn't easy
> to export using the fdinfo interface (like exact struct page composing
> the registered buffer).
> 
> The primary usecase this is enabling is checkpoint/restore support.
> 
> Note that we need to use mutex_trylock when the file is read from, in
> seq_start functions, as the order of lock taken is opposite of what it
> would be when io_uring operation reads the same file.  We take
> seq_file->lock, then ctx->uring_lock, while io_uring would first take
> ctx->uring_lock and then seq_file->lock for the same ctx.
> 
> This can lead to a deadlock scenario described below:
> 
>       CPU 0				CPU 1
> 
>       vfs_read
>       mutex_lock(&seq_file->lock)	io_read
> 					mutex_lock(&ctx->uring_lock)
>       mutex_lock(&ctx->uring_lock) # switched to mutex_trylock
> 					mutex_lock(&seq_file->lock)
> 
> The trylock also protects the case where io_uring tries to read from
> iterator attached to itself (same ctx), where the order of locks would
> be:
>  io_uring_enter
>   mutex_lock(&ctx->uring_lock) <-----------.
>   io_read				    \
>    seq_read				     \
>     mutex_lock(&seq_file->lock)		     /
>     mutex_lock(&ctx->uring_lock) # deadlock-`
> 
> In both these cases (recursive read and contended uring_lock), -EDEADLK
> is returned to userspace.
> 
> In the future, this iterator will be extended to directly support
> iteration of bvec Flexible Array Member, so that when there is no
> corresponding VMA that maps to the registered buffer (e.g. if VMA is
> destroyed after pinning pages), we are able to reconstruct the
> registration on restore by dumping the page contents and then replaying
> them into a temporary mapping used for registration later. All this is
> out of scope for the current series however, but builds upon this
> iterator.

From BPF infra perspective these new iterators fit very well and
I don't see any issues maintaining this interface while kernel keeps
changing, but this commit log and shallowness of the selftests
makes me question feasibility of this approach in particular with io_uring.
Is it even possible to scan all internal bits of io_uring and reconstruct
it later? The bpf iter is only the read part. Don't you need the write part
for CRIU ? Even for reads only... io_uring has complex inner state.
Like bpf itself which cannot be realistically CRIU-ed.
I don't think we can merge this in pieces. We need to wait until there is
full working CRIU framework that uses these new iterators.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers
  2021-11-18 22:02   ` Alexei Starovoitov
@ 2021-11-19  4:15     ` Kumar Kartikeya Dwivedi
  2021-11-19  4:44       ` Kumar Kartikeya Dwivedi
  2021-11-19  4:56       ` Alexei Starovoitov
  0 siblings, 2 replies; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-11-19  4:15 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Jens Axboe, Pavel Begunkov, io-uring, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Pavel Emelyanov,
	Alexander Mihalicyn, Andrei Vagin, criu, linux-fsdevel

On Fri, Nov 19, 2021 at 03:32:26AM IST, Alexei Starovoitov wrote:
> On Tue, Nov 16, 2021 at 11:12:30AM +0530, Kumar Kartikeya Dwivedi wrote:
> > This change adds eBPF iterator for buffers registered in io_uring ctx.
> > It gives access to the ctx, the index of the registered buffer, and a
> > pointer to the io_uring_ubuf itself. This allows the iterator to save
> > info related to buffers added to an io_uring instance, that isn't easy
> > to export using the fdinfo interface (like exact struct page composing
> > the registered buffer).
> >
> > The primary usecase this is enabling is checkpoint/restore support.
> >
> > Note that we need to use mutex_trylock when the file is read from, in
> > seq_start functions, as the order of lock taken is opposite of what it
> > would be when io_uring operation reads the same file.  We take
> > seq_file->lock, then ctx->uring_lock, while io_uring would first take
> > ctx->uring_lock and then seq_file->lock for the same ctx.
> >
> > This can lead to a deadlock scenario described below:
> >
> >       CPU 0				CPU 1
> >
> >       vfs_read
> >       mutex_lock(&seq_file->lock)	io_read
> > 					mutex_lock(&ctx->uring_lock)
> >       mutex_lock(&ctx->uring_lock) # switched to mutex_trylock
> > 					mutex_lock(&seq_file->lock)
> >
> > The trylock also protects the case where io_uring tries to read from
> > iterator attached to itself (same ctx), where the order of locks would
> > be:
> >  io_uring_enter
> >   mutex_lock(&ctx->uring_lock) <-----------.
> >   io_read				    \
> >    seq_read				     \
> >     mutex_lock(&seq_file->lock)		     /
> >     mutex_lock(&ctx->uring_lock) # deadlock-`
> >
> > In both these cases (recursive read and contended uring_lock), -EDEADLK
> > is returned to userspace.
> >
> > In the future, this iterator will be extended to directly support
> > iteration of bvec Flexible Array Member, so that when there is no
> > corresponding VMA that maps to the registered buffer (e.g. if VMA is
> > destroyed after pinning pages), we are able to reconstruct the
> > registration on restore by dumping the page contents and then replaying
> > them into a temporary mapping used for registration later. All this is
> > out of scope for the current series however, but builds upon this
> > iterator.
>
> From BPF infra perspective these new iterators fit very well and
> I don't see any issues maintaining this interface while kernel keeps
> changing, but this commit log and shallowness of the selftests
> makes me question feasibility of this approach in particular with io_uring.
> Is it even possible to scan all internal bits of io_uring and reconstruct
> it later? The bpf iter is only the read part. Don't you need the write part
> for CRIU ? Even for reads only... io_uring has complex inner state.

Yes, the inner state is complex and often entangled with other task attributes,
like credentials, mm, file descriptors, other io_uring fds (wq_fd, etc.) but for
now these iterators are a good starting point to implement the majority of the
missing features in CRIU userspace. These iterators (including task* ones), and
procfs allow us to collect enough state to correlate various resources and form
relationships (e.g. which fd or buffer of which task(s) is registered, which
io_uring was used for wq_fd registration, or which eventfd was registered,
etc.). Thanks to access to io_ring_ctx, and iter being a tracing prog, we can
do usual pointer access to read some of that data.

> Like bpf itself which cannot be realistically CRIU-ed.
> I don't think we can merge this in pieces. We need to wait until there is
> full working CRIU framework that uses these new iterators.

I don't intend to add a 'write' framework. The usual process with CRIU is to
gather enough state to reconstruct the kernel resource by repeating steps
similar to what it might have performed during it's lifetime, e.g. approximating
a trace of execution leading to the same task state and then repeating that on
restore.

While a 'write' framework with first class kernel support for checkpoint/restore
would actually make CRIU's life a lot more simpler, there usually has been a lot
of pushback against interfaces like that, hence the approach has been to add
features that can extract the relevant information out of the kernel, and do the
rest of the work in userspace.

E.g. if we find that fd 1 and 2 are registered in the io_uring, and buffer at
0xabcd with len 128, then we first restore fd 1 and 2 (which can be anything)
at restore time, then after this restore is complete, continue restoration of
io_uring by executing file registration step for both [0], and depending on if
the mapping existed for 0xabcd in task mm, restore buffer once the mm of the
task has been restored, or otherwise map a temporary buffer, replay data, and
register it and destroy mappings. This would be similar to what might have
happened in the task itself. The correct order of events to do the restore
will be managed by CRIU itself.

It doesn't have to be exactly the same stpes, just enough for the task to not
notice a difference and not break its assumptions, and lead to the same end
result of task and resource state.

Even the task of determining whether fd 1 and 2 belong to which fdtable is
non-trivial and ineffecient rn. You can look at [0] for a similar case that was
solved for epoll, which works but we can do better (BPF didn't exist at that
time so it was the best we could do back then).

Also, this work is part of GSoC. There is already code that is waiting for this
to fill in the missing pieces [0]. If you want me to add a sample/selftest that
demonstrates/tests how this can be used to reconstruct a task's io_uring, I can
certainly do that. We've already spent a few months contemplating on a few
approaches and this turned out to be the best/most powerful. At one point I had
to scrap some my earlier patches completely because they couldn't work with
descriptorless io_uring. Iterator seem like the best solution so far that can
adapt gracefully to feature additions in something seeing as heavy development
as io_uring.

  [0]: https://github.com/checkpoint-restore/criu/commit/cfa3f405d522334076fc4d687bd077bee3186ccf#diff-d2cfa5a05213c854d539de003a23a286311ae81431026d3d50b0068c0cb5a852
  [1]: https://github.com/checkpoint-restore/criu/pull/1597

--
Kartikeya

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers
  2021-11-19  4:15     ` Kumar Kartikeya Dwivedi
@ 2021-11-19  4:44       ` Kumar Kartikeya Dwivedi
  2021-11-19  4:56       ` Alexei Starovoitov
  1 sibling, 0 replies; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-11-19  4:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Jens Axboe, Pavel Begunkov, io-uring, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Pavel Emelyanov,
	Alexander Mihalicyn, Andrei Vagin, criu, linux-fsdevel

On Fri, Nov 19, 2021 at 09:45:23AM IST, Kumar Kartikeya Dwivedi wrote:
> [...]
> E.g. if we find that fd 1 and 2 are registered in the io_uring, and buffer at
> 0xabcd with len 128, then we first restore fd 1 and 2 (which can be anything)
> at restore time, then after this restore is complete, continue restoration of
> io_uring by executing file registration step for both [0], and depending on if

This was meant to link to https://criu.org/Fdinfo_engine.

> [...]

--
Kartikeya

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers
  2021-11-19  4:15     ` Kumar Kartikeya Dwivedi
  2021-11-19  4:44       ` Kumar Kartikeya Dwivedi
@ 2021-11-19  4:56       ` Alexei Starovoitov
  2021-11-19  5:16         ` Kumar Kartikeya Dwivedi
  1 sibling, 1 reply; 34+ messages in thread
From: Alexei Starovoitov @ 2021-11-19  4:56 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Jens Axboe, Pavel Begunkov, io-uring, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Pavel Emelyanov,
	Alexander Mihalicyn, Andrei Vagin, criu, linux-fsdevel

On Fri, Nov 19, 2021 at 09:45:23AM +0530, Kumar Kartikeya Dwivedi wrote:
> 
> Also, this work is part of GSoC. There is already code that is waiting for this
> to fill in the missing pieces [0]. If you want me to add a sample/selftest that
> demonstrates/tests how this can be used to reconstruct a task's io_uring, I can
> certainly do that. We've already spent a few months contemplating on a few
> approaches and this turned out to be the best/most powerful. At one point I had
> to scrap some my earlier patches completely because they couldn't work with
> descriptorless io_uring. Iterator seem like the best solution so far that can
> adapt gracefully to feature additions in something seeing as heavy development
> as io_uring.
> 
>   [0]: https://github.com/checkpoint-restore/criu/commit/cfa3f405d522334076fc4d687bd077bee3186ccf#diff-d2cfa5a05213c854d539de003a23a286311ae81431026d3d50b0068c0cb5a852
>   [1]: https://github.com/checkpoint-restore/criu/pull/1597

Is that the main PR? 1095 changed files? Is it stale or something?
Is there a way to view the actual logic that exercises these bpf iterators?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers
  2021-11-19  4:56       ` Alexei Starovoitov
@ 2021-11-19  5:16         ` Kumar Kartikeya Dwivedi
  2021-11-19  5:24           ` Alexei Starovoitov
  0 siblings, 1 reply; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-11-19  5:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Jens Axboe, Pavel Begunkov, io-uring, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Pavel Emelyanov,
	Alexander Mihalicyn, Andrei Vagin, criu, linux-fsdevel

On Fri, Nov 19, 2021 at 10:26:59AM IST, Alexei Starovoitov wrote:
> On Fri, Nov 19, 2021 at 09:45:23AM +0530, Kumar Kartikeya Dwivedi wrote:
> >
> > Also, this work is part of GSoC. There is already code that is waiting for this
> > to fill in the missing pieces [0]. If you want me to add a sample/selftest that
> > demonstrates/tests how this can be used to reconstruct a task's io_uring, I can
> > certainly do that. We've already spent a few months contemplating on a few
> > approaches and this turned out to be the best/most powerful. At one point I had
> > to scrap some my earlier patches completely because they couldn't work with
> > descriptorless io_uring. Iterator seem like the best solution so far that can
> > adapt gracefully to feature additions in something seeing as heavy development
> > as io_uring.
> >
> >   [0]: https://github.com/checkpoint-restore/criu/commit/cfa3f405d522334076fc4d687bd077bee3186ccf#diff-d2cfa5a05213c854d539de003a23a286311ae81431026d3d50b0068c0cb5a852
> >   [1]: https://github.com/checkpoint-restore/criu/pull/1597
>
> Is that the main PR? 1095 changed files? Is it stale or something?
> Is there a way to view the actual logic that exercises these bpf iterators?

No, there is no code exercising BPF iterator in that PR yet (since it wouldn't
build/run in CI). There's some code I have locally that uses these to collect
the necessary information, I can post that, either as a sample or selftest in
the next version, or separately on GH for you to take a look.

I still rebased it so that you can see the rest of the actual code.

--
Kartikeya

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers
  2021-11-19  5:16         ` Kumar Kartikeya Dwivedi
@ 2021-11-19  5:24           ` Alexei Starovoitov
  2021-11-19  6:12             ` Kumar Kartikeya Dwivedi
  2021-12-03 15:52             ` Pavel Begunkov
  0 siblings, 2 replies; 34+ messages in thread
From: Alexei Starovoitov @ 2021-11-19  5:24 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Jens Axboe, Pavel Begunkov, io-uring, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Pavel Emelyanov,
	Alexander Mihalicyn, Andrei Vagin, criu, Linux-Fsdevel

On Thu, Nov 18, 2021 at 9:17 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Fri, Nov 19, 2021 at 10:26:59AM IST, Alexei Starovoitov wrote:
> > On Fri, Nov 19, 2021 at 09:45:23AM +0530, Kumar Kartikeya Dwivedi wrote:
> > >
> > > Also, this work is part of GSoC. There is already code that is waiting for this
> > > to fill in the missing pieces [0]. If you want me to add a sample/selftest that
> > > demonstrates/tests how this can be used to reconstruct a task's io_uring, I can
> > > certainly do that. We've already spent a few months contemplating on a few
> > > approaches and this turned out to be the best/most powerful. At one point I had
> > > to scrap some my earlier patches completely because they couldn't work with
> > > descriptorless io_uring. Iterator seem like the best solution so far that can
> > > adapt gracefully to feature additions in something seeing as heavy development
> > > as io_uring.
> > >
> > >   [0]: https://github.com/checkpoint-restore/criu/commit/cfa3f405d522334076fc4d687bd077bee3186ccf#diff-d2cfa5a05213c854d539de003a23a286311ae81431026d3d50b0068c0cb5a852
> > >   [1]: https://github.com/checkpoint-restore/criu/pull/1597
> >
> > Is that the main PR? 1095 changed files? Is it stale or something?
> > Is there a way to view the actual logic that exercises these bpf iterators?
>
> No, there is no code exercising BPF iterator in that PR yet (since it wouldn't
> build/run in CI). There's some code I have locally that uses these to collect
> the necessary information, I can post that, either as a sample or selftest in
> the next version, or separately on GH for you to take a look.
>
> I still rebased it so that you can see the rest of the actual code.

I would like to see a working end to end solution.

Also I'd like to hear what Jens and Pavel have to say about
applicability of CRIU to io_uring in general.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers
  2021-11-19  5:24           ` Alexei Starovoitov
@ 2021-11-19  6:12             ` Kumar Kartikeya Dwivedi
  2021-12-03 15:52             ` Pavel Begunkov
  1 sibling, 0 replies; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-11-19  6:12 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Jens Axboe, Pavel Begunkov, io-uring, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Pavel Emelyanov,
	Alexander Mihalicyn, Andrei Vagin, criu, Linux-Fsdevel

On Fri, Nov 19, 2021 at 10:54:21AM IST, Alexei Starovoitov wrote:
> On Thu, Nov 18, 2021 at 9:17 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Fri, Nov 19, 2021 at 10:26:59AM IST, Alexei Starovoitov wrote:
> > > On Fri, Nov 19, 2021 at 09:45:23AM +0530, Kumar Kartikeya Dwivedi wrote:
> > > >
> > > > Also, this work is part of GSoC. There is already code that is waiting for this
> > > > to fill in the missing pieces [0]. If you want me to add a sample/selftest that
> > > > demonstrates/tests how this can be used to reconstruct a task's io_uring, I can
> > > > certainly do that. We've already spent a few months contemplating on a few
> > > > approaches and this turned out to be the best/most powerful. At one point I had
> > > > to scrap some my earlier patches completely because they couldn't work with
> > > > descriptorless io_uring. Iterator seem like the best solution so far that can
> > > > adapt gracefully to feature additions in something seeing as heavy development
> > > > as io_uring.
> > > >
> > > >   [0]: https://github.com/checkpoint-restore/criu/commit/cfa3f405d522334076fc4d687bd077bee3186ccf#diff-d2cfa5a05213c854d539de003a23a286311ae81431026d3d50b0068c0cb5a852
> > > >   [1]: https://github.com/checkpoint-restore/criu/pull/1597
> > >
> > > Is that the main PR? 1095 changed files? Is it stale or something?
> > > Is there a way to view the actual logic that exercises these bpf iterators?
> >
> > No, there is no code exercising BPF iterator in that PR yet (since it wouldn't
> > build/run in CI). There's some code I have locally that uses these to collect
> > the necessary information, I can post that, either as a sample or selftest in
> > the next version, or separately on GH for you to take a look.
> >
> > I still rebased it so that you can see the rest of the actual code.
>
> I would like to see a working end to end solution.
>

Understood, I'll address this in v2.

> Also I'd like to hear what Jens and Pavel have to say about
> applicability of CRIU to io_uring in general.

--
Kartikeya

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers
  2021-11-19  5:24           ` Alexei Starovoitov
  2021-11-19  6:12             ` Kumar Kartikeya Dwivedi
@ 2021-12-03 15:52             ` Pavel Begunkov
  2021-12-03 23:16               ` Kumar Kartikeya Dwivedi
  1 sibling, 1 reply; 34+ messages in thread
From: Pavel Begunkov @ 2021-12-03 15:52 UTC (permalink / raw)
  To: Alexei Starovoitov, Kumar Kartikeya Dwivedi
  Cc: bpf, Jens Axboe, io-uring, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Pavel Emelyanov, Alexander Mihalicyn,
	Andrei Vagin, criu, Linux-Fsdevel

On 11/19/21 05:24, Alexei Starovoitov wrote:
> On Thu, Nov 18, 2021 at 9:17 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
>>
>> On Fri, Nov 19, 2021 at 10:26:59AM IST, Alexei Starovoitov wrote:
>>> On Fri, Nov 19, 2021 at 09:45:23AM +0530, Kumar Kartikeya Dwivedi wrote:
>>>>
>>>> Also, this work is part of GSoC. There is already code that is waiting for this
>>>> to fill in the missing pieces [0]. If you want me to add a sample/selftest that
>>>> demonstrates/tests how this can be used to reconstruct a task's io_uring, I can
>>>> certainly do that. We've already spent a few months contemplating on a few
>>>> approaches and this turned out to be the best/most powerful. At one point I had
>>>> to scrap some my earlier patches completely because they couldn't work with
>>>> descriptorless io_uring. Iterator seem like the best solution so far that can
>>>> adapt gracefully to feature additions in something seeing as heavy development
>>>> as io_uring.
>>>>
>>>>    [0]: https://github.com/checkpoint-restore/criu/commit/cfa3f405d522334076fc4d687bd077bee3186ccf#diff-d2cfa5a05213c854d539de003a23a286311ae81431026d3d50b0068c0cb5a852
>>>>    [1]: https://github.com/checkpoint-restore/criu/pull/1597
>>>
>>> Is that the main PR? 1095 changed files? Is it stale or something?
>>> Is there a way to view the actual logic that exercises these bpf iterators?
>>
>> No, there is no code exercising BPF iterator in that PR yet (since it wouldn't
>> build/run in CI). There's some code I have locally that uses these to collect
>> the necessary information, I can post that, either as a sample or selftest in
>> the next version, or separately on GH for you to take a look.
>>
>> I still rebased it so that you can see the rest of the actual code.
> 
> I would like to see a working end to end solution.
> 
> Also I'd like to hear what Jens and Pavel have to say about
> applicability of CRIU to io_uring in general.

First, we have no way to know what requests are in flight, without it
CR doesn't make much sense. The most compelling way for me is to add
a feature to fail all in-flights as it does when is closed. But maybe,
you already did solve it somehow?

There is probably a way to restore registered buffers and files, though
it may be tough considering that files may not have corresponding fds in
the userspace, buffers may be unmapped, buffers may come from
shmem/etc. and other corner cases.

There are also not covered here pieces of state, SELECT_BUFFER
buffers, personalities (aka creds), registered eventfd, io-wq
configuration, etc. I'm assuming you'll be checking them and
failing CR if any of them is there.

And the last point, there will be some stuff CR of which is
likely to be a bad idea. E.g. registered dmabuf's,
pre-registered DMA mappings, zerocopy contexts and so on.

IOW, if the first point is solved, there may be a subset of ring
setups that can probably be CR. That should cover a good amount
of cases. I don't have a strong opinion on the whole thing,
I guess it depends on the amount of problems to implement
in-flight cancellations.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers
  2021-12-03 15:52             ` Pavel Begunkov
@ 2021-12-03 23:16               ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 34+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2021-12-03 23:16 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Alexei Starovoitov, bpf, Jens Axboe, io-uring,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Pavel Emelyanov, Alexander Mihalicyn, Andrei Vagin, criu,
	Linux-Fsdevel

On Fri, Dec 03, 2021 at 09:22:54PM IST, Pavel Begunkov wrote:
> On 11/19/21 05:24, Alexei Starovoitov wrote:
> > [...]
> >
> > Also I'd like to hear what Jens and Pavel have to say about
> > applicability of CRIU to io_uring in general.
>

Hi Pavel, thanks for taking a look!

> First, we have no way to know what requests are in flight, without it
> CR doesn't make much sense. The most compelling way for me is to add
> a feature to fail all in-flights as it does when is closed. But maybe,
> you already did solve it somehow?

Indeed, as you note, there is no way to currently inspect in-flight requests,
what we can do is wait for a barrier operation to synchronize against all
previous requests.

So for now, my idea is to drain the ring (by waiting for all in-flight requests
to complete), by using a IOSQE_IO_DRAIN IORING_OP_NOP, and then waiting with a
fixed timeout (so that if forward progress depends on a blocked task/ourselves),
we can fail the dumping. This is ofcourse best effort, but it has worked well
for many of the cases I tested so far.

This might have some other issues, e.g. not being able to accommodate all posted
completions in the CQ ring due to unreaped completions from when it was
checkpointed. In that case we can simply give up, since otherwise recreating the
ring as it was becomes very hard if we let it trample over unread items (it is
unclear how I can send in completitions at restore that were in overflow list at
dump).

One idea I had in mind was to add support to post a dummy CQE entry (by
submitting a e.g. IORING_OP_NOP) where the fields of CQE are set during
submission time. This allows faking a completed request, then at restore we can
push all these into the overflow list and project the state as it were if the CQ
ring was full. At dump time it allows us to continually reap completion items.
If we detect that kernel doesn't support overflow, we fail.

Adjustment of the kernel side tail is not as hard (we can use IORING_OP_NOP
completitions to fill it up, then rewrite entries).

There were other operations (like registering buffers) that had similar side
effect of synchronization of ring state (waiting for it to become idle) before
returning to userspace, but that was pre-5.13.

Also we have to do this ring synchronization fairly early during the dump, since
it can lead to addition of other resources (e.g. fds) to the task that then need
to be dumped again.

> There is probably a way to restore registered buffers and files, though
> it may be tough considering that files may not have corresponding fds in
> the userspace, buffers may be unmapped, buffers may come from
> shmem/etc. and other corner cases.

See [0] for some explanation on all that. CRIU also knows if certain VMA comes
from shmem or not (whose restoration is already handled separately).

>
> There are also not covered here pieces of state, SELECT_BUFFER
> buffers, personalities (aka creds), registered eventfd, io-wq
> configuration, etc. I'm assuming you'll be checking them and
> failing CR if any of them is there.

Personalities are not as hard (IIUC), because all the required state is
available through fdinfo. In the PR linked in this thread, there is code to
parse it and restore using the saved credentials (though we might want to
implement UID mapping options, or either let the user do image rewriting for
that, which is a separate concern).

Ideally I'd like to be able to grab this state from the iterator as well, but it
needs atleast a bpf_xa_for_each helper, since io_uring's show_fdinfo skips some
crucial data when it detects contention over uring_lock (and doesn't indicate
this at all) :(. See the conditional printing on 'has_lock'.

SELECT_BUFFER is indeed unhandled rn. I'm contemplating ways on how to extend
the iterator so that it can loop over all items of generic structures like
Xarray in general while taking appropriate locks relevant for the specific hook
in particular. Both personalities registration and IORING_OP_PROVIDE_BUFFERS
insert use an Xarray, so it might make sense to rather add a bpf_xa_for_each
than introducing another iterator, and only mark it as safe for this iterator
context (where appropriate locks e.g. ctx->uring_lock is held).

For registered eventfd, and io-wq, you can look at [0] to see how I am solving
that, TLDR I just map the underlying structure to the open fd in the task. eBPF
is flexible enough to also allow state inspection in case e.g. the corresponding
eventfd is closed, so that we can recreate it, register, and then close again
when restoring. Same with files directly added to the fixed file set, the whole
idea was to bring in eBPF so that dumping these resources is possible when they
are "hidden" from normal view.

[0]: https://lore.kernel.org/bpf/20211201042333.2035153-11-memxor@gmail.com

>
> And the last point, there will be some stuff CR of which is
> likely to be a bad idea. E.g. registered dmabuf's,
> pre-registered DMA mappings, zerocopy contexts and so on.
>

Yes, we can just fail the dump for these cases. There are many other cases (in
general) where we just have to give up.

> IOW, if the first point is solved, there may be a subset of ring
> setups that can probably be CR. That should cover a good amount
> of cases. I don't have a strong opinion on the whole thing,
> I guess it depends on the amount of problems to implement
> in-flight cancellations.
>
> --
> Pavel Begunkov

--
Kartikeya

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2021-12-03 23:16 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-16  5:42 [PATCH bpf-next v1 0/8] Introduce BPF iterators for io_uring and epoll Kumar Kartikeya Dwivedi
2021-11-16  5:42 ` [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers Kumar Kartikeya Dwivedi
2021-11-18 17:21   ` Yonghong Song
2021-11-18 18:28     ` Kumar Kartikeya Dwivedi
2021-11-18 19:13       ` Yonghong Song
2021-11-18 22:02   ` Alexei Starovoitov
2021-11-19  4:15     ` Kumar Kartikeya Dwivedi
2021-11-19  4:44       ` Kumar Kartikeya Dwivedi
2021-11-19  4:56       ` Alexei Starovoitov
2021-11-19  5:16         ` Kumar Kartikeya Dwivedi
2021-11-19  5:24           ` Alexei Starovoitov
2021-11-19  6:12             ` Kumar Kartikeya Dwivedi
2021-12-03 15:52             ` Pavel Begunkov
2021-12-03 23:16               ` Kumar Kartikeya Dwivedi
2021-11-16  5:42 ` [PATCH bpf-next v1 2/8] bpf: Add bpf_page_to_pfn helper Kumar Kartikeya Dwivedi
2021-11-17 12:35   ` kernel test robot
2021-11-17 12:35     ` kernel test robot
2021-11-17 13:39   ` kernel test robot
2021-11-17 13:39     ` kernel test robot
2021-11-18 17:27   ` Yonghong Song
2021-11-18 18:30     ` Kumar Kartikeya Dwivedi
2021-11-18 19:18       ` Yonghong Song
2021-11-18 19:22         ` Kumar Kartikeya Dwivedi
2021-11-16  5:42 ` [PATCH bpf-next v1 3/8] io_uring: Implement eBPF iterator for registered files Kumar Kartikeya Dwivedi
2021-11-18 17:33   ` Yonghong Song
2021-11-16  5:42 ` [PATCH bpf-next v1 4/8] epoll: Implement eBPF iterator for registered items Kumar Kartikeya Dwivedi
2021-11-18 17:50   ` Yonghong Song
2021-11-16  5:42 ` [PATCH bpf-next v1 5/8] selftests/bpf: Add test for io_uring BPF iterators Kumar Kartikeya Dwivedi
2021-11-18 17:54   ` Yonghong Song
2021-11-18 18:33     ` Kumar Kartikeya Dwivedi
2021-11-18 19:21       ` Yonghong Song
2021-11-16  5:42 ` [PATCH bpf-next v1 6/8] selftests/bpf: Add test for epoll BPF iterator Kumar Kartikeya Dwivedi
2021-11-16  5:42 ` [PATCH bpf-next v1 7/8] selftests/bpf: Test partial reads for io_uring, epoll iterators Kumar Kartikeya Dwivedi
2021-11-16  5:42 ` [PATCH bpf-next v1 8/8] selftests/bpf: Fix btf_dump test for bpf_iter_link_info Kumar Kartikeya Dwivedi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.