io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET 0/5] User mapped provided buffer rings
@ 2023-03-14 17:16 Jens Axboe
  2023-03-14 17:16 ` [PATCH 1/5] io_uring: Adjust mapping wrt architecture aliasing requirements Jens Axboe
                   ` (5 more replies)
  0 siblings, 6 replies; 36+ messages in thread
From: Jens Axboe @ 2023-03-14 17:16 UTC (permalink / raw)
  To: io-uring; +Cc: deller

Hi,

One issue that became apparent when running io_uring code on parisc is
that for data shared between the application and the kernel, we must
ensure that it's placed correctly to avoid aliasing issues that render
it useless.

The first patch in this series is from Helge, and ensures that the
SQ/CQ rings are mapped appropriately. This makes io_uring actually work
there.

Patches 2..4 are prep patches for patch 5, which adds a variant of
ring mapped provided buffers that have the kernel allocate the memory
for them and the application mmap() it. This brings these mapped
buffers in line with how the SQ/CQ rings are managed too.

I'm not fully sure if this ONLY impacts archs that set SHM_COLOUR,
of which there is only parisc, or if SHMLBA setting archs (of which
there are others) are impact to any degree as well...

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/5] io_uring: Adjust mapping wrt architecture aliasing requirements
  2023-03-14 17:16 [PATCHSET 0/5] User mapped provided buffer rings Jens Axboe
@ 2023-03-14 17:16 ` Jens Axboe
  2023-07-12  4:43   ` matoro
  2023-03-14 17:16 ` [PATCH 2/5] io_uring/kbuf: move pinning of provided buffer ring into helper Jens Axboe
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2023-03-14 17:16 UTC (permalink / raw)
  To: io-uring; +Cc: deller, Jens Axboe

From: Helge Deller <deller@gmx.de>

Some architectures have memory cache aliasing requirements (e.g. parisc)
if memory is shared between userspace and kernel. This patch fixes the
kernel to return an aliased address when asked by userspace via mmap().

Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/io_uring.c | 51 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 722624b6d0dc..3adecebbac71 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -72,6 +72,7 @@
 #include <linux/io_uring.h>
 #include <linux/audit.h>
 #include <linux/security.h>
+#include <asm/shmparam.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/io_uring.h>
@@ -3317,6 +3318,54 @@ static __cold int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
 	return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot);
 }
 
+static unsigned long io_uring_mmu_get_unmapped_area(struct file *filp,
+			unsigned long addr, unsigned long len,
+			unsigned long pgoff, unsigned long flags)
+{
+	const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
+	struct vm_unmapped_area_info info;
+	void *ptr;
+
+	/*
+	 * Do not allow to map to user-provided address to avoid breaking the
+	 * aliasing rules. Userspace is not able to guess the offset address of
+	 * kernel kmalloc()ed memory area.
+	 */
+	if (addr)
+		return -EINVAL;
+
+	ptr = io_uring_validate_mmap_request(filp, pgoff, len);
+	if (IS_ERR(ptr))
+		return -ENOMEM;
+
+	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
+	info.length = len;
+	info.low_limit = max(PAGE_SIZE, mmap_min_addr);
+	info.high_limit = arch_get_mmap_base(addr, current->mm->mmap_base);
+#ifdef SHM_COLOUR
+	info.align_mask = PAGE_MASK & (SHM_COLOUR - 1UL);
+#else
+	info.align_mask = PAGE_MASK & (SHMLBA - 1UL);
+#endif
+	info.align_offset = (unsigned long) ptr;
+
+	/*
+	 * A failed mmap() very likely causes application failure,
+	 * so fall back to the bottom-up function here. This scenario
+	 * can happen with large stack limits and large mmap()
+	 * allocations.
+	 */
+	addr = vm_unmapped_area(&info);
+	if (offset_in_page(addr)) {
+		info.flags = 0;
+		info.low_limit = TASK_UNMAPPED_BASE;
+		info.high_limit = mmap_end;
+		addr = vm_unmapped_area(&info);
+	}
+
+	return addr;
+}
+
 #else /* !CONFIG_MMU */
 
 static int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
@@ -3529,6 +3578,8 @@ static const struct file_operations io_uring_fops = {
 #ifndef CONFIG_MMU
 	.get_unmapped_area = io_uring_nommu_get_unmapped_area,
 	.mmap_capabilities = io_uring_nommu_mmap_capabilities,
+#else
+	.get_unmapped_area = io_uring_mmu_get_unmapped_area,
 #endif
 	.poll		= io_uring_poll,
 #ifdef CONFIG_PROC_FS
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/5] io_uring/kbuf: move pinning of provided buffer ring into helper
  2023-03-14 17:16 [PATCHSET 0/5] User mapped provided buffer rings Jens Axboe
  2023-03-14 17:16 ` [PATCH 1/5] io_uring: Adjust mapping wrt architecture aliasing requirements Jens Axboe
@ 2023-03-14 17:16 ` Jens Axboe
  2023-03-14 17:16 ` [PATCH 3/5] io_uring/kbuf: add buffer_list->is_mapped member Jens Axboe
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2023-03-14 17:16 UTC (permalink / raw)
  To: io-uring; +Cc: deller, Jens Axboe

In preparation for allowing the kernel to allocate the provided buffer
rings and have the application mmap it instead, abstract out the
current method of pinning and mapping the user allocated ring.

No functional changes intended in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/kbuf.c | 37 +++++++++++++++++++++++++------------
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index 3002dc827195..3adc08f90e41 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -463,14 +463,32 @@ int io_provide_buffers(struct io_kiocb *req, unsigned int issue_flags)
 	return IOU_OK;
 }
 
-int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg)
+static int io_pin_pbuf_ring(struct io_uring_buf_reg *reg,
+			    struct io_buffer_list *bl)
 {
 	struct io_uring_buf_ring *br;
-	struct io_uring_buf_reg reg;
-	struct io_buffer_list *bl, *free_bl = NULL;
 	struct page **pages;
 	int nr_pages;
 
+	pages = io_pin_pages(reg->ring_addr,
+			     flex_array_size(br, bufs, reg->ring_entries),
+			     &nr_pages);
+	if (IS_ERR(pages))
+		return PTR_ERR(pages);
+
+	br = page_address(pages[0]);
+	bl->buf_pages = pages;
+	bl->buf_nr_pages = nr_pages;
+	bl->buf_ring = br;
+	return 0;
+}
+
+int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg)
+{
+	struct io_uring_buf_reg reg;
+	struct io_buffer_list *bl, *free_bl = NULL;
+	int ret;
+
 	if (copy_from_user(&reg, arg, sizeof(reg)))
 		return -EFAULT;
 
@@ -504,20 +522,15 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg)
 			return -ENOMEM;
 	}
 
-	pages = io_pin_pages(reg.ring_addr,
-			     flex_array_size(br, bufs, reg.ring_entries),
-			     &nr_pages);
-	if (IS_ERR(pages)) {
+	ret = io_pin_pbuf_ring(&reg, bl);
+	if (ret) {
 		kfree(free_bl);
-		return PTR_ERR(pages);
+		return ret;
 	}
 
-	br = page_address(pages[0]);
-	bl->buf_pages = pages;
-	bl->buf_nr_pages = nr_pages;
 	bl->nr_entries = reg.ring_entries;
-	bl->buf_ring = br;
 	bl->mask = reg.ring_entries - 1;
+
 	io_buffer_add_list(ctx, bl, reg.bgid);
 	return 0;
 }
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/5] io_uring/kbuf: add buffer_list->is_mapped member
  2023-03-14 17:16 [PATCHSET 0/5] User mapped provided buffer rings Jens Axboe
  2023-03-14 17:16 ` [PATCH 1/5] io_uring: Adjust mapping wrt architecture aliasing requirements Jens Axboe
  2023-03-14 17:16 ` [PATCH 2/5] io_uring/kbuf: move pinning of provided buffer ring into helper Jens Axboe
@ 2023-03-14 17:16 ` Jens Axboe
  2023-03-14 17:16 ` [PATCH 4/5] io_uring/kbuf: rename struct io_uring_buf_reg 'pad' to'flags' Jens Axboe
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2023-03-14 17:16 UTC (permalink / raw)
  To: io-uring; +Cc: deller, Jens Axboe

Rather than rely on checking buffer_list->buf_pages or ->buf_nr_pages,
add a separate member that tracks if this is a ring mapped provided
buffer list or not.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/kbuf.c | 14 ++++++++------
 io_uring/kbuf.h |  3 +++
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index 3adc08f90e41..db5f189267b7 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -179,7 +179,7 @@ void __user *io_buffer_select(struct io_kiocb *req, size_t *len,
 
 	bl = io_buffer_get_list(ctx, req->buf_index);
 	if (likely(bl)) {
-		if (bl->buf_nr_pages)
+		if (bl->is_mapped)
 			ret = io_ring_buffer_select(req, len, bl, issue_flags);
 		else
 			ret = io_provided_buffer_select(req, len, bl);
@@ -214,7 +214,7 @@ static int __io_remove_buffers(struct io_ring_ctx *ctx,
 	if (!nbufs)
 		return 0;
 
-	if (bl->buf_nr_pages) {
+	if (bl->is_mapped && bl->buf_nr_pages) {
 		int j;
 
 		i = bl->buf_ring->tail - bl->head;
@@ -225,6 +225,7 @@ static int __io_remove_buffers(struct io_ring_ctx *ctx,
 		bl->buf_nr_pages = 0;
 		/* make sure it's seen as empty */
 		INIT_LIST_HEAD(&bl->buf_list);
+		bl->is_mapped = 0;
 		return i;
 	}
 
@@ -303,7 +304,7 @@ int io_remove_buffers(struct io_kiocb *req, unsigned int issue_flags)
 	if (bl) {
 		ret = -EINVAL;
 		/* can't use provide/remove buffers command on mapped buffers */
-		if (!bl->buf_nr_pages)
+		if (!bl->is_mapped)
 			ret = __io_remove_buffers(ctx, bl, p->nbufs);
 	}
 	io_ring_submit_unlock(ctx, issue_flags);
@@ -448,7 +449,7 @@ int io_provide_buffers(struct io_kiocb *req, unsigned int issue_flags)
 		}
 	}
 	/* can't add buffers via this command for a mapped buffer ring */
-	if (bl->buf_nr_pages) {
+	if (bl->is_mapped) {
 		ret = -EINVAL;
 		goto err;
 	}
@@ -480,6 +481,7 @@ static int io_pin_pbuf_ring(struct io_uring_buf_reg *reg,
 	bl->buf_pages = pages;
 	bl->buf_nr_pages = nr_pages;
 	bl->buf_ring = br;
+	bl->is_mapped = 1;
 	return 0;
 }
 
@@ -514,7 +516,7 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg)
 	bl = io_buffer_get_list(ctx, reg.bgid);
 	if (bl) {
 		/* if mapped buffer ring OR classic exists, don't allow */
-		if (bl->buf_nr_pages || !list_empty(&bl->buf_list))
+		if (bl->is_mapped || !list_empty(&bl->buf_list))
 			return -EEXIST;
 	} else {
 		free_bl = bl = kzalloc(sizeof(*bl), GFP_KERNEL);
@@ -548,7 +550,7 @@ int io_unregister_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg)
 	bl = io_buffer_get_list(ctx, reg.bgid);
 	if (!bl)
 		return -ENOENT;
-	if (!bl->buf_nr_pages)
+	if (!bl->is_mapped)
 		return -EINVAL;
 
 	__io_remove_buffers(ctx, bl, -1U);
diff --git a/io_uring/kbuf.h b/io_uring/kbuf.h
index c23e15d7d3ca..61b9c7dade9d 100644
--- a/io_uring/kbuf.h
+++ b/io_uring/kbuf.h
@@ -23,6 +23,9 @@ struct io_buffer_list {
 	__u16 nr_entries;
 	__u16 head;
 	__u16 mask;
+
+	/* ring mapped provided buffers */
+	__u8 is_mapped;
 };
 
 struct io_buffer {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 4/5] io_uring/kbuf: rename struct io_uring_buf_reg 'pad' to'flags'
  2023-03-14 17:16 [PATCHSET 0/5] User mapped provided buffer rings Jens Axboe
                   ` (2 preceding siblings ...)
  2023-03-14 17:16 ` [PATCH 3/5] io_uring/kbuf: add buffer_list->is_mapped member Jens Axboe
@ 2023-03-14 17:16 ` Jens Axboe
  2023-03-14 17:16 ` [PATCH 5/5] io_uring: add support for user mapped provided buffer ring Jens Axboe
  2023-03-15 20:03 ` [PATCHSET 0/5] User mapped provided buffer rings Helge Deller
  5 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2023-03-14 17:16 UTC (permalink / raw)
  To: io-uring; +Cc: deller, Jens Axboe

In preparation for allowing flags to be set for registration, rename
the padding and use it for that.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/uapi/linux/io_uring.h | 2 +-
 io_uring/kbuf.c               | 8 ++++++--
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 709de6d4feb2..c3f3ea997f3a 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -640,7 +640,7 @@ struct io_uring_buf_reg {
 	__u64	ring_addr;
 	__u32	ring_entries;
 	__u16	bgid;
-	__u16	pad;
+	__u16	flags;
 	__u64	resv[3];
 };
 
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index db5f189267b7..4b2f4a0ee962 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -494,7 +494,9 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg)
 	if (copy_from_user(&reg, arg, sizeof(reg)))
 		return -EFAULT;
 
-	if (reg.pad || reg.resv[0] || reg.resv[1] || reg.resv[2])
+	if (reg.resv[0] || reg.resv[1] || reg.resv[2])
+		return -EINVAL;
+	if (reg.flags)
 		return -EINVAL;
 	if (!reg.ring_addr)
 		return -EFAULT;
@@ -544,7 +546,9 @@ int io_unregister_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg)
 
 	if (copy_from_user(&reg, arg, sizeof(reg)))
 		return -EFAULT;
-	if (reg.pad || reg.resv[0] || reg.resv[1] || reg.resv[2])
+	if (reg.resv[0] || reg.resv[1] || reg.resv[2])
+		return -EINVAL;
+	if (reg.flags)
 		return -EINVAL;
 
 	bl = io_buffer_get_list(ctx, reg.bgid);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 5/5] io_uring: add support for user mapped provided buffer ring
  2023-03-14 17:16 [PATCHSET 0/5] User mapped provided buffer rings Jens Axboe
                   ` (3 preceding siblings ...)
  2023-03-14 17:16 ` [PATCH 4/5] io_uring/kbuf: rename struct io_uring_buf_reg 'pad' to'flags' Jens Axboe
@ 2023-03-14 17:16 ` Jens Axboe
  2023-03-16 18:07   ` Ammar Faizi
  2023-03-15 20:03 ` [PATCHSET 0/5] User mapped provided buffer rings Helge Deller
  5 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2023-03-14 17:16 UTC (permalink / raw)
  To: io-uring; +Cc: deller, Jens Axboe

The ring mapped provided buffer rings rely on the application allocating
the memory for the ring, and then the kernel will map it. This generally
works fine, but runs into issues on some architectures where we need
to be able to ensure that the kernel and application virtual address for
the ring play nicely together. This at least impacts architectures that
set SHM_COLOUR, but potentially also anyone setting SHMLBA.

To use this variant of ring provided buffers, the application need not
allocate any memory for the ring. Instead the kernel will do so, and
the allocation must subsequently call mmap(2) on the ring with the
offset set to:

	IORING_OFF_PBUF_RING | (bgid << IORING_OFF_PBUF_SHIFT)

to get a virtual address for the buffer ring. Normally the application
would allocate a suitable piece of memory (and correctly aligned) and
simply pass that in via io_uring_buf_reg.ring_addr and the kernel would
map it.

Outside of the setup differences, the kernel allocate + user mapped
provided buffer ring works exactly the same.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/uapi/linux/io_uring.h | 17 ++++++
 io_uring/io_uring.c           | 13 ++++-
 io_uring/kbuf.c               | 99 +++++++++++++++++++++++++++--------
 io_uring/kbuf.h               |  4 ++
 4 files changed, 109 insertions(+), 24 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index c3f3ea997f3a..1d59c816a5b8 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -389,6 +389,9 @@ enum {
 #define IORING_OFF_SQ_RING		0ULL
 #define IORING_OFF_CQ_RING		0x8000000ULL
 #define IORING_OFF_SQES			0x10000000ULL
+#define IORING_OFF_PBUF_RING		0x80000000ULL
+#define IORING_OFF_PBUF_SHIFT		16
+#define IORING_OFF_MMAP_MASK		0xf8000000ULL
 
 /*
  * Filled with the offset for mmap(2)
@@ -635,6 +638,20 @@ struct io_uring_buf_ring {
 	};
 };
 
+/*
+ * Flags for IORING_REGISTER_PBUF_RING.
+ *
+ * IOU_PBUF_RING_MMAP:	If set, kernel will allocate the memory for the ring.
+ *			The application must not set a ring_addr in struct
+ *			io_uring_buf_reg, instead it must subsequently call
+ *			mmap(2) with the offset set as:
+ *			IORING_OFF_PBUF_RING | (bgid << IORING_OFF_PBUF_SHIFT)
+ *			to get a virtual mapping for the ring.
+ */
+enum {
+	IOU_PBUF_RING_MMAP	= 1,
+};
+
 /* argument for IORING_(UN)REGISTER_PBUF_RING */
 struct io_uring_buf_reg {
 	__u64	ring_addr;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 3adecebbac71..caebe9c82728 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3283,7 +3283,7 @@ static void *io_uring_validate_mmap_request(struct file *file,
 	struct page *page;
 	void *ptr;
 
-	switch (offset) {
+	switch (offset & IORING_OFF_MMAP_MASK) {
 	case IORING_OFF_SQ_RING:
 	case IORING_OFF_CQ_RING:
 		ptr = ctx->rings;
@@ -3291,6 +3291,17 @@ static void *io_uring_validate_mmap_request(struct file *file,
 	case IORING_OFF_SQES:
 		ptr = ctx->sq_sqes;
 		break;
+	case IORING_OFF_PBUF_RING: {
+		unsigned int bgid;
+
+		bgid = (offset & ~IORING_OFF_MMAP_MASK) >> IORING_OFF_PBUF_SHIFT;
+		mutex_lock(&ctx->uring_lock);
+		ptr = io_pbuf_get_address(ctx, bgid);
+		mutex_unlock(&ctx->uring_lock);
+		if (!ptr)
+			return ERR_PTR(-EINVAL);
+		break;
+		}
 	default:
 		return ERR_PTR(-EINVAL);
 	}
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index 4b2f4a0ee962..cd1d9dddf58e 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -137,7 +137,8 @@ static void __user *io_ring_buffer_select(struct io_kiocb *req, size_t *len,
 		return NULL;
 
 	head &= bl->mask;
-	if (head < IO_BUFFER_LIST_BUF_PER_PAGE) {
+	/* mmaped buffers are always contig */
+	if (bl->is_mmap || head < IO_BUFFER_LIST_BUF_PER_PAGE) {
 		buf = &br->bufs[head];
 	} else {
 		int off = head & (IO_BUFFER_LIST_BUF_PER_PAGE - 1);
@@ -214,15 +215,27 @@ static int __io_remove_buffers(struct io_ring_ctx *ctx,
 	if (!nbufs)
 		return 0;
 
-	if (bl->is_mapped && bl->buf_nr_pages) {
-		int j;
-
+	if (bl->is_mapped) {
 		i = bl->buf_ring->tail - bl->head;
-		for (j = 0; j < bl->buf_nr_pages; j++)
-			unpin_user_page(bl->buf_pages[j]);
-		kvfree(bl->buf_pages);
-		bl->buf_pages = NULL;
-		bl->buf_nr_pages = 0;
+		if (bl->is_mmap) {
+			if (bl->buf_ring) {
+				struct page *page;
+
+				page = virt_to_head_page(bl->buf_ring);
+				if (put_page_testzero(page))
+					free_compound_page(page);
+				bl->buf_ring = NULL;
+			}
+			bl->is_mmap = 0;
+		} else if (bl->buf_nr_pages) {
+			int j;
+
+			for (j = 0; j < bl->buf_nr_pages; j++)
+				unpin_user_page(bl->buf_pages[j]);
+			kvfree(bl->buf_pages);
+			bl->buf_pages = NULL;
+			bl->buf_nr_pages = 0;
+		}
 		/* make sure it's seen as empty */
 		INIT_LIST_HEAD(&bl->buf_list);
 		bl->is_mapped = 0;
@@ -482,6 +495,25 @@ static int io_pin_pbuf_ring(struct io_uring_buf_reg *reg,
 	bl->buf_nr_pages = nr_pages;
 	bl->buf_ring = br;
 	bl->is_mapped = 1;
+	bl->is_mmap = 0;
+	return 0;
+}
+
+static int io_alloc_pbuf_ring(struct io_uring_buf_reg *reg,
+			      struct io_buffer_list *bl)
+{
+	gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP;
+	size_t ring_size;
+	void *ptr;
+
+	ring_size = reg->ring_entries * sizeof(struct io_uring_buf_ring);
+	ptr = (void *) __get_free_pages(gfp, get_order(ring_size));
+	if (!ptr)
+		return -ENOMEM;
+
+	bl->buf_ring = ptr;
+	bl->is_mapped = 1;
+	bl->is_mmap = 1;
 	return 0;
 }
 
@@ -496,12 +528,18 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg)
 
 	if (reg.resv[0] || reg.resv[1] || reg.resv[2])
 		return -EINVAL;
-	if (reg.flags)
-		return -EINVAL;
-	if (!reg.ring_addr)
-		return -EFAULT;
-	if (reg.ring_addr & ~PAGE_MASK)
+	if (reg.flags & ~IOU_PBUF_RING_MMAP)
 		return -EINVAL;
+	if (!(reg.flags & IOU_PBUF_RING_MMAP)) {
+		if (!reg.ring_addr)
+			return -EFAULT;
+		if (reg.ring_addr & ~PAGE_MASK)
+			return -EINVAL;
+	} else {
+		if (reg.ring_addr)
+			return -EINVAL;
+	}
+
 	if (!is_power_of_2(reg.ring_entries))
 		return -EINVAL;
 
@@ -526,17 +564,21 @@ int io_register_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg)
 			return -ENOMEM;
 	}
 
-	ret = io_pin_pbuf_ring(&reg, bl);
-	if (ret) {
-		kfree(free_bl);
-		return ret;
-	}
+	if (!(reg.flags & IOU_PBUF_RING_MMAP))
+		ret = io_pin_pbuf_ring(&reg, bl);
+	else
+		ret = io_alloc_pbuf_ring(&reg, bl);
 
-	bl->nr_entries = reg.ring_entries;
-	bl->mask = reg.ring_entries - 1;
+	if (!ret) {
+		bl->nr_entries = reg.ring_entries;
+		bl->mask = reg.ring_entries - 1;
 
-	io_buffer_add_list(ctx, bl, reg.bgid);
-	return 0;
+		io_buffer_add_list(ctx, bl, reg.bgid);
+		return 0;
+	}
+
+	kfree(free_bl);
+	return ret;
 }
 
 int io_unregister_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg)
@@ -564,3 +606,14 @@ int io_unregister_pbuf_ring(struct io_ring_ctx *ctx, void __user *arg)
 	}
 	return 0;
 }
+
+void *io_pbuf_get_address(struct io_ring_ctx *ctx, unsigned long bgid)
+{
+	struct io_buffer_list *bl;
+
+	bl = io_buffer_get_list(ctx, bgid);
+	if (!bl || !bl->is_mmap)
+		return NULL;
+
+	return bl->buf_ring;
+}
diff --git a/io_uring/kbuf.h b/io_uring/kbuf.h
index 61b9c7dade9d..d14345ef61fc 100644
--- a/io_uring/kbuf.h
+++ b/io_uring/kbuf.h
@@ -26,6 +26,8 @@ struct io_buffer_list {
 
 	/* ring mapped provided buffers */
 	__u8 is_mapped;
+	/* ring mapped provided buffers, but mmap'ed by application */
+	__u8 is_mmap;
 };
 
 struct io_buffer {
@@ -53,6 +55,8 @@ unsigned int __io_put_kbuf(struct io_kiocb *req, unsigned issue_flags);
 
 void io_kbuf_recycle_legacy(struct io_kiocb *req, unsigned issue_flags);
 
+void *io_pbuf_get_address(struct io_ring_ctx *ctx, unsigned long bgid);
+
 static inline void io_kbuf_recycle_ring(struct io_kiocb *req)
 {
 	/*
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCHSET 0/5] User mapped provided buffer rings
  2023-03-14 17:16 [PATCHSET 0/5] User mapped provided buffer rings Jens Axboe
                   ` (4 preceding siblings ...)
  2023-03-14 17:16 ` [PATCH 5/5] io_uring: add support for user mapped provided buffer ring Jens Axboe
@ 2023-03-15 20:03 ` Helge Deller
  2023-03-15 20:07   ` Helge Deller
  2023-03-15 20:11   ` Jens Axboe
  5 siblings, 2 replies; 36+ messages in thread
From: Helge Deller @ 2023-03-15 20:03 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-parisc

Hi Jens,

Thanks for doing those fixes!

On 3/14/23 18:16, Jens Axboe wrote:
> One issue that became apparent when running io_uring code on parisc is
> that for data shared between the application and the kernel, we must
> ensure that it's placed correctly to avoid aliasing issues that render
> it useless.
>
> The first patch in this series is from Helge, and ensures that the
> SQ/CQ rings are mapped appropriately. This makes io_uring actually work
> there.
>
> Patches 2..4 are prep patches for patch 5, which adds a variant of
> ring mapped provided buffers that have the kernel allocate the memory
> for them and the application mmap() it. This brings these mapped
> buffers in line with how the SQ/CQ rings are managed too.
>
> I'm not fully sure if this ONLY impacts archs that set SHM_COLOUR,
> of which there is only parisc, or if SHMLBA setting archs (of which
> there are others) are impact to any degree as well...

It would be interesting to find out. I'd assume that other arches,
e.g. sparc, might have similiar issues.
Have you tested your patches on other arches as well?

Helge

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHSET 0/5] User mapped provided buffer rings
  2023-03-15 20:03 ` [PATCHSET 0/5] User mapped provided buffer rings Helge Deller
@ 2023-03-15 20:07   ` Helge Deller
  2023-03-15 20:38     ` Jens Axboe
  2023-03-15 20:11   ` Jens Axboe
  1 sibling, 1 reply; 36+ messages in thread
From: Helge Deller @ 2023-03-15 20:07 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-parisc

On 3/15/23 21:03, Helge Deller wrote:
> Hi Jens,
>
> Thanks for doing those fixes!
>
> On 3/14/23 18:16, Jens Axboe wrote:
>> One issue that became apparent when running io_uring code on parisc is
>> that for data shared between the application and the kernel, we must
>> ensure that it's placed correctly to avoid aliasing issues that render
>> it useless.
>>
>> The first patch in this series is from Helge, and ensures that the
>> SQ/CQ rings are mapped appropriately. This makes io_uring actually work
>> there.
>>
>> Patches 2..4 are prep patches for patch 5, which adds a variant of
>> ring mapped provided buffers that have the kernel allocate the memory
>> for them and the application mmap() it. This brings these mapped
>> buffers in line with how the SQ/CQ rings are managed too.
>>
>> I'm not fully sure if this ONLY impacts archs that set SHM_COLOUR,
>> of which there is only parisc, or if SHMLBA setting archs (of which
>> there are others) are impact to any degree as well...
>
> It would be interesting to find out. I'd assume that other arches,
> e.g. sparc, might have similiar issues.
> Have you tested your patches on other arches as well?

By the way, I've now tested this series on current git head on an
older parisc box (with PA8700 / PCX-W2 CPU).

Results of liburing testsuite:
Tests timed out (1): <send-zerocopy.t> - (may not be a failure)
Tests failed (5): <buf-ring.t> <file-verify.t> <poll-race-mshot.t> <ringbuf-read.t> <send_recvmsg.t>

Helge

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHSET 0/5] User mapped provided buffer rings
  2023-03-15 20:03 ` [PATCHSET 0/5] User mapped provided buffer rings Helge Deller
  2023-03-15 20:07   ` Helge Deller
@ 2023-03-15 20:11   ` Jens Axboe
  1 sibling, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2023-03-15 20:11 UTC (permalink / raw)
  To: Helge Deller, io-uring, linux-parisc

On 3/15/23 2:03?PM, Helge Deller wrote:
> Hi Jens,
> 
> Thanks for doing those fixes!
> 
> On 3/14/23 18:16, Jens Axboe wrote:
>> One issue that became apparent when running io_uring code on parisc is
>> that for data shared between the application and the kernel, we must
>> ensure that it's placed correctly to avoid aliasing issues that render
>> it useless.
>>
>> The first patch in this series is from Helge, and ensures that the
>> SQ/CQ rings are mapped appropriately. This makes io_uring actually work
>> there.
>>
>> Patches 2..4 are prep patches for patch 5, which adds a variant of
>> ring mapped provided buffers that have the kernel allocate the memory
>> for them and the application mmap() it. This brings these mapped
>> buffers in line with how the SQ/CQ rings are managed too.
>>
>> I'm not fully sure if this ONLY impacts archs that set SHM_COLOUR,
>> of which there is only parisc, or if SHMLBA setting archs (of which
>> there are others) are impact to any degree as well...
> 
> It would be interesting to find out. I'd assume that other arches,
> e.g. sparc, might have similiar issues.
> Have you tested your patches on other arches as well?

I don't have any sparc boxes, unfortunately.. But yes, would be
interesting to test on sparc for sure.

I do all my testing on aarch64 and x86-64, and I know that powerpc/s390
has been tested too. But in terms of coverage and regular testing, it's
just the former two.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHSET 0/5] User mapped provided buffer rings
  2023-03-15 20:07   ` Helge Deller
@ 2023-03-15 20:38     ` Jens Axboe
  2023-03-15 21:04       ` John David Anglin
  2023-03-15 21:18       ` Jens Axboe
  0 siblings, 2 replies; 36+ messages in thread
From: Jens Axboe @ 2023-03-15 20:38 UTC (permalink / raw)
  To: Helge Deller, io-uring, linux-parisc

On 3/15/23 2:07?PM, Helge Deller wrote:
> On 3/15/23 21:03, Helge Deller wrote:
>> Hi Jens,
>>
>> Thanks for doing those fixes!
>>
>> On 3/14/23 18:16, Jens Axboe wrote:
>>> One issue that became apparent when running io_uring code on parisc is
>>> that for data shared between the application and the kernel, we must
>>> ensure that it's placed correctly to avoid aliasing issues that render
>>> it useless.
>>>
>>> The first patch in this series is from Helge, and ensures that the
>>> SQ/CQ rings are mapped appropriately. This makes io_uring actually work
>>> there.
>>>
>>> Patches 2..4 are prep patches for patch 5, which adds a variant of
>>> ring mapped provided buffers that have the kernel allocate the memory
>>> for them and the application mmap() it. This brings these mapped
>>> buffers in line with how the SQ/CQ rings are managed too.
>>>
>>> I'm not fully sure if this ONLY impacts archs that set SHM_COLOUR,
>>> of which there is only parisc, or if SHMLBA setting archs (of which
>>> there are others) are impact to any degree as well...
>>
>> It would be interesting to find out. I'd assume that other arches,
>> e.g. sparc, might have similiar issues.
>> Have you tested your patches on other arches as well?
> 
> By the way, I've now tested this series on current git head on an
> older parisc box (with PA8700 / PCX-W2 CPU).
> 
> Results of liburing testsuite:
> Tests timed out (1): <send-zerocopy.t> - (may not be a failure)
> Tests failed (5): <buf-ring.t> <file-verify.t> <poll-race-mshot.t> <ringbuf-read.t> <send_recvmsg.t>

send-zerocopy.t takes about ~20 seconds for me on modern hardware, so
that one likely just needs a longer timeout to work. Running it here on
my PA8900:

axboe@c8000 ~/g/liburing (master)> time test/send-zerocopy.t 

________________________________________________________
Executed in  115.08 secs    fish           external
   usr time   63.70 secs    1.08 millis   63.70 secs
   sys time   57.25 secs    4.26 millis   57.24 secs

which on that box is almost twice as long as the normal timeout for
the test script.

For file-verify.t, that one should work with the current tree. The issue
there is the use of registered buffers, and I added a parisc hack for
that. Maybe it's too specific to the PA8900 (the 128 byte stride). If
your tree does have:

commit 4c4fd1843bf284c0063c3a0f8822cb2d352b20c0 (origin/master, origin/HEAD, master)
Author: Jens Axboe <axboe@kernel.dk>
Date:   Wed Mar 15 11:34:54 2023 -0600

    test/file-verify: add dcache sync for parisc

then please experiment with that. 64 might be the correct value here and
I just got lucky with my testing...
be interesting to see 

For the remainder, they are all related to the buffer ring, which is
what is enabled by this series. But the tests don't use that yet, so
they will fail just like they do without the patch. In the
ring-buf-alloc branch of liburing there's the start of adding helpers to
setup the buffer rings, and then we can switch them to the mmap()
approach without much trouble. It's just not done yet, I will add a
patch in there to do that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHSET 0/5] User mapped provided buffer rings
  2023-03-15 20:38     ` Jens Axboe
@ 2023-03-15 21:04       ` John David Anglin
  2023-03-15 21:08         ` Jens Axboe
  2023-03-15 21:18       ` Jens Axboe
  1 sibling, 1 reply; 36+ messages in thread
From: John David Anglin @ 2023-03-15 21:04 UTC (permalink / raw)
  To: Jens Axboe, Helge Deller, io-uring, linux-parisc

On 2023-03-15 4:38 p.m., Jens Axboe wrote:
> For file-verify.t, that one should work with the current tree. The issue
> there is the use of registered buffers, and I added a parisc hack for
> that. Maybe it's too specific to the PA8900 (the 128 byte stride). If
> your tree does have:
The 128 byte stride is only used on PA8800 and PA8900 processors. Other PA 2.0 processors
use a 64 byte stride.  PA 1.1 processors need a 32 byte stride.

The following gcc defines are available: _PA_RISC2_0, _PA_RISC1_1 and _PA_RISC1_0.

/proc/cpuinfo provides the CPU type but I'm not aware of any easy way to access the stride value
from userspace.  It's available from the PDC_CACHE call and it's used in the kernel.
>
> commit 4c4fd1843bf284c0063c3a0f8822cb2d352b20c0 (origin/master, origin/HEAD, master)
> Author: Jens Axboe<axboe@kernel.dk>
> Date:   Wed Mar 15 11:34:54 2023 -0600
>
>      test/file-verify: add dcache sync for parisc
>
> then please experiment with that. 64 might be the correct value here and
> I just got lucky with my testing...
> be interesting to see

-- 
John David Anglin  dave.anglin@bell.net


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHSET 0/5] User mapped provided buffer rings
  2023-03-15 21:04       ` John David Anglin
@ 2023-03-15 21:08         ` Jens Axboe
  0 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2023-03-15 21:08 UTC (permalink / raw)
  To: John David Anglin, Helge Deller, io-uring, linux-parisc

On 3/15/23 3:04?PM, John David Anglin wrote:
> On 2023-03-15 4:38 p.m., Jens Axboe wrote:
>> For file-verify.t, that one should work with the current tree. The issue
>> there is the use of registered buffers, and I added a parisc hack for
>> that. Maybe it's too specific to the PA8900 (the 128 byte stride). If
>> your tree does have:
> The 128 byte stride is only used on PA8800 and PA8900 processors. Other PA 2.0 processors
> use a 64 byte stride.  PA 1.1 processors need a 32 byte stride.
> 
> The following gcc defines are available: _PA_RISC2_0, _PA_RISC1_1 and _PA_RISC1_0.

Ah perfect!

> /proc/cpuinfo provides the CPU type but I'm not aware of any easy way to access the stride value
> from userspace.  It's available from the PDC_CACHE call and it's used in the kernel.

model		: 9000/785/C8000 - Crestone Peak Mako+ Slow [128]

I guess that's why it worked for me. OK, will ponder how to define that,
I think just going lowest common denominator is enough for now.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHSET 0/5] User mapped provided buffer rings
  2023-03-15 20:38     ` Jens Axboe
  2023-03-15 21:04       ` John David Anglin
@ 2023-03-15 21:18       ` Jens Axboe
  2023-03-16 10:18         ` Helge Deller
  2023-03-16 19:08         ` John David Anglin
  1 sibling, 2 replies; 36+ messages in thread
From: Jens Axboe @ 2023-03-15 21:18 UTC (permalink / raw)
  To: Helge Deller, io-uring, linux-parisc

On 3/15/23 2:38 PM, Jens Axboe wrote:
> On 3/15/23 2:07?PM, Helge Deller wrote:
>> On 3/15/23 21:03, Helge Deller wrote:
>>> Hi Jens,
>>>
>>> Thanks for doing those fixes!
>>>
>>> On 3/14/23 18:16, Jens Axboe wrote:
>>>> One issue that became apparent when running io_uring code on parisc is
>>>> that for data shared between the application and the kernel, we must
>>>> ensure that it's placed correctly to avoid aliasing issues that render
>>>> it useless.
>>>>
>>>> The first patch in this series is from Helge, and ensures that the
>>>> SQ/CQ rings are mapped appropriately. This makes io_uring actually work
>>>> there.
>>>>
>>>> Patches 2..4 are prep patches for patch 5, which adds a variant of
>>>> ring mapped provided buffers that have the kernel allocate the memory
>>>> for them and the application mmap() it. This brings these mapped
>>>> buffers in line with how the SQ/CQ rings are managed too.
>>>>
>>>> I'm not fully sure if this ONLY impacts archs that set SHM_COLOUR,
>>>> of which there is only parisc, or if SHMLBA setting archs (of which
>>>> there are others) are impact to any degree as well...
>>>
>>> It would be interesting to find out. I'd assume that other arches,
>>> e.g. sparc, might have similiar issues.
>>> Have you tested your patches on other arches as well?
>>
>> By the way, I've now tested this series on current git head on an
>> older parisc box (with PA8700 / PCX-W2 CPU).
>>
>> Results of liburing testsuite:
>> Tests timed out (1): <send-zerocopy.t> - (may not be a failure)
>> Tests failed (5): <buf-ring.t> <file-verify.t> <poll-race-mshot.t> <ringbuf-read.t> <send_recvmsg.t>

If you update your liburing git copy, switch to the ring-buf-alloc branch,
then all of the above should work:

axboe@c8000 ~/g/liburing (ring-buf-alloc)> test/buf-ring.t
axboe@c8000 ~/g/liburing (ring-buf-alloc)> test/send_recvmsg.t 
axboe@c8000 ~/g/liburing (ring-buf-alloc)> test/ringbuf-read.t 
axboe@c8000 ~/g/liburing (ring-buf-alloc)> test/poll-race-mshot.t 
axboe@c8000 ~/g/liburing (ring-buf-alloc)> git describe
liburing-2.3-245-g8534193

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHSET 0/5] User mapped provided buffer rings
  2023-03-15 21:18       ` Jens Axboe
@ 2023-03-16 10:18         ` Helge Deller
  2023-03-16 17:00           ` Jens Axboe
  2023-03-16 19:08         ` John David Anglin
  1 sibling, 1 reply; 36+ messages in thread
From: Helge Deller @ 2023-03-16 10:18 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-parisc

On 3/15/23 22:18, Jens Axboe wrote:
> On 3/15/23 2:38 PM, Jens Axboe wrote:
>> On 3/15/23 2:07?PM, Helge Deller wrote:
>>> On 3/15/23 21:03, Helge Deller wrote:
>>>> Hi Jens,
>>>>
>>>> Thanks for doing those fixes!
>>>>
>>>> On 3/14/23 18:16, Jens Axboe wrote:
>>>>> One issue that became apparent when running io_uring code on parisc is
>>>>> that for data shared between the application and the kernel, we must
>>>>> ensure that it's placed correctly to avoid aliasing issues that render
>>>>> it useless.
>>>>>
>>>>> The first patch in this series is from Helge, and ensures that the
>>>>> SQ/CQ rings are mapped appropriately. This makes io_uring actually work
>>>>> there.
>>>>>
>>>>> Patches 2..4 are prep patches for patch 5, which adds a variant of
>>>>> ring mapped provided buffers that have the kernel allocate the memory
>>>>> for them and the application mmap() it. This brings these mapped
>>>>> buffers in line with how the SQ/CQ rings are managed too.
>>>>>
>>>>> I'm not fully sure if this ONLY impacts archs that set SHM_COLOUR,
>>>>> of which there is only parisc, or if SHMLBA setting archs (of which
>>>>> there are others) are impact to any degree as well...
>>>>
>>>> It would be interesting to find out. I'd assume that other arches,
>>>> e.g. sparc, might have similiar issues.
>>>> Have you tested your patches on other arches as well?
>>>
>>> By the way, I've now tested this series on current git head on an
>>> older parisc box (with PA8700 / PCX-W2 CPU).
>>>
>>> Results of liburing testsuite:
>>> Tests timed out (1): <send-zerocopy.t> - (may not be a failure)
>>> Tests failed (5): <buf-ring.t> <file-verify.t> <poll-race-mshot.t> <ringbuf-read.t> <send_recvmsg.t>
>
> If you update your liburing git copy, switch to the ring-buf-alloc branch,
> then all of the above should work:
>
> axboe@c8000 ~/g/liburing (ring-buf-alloc)> test/buf-ring.t
> axboe@c8000 ~/g/liburing (ring-buf-alloc)> test/send_recvmsg.t
> axboe@c8000 ~/g/liburing (ring-buf-alloc)> test/ringbuf-read.t
> axboe@c8000 ~/g/liburing (ring-buf-alloc)> test/poll-race-mshot.t
> axboe@c8000 ~/g/liburing (ring-buf-alloc)> git describe
> liburing-2.3-245-g8534193

Yes, verified. All tests in that branch pass now.

Thanks!
Helge

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHSET 0/5] User mapped provided buffer rings
  2023-03-16 10:18         ` Helge Deller
@ 2023-03-16 17:00           ` Jens Axboe
  0 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2023-03-16 17:00 UTC (permalink / raw)
  To: Helge Deller, io-uring, linux-parisc

On 3/16/23 4:18 AM, Helge Deller wrote:
> On 3/15/23 22:18, Jens Axboe wrote:
>> On 3/15/23 2:38 PM, Jens Axboe wrote:
>>> On 3/15/23 2:07?PM, Helge Deller wrote:
>>>> On 3/15/23 21:03, Helge Deller wrote:
>>>>> Hi Jens,
>>>>>
>>>>> Thanks for doing those fixes!
>>>>>
>>>>> On 3/14/23 18:16, Jens Axboe wrote:
>>>>>> One issue that became apparent when running io_uring code on parisc is
>>>>>> that for data shared between the application and the kernel, we must
>>>>>> ensure that it's placed correctly to avoid aliasing issues that render
>>>>>> it useless.
>>>>>>
>>>>>> The first patch in this series is from Helge, and ensures that the
>>>>>> SQ/CQ rings are mapped appropriately. This makes io_uring actually work
>>>>>> there.
>>>>>>
>>>>>> Patches 2..4 are prep patches for patch 5, which adds a variant of
>>>>>> ring mapped provided buffers that have the kernel allocate the memory
>>>>>> for them and the application mmap() it. This brings these mapped
>>>>>> buffers in line with how the SQ/CQ rings are managed too.
>>>>>>
>>>>>> I'm not fully sure if this ONLY impacts archs that set SHM_COLOUR,
>>>>>> of which there is only parisc, or if SHMLBA setting archs (of which
>>>>>> there are others) are impact to any degree as well...
>>>>>
>>>>> It would be interesting to find out. I'd assume that other arches,
>>>>> e.g. sparc, might have similiar issues.
>>>>> Have you tested your patches on other arches as well?
>>>>
>>>> By the way, I've now tested this series on current git head on an
>>>> older parisc box (with PA8700 / PCX-W2 CPU).
>>>>
>>>> Results of liburing testsuite:
>>>> Tests timed out (1): <send-zerocopy.t> - (may not be a failure)
>>>> Tests failed (5): <buf-ring.t> <file-verify.t> <poll-race-mshot.t> <ringbuf-read.t> <send_recvmsg.t>
>>
>> If you update your liburing git copy, switch to the ring-buf-alloc branch,
>> then all of the above should work:
>>
>> axboe@c8000 ~/g/liburing (ring-buf-alloc)> test/buf-ring.t
>> axboe@c8000 ~/g/liburing (ring-buf-alloc)> test/send_recvmsg.t
>> axboe@c8000 ~/g/liburing (ring-buf-alloc)> test/ringbuf-read.t
>> axboe@c8000 ~/g/liburing (ring-buf-alloc)> test/poll-race-mshot.t
>> axboe@c8000 ~/g/liburing (ring-buf-alloc)> git describe
>> liburing-2.3-245-g8534193
> 
> Yes, verified. All tests in that branch pass now.

Nice, thanks for re-testing!

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/5] io_uring: add support for user mapped provided buffer ring
  2023-03-14 17:16 ` [PATCH 5/5] io_uring: add support for user mapped provided buffer ring Jens Axboe
@ 2023-03-16 18:07   ` Ammar Faizi
  2023-03-16 18:42     ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Ammar Faizi @ 2023-03-16 18:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Helge Deller, io-uring Mailing List, Linux Parisc Mailing List

I tried to verify the for-next build report. And I think this doesn't
look right.

On Tue, Mar 14, 2023 at 11:16:42AM -0600, Jens Axboe wrote:
> @@ -214,15 +215,27 @@ static int __io_remove_buffers(struct io_ring_ctx *ctx,
>  	if (!nbufs)
>  		return 0;
>  
> -	if (bl->is_mapped && bl->buf_nr_pages) {
> -		int j;
> -
> +	if (bl->is_mapped) {
>  		i = bl->buf_ring->tail - bl->head;
                    ^^^^^^^^^^^^^^^^^^

Dereference bl->buf_ring. It implies bl->buf_ring is not NULL.

> -		for (j = 0; j < bl->buf_nr_pages; j++)
> -			unpin_user_page(bl->buf_pages[j]);
> -		kvfree(bl->buf_pages);
> -		bl->buf_pages = NULL;
> -		bl->buf_nr_pages = 0;
> +		if (bl->is_mmap) {
> +			if (bl->buf_ring) {
                        ^^^^^^^^^^^^^^^^^

A NULL check against bl->buf_ring here. If it was possible to be NULL,
wouldn't the above dereference BUG()?

-- 
Ammar Faizi


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/5] io_uring: add support for user mapped provided buffer ring
  2023-03-16 18:07   ` Ammar Faizi
@ 2023-03-16 18:42     ` Jens Axboe
  0 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2023-03-16 18:42 UTC (permalink / raw)
  To: Ammar Faizi
  Cc: Helge Deller, io-uring Mailing List, Linux Parisc Mailing List

On 3/16/23 12:07 PM, Ammar Faizi wrote:
> I tried to verify the for-next build report. And I think this doesn't
> look right.
> 
> On Tue, Mar 14, 2023 at 11:16:42AM -0600, Jens Axboe wrote:
>> @@ -214,15 +215,27 @@ static int __io_remove_buffers(struct io_ring_ctx *ctx,
>>  	if (!nbufs)
>>  		return 0;
>>  
>> -	if (bl->is_mapped && bl->buf_nr_pages) {
>> -		int j;
>> -
>> +	if (bl->is_mapped) {
>>  		i = bl->buf_ring->tail - bl->head;
>                     ^^^^^^^^^^^^^^^^^^
> 
> Dereference bl->buf_ring. It implies bl->buf_ring is not NULL.
> 
>> -		for (j = 0; j < bl->buf_nr_pages; j++)
>> -			unpin_user_page(bl->buf_pages[j]);
>> -		kvfree(bl->buf_pages);
>> -		bl->buf_pages = NULL;
>> -		bl->buf_nr_pages = 0;
>> +		if (bl->is_mmap) {
>> +			if (bl->buf_ring) {
>                         ^^^^^^^^^^^^^^^^^
> 
> A NULL check against bl->buf_ring here. If it was possible to be NULL,
> wouldn't the above dereference BUG()?

I don't think it's possible and we should probably just remove that
latter check. If the buffer group is visible, either method will have
a valid ->buf_ring IFF is_mmap/is_mapped is set.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHSET 0/5] User mapped provided buffer rings
  2023-03-15 21:18       ` Jens Axboe
  2023-03-16 10:18         ` Helge Deller
@ 2023-03-16 19:08         ` John David Anglin
  2023-03-16 19:46           ` Jens Axboe
  1 sibling, 1 reply; 36+ messages in thread
From: John David Anglin @ 2023-03-16 19:08 UTC (permalink / raw)
  To: Jens Axboe, Helge Deller, io-uring, linux-parisc

On 2023-03-15 5:18 p.m., Jens Axboe wrote:
> On 3/15/23 2:38 PM, Jens Axboe wrote:
>> On 3/15/23 2:07?PM, Helge Deller wrote:
>>> On 3/15/23 21:03, Helge Deller wrote:
>>>> Hi Jens,
>>>>
>>>> Thanks for doing those fixes!
>>>>
>>>> On 3/14/23 18:16, Jens Axboe wrote:
>>>>> One issue that became apparent when running io_uring code on parisc is
>>>>> that for data shared between the application and the kernel, we must
>>>>> ensure that it's placed correctly to avoid aliasing issues that render
>>>>> it useless.
>>>>>
>>>>> The first patch in this series is from Helge, and ensures that the
>>>>> SQ/CQ rings are mapped appropriately. This makes io_uring actually work
>>>>> there.
>>>>>
>>>>> Patches 2..4 are prep patches for patch 5, which adds a variant of
>>>>> ring mapped provided buffers that have the kernel allocate the memory
>>>>> for them and the application mmap() it. This brings these mapped
>>>>> buffers in line with how the SQ/CQ rings are managed too.
>>>>>
>>>>> I'm not fully sure if this ONLY impacts archs that set SHM_COLOUR,
>>>>> of which there is only parisc, or if SHMLBA setting archs (of which
>>>>> there are others) are impact to any degree as well...
>>>> It would be interesting to find out. I'd assume that other arches,
>>>> e.g. sparc, might have similiar issues.
>>>> Have you tested your patches on other arches as well?
>>> By the way, I've now tested this series on current git head on an
>>> older parisc box (with PA8700 / PCX-W2 CPU).
>>>
>>> Results of liburing testsuite:
>>> Tests timed out (1): <send-zerocopy.t> - (may not be a failure)
>>> Tests failed (5): <buf-ring.t> <file-verify.t> <poll-race-mshot.t> <ringbuf-read.t> <send_recvmsg.t>
> If you update your liburing git copy, switch to the ring-buf-alloc branch,
> then all of the above should work:
With master liburing branch, test/poll-race-mshot.t still crashes my rp3440:
Running test poll-race-mshot.t Bad cqe res -233
Bad cqe res -233
Bad cqe res -233

There is a total lockup with no messages of any kind.

I think the io_uring code needs to reject user supplied ring buffers that are not equivalently mapped
to the corresponding kernel pages.  Don't know if it would be possible to reallocate kernel pages so they
are equivalently mapped.

Dave

-- 
John David Anglin  dave.anglin@bell.net


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHSET 0/5] User mapped provided buffer rings
  2023-03-16 19:08         ` John David Anglin
@ 2023-03-16 19:46           ` Jens Axboe
  2023-03-17  2:09             ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2023-03-16 19:46 UTC (permalink / raw)
  To: John David Anglin, Helge Deller, io-uring, linux-parisc

On 3/16/23 1:08?PM, John David Anglin wrote:
> On 2023-03-15 5:18 p.m., Jens Axboe wrote:
>> On 3/15/23 2:38?PM, Jens Axboe wrote:
>>> On 3/15/23 2:07?PM, Helge Deller wrote:
>>>> On 3/15/23 21:03, Helge Deller wrote:
>>>>> Hi Jens,
>>>>>
>>>>> Thanks for doing those fixes!
>>>>>
>>>>> On 3/14/23 18:16, Jens Axboe wrote:
>>>>>> One issue that became apparent when running io_uring code on parisc is
>>>>>> that for data shared between the application and the kernel, we must
>>>>>> ensure that it's placed correctly to avoid aliasing issues that render
>>>>>> it useless.
>>>>>>
>>>>>> The first patch in this series is from Helge, and ensures that the
>>>>>> SQ/CQ rings are mapped appropriately. This makes io_uring actually work
>>>>>> there.
>>>>>>
>>>>>> Patches 2..4 are prep patches for patch 5, which adds a variant of
>>>>>> ring mapped provided buffers that have the kernel allocate the memory
>>>>>> for them and the application mmap() it. This brings these mapped
>>>>>> buffers in line with how the SQ/CQ rings are managed too.
>>>>>>
>>>>>> I'm not fully sure if this ONLY impacts archs that set SHM_COLOUR,
>>>>>> of which there is only parisc, or if SHMLBA setting archs (of which
>>>>>> there are others) are impact to any degree as well...
>>>>> It would be interesting to find out. I'd assume that other arches,
>>>>> e.g. sparc, might have similiar issues.
>>>>> Have you tested your patches on other arches as well?
>>>> By the way, I've now tested this series on current git head on an
>>>> older parisc box (with PA8700 / PCX-W2 CPU).
>>>>
>>>> Results of liburing testsuite:
>>>> Tests timed out (1): <send-zerocopy.t> - (may not be a failure)
>>>> Tests failed (5): <buf-ring.t> <file-verify.t> <poll-race-mshot.t> <ringbuf-read.t> <send_recvmsg.t>
>> If you update your liburing git copy, switch to the ring-buf-alloc branch,
>> then all of the above should work:
> With master liburing branch, test/poll-race-mshot.t still crashes my rp3440:
> Running test poll-race-mshot.t Bad cqe res -233
> Bad cqe res -233
> Bad cqe res -233
> 
> There is a total lockup with no messages of any kind.
> 
> I think the io_uring code needs to reject user supplied ring buffers that are not equivalently mapped
> to the corresponding kernel pages.  Don't know if it would be possible to reallocate kernel pages so they
> are equivalently mapped.

We can do that, you'd just want to add that check in io_pin_pbuf_ring()
when the pages have been mapped AND we're on an arch that has those
kinds of requirements. Maybe something like the below, totally
untested...

I am puzzled where the crash is coming from, though. It should just hit
the -ENOBUFS case as it can't find a buffer, and that'd terminate that
request. Which does seem to be what is happening above, that is really
no different than an attempt to read/receive from a buffer group that
has no buffers available. So a bit puzzling on what makes your kernel
crash after that has happened, as we do have generic test cases that
exercise that explicitly.


diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index cd1d9dddf58e..73f290aca7f1 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -491,6 +491,15 @@ static int io_pin_pbuf_ring(struct io_uring_buf_reg *reg,
 		return PTR_ERR(pages);
 
 	br = page_address(pages[0]);
+#ifdef SHM_COLOUR
+	if ((reg->ring_addr & (unsigned long) br) & SHM_COLOUR) {
+		int i;
+
+		for (i = 0; i < nr_pages; i++)
+			unpin_user_page(pages[i]);
+		return -EINVAL;
+	}
+#endif
 	bl->buf_pages = pages;
 	bl->buf_nr_pages = nr_pages;
 	bl->buf_ring = br;

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCHSET 0/5] User mapped provided buffer rings
  2023-03-16 19:46           ` Jens Axboe
@ 2023-03-17  2:09             ` Jens Axboe
  2023-03-17  2:17               ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2023-03-17  2:09 UTC (permalink / raw)
  To: John David Anglin, Helge Deller, io-uring, linux-parisc

On 3/16/23 1:46 PM, Jens Axboe wrote:
> On 3/16/23 1:08?PM, John David Anglin wrote:
>> On 2023-03-15 5:18 p.m., Jens Axboe wrote:
>>> On 3/15/23 2:38?PM, Jens Axboe wrote:
>>>> On 3/15/23 2:07?PM, Helge Deller wrote:
>>>>> On 3/15/23 21:03, Helge Deller wrote:
>>>>>> Hi Jens,
>>>>>>
>>>>>> Thanks for doing those fixes!
>>>>>>
>>>>>> On 3/14/23 18:16, Jens Axboe wrote:
>>>>>>> One issue that became apparent when running io_uring code on parisc is
>>>>>>> that for data shared between the application and the kernel, we must
>>>>>>> ensure that it's placed correctly to avoid aliasing issues that render
>>>>>>> it useless.
>>>>>>>
>>>>>>> The first patch in this series is from Helge, and ensures that the
>>>>>>> SQ/CQ rings are mapped appropriately. This makes io_uring actually work
>>>>>>> there.
>>>>>>>
>>>>>>> Patches 2..4 are prep patches for patch 5, which adds a variant of
>>>>>>> ring mapped provided buffers that have the kernel allocate the memory
>>>>>>> for them and the application mmap() it. This brings these mapped
>>>>>>> buffers in line with how the SQ/CQ rings are managed too.
>>>>>>>
>>>>>>> I'm not fully sure if this ONLY impacts archs that set SHM_COLOUR,
>>>>>>> of which there is only parisc, or if SHMLBA setting archs (of which
>>>>>>> there are others) are impact to any degree as well...
>>>>>> It would be interesting to find out. I'd assume that other arches,
>>>>>> e.g. sparc, might have similiar issues.
>>>>>> Have you tested your patches on other arches as well?
>>>>> By the way, I've now tested this series on current git head on an
>>>>> older parisc box (with PA8700 / PCX-W2 CPU).
>>>>>
>>>>> Results of liburing testsuite:
>>>>> Tests timed out (1): <send-zerocopy.t> - (may not be a failure)
>>>>> Tests failed (5): <buf-ring.t> <file-verify.t> <poll-race-mshot.t> <ringbuf-read.t> <send_recvmsg.t>
>>> If you update your liburing git copy, switch to the ring-buf-alloc branch,
>>> then all of the above should work:
>> With master liburing branch, test/poll-race-mshot.t still crashes my rp3440:
>> Running test poll-race-mshot.t Bad cqe res -233
>> Bad cqe res -233
>> Bad cqe res -233
>>
>> There is a total lockup with no messages of any kind.
>>
>> I think the io_uring code needs to reject user supplied ring buffers that are not equivalently mapped
>> to the corresponding kernel pages.  Don't know if it would be possible to reallocate kernel pages so they
>> are equivalently mapped.
> 
> We can do that, you'd just want to add that check in io_pin_pbuf_ring()
> when the pages have been mapped AND we're on an arch that has those
> kinds of requirements. Maybe something like the below, totally
> untested...
> 
> I am puzzled where the crash is coming from, though. It should just hit
> the -ENOBUFS case as it can't find a buffer, and that'd terminate that
> request. Which does seem to be what is happening above, that is really
> no different than an attempt to read/receive from a buffer group that
> has no buffers available. So a bit puzzling on what makes your kernel
> crash after that has happened, as we do have generic test cases that
> exercise that explicitly.
> 
> 
> diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
> index cd1d9dddf58e..73f290aca7f1 100644
> --- a/io_uring/kbuf.c
> +++ b/io_uring/kbuf.c
> @@ -491,6 +491,15 @@ static int io_pin_pbuf_ring(struct io_uring_buf_reg *reg,
>  		return PTR_ERR(pages);
>  
>  	br = page_address(pages[0]);
> +#ifdef SHM_COLOUR
> +	if ((reg->ring_addr & (unsigned long) br) & SHM_COLOUR) {

& (SHM_COLOUR - 1)) {

of course...

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHSET 0/5] User mapped provided buffer rings
  2023-03-17  2:09             ` Jens Axboe
@ 2023-03-17  2:17               ` Jens Axboe
  2023-03-17 15:36                 ` John David Anglin
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2023-03-17  2:17 UTC (permalink / raw)
  To: John David Anglin, Helge Deller, io-uring, linux-parisc

On 3/16/23 8:09?PM, Jens Axboe wrote:
> On 3/16/23 1:46?PM, Jens Axboe wrote:
>> On 3/16/23 1:08?PM, John David Anglin wrote:
>>> On 2023-03-15 5:18 p.m., Jens Axboe wrote:
>>>> On 3/15/23 2:38?PM, Jens Axboe wrote:
>>>>> On 3/15/23 2:07?PM, Helge Deller wrote:
>>>>>> On 3/15/23 21:03, Helge Deller wrote:
>>>>>>> Hi Jens,
>>>>>>>
>>>>>>> Thanks for doing those fixes!
>>>>>>>
>>>>>>> On 3/14/23 18:16, Jens Axboe wrote:
>>>>>>>> One issue that became apparent when running io_uring code on parisc is
>>>>>>>> that for data shared between the application and the kernel, we must
>>>>>>>> ensure that it's placed correctly to avoid aliasing issues that render
>>>>>>>> it useless.
>>>>>>>>
>>>>>>>> The first patch in this series is from Helge, and ensures that the
>>>>>>>> SQ/CQ rings are mapped appropriately. This makes io_uring actually work
>>>>>>>> there.
>>>>>>>>
>>>>>>>> Patches 2..4 are prep patches for patch 5, which adds a variant of
>>>>>>>> ring mapped provided buffers that have the kernel allocate the memory
>>>>>>>> for them and the application mmap() it. This brings these mapped
>>>>>>>> buffers in line with how the SQ/CQ rings are managed too.
>>>>>>>>
>>>>>>>> I'm not fully sure if this ONLY impacts archs that set SHM_COLOUR,
>>>>>>>> of which there is only parisc, or if SHMLBA setting archs (of which
>>>>>>>> there are others) are impact to any degree as well...
>>>>>>> It would be interesting to find out. I'd assume that other arches,
>>>>>>> e.g. sparc, might have similiar issues.
>>>>>>> Have you tested your patches on other arches as well?
>>>>>> By the way, I've now tested this series on current git head on an
>>>>>> older parisc box (with PA8700 / PCX-W2 CPU).
>>>>>>
>>>>>> Results of liburing testsuite:
>>>>>> Tests timed out (1): <send-zerocopy.t> - (may not be a failure)
>>>>>> Tests failed (5): <buf-ring.t> <file-verify.t> <poll-race-mshot.t> <ringbuf-read.t> <send_recvmsg.t>
>>>> If you update your liburing git copy, switch to the ring-buf-alloc branch,
>>>> then all of the above should work:
>>> With master liburing branch, test/poll-race-mshot.t still crashes my rp3440:
>>> Running test poll-race-mshot.t Bad cqe res -233
>>> Bad cqe res -233
>>> Bad cqe res -233
>>>
>>> There is a total lockup with no messages of any kind.
>>>
>>> I think the io_uring code needs to reject user supplied ring buffers that are not equivalently mapped
>>> to the corresponding kernel pages.  Don't know if it would be possible to reallocate kernel pages so they
>>> are equivalently mapped.
>>
>> We can do that, you'd just want to add that check in io_pin_pbuf_ring()
>> when the pages have been mapped AND we're on an arch that has those
>> kinds of requirements. Maybe something like the below, totally
>> untested...
>>
>> I am puzzled where the crash is coming from, though. It should just hit
>> the -ENOBUFS case as it can't find a buffer, and that'd terminate that
>> request. Which does seem to be what is happening above, that is really
>> no different than an attempt to read/receive from a buffer group that
>> has no buffers available. So a bit puzzling on what makes your kernel
>> crash after that has happened, as we do have generic test cases that
>> exercise that explicitly.
>>
>>
>> diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
>> index cd1d9dddf58e..73f290aca7f1 100644
>> --- a/io_uring/kbuf.c
>> +++ b/io_uring/kbuf.c
>> @@ -491,6 +491,15 @@ static int io_pin_pbuf_ring(struct io_uring_buf_reg *reg,
>>  		return PTR_ERR(pages);
>>  
>>  	br = page_address(pages[0]);
>> +#ifdef SHM_COLOUR
>> +	if ((reg->ring_addr & (unsigned long) br) & SHM_COLOUR) {
> 
> & (SHM_COLOUR - 1)) {
> 
> of course...

Full version, I think this should do the right thing. If the kernel and
app side isn't aligned on the same SHM_COLOUR boundary, we'll return
-EINVAL rather than setup the ring.

For the ring-buf-alloc branch, this is handled automatically. But we
should, as you mentioned, ensure that the kernel doesn't allow setting
something up that will not work.

Note that this is still NOT related to your hang, I honestly have no
idea what that could be. Unfortunately parisc doesn't have a lot of
debugging aids for this... Could even be a generic kernel issue. I
looked up your rp3440, and it sounds like we have basically the same
setup. I'm running a dual socket PA8900 at 1GHz.


diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index cd1d9dddf58e..7c6544456f90 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -491,6 +491,15 @@ static int io_pin_pbuf_ring(struct io_uring_buf_reg *reg,
 		return PTR_ERR(pages);
 
 	br = page_address(pages[0]);
+#ifdef SHM_COLOUR
+	if ((reg->ring_addr | (unsigned long) br) & (SHM_COLOUR - 1)) {
+		int i;
+
+		for (i = 0; i < nr_pages; i++)
+			unpin_user_page(pages[i]);
+		return -EINVAL;
+	}
+#endif
 	bl->buf_pages = pages;
 	bl->buf_nr_pages = nr_pages;
 	bl->buf_ring = br;

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCHSET 0/5] User mapped provided buffer rings
  2023-03-17  2:17               ` Jens Axboe
@ 2023-03-17 15:36                 ` John David Anglin
  2023-03-17 15:57                   ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: John David Anglin @ 2023-03-17 15:36 UTC (permalink / raw)
  To: Jens Axboe, Helge Deller, io-uring, linux-parisc

On 2023-03-16 10:17 p.m., Jens Axboe wrote:
> On 3/16/23 8:09?PM, Jens Axboe wrote:
>> On 3/16/23 1:46?PM, Jens Axboe wrote:
>>> On 3/16/23 1:08?PM, John David Anglin wrote:
>>>> On 2023-03-15 5:18 p.m., Jens Axboe wrote:
>>>>> On 3/15/23 2:38?PM, Jens Axboe wrote:
>>>>>> On 3/15/23 2:07?PM, Helge Deller wrote:
>>>>>>> On 3/15/23 21:03, Helge Deller wrote:
>>>>>>>> Hi Jens,
>>>>>>>>
>>>>>>>> Thanks for doing those fixes!
>>>>>>>>
>>>>>>>> On 3/14/23 18:16, Jens Axboe wrote:
>>>>>>>>> One issue that became apparent when running io_uring code on parisc is
>>>>>>>>> that for data shared between the application and the kernel, we must
>>>>>>>>> ensure that it's placed correctly to avoid aliasing issues that render
>>>>>>>>> it useless.
>>>>>>>>>
>>>>>>>>> The first patch in this series is from Helge, and ensures that the
>>>>>>>>> SQ/CQ rings are mapped appropriately. This makes io_uring actually work
>>>>>>>>> there.
>>>>>>>>>
>>>>>>>>> Patches 2..4 are prep patches for patch 5, which adds a variant of
>>>>>>>>> ring mapped provided buffers that have the kernel allocate the memory
>>>>>>>>> for them and the application mmap() it. This brings these mapped
>>>>>>>>> buffers in line with how the SQ/CQ rings are managed too.
>>>>>>>>>
>>>>>>>>> I'm not fully sure if this ONLY impacts archs that set SHM_COLOUR,
>>>>>>>>> of which there is only parisc, or if SHMLBA setting archs (of which
>>>>>>>>> there are others) are impact to any degree as well...
>>>>>>>> It would be interesting to find out. I'd assume that other arches,
>>>>>>>> e.g. sparc, might have similiar issues.
>>>>>>>> Have you tested your patches on other arches as well?
>>>>>>> By the way, I've now tested this series on current git head on an
>>>>>>> older parisc box (with PA8700 / PCX-W2 CPU).
>>>>>>>
>>>>>>> Results of liburing testsuite:
>>>>>>> Tests timed out (1): <send-zerocopy.t> - (may not be a failure)
>>>>>>> Tests failed (5): <buf-ring.t> <file-verify.t> <poll-race-mshot.t> <ringbuf-read.t> <send_recvmsg.t>
>>>>> If you update your liburing git copy, switch to the ring-buf-alloc branch,
>>>>> then all of the above should work:
>>>> With master liburing branch, test/poll-race-mshot.t still crashes my rp3440:
>>>> Running test poll-race-mshot.t Bad cqe res -233
>>>> Bad cqe res -233
>>>> Bad cqe res -233
>>>>
>>>> There is a total lockup with no messages of any kind.
>>>>
>>>> I think the io_uring code needs to reject user supplied ring buffers that are not equivalently mapped
>>>> to the corresponding kernel pages.  Don't know if it would be possible to reallocate kernel pages so they
>>>> are equivalently mapped.
>>> We can do that, you'd just want to add that check in io_pin_pbuf_ring()
>>> when the pages have been mapped AND we're on an arch that has those
>>> kinds of requirements. Maybe something like the below, totally
>>> untested...
>>>
>>> I am puzzled where the crash is coming from, though. It should just hit
>>> the -ENOBUFS case as it can't find a buffer, and that'd terminate that
>>> request. Which does seem to be what is happening above, that is really
>>> no different than an attempt to read/receive from a buffer group that
>>> has no buffers available. So a bit puzzling on what makes your kernel
>>> crash after that has happened, as we do have generic test cases that
>>> exercise that explicitly.
>>>
>>>
>>> diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
>>> index cd1d9dddf58e..73f290aca7f1 100644
>>> --- a/io_uring/kbuf.c
>>> +++ b/io_uring/kbuf.c
>>> @@ -491,6 +491,15 @@ static int io_pin_pbuf_ring(struct io_uring_buf_reg *reg,
>>>   		return PTR_ERR(pages);
>>>   
>>>   	br = page_address(pages[0]);
>>> +#ifdef SHM_COLOUR
>>> +	if ((reg->ring_addr & (unsigned long) br) & SHM_COLOUR) {
>> & (SHM_COLOUR - 1)) {
>>
>> of course...
> Full version, I think this should do the right thing. If the kernel and
> app side isn't aligned on the same SHM_COLOUR boundary, we'll return
> -EINVAL rather than setup the ring.
>
> For the ring-buf-alloc branch, this is handled automatically. But we
> should, as you mentioned, ensure that the kernel doesn't allow setting
> something up that will not work.
>
> Note that this is still NOT related to your hang, I honestly have no
> idea what that could be. Unfortunately parisc doesn't have a lot of
> debugging aids for this... Could even be a generic kernel issue. I
> looked up your rp3440, and it sounds like we have basically the same
> setup. I'm running a dual socket PA8900 at 1GHz.
With this change, test/poll-race-mshot.t no longer crashes my rp34404.

Results on master are:
Tests timed out (2): <a4c0b3decb33.t> <send-zerocopy.t>
Tests failed (1): <fd-pass.t>

Running test buf-ring.t 0 sec [0]
Running test poll-race-mshot.t Skipped

Results on ring-buf-alloc are:
Tests timed out (2): <a4c0b3decb33.t> <send-zerocopy.t>
Tests failed (2): <buf-ring.t> <fd-pass.t>

Running test buf-ring.t register buf ring failed -22
test_full_page_reg failed
Test buf-ring.t failed with ret 1
Running test poll-race-mshot.t 4 sec

Without the change, the test/poll-race-mshot.t test causes HPMCs on my rp3440 (two processors).
The front status LED turns red and the event is logged in the hardware system log.  I looked at where
the HPMC occurred but the locations were unrelated to io_uring.

I tried running the test under strace.  With output to console, the test doesn't cause a crash and it more
or less exits normally (need ^C to kill one process).  With output to file, system crashes and file is empty
on reboot.

fd-pass.t fail is new.

I don't think buf-ring.t and send_recvmsg.t actually pass on master with change.  Tests probably need
updating.

The "Bad cqe res -233" messages are gone😁

Aside from additional server related stuff, the rp3440 is architecturally similar to c8000.  Both used PA8800
and PA8900 CPUs.

>
>
> diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
> index cd1d9dddf58e..7c6544456f90 100644
> --- a/io_uring/kbuf.c
> +++ b/io_uring/kbuf.c
> @@ -491,6 +491,15 @@ static int io_pin_pbuf_ring(struct io_uring_buf_reg *reg,
>   		return PTR_ERR(pages);
>   
>   	br = page_address(pages[0]);
> +#ifdef SHM_COLOUR
> +	if ((reg->ring_addr | (unsigned long) br) & (SHM_COLOUR - 1)) {
> +		int i;
> +
> +		for (i = 0; i < nr_pages; i++)
> +			unpin_user_page(pages[i]);
> +		return -EINVAL;
> +	}
> +#endif
>   	bl->buf_pages = pages;
>   	bl->buf_nr_pages = nr_pages;
>   	bl->buf_ring = br;
>

Dave

-- 
John David Anglin  dave.anglin@bell.net


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHSET 0/5] User mapped provided buffer rings
  2023-03-17 15:36                 ` John David Anglin
@ 2023-03-17 15:57                   ` Jens Axboe
  2023-03-17 16:15                     ` John David Anglin
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2023-03-17 15:57 UTC (permalink / raw)
  To: John David Anglin, Helge Deller, io-uring, linux-parisc

On 3/17/23 9:36?AM, John David Anglin wrote:
> On 2023-03-16 10:17 p.m., Jens Axboe wrote:
>> On 3/16/23 8:09?PM, Jens Axboe wrote:
>>> On 3/16/23 1:46?PM, Jens Axboe wrote:
>>>> On 3/16/23 1:08?PM, John David Anglin wrote:
>>>>> On 2023-03-15 5:18 p.m., Jens Axboe wrote:
>>>>>> On 3/15/23 2:38?PM, Jens Axboe wrote:
>>>>>>> On 3/15/23 2:07?PM, Helge Deller wrote:
>>>>>>>> On 3/15/23 21:03, Helge Deller wrote:
>>>>>>>>> Hi Jens,
>>>>>>>>>
>>>>>>>>> Thanks for doing those fixes!
>>>>>>>>>
>>>>>>>>> On 3/14/23 18:16, Jens Axboe wrote:
>>>>>>>>>> One issue that became apparent when running io_uring code on parisc is
>>>>>>>>>> that for data shared between the application and the kernel, we must
>>>>>>>>>> ensure that it's placed correctly to avoid aliasing issues that render
>>>>>>>>>> it useless.
>>>>>>>>>>
>>>>>>>>>> The first patch in this series is from Helge, and ensures that the
>>>>>>>>>> SQ/CQ rings are mapped appropriately. This makes io_uring actually work
>>>>>>>>>> there.
>>>>>>>>>>
>>>>>>>>>> Patches 2..4 are prep patches for patch 5, which adds a variant of
>>>>>>>>>> ring mapped provided buffers that have the kernel allocate the memory
>>>>>>>>>> for them and the application mmap() it. This brings these mapped
>>>>>>>>>> buffers in line with how the SQ/CQ rings are managed too.
>>>>>>>>>>
>>>>>>>>>> I'm not fully sure if this ONLY impacts archs that set SHM_COLOUR,
>>>>>>>>>> of which there is only parisc, or if SHMLBA setting archs (of which
>>>>>>>>>> there are others) are impact to any degree as well...
>>>>>>>>> It would be interesting to find out. I'd assume that other arches,
>>>>>>>>> e.g. sparc, might have similiar issues.
>>>>>>>>> Have you tested your patches on other arches as well?
>>>>>>>> By the way, I've now tested this series on current git head on an
>>>>>>>> older parisc box (with PA8700 / PCX-W2 CPU).
>>>>>>>>
>>>>>>>> Results of liburing testsuite:
>>>>>>>> Tests timed out (1): <send-zerocopy.t> - (may not be a failure)
>>>>>>>> Tests failed (5): <buf-ring.t> <file-verify.t> <poll-race-mshot.t> <ringbuf-read.t> <send_recvmsg.t>
>>>>>> If you update your liburing git copy, switch to the ring-buf-alloc branch,
>>>>>> then all of the above should work:
>>>>> With master liburing branch, test/poll-race-mshot.t still crashes my rp3440:
>>>>> Running test poll-race-mshot.t Bad cqe res -233
>>>>> Bad cqe res -233
>>>>> Bad cqe res -233
>>>>>
>>>>> There is a total lockup with no messages of any kind.
>>>>>
>>>>> I think the io_uring code needs to reject user supplied ring buffers that are not equivalently mapped
>>>>> to the corresponding kernel pages.  Don't know if it would be possible to reallocate kernel pages so they
>>>>> are equivalently mapped.
>>>> We can do that, you'd just want to add that check in io_pin_pbuf_ring()
>>>> when the pages have been mapped AND we're on an arch that has those
>>>> kinds of requirements. Maybe something like the below, totally
>>>> untested...
>>>>
>>>> I am puzzled where the crash is coming from, though. It should just hit
>>>> the -ENOBUFS case as it can't find a buffer, and that'd terminate that
>>>> request. Which does seem to be what is happening above, that is really
>>>> no different than an attempt to read/receive from a buffer group that
>>>> has no buffers available. So a bit puzzling on what makes your kernel
>>>> crash after that has happened, as we do have generic test cases that
>>>> exercise that explicitly.
>>>>
>>>>
>>>> diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
>>>> index cd1d9dddf58e..73f290aca7f1 100644
>>>> --- a/io_uring/kbuf.c
>>>> +++ b/io_uring/kbuf.c
>>>> @@ -491,6 +491,15 @@ static int io_pin_pbuf_ring(struct io_uring_buf_reg *reg,
>>>>           return PTR_ERR(pages);
>>>>         br = page_address(pages[0]);
>>>> +#ifdef SHM_COLOUR
>>>> +    if ((reg->ring_addr & (unsigned long) br) & SHM_COLOUR) {
>>> & (SHM_COLOUR - 1)) {
>>>
>>> of course...
>> Full version, I think this should do the right thing. If the kernel and
>> app side isn't aligned on the same SHM_COLOUR boundary, we'll return
>> -EINVAL rather than setup the ring.
>>
>> For the ring-buf-alloc branch, this is handled automatically. But we
>> should, as you mentioned, ensure that the kernel doesn't allow setting
>> something up that will not work.
>>
>> Note that this is still NOT related to your hang, I honestly have no
>> idea what that could be. Unfortunately parisc doesn't have a lot of
>> debugging aids for this... Could even be a generic kernel issue. I
>> looked up your rp3440, and it sounds like we have basically the same
>> setup. I'm running a dual socket PA8900 at 1GHz.
> With this change, test/poll-race-mshot.t no longer crashes my rp34404.
> 
> Results on master are:
> Tests timed out (2): <a4c0b3decb33.t> <send-zerocopy.t>

Take too long on your system.. Would work with bigger timeout.

> Tests failed (1): <fd-pass.t>

This one is missing a patch that'll go upstream today, and it's testing
for it and hence failing.

> Running test buf-ring.t 0 sec [0]
> Running test poll-race-mshot.t Skipped
> 
> Results on ring-buf-alloc are:
> Tests timed out (2): <a4c0b3decb33.t> <send-zerocopy.t>
> Tests failed (2): <buf-ring.t> <fd-pass.t>
> 
> Running test buf-ring.t register buf ring failed -22
> test_full_page_reg failed
> Test buf-ring.t failed with ret 1

The buf-ring failure with the patch from my previous message is because
it manually tries to set up a ring with an address that won't work. The
test case itself never uses the ring, it's just a basic
register/unregister test. So would just need updating if that patch goes
in to pass on hppa, there's nothing inherently wrong here.

> Running test poll-race-mshot.t 4 sec
> 
> Without the change, the test/poll-race-mshot.t test causes HPMCs on my rp3440 (two processors).
> The front status LED turns red and the event is logged in the hardware system log.  I looked at where
> the HPMC occurred but the locations were unrelated to io_uring.
> 
> I tried running the test under strace.  With output to console, the test doesn't cause a crash and it more
> or less exits normally (need ^C to kill one process).  With output to file, system crashes and file is empty
> on reboot.
> 
> fd-pass.t fail is new.
> 
> I don't think buf-ring.t and send_recvmsg.t actually pass on master with change.  Tests probably need
> updating.
> 
> The "Bad cqe res -233" messages are gone?

Those happened because we filled the buffers on the user side, but the
kernel side didn't see them due to the aliasing issue. Which means that
ring provided buffers now work.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHSET 0/5] User mapped provided buffer rings
  2023-03-17 15:57                   ` Jens Axboe
@ 2023-03-17 16:15                     ` John David Anglin
  2023-03-17 16:37                       ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: John David Anglin @ 2023-03-17 16:15 UTC (permalink / raw)
  To: Jens Axboe, Helge Deller, io-uring, linux-parisc

On 2023-03-17 11:57 a.m., Jens Axboe wrote:
>> Running test buf-ring.t register buf ring failed -22
>> test_full_page_reg failed
>> Test buf-ring.t failed with ret 1
> The buf-ring failure with the patch from my previous message is because
> it manually tries to set up a ring with an address that won't work. The
> test case itself never uses the ring, it's just a basic
> register/unregister test. So would just need updating if that patch goes
> in to pass on hppa, there's nothing inherently wrong here.
>
I would suggest it.  From page F-7 of the PA-RISC 2.0 Architecture:

    All other uses of non-equivalent aliasing (including simultaneously enabling multiple non-equivalently
    aliased translations where one or more allow for write access) are prohibited, and can cause machine
    checks or silent data corruption, including data corruption of unrelated memory on unrelated pages.

Dave

-- 
John David Anglin  dave.anglin@bell.net


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCHSET 0/5] User mapped provided buffer rings
  2023-03-17 16:15                     ` John David Anglin
@ 2023-03-17 16:37                       ` Jens Axboe
  0 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2023-03-17 16:37 UTC (permalink / raw)
  To: John David Anglin, Helge Deller, io-uring, linux-parisc

On 3/17/23 10:15?AM, John David Anglin wrote:
> On 2023-03-17 11:57 a.m., Jens Axboe wrote:
>>> Running test buf-ring.t register buf ring failed -22
>>> test_full_page_reg failed
>>> Test buf-ring.t failed with ret 1
>> The buf-ring failure with the patch from my previous message is because
>> it manually tries to set up a ring with an address that won't work. The
>> test case itself never uses the ring, it's just a basic
>> register/unregister test. So would just need updating if that patch goes
>> in to pass on hppa, there's nothing inherently wrong here.
>>
> I would suggest it.  From page F-7 of the PA-RISC 2.0 Architecture:
> 
>    All other uses of non-equivalent aliasing (including simultaneously
>    enabling multiple non-equivalently aliased translations where one
>    or more allow for write access) are prohibited, and can cause
>    machine checks or silent data corruption, including data corruption
>    of unrelated memory on unrelated pages.

I did add a patch to skip that sub-test on hppa, as there's just no way
to make that one work as it relies on manually aligning memory to
trigger an issue in an older kernel. So the test should pass now in the
liburing master branch.

I'll send out the alignment check patch and we can queue that up for
6.4.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/5] io_uring: Adjust mapping wrt architecture aliasing requirements
  2023-03-14 17:16 ` [PATCH 1/5] io_uring: Adjust mapping wrt architecture aliasing requirements Jens Axboe
@ 2023-07-12  4:43   ` matoro
  2023-07-12 16:24     ` Helge Deller
  0 siblings, 1 reply; 36+ messages in thread
From: matoro @ 2023-07-12  4:43 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, deller, Linux Ia64, glaubitz, Sam James

On 2023-03-14 13:16, Jens Axboe wrote:
> From: Helge Deller <deller@gmx.de>
> 
> Some architectures have memory cache aliasing requirements (e.g. 
> parisc)
> if memory is shared between userspace and kernel. This patch fixes the
> kernel to return an aliased address when asked by userspace via mmap().
> 
> Signed-off-by: Helge Deller <deller@gmx.de>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  io_uring/io_uring.c | 51 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 51 insertions(+)
> 
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 722624b6d0dc..3adecebbac71 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -72,6 +72,7 @@
>  #include <linux/io_uring.h>
>  #include <linux/audit.h>
>  #include <linux/security.h>
> +#include <asm/shmparam.h>
> 
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/io_uring.h>
> @@ -3317,6 +3318,54 @@ static __cold int io_uring_mmap(struct file 
> *file, struct vm_area_struct *vma)
>  	return remap_pfn_range(vma, vma->vm_start, pfn, sz, 
> vma->vm_page_prot);
>  }
> 
> +static unsigned long io_uring_mmu_get_unmapped_area(struct file *filp,
> +			unsigned long addr, unsigned long len,
> +			unsigned long pgoff, unsigned long flags)
> +{
> +	const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
> +	struct vm_unmapped_area_info info;
> +	void *ptr;
> +
> +	/*
> +	 * Do not allow to map to user-provided address to avoid breaking the
> +	 * aliasing rules. Userspace is not able to guess the offset address 
> of
> +	 * kernel kmalloc()ed memory area.
> +	 */
> +	if (addr)
> +		return -EINVAL;
> +
> +	ptr = io_uring_validate_mmap_request(filp, pgoff, len);
> +	if (IS_ERR(ptr))
> +		return -ENOMEM;
> +
> +	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
> +	info.length = len;
> +	info.low_limit = max(PAGE_SIZE, mmap_min_addr);
> +	info.high_limit = arch_get_mmap_base(addr, current->mm->mmap_base);
> +#ifdef SHM_COLOUR
> +	info.align_mask = PAGE_MASK & (SHM_COLOUR - 1UL);
> +#else
> +	info.align_mask = PAGE_MASK & (SHMLBA - 1UL);
> +#endif
> +	info.align_offset = (unsigned long) ptr;
> +
> +	/*
> +	 * A failed mmap() very likely causes application failure,
> +	 * so fall back to the bottom-up function here. This scenario
> +	 * can happen with large stack limits and large mmap()
> +	 * allocations.
> +	 */
> +	addr = vm_unmapped_area(&info);
> +	if (offset_in_page(addr)) {
> +		info.flags = 0;
> +		info.low_limit = TASK_UNMAPPED_BASE;
> +		info.high_limit = mmap_end;
> +		addr = vm_unmapped_area(&info);
> +	}
> +
> +	return addr;
> +}
> +
>  #else /* !CONFIG_MMU */
> 
>  static int io_uring_mmap(struct file *file, struct vm_area_struct 
> *vma)
> @@ -3529,6 +3578,8 @@ static const struct file_operations io_uring_fops 
> = {
>  #ifndef CONFIG_MMU
>  	.get_unmapped_area = io_uring_nommu_get_unmapped_area,
>  	.mmap_capabilities = io_uring_nommu_mmap_capabilities,
> +#else
> +	.get_unmapped_area = io_uring_mmu_get_unmapped_area,
>  #endif
>  	.poll		= io_uring_poll,
>  #ifdef CONFIG_PROC_FS

Hi Jens, Helge - I've bisected a regression with io_uring on ia64 to 
this patch in 6.4.  Unfortunately this breaks userspace programs using 
io_uring, the easiest one to test is cmake with an io_uring enabled 
libuv (i.e., libuv >= 1.45.0) which will hang.

I am aware that ia64 is in a vulnerable place right now which I why I am 
keeping this spread limited.  Since this clearly involves 
architecture-specific changes for parisc, is there any chance of looking 
at what is required to do the same for ia64?  I looked at 
0ef36bd2b37815719e31a72d2beecc28ca8ecd26 ("parisc: change value of 
SHMLBA from 0x00400000 to PAGE_SIZE") and tried to replicate the SHMLBA 
-> SHM_COLOUR change, but it made no difference.

If hardware is necessary for testing, I can provide it, including remote 
BMC access for restarts/kernel debugging.  Any takers?

$ git bisect log
git bisect start
# status: waiting for both good and bad commits
# good: [eceb0b18ae34b399856a2dd1eee8c18b2341e6f0] Linux 6.3.12
git bisect good eceb0b18ae34b399856a2dd1eee8c18b2341e6f0
# status: waiting for bad commit, 1 good commit known
# bad: [59377679473491963a599bfd51cc9877492312ee] Linux 6.4.1
git bisect bad 59377679473491963a599bfd51cc9877492312ee
# good: [457391b0380335d5e9a5babdec90ac53928b23b4] Linux 6.3
git bisect good 457391b0380335d5e9a5babdec90ac53928b23b4
# bad: [cb6fe2ceb667eb78f252d473b03deb23999ab1cf] Merge tag 
'devicetree-for-6.4-2' of 
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
git bisect bad cb6fe2ceb667eb78f252d473b03deb23999ab1cf
# good: [f5468bec213ec2ad3f2724e3f1714b3bc7bf1515] Merge tag 
'regmap-v6.4' of 
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap
git bisect good f5468bec213ec2ad3f2724e3f1714b3bc7bf1515
# good: [207296f1a03bfead0110ffc4f192f242100ce4ff] netfilter: nf_tables: 
allow to create netdev chain without device
git bisect good 207296f1a03bfead0110ffc4f192f242100ce4ff
# good: [85d7ab2463822a4ab096c0b7b59feec962552572] Merge tag 
'for-6.4-tag' of 
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
git bisect good 85d7ab2463822a4ab096c0b7b59feec962552572
# bad: [b68ee1c6131c540a62ecd443be89c406401df091] Merge tag 'scsi-misc' 
of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
git bisect bad b68ee1c6131c540a62ecd443be89c406401df091
# bad: [48dc810012a6b4f4ba94073d6b7edb4f76edeb72] Merge tag 
'for-6.4/dm-changes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
git bisect bad 48dc810012a6b4f4ba94073d6b7edb4f76edeb72
# bad: [5b9a7bb72fddbc5247f56ede55d485fab7abdf92] Merge tag 
'for-6.4/io_uring-2023-04-21' of git://git.kernel.dk/linux
git bisect bad 5b9a7bb72fddbc5247f56ede55d485fab7abdf92
# good: [5c7ecada25d2086aee607ff7deb69e77faa4aa92] Merge tag 
'f2fs-for-6.4-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
git bisect good 5c7ecada25d2086aee607ff7deb69e77faa4aa92
# bad: [6e7248adf8f7adb5e36ec1e91efcc85a83bf8aeb] io_uring: refactor 
io_cqring_wake()
git bisect bad 6e7248adf8f7adb5e36ec1e91efcc85a83bf8aeb
# bad: [2ad57931db641f3de627023afb8147a8ec0b41dc] io_uring: rename 
trace_io_uring_submit_sqe() tracepoint
git bisect bad 2ad57931db641f3de627023afb8147a8ec0b41dc
# bad: [efba1a9e653e107577a48157b5424878c46f2285] io_uring: Move from 
hlist to io_wq_work_node
git bisect bad efba1a9e653e107577a48157b5424878c46f2285
# bad: [ba56b63242d12df088ed9a701cad320e6b306dfe] io_uring/kbuf: move 
pinning of provided buffer ring into helper
git bisect bad ba56b63242d12df088ed9a701cad320e6b306dfe
# good: [d4755e15386c38e4ae532ace5acc29fbfaee42e7] io_uring: avoid 
hashing O_DIRECT writes if the filesystem doesn't need it
git bisect good d4755e15386c38e4ae532ace5acc29fbfaee42e7
# bad: [d808459b2e31bd5123a14258a7a529995db974c8] io_uring: Adjust 
mapping wrt architecture aliasing requirements
git bisect bad d808459b2e31bd5123a14258a7a529995db974c8
# first bad commit: [d808459b2e31bd5123a14258a7a529995db974c8] io_uring: 
Adjust mapping wrt architecture aliasing requirements

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/5] io_uring: Adjust mapping wrt architecture aliasing requirements
  2023-07-12  4:43   ` matoro
@ 2023-07-12 16:24     ` Helge Deller
  2023-07-12 17:28       ` matoro
  0 siblings, 1 reply; 36+ messages in thread
From: Helge Deller @ 2023-07-12 16:24 UTC (permalink / raw)
  To: matoro; +Cc: Jens Axboe, io-uring, Linux Ia64, glaubitz, Sam James

Hi Matoro,

* matoro <matoro_mailinglist_kernel@matoro.tk>:
> On 2023-03-14 13:16, Jens Axboe wrote:
> > From: Helge Deller <deller@gmx.de>
> >
> > Some architectures have memory cache aliasing requirements (e.g. parisc)
> > if memory is shared between userspace and kernel. This patch fixes the
> > kernel to return an aliased address when asked by userspace via mmap().
> >
> > Signed-off-by: Helge Deller <deller@gmx.de>
> > Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > ---
> >  io_uring/io_uring.c | 51 +++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 51 insertions(+)
> >
> > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> > index 722624b6d0dc..3adecebbac71 100644
> > --- a/io_uring/io_uring.c
> > +++ b/io_uring/io_uring.c
> > @@ -72,6 +72,7 @@
> >  #include <linux/io_uring.h>
> >  #include <linux/audit.h>
> >  #include <linux/security.h>
> > +#include <asm/shmparam.h>
> >
> >  #define CREATE_TRACE_POINTS
> >  #include <trace/events/io_uring.h>
> > @@ -3317,6 +3318,54 @@ static __cold int io_uring_mmap(struct file
> > *file, struct vm_area_struct *vma)
> >  	return remap_pfn_range(vma, vma->vm_start, pfn, sz,
> > vma->vm_page_prot);
> >  }
> >
> > +static unsigned long io_uring_mmu_get_unmapped_area(struct file *filp,
> > +			unsigned long addr, unsigned long len,
> > +			unsigned long pgoff, unsigned long flags)
> > +{
> > +	const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
> > +	struct vm_unmapped_area_info info;
> > +	void *ptr;
> > +
> > +	/*
> > +	 * Do not allow to map to user-provided address to avoid breaking the
> > +	 * aliasing rules. Userspace is not able to guess the offset address
> > of
> > +	 * kernel kmalloc()ed memory area.
> > +	 */
> > +	if (addr)
> > +		return -EINVAL;
> > +
> > +	ptr = io_uring_validate_mmap_request(filp, pgoff, len);
> > +	if (IS_ERR(ptr))
> > +		return -ENOMEM;
> > +
> > +	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
> > +	info.length = len;
> > +	info.low_limit = max(PAGE_SIZE, mmap_min_addr);
> > +	info.high_limit = arch_get_mmap_base(addr, current->mm->mmap_base);
> > +#ifdef SHM_COLOUR
> > +	info.align_mask = PAGE_MASK & (SHM_COLOUR - 1UL);
> > +#else
> > +	info.align_mask = PAGE_MASK & (SHMLBA - 1UL);
> > +#endif
> > +	info.align_offset = (unsigned long) ptr;
> > +
> > +	/*
> > +	 * A failed mmap() very likely causes application failure,
> > +	 * so fall back to the bottom-up function here. This scenario
> > +	 * can happen with large stack limits and large mmap()
> > +	 * allocations.
> > +	 */
> > +	addr = vm_unmapped_area(&info);
> > +	if (offset_in_page(addr)) {
> > +		info.flags = 0;
> > +		info.low_limit = TASK_UNMAPPED_BASE;
> > +		info.high_limit = mmap_end;
> > +		addr = vm_unmapped_area(&info);
> > +	}
> > +
> > +	return addr;
> > +}
> > +
> >  #else /* !CONFIG_MMU */
> >
> >  static int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
> > @@ -3529,6 +3578,8 @@ static const struct file_operations io_uring_fops
> > = {
> >  #ifndef CONFIG_MMU
> >  	.get_unmapped_area = io_uring_nommu_get_unmapped_area,
> >  	.mmap_capabilities = io_uring_nommu_mmap_capabilities,
> > +#else
> > +	.get_unmapped_area = io_uring_mmu_get_unmapped_area,
> >  #endif
> >  	.poll		= io_uring_poll,
> >  #ifdef CONFIG_PROC_FS
>
> Hi Jens, Helge - I've bisected a regression with io_uring on ia64 to this
> patch in 6.4.  Unfortunately this breaks userspace programs using io_uring,
> the easiest one to test is cmake with an io_uring enabled libuv (i.e., libuv
> >= 1.45.0) which will hang.
>
> I am aware that ia64 is in a vulnerable place right now which I why I am
> keeping this spread limited.  Since this clearly involves
> architecture-specific changes for parisc,

it isn't so much architecture-specific... (just one ifdef)

> is there any chance of looking at
> what is required to do the same for ia64?  I looked at
> 0ef36bd2b37815719e31a72d2beecc28ca8ecd26 ("parisc: change value of SHMLBA
> from 0x00400000 to PAGE_SIZE") and tried to replicate the SHMLBA ->
> SHM_COLOUR change, but it made no difference.
>
> If hardware is necessary for testing, I can provide it, including remote BMC
> access for restarts/kernel debugging.  Any takers?

I won't have time to test myself, but maybe you could test?

Basically we should try to find out why io_uring_mmu_get_unmapped_area()
doesn't return valid addresses, while arch_get_unmapped_area()
[in arch/ia64/kernel/sys_ia64.c] does.

You could apply this patch first:
It introduces a memory leak (as it requests memory twice), but maybe we
get an idea?
The ia64 arch_get_unmapped_area() searches for memory from bottom
(flags=0), while io_uring function tries top-down first. Maybe that's
the problem. And I don't understand the offset_in_page() check right
now.

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 3bca7a79efda..93b1964d2bbb 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3431,13 +3431,17 @@ static unsigned long io_uring_mmu_get_unmapped_area(struct file *filp,
 	 * can happen with large stack limits and large mmap()
 	 * allocations.
 	 */
+/* compare to arch_get_unmapped_area() in arch/ia64/kernel/sys_ia64.c */
 	addr = vm_unmapped_area(&info);
-	if (offset_in_page(addr)) {
+printk("io_uring_mmu_get_unmapped_area() address 1 is: %px\n", addr);
+	addr = NULL;
+	if (!addr) {
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
 		info.high_limit = mmap_end;
 		addr = vm_unmapped_area(&info);
 	}
+printk("io_uring_mmu_get_unmapped_area() returns address %px\n", addr);

 	return addr;
 }


Another option is to disable the call to io_uring_nommu_get_unmapped_area())
with the next patch. Maybe you could add printks() to ia64's arch_get_unmapped_area()
and check what it returns there?

@@ -3654,6 +3658,8 @@ static const struct file_operations io_uring_fops = {
 #ifndef CONFIG_MMU
 	.get_unmapped_area = io_uring_nommu_get_unmapped_area,
 	.mmap_capabilities = io_uring_nommu_mmap_capabilities,
+#elif 0    /* IS_ENABLED(CONFIG_IA64) */
+	.get_unmapped_area = NULL,
 #else
 	.get_unmapped_area = io_uring_mmu_get_unmapped_area,
 #endif

Helge

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/5] io_uring: Adjust mapping wrt architecture aliasing requirements
  2023-07-12 16:24     ` Helge Deller
@ 2023-07-12 17:28       ` matoro
  2023-07-12 19:05         ` Helge Deller
  0 siblings, 1 reply; 36+ messages in thread
From: matoro @ 2023-07-12 17:28 UTC (permalink / raw)
  To: Helge Deller; +Cc: Jens Axboe, io-uring, Linux Ia64, glaubitz, Sam James

On 2023-07-12 12:24, Helge Deller wrote:
> Hi Matoro,
> 
> * matoro <matoro_mailinglist_kernel@matoro.tk>:
>> On 2023-03-14 13:16, Jens Axboe wrote:
>> > From: Helge Deller <deller@gmx.de>
>> >
>> > Some architectures have memory cache aliasing requirements (e.g. parisc)
>> > if memory is shared between userspace and kernel. This patch fixes the
>> > kernel to return an aliased address when asked by userspace via mmap().
>> >
>> > Signed-off-by: Helge Deller <deller@gmx.de>
>> > Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> > ---
>> >  io_uring/io_uring.c | 51 +++++++++++++++++++++++++++++++++++++++++++++
>> >  1 file changed, 51 insertions(+)
>> >
>> > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> > index 722624b6d0dc..3adecebbac71 100644
>> > --- a/io_uring/io_uring.c
>> > +++ b/io_uring/io_uring.c
>> > @@ -72,6 +72,7 @@
>> >  #include <linux/io_uring.h>
>> >  #include <linux/audit.h>
>> >  #include <linux/security.h>
>> > +#include <asm/shmparam.h>
>> >
>> >  #define CREATE_TRACE_POINTS
>> >  #include <trace/events/io_uring.h>
>> > @@ -3317,6 +3318,54 @@ static __cold int io_uring_mmap(struct file
>> > *file, struct vm_area_struct *vma)
>> >  	return remap_pfn_range(vma, vma->vm_start, pfn, sz,
>> > vma->vm_page_prot);
>> >  }
>> >
>> > +static unsigned long io_uring_mmu_get_unmapped_area(struct file *filp,
>> > +			unsigned long addr, unsigned long len,
>> > +			unsigned long pgoff, unsigned long flags)
>> > +{
>> > +	const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
>> > +	struct vm_unmapped_area_info info;
>> > +	void *ptr;
>> > +
>> > +	/*
>> > +	 * Do not allow to map to user-provided address to avoid breaking the
>> > +	 * aliasing rules. Userspace is not able to guess the offset address
>> > of
>> > +	 * kernel kmalloc()ed memory area.
>> > +	 */
>> > +	if (addr)
>> > +		return -EINVAL;
>> > +
>> > +	ptr = io_uring_validate_mmap_request(filp, pgoff, len);
>> > +	if (IS_ERR(ptr))
>> > +		return -ENOMEM;
>> > +
>> > +	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>> > +	info.length = len;
>> > +	info.low_limit = max(PAGE_SIZE, mmap_min_addr);
>> > +	info.high_limit = arch_get_mmap_base(addr, current->mm->mmap_base);
>> > +#ifdef SHM_COLOUR
>> > +	info.align_mask = PAGE_MASK & (SHM_COLOUR - 1UL);
>> > +#else
>> > +	info.align_mask = PAGE_MASK & (SHMLBA - 1UL);
>> > +#endif
>> > +	info.align_offset = (unsigned long) ptr;
>> > +
>> > +	/*
>> > +	 * A failed mmap() very likely causes application failure,
>> > +	 * so fall back to the bottom-up function here. This scenario
>> > +	 * can happen with large stack limits and large mmap()
>> > +	 * allocations.
>> > +	 */
>> > +	addr = vm_unmapped_area(&info);
>> > +	if (offset_in_page(addr)) {
>> > +		info.flags = 0;
>> > +		info.low_limit = TASK_UNMAPPED_BASE;
>> > +		info.high_limit = mmap_end;
>> > +		addr = vm_unmapped_area(&info);
>> > +	}
>> > +
>> > +	return addr;
>> > +}
>> > +
>> >  #else /* !CONFIG_MMU */
>> >
>> >  static int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
>> > @@ -3529,6 +3578,8 @@ static const struct file_operations io_uring_fops
>> > = {
>> >  #ifndef CONFIG_MMU
>> >  	.get_unmapped_area = io_uring_nommu_get_unmapped_area,
>> >  	.mmap_capabilities = io_uring_nommu_mmap_capabilities,
>> > +#else
>> > +	.get_unmapped_area = io_uring_mmu_get_unmapped_area,
>> >  #endif
>> >  	.poll		= io_uring_poll,
>> >  #ifdef CONFIG_PROC_FS
>> 
>> Hi Jens, Helge - I've bisected a regression with io_uring on ia64 to 
>> this
>> patch in 6.4.  Unfortunately this breaks userspace programs using 
>> io_uring,
>> the easiest one to test is cmake with an io_uring enabled libuv (i.e., 
>> libuv
>> >= 1.45.0) which will hang.
>> 
>> I am aware that ia64 is in a vulnerable place right now which I why I 
>> am
>> keeping this spread limited.  Since this clearly involves
>> architecture-specific changes for parisc,
> 
> it isn't so much architecture-specific... (just one ifdef)
> 
>> is there any chance of looking at
>> what is required to do the same for ia64?  I looked at
>> 0ef36bd2b37815719e31a72d2beecc28ca8ecd26 ("parisc: change value of 
>> SHMLBA
>> from 0x00400000 to PAGE_SIZE") and tried to replicate the SHMLBA ->
>> SHM_COLOUR change, but it made no difference.
>> 
>> If hardware is necessary for testing, I can provide it, including 
>> remote BMC
>> access for restarts/kernel debugging.  Any takers?
> 
> I won't have time to test myself, but maybe you could test?
> 
> Basically we should try to find out why 
> io_uring_mmu_get_unmapped_area()
> doesn't return valid addresses, while arch_get_unmapped_area()
> [in arch/ia64/kernel/sys_ia64.c] does.
> 
> You could apply this patch first:
> It introduces a memory leak (as it requests memory twice), but maybe we
> get an idea?
> The ia64 arch_get_unmapped_area() searches for memory from bottom
> (flags=0), while io_uring function tries top-down first. Maybe that's
> the problem. And I don't understand the offset_in_page() check right
> now.
> 
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 3bca7a79efda..93b1964d2bbb 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -3431,13 +3431,17 @@ static unsigned long 
> io_uring_mmu_get_unmapped_area(struct file *filp,
>  	 * can happen with large stack limits and large mmap()
>  	 * allocations.
>  	 */
> +/* compare to arch_get_unmapped_area() in arch/ia64/kernel/sys_ia64.c 
> */
>  	addr = vm_unmapped_area(&info);
> -	if (offset_in_page(addr)) {
> +printk("io_uring_mmu_get_unmapped_area() address 1 is: %px\n", addr);
> +	addr = NULL;
> +	if (!addr) {
>  		info.flags = 0;
>  		info.low_limit = TASK_UNMAPPED_BASE;
>  		info.high_limit = mmap_end;
>  		addr = vm_unmapped_area(&info);
>  	}
> +printk("io_uring_mmu_get_unmapped_area() returns address %px\n", 
> addr);
> 
>  	return addr;
>  }
> 
> 
> Another option is to disable the call to 
> io_uring_nommu_get_unmapped_area())
> with the next patch. Maybe you could add printks() to ia64's 
> arch_get_unmapped_area()
> and check what it returns there?
> 
> @@ -3654,6 +3658,8 @@ static const struct file_operations io_uring_fops 
> = {
>  #ifndef CONFIG_MMU
>  	.get_unmapped_area = io_uring_nommu_get_unmapped_area,
>  	.mmap_capabilities = io_uring_nommu_mmap_capabilities,
> +#elif 0    /* IS_ENABLED(CONFIG_IA64) */
> +	.get_unmapped_area = NULL,
>  #else
>  	.get_unmapped_area = io_uring_mmu_get_unmapped_area,
>  #endif
> 
> Helge

Thanks Helge.  Sample output from that first patch:

[Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() address 1 
is: 1ffffffffff40000
[Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() returns 
address 2000000001e40000
[Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() address 1 
is: 1ffffffffff20000
[Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() returns 
address 2000000001f20000
[Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() address 1 
is: 1ffffffffff30000
[Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() returns 
address 2000000001f30000
[Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() address 1 
is: 1ffffffffff90000
[Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() returns 
address 2000000001f90000

This pattern seems to be pretty stable, I tried instead just directly 
returning the result of a call to arch_get_unmapped_area() at the end of 
the function and it seems similar:

[Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area() would return 
address 1ffffffffffd0000
[Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would return 
address 2000000001f00000
[Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area() would return 
address 1ffffffffff00000
[Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would return 
address 1ffffffffff00000
[Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area() would return 
address 1fffffffffe20000
[Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would return 
address 2000000002000000
[Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area() would return 
address 1fffffffffe30000
[Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would return 
address 2000000002100000

Is that enough of a clue to go on?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/5] io_uring: Adjust mapping wrt architecture aliasing requirements
  2023-07-12 17:28       ` matoro
@ 2023-07-12 19:05         ` Helge Deller
  2023-07-12 20:30           ` Helge Deller
  0 siblings, 1 reply; 36+ messages in thread
From: Helge Deller @ 2023-07-12 19:05 UTC (permalink / raw)
  To: matoro; +Cc: Jens Axboe, io-uring, Linux Ia64, glaubitz, Sam James

On 7/12/23 19:28, matoro wrote:
> On 2023-07-12 12:24, Helge Deller wrote:
>> Hi Matoro,
>>
>> * matoro <matoro_mailinglist_kernel@matoro.tk>:
>>> On 2023-03-14 13:16, Jens Axboe wrote:
>>> > From: Helge Deller <deller@gmx.de>
>>> >
>>> > Some architectures have memory cache aliasing requirements (e.g. parisc)
>>> > if memory is shared between userspace and kernel. This patch fixes the
>>> > kernel to return an aliased address when asked by userspace via mmap().
>>> >
>>> > Signed-off-by: Helge Deller <deller@gmx.de>
>>> > Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>> > ---
>>> >  io_uring/io_uring.c | 51 +++++++++++++++++++++++++++++++++++++++++++++
>>> >  1 file changed, 51 insertions(+)
>>> >
>>> > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>> > index 722624b6d0dc..3adecebbac71 100644
>>> > --- a/io_uring/io_uring.c
>>> > +++ b/io_uring/io_uring.c
>>> > @@ -72,6 +72,7 @@
>>> >  #include <linux/io_uring.h>
>>> >  #include <linux/audit.h>
>>> >  #include <linux/security.h>
>>> > +#include <asm/shmparam.h>
>>> >
>>> >  #define CREATE_TRACE_POINTS
>>> >  #include <trace/events/io_uring.h>
>>> > @@ -3317,6 +3318,54 @@ static __cold int io_uring_mmap(struct file
>>> > *file, struct vm_area_struct *vma)
>>> >      return remap_pfn_range(vma, vma->vm_start, pfn, sz,
>>> > vma->vm_page_prot);
>>> >  }
>>> >
>>> > +static unsigned long io_uring_mmu_get_unmapped_area(struct file *filp,
>>> > +            unsigned long addr, unsigned long len,
>>> > +            unsigned long pgoff, unsigned long flags)
>>> > +{
>>> > +    const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
>>> > +    struct vm_unmapped_area_info info;
>>> > +    void *ptr;
>>> > +
>>> > +    /*
>>> > +     * Do not allow to map to user-provided address to avoid breaking the
>>> > +     * aliasing rules. Userspace is not able to guess the offset address
>>> > of
>>> > +     * kernel kmalloc()ed memory area.
>>> > +     */
>>> > +    if (addr)
>>> > +        return -EINVAL;
>>> > +
>>> > +    ptr = io_uring_validate_mmap_request(filp, pgoff, len);
>>> > +    if (IS_ERR(ptr))
>>> > +        return -ENOMEM;
>>> > +
>>> > +    info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>>> > +    info.length = len;
>>> > +    info.low_limit = max(PAGE_SIZE, mmap_min_addr);
>>> > +    info.high_limit = arch_get_mmap_base(addr, current->mm->mmap_base);
>>> > +#ifdef SHM_COLOUR
>>> > +    info.align_mask = PAGE_MASK & (SHM_COLOUR - 1UL);
>>> > +#else
>>> > +    info.align_mask = PAGE_MASK & (SHMLBA - 1UL);
>>> > +#endif
>>> > +    info.align_offset = (unsigned long) ptr;
>>> > +
>>> > +    /*
>>> > +     * A failed mmap() very likely causes application failure,
>>> > +     * so fall back to the bottom-up function here. This scenario
>>> > +     * can happen with large stack limits and large mmap()
>>> > +     * allocations.
>>> > +     */
>>> > +    addr = vm_unmapped_area(&info);
>>> > +    if (offset_in_page(addr)) {
>>> > +        info.flags = 0;
>>> > +        info.low_limit = TASK_UNMAPPED_BASE;
>>> > +        info.high_limit = mmap_end;
>>> > +        addr = vm_unmapped_area(&info);
>>> > +    }
>>> > +
>>> > +    return addr;
>>> > +}
>>> > +
>>> >  #else /* !CONFIG_MMU */
>>> >
>>> >  static int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
>>> > @@ -3529,6 +3578,8 @@ static const struct file_operations io_uring_fops
>>> > = {
>>> >  #ifndef CONFIG_MMU
>>> >      .get_unmapped_area = io_uring_nommu_get_unmapped_area,
>>> >      .mmap_capabilities = io_uring_nommu_mmap_capabilities,
>>> > +#else
>>> > +    .get_unmapped_area = io_uring_mmu_get_unmapped_area,
>>> >  #endif
>>> >      .poll        = io_uring_poll,
>>> >  #ifdef CONFIG_PROC_FS
>>>
>>> Hi Jens, Helge - I've bisected a regression with io_uring on ia64 to this
>>> patch in 6.4.  Unfortunately this breaks userspace programs using io_uring,
>>> the easiest one to test is cmake with an io_uring enabled libuv (i.e., libuv
>>> >= 1.45.0) which will hang.
>>>
>>> I am aware that ia64 is in a vulnerable place right now which I why I am
>>> keeping this spread limited.  Since this clearly involves
>>> architecture-specific changes for parisc,
>>
>> it isn't so much architecture-specific... (just one ifdef)
>>
>>> is there any chance of looking at
>>> what is required to do the same for ia64?  I looked at
>>> 0ef36bd2b37815719e31a72d2beecc28ca8ecd26 ("parisc: change value of SHMLBA
>>> from 0x00400000 to PAGE_SIZE") and tried to replicate the SHMLBA ->
>>> SHM_COLOUR change, but it made no difference.
>>>
>>> If hardware is necessary for testing, I can provide it, including remote BMC
>>> access for restarts/kernel debugging.  Any takers?
>>
>> I won't have time to test myself, but maybe you could test?
>>
>> Basically we should try to find out why io_uring_mmu_get_unmapped_area()
>> doesn't return valid addresses, while arch_get_unmapped_area()
>> [in arch/ia64/kernel/sys_ia64.c] does.
>>
>> You could apply this patch first:
>> It introduces a memory leak (as it requests memory twice), but maybe we
>> get an idea?
>> The ia64 arch_get_unmapped_area() searches for memory from bottom
>> (flags=0), while io_uring function tries top-down first. Maybe that's
>> the problem. And I don't understand the offset_in_page() check right
>> now.
>>
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index 3bca7a79efda..93b1964d2bbb 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -3431,13 +3431,17 @@ static unsigned long io_uring_mmu_get_unmapped_area(struct file *filp,
>>       * can happen with large stack limits and large mmap()
>>       * allocations.
>>       */
>> +/* compare to arch_get_unmapped_area() in arch/ia64/kernel/sys_ia64.c */
>>      addr = vm_unmapped_area(&info);
>> -    if (offset_in_page(addr)) {
>> +printk("io_uring_mmu_get_unmapped_area() address 1 is: %px\n", addr);
>> +    addr = NULL;
>> +    if (!addr) {
>>          info.flags = 0;
>>          info.low_limit = TASK_UNMAPPED_BASE;
>>          info.high_limit = mmap_end;
>>          addr = vm_unmapped_area(&info);
>>      }
>> +printk("io_uring_mmu_get_unmapped_area() returns address %px\n", addr);
>>
>>      return addr;
>>  }
>>
>>
>> Another option is to disable the call to io_uring_nommu_get_unmapped_area())
>> with the next patch. Maybe you could add printks() to ia64's arch_get_unmapped_area()
>> and check what it returns there?
>>
>> @@ -3654,6 +3658,8 @@ static const struct file_operations io_uring_fops = {
>>  #ifndef CONFIG_MMU
>>      .get_unmapped_area = io_uring_nommu_get_unmapped_area,
>>      .mmap_capabilities = io_uring_nommu_mmap_capabilities,
>> +#elif 0    /* IS_ENABLED(CONFIG_IA64) */
>> +    .get_unmapped_area = NULL,
>>  #else
>>      .get_unmapped_area = io_uring_mmu_get_unmapped_area,
>>  #endif
>>
>> Helge
>
> Thanks Helge.  Sample output from that first patch:
>
> [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() address 1 is: 1ffffffffff40000
> [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() returns address 2000000001e40000
> [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() address 1 is: 1ffffffffff20000
> [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() returns address 2000000001f20000
> [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() address 1 is: 1ffffffffff30000
> [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() returns address 2000000001f30000
> [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() address 1 is: 1ffffffffff90000
> [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() returns address 2000000001f90000
>
> This pattern seems to be pretty stable, I tried instead just directly returning the result of a call to arch_get_unmapped_area() at the end of the function and it seems similar:
>
> [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area() would return address 1ffffffffffd0000
> [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would return address 2000000001f00000
> [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area() would return address 1ffffffffff00000
> [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would return address 1ffffffffff00000
> [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area() would return address 1fffffffffe20000
> [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would return address 2000000002000000
> [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area() would return address 1fffffffffe30000
> [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would return address 2000000002100000
>
> Is that enough of a clue to go on?

SHMLBA on ia64 is 0x100000:
arch/ia64/include/asm/shmparam.h:#define        SHMLBA  (1024*1024)
but the values returned by io_uring_mmu_get_unmapped_area() does not fullfill this.

So, probably ia64's SHMLBA isn't pulled in correctly in io_uring/io_uring.c.
Check value of this line:
	info.align_mask = PAGE_MASK & (SHMLBA - 1UL);

You could also add
#define SHM_COLOUR  0x100000
in front of the
	#ifdef SHM_COLOUR
(define SHM_COLOUR in io_uring/kbuf.c too).

Helge

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/5] io_uring: Adjust mapping wrt architecture aliasing requirements
  2023-07-12 19:05         ` Helge Deller
@ 2023-07-12 20:30           ` Helge Deller
  2023-07-13  0:35             ` matoro
  0 siblings, 1 reply; 36+ messages in thread
From: Helge Deller @ 2023-07-12 20:30 UTC (permalink / raw)
  To: matoro; +Cc: Jens Axboe, io-uring, Linux Ia64, glaubitz, Sam James

On 7/12/23 21:05, Helge Deller wrote:
> On 7/12/23 19:28, matoro wrote:
>> On 2023-07-12 12:24, Helge Deller wrote:
>>> Hi Matoro,
>>>
>>> * matoro <matoro_mailinglist_kernel@matoro.tk>:
>>>> On 2023-03-14 13:16, Jens Axboe wrote:
>>>> > From: Helge Deller <deller@gmx.de>
>>>> >
>>>> > Some architectures have memory cache aliasing requirements (e.g. parisc)
>>>> > if memory is shared between userspace and kernel. This patch fixes the
>>>> > kernel to return an aliased address when asked by userspace via mmap().
>>>> >
>>>> > Signed-off-by: Helge Deller <deller@gmx.de>
>>>> > Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>> > ---
>>>> >  io_uring/io_uring.c | 51 +++++++++++++++++++++++++++++++++++++++++++++
>>>> >  1 file changed, 51 insertions(+)
>>>> >
>>>> > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>>> > index 722624b6d0dc..3adecebbac71 100644
>>>> > --- a/io_uring/io_uring.c
>>>> > +++ b/io_uring/io_uring.c
>>>> > @@ -72,6 +72,7 @@
>>>> >  #include <linux/io_uring.h>
>>>> >  #include <linux/audit.h>
>>>> >  #include <linux/security.h>
>>>> > +#include <asm/shmparam.h>
>>>> >
>>>> >  #define CREATE_TRACE_POINTS
>>>> >  #include <trace/events/io_uring.h>
>>>> > @@ -3317,6 +3318,54 @@ static __cold int io_uring_mmap(struct file
>>>> > *file, struct vm_area_struct *vma)
>>>> >      return remap_pfn_range(vma, vma->vm_start, pfn, sz,
>>>> > vma->vm_page_prot);
>>>> >  }
>>>> >
>>>> > +static unsigned long io_uring_mmu_get_unmapped_area(struct file *filp,
>>>> > +            unsigned long addr, unsigned long len,
>>>> > +            unsigned long pgoff, unsigned long flags)
>>>> > +{
>>>> > +    const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
>>>> > +    struct vm_unmapped_area_info info;
>>>> > +    void *ptr;
>>>> > +
>>>> > +    /*
>>>> > +     * Do not allow to map to user-provided address to avoid breaking the
>>>> > +     * aliasing rules. Userspace is not able to guess the offset address
>>>> > of
>>>> > +     * kernel kmalloc()ed memory area.
>>>> > +     */
>>>> > +    if (addr)
>>>> > +        return -EINVAL;
>>>> > +
>>>> > +    ptr = io_uring_validate_mmap_request(filp, pgoff, len);
>>>> > +    if (IS_ERR(ptr))
>>>> > +        return -ENOMEM;
>>>> > +
>>>> > +    info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>>>> > +    info.length = len;
>>>> > +    info.low_limit = max(PAGE_SIZE, mmap_min_addr);
>>>> > +    info.high_limit = arch_get_mmap_base(addr, current->mm->mmap_base);
>>>> > +#ifdef SHM_COLOUR
>>>> > +    info.align_mask = PAGE_MASK & (SHM_COLOUR - 1UL);
>>>> > +#else
>>>> > +    info.align_mask = PAGE_MASK & (SHMLBA - 1UL);
>>>> > +#endif
>>>> > +    info.align_offset = (unsigned long) ptr;
>>>> > +
>>>> > +    /*
>>>> > +     * A failed mmap() very likely causes application failure,
>>>> > +     * so fall back to the bottom-up function here. This scenario
>>>> > +     * can happen with large stack limits and large mmap()
>>>> > +     * allocations.
>>>> > +     */
>>>> > +    addr = vm_unmapped_area(&info);
>>>> > +    if (offset_in_page(addr)) {
>>>> > +        info.flags = 0;
>>>> > +        info.low_limit = TASK_UNMAPPED_BASE;
>>>> > +        info.high_limit = mmap_end;
>>>> > +        addr = vm_unmapped_area(&info);
>>>> > +    }
>>>> > +
>>>> > +    return addr;
>>>> > +}
>>>> > +
>>>> >  #else /* !CONFIG_MMU */
>>>> >
>>>> >  static int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
>>>> > @@ -3529,6 +3578,8 @@ static const struct file_operations io_uring_fops
>>>> > = {
>>>> >  #ifndef CONFIG_MMU
>>>> >      .get_unmapped_area = io_uring_nommu_get_unmapped_area,
>>>> >      .mmap_capabilities = io_uring_nommu_mmap_capabilities,
>>>> > +#else
>>>> > +    .get_unmapped_area = io_uring_mmu_get_unmapped_area,
>>>> >  #endif
>>>> >      .poll        = io_uring_poll,
>>>> >  #ifdef CONFIG_PROC_FS
>>>>
>>>> Hi Jens, Helge - I've bisected a regression with io_uring on ia64 to this
>>>> patch in 6.4.  Unfortunately this breaks userspace programs using io_uring,
>>>> the easiest one to test is cmake with an io_uring enabled libuv (i.e., libuv
>>>> >= 1.45.0) which will hang.
>>>>
>>>> I am aware that ia64 is in a vulnerable place right now which I why I am
>>>> keeping this spread limited.  Since this clearly involves
>>>> architecture-specific changes for parisc,
>>>
>>> it isn't so much architecture-specific... (just one ifdef)
>>>
>>>> is there any chance of looking at
>>>> what is required to do the same for ia64?  I looked at
>>>> 0ef36bd2b37815719e31a72d2beecc28ca8ecd26 ("parisc: change value of SHMLBA
>>>> from 0x00400000 to PAGE_SIZE") and tried to replicate the SHMLBA ->
>>>> SHM_COLOUR change, but it made no difference.
>>>>
>>>> If hardware is necessary for testing, I can provide it, including remote BMC
>>>> access for restarts/kernel debugging.  Any takers?
>>>
>>> I won't have time to test myself, but maybe you could test?
>>>
>>> Basically we should try to find out why io_uring_mmu_get_unmapped_area()
>>> doesn't return valid addresses, while arch_get_unmapped_area()
>>> [in arch/ia64/kernel/sys_ia64.c] does.
>>>
>>> You could apply this patch first:
>>> It introduces a memory leak (as it requests memory twice), but maybe we
>>> get an idea?
>>> The ia64 arch_get_unmapped_area() searches for memory from bottom
>>> (flags=0), while io_uring function tries top-down first. Maybe that's
>>> the problem. And I don't understand the offset_in_page() check right
>>> now.
>>>
>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>> index 3bca7a79efda..93b1964d2bbb 100644
>>> --- a/io_uring/io_uring.c
>>> +++ b/io_uring/io_uring.c
>>> @@ -3431,13 +3431,17 @@ static unsigned long io_uring_mmu_get_unmapped_area(struct file *filp,
>>>       * can happen with large stack limits and large mmap()
>>>       * allocations.
>>>       */
>>> +/* compare to arch_get_unmapped_area() in arch/ia64/kernel/sys_ia64.c */
>>>      addr = vm_unmapped_area(&info);
>>> -    if (offset_in_page(addr)) {
>>> +printk("io_uring_mmu_get_unmapped_area() address 1 is: %px\n", addr);
>>> +    addr = NULL;
>>> +    if (!addr) {
>>>          info.flags = 0;
>>>          info.low_limit = TASK_UNMAPPED_BASE;
>>>          info.high_limit = mmap_end;
>>>          addr = vm_unmapped_area(&info);
>>>      }
>>> +printk("io_uring_mmu_get_unmapped_area() returns address %px\n", addr);
>>>
>>>      return addr;
>>>  }
>>>
>>>
>>> Another option is to disable the call to io_uring_nommu_get_unmapped_area())
>>> with the next patch. Maybe you could add printks() to ia64's arch_get_unmapped_area()
>>> and check what it returns there?
>>>
>>> @@ -3654,6 +3658,8 @@ static const struct file_operations io_uring_fops = {
>>>  #ifndef CONFIG_MMU
>>>      .get_unmapped_area = io_uring_nommu_get_unmapped_area,
>>>      .mmap_capabilities = io_uring_nommu_mmap_capabilities,
>>> +#elif 0    /* IS_ENABLED(CONFIG_IA64) */
>>> +    .get_unmapped_area = NULL,
>>>  #else
>>>      .get_unmapped_area = io_uring_mmu_get_unmapped_area,
>>>  #endif
>>>
>>> Helge
>>
>> Thanks Helge.  Sample output from that first patch:
>>
>> [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() address 1 is: 1ffffffffff40000
>> [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() returns address 2000000001e40000
>> [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() address 1 is: 1ffffffffff20000
>> [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() returns address 2000000001f20000
>> [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() address 1 is: 1ffffffffff30000
>> [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() returns address 2000000001f30000
>> [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() address 1 is: 1ffffffffff90000
>> [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() returns address 2000000001f90000
>>
>> This pattern seems to be pretty stable, I tried instead just directly returning the result of a call to arch_get_unmapped_area() at the end of the function and it seems similar:
>>
>> [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area() would return address 1ffffffffffd0000
>> [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would return address 2000000001f00000
>> [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area() would return address 1ffffffffff00000
>> [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would return address 1ffffffffff00000
>> [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area() would return address 1fffffffffe20000
>> [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would return address 2000000002000000
>> [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area() would return address 1fffffffffe30000
>> [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would return address 2000000002100000
>>
>> Is that enough of a clue to go on?
>
> SHMLBA on ia64 is 0x100000:
> arch/ia64/include/asm/shmparam.h:#define        SHMLBA  (1024*1024)
> but the values returned by io_uring_mmu_get_unmapped_area() does not fullfill this.
>
> So, probably ia64's SHMLBA isn't pulled in correctly in io_uring/io_uring.c.
> Check value of this line:
>      info.align_mask = PAGE_MASK & (SHMLBA - 1UL);
>
> You could also add
> #define SHM_COLOUR  0x100000
> in front of the
>      #ifdef SHM_COLOUR
> (define SHM_COLOUR in io_uring/kbuf.c too).

What is the value of PAGE_SIZE and "ptr" on your machine?
For 4k page size I get:
SHMLBA -1   ->        FFFFF
PAGE_MASK   -> FFFFFFFFF000
so,
info.align_mask = PAGE_MASK & (SHMLBA - 1UL) = 0xFF000;
You could try to set nfo.align_mask = 0xfffff;

Helge

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/5] io_uring: Adjust mapping wrt architecture aliasing requirements
  2023-07-12 20:30           ` Helge Deller
@ 2023-07-13  0:35             ` matoro
  2023-07-13  7:27               ` Helge Deller
  0 siblings, 1 reply; 36+ messages in thread
From: matoro @ 2023-07-13  0:35 UTC (permalink / raw)
  To: Helge Deller; +Cc: Jens Axboe, io-uring, Linux Ia64, glaubitz, Sam James

On 2023-07-12 16:30, Helge Deller wrote:
> On 7/12/23 21:05, Helge Deller wrote:
>> On 7/12/23 19:28, matoro wrote:
>>> On 2023-07-12 12:24, Helge Deller wrote:
>>>> Hi Matoro,
>>>> 
>>>> * matoro <matoro_mailinglist_kernel@matoro.tk>:
>>>>> On 2023-03-14 13:16, Jens Axboe wrote:
>>>>> > From: Helge Deller <deller@gmx.de>
>>>>> >
>>>>> > Some architectures have memory cache aliasing requirements (e.g. parisc)
>>>>> > if memory is shared between userspace and kernel. This patch fixes the
>>>>> > kernel to return an aliased address when asked by userspace via mmap().
>>>>> >
>>>>> > Signed-off-by: Helge Deller <deller@gmx.de>
>>>>> > Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>> > ---
>>>>> >  io_uring/io_uring.c | 51 +++++++++++++++++++++++++++++++++++++++++++++
>>>>> >  1 file changed, 51 insertions(+)
>>>>> >
>>>>> > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>>>> > index 722624b6d0dc..3adecebbac71 100644
>>>>> > --- a/io_uring/io_uring.c
>>>>> > +++ b/io_uring/io_uring.c
>>>>> > @@ -72,6 +72,7 @@
>>>>> >  #include <linux/io_uring.h>
>>>>> >  #include <linux/audit.h>
>>>>> >  #include <linux/security.h>
>>>>> > +#include <asm/shmparam.h>
>>>>> >
>>>>> >  #define CREATE_TRACE_POINTS
>>>>> >  #include <trace/events/io_uring.h>
>>>>> > @@ -3317,6 +3318,54 @@ static __cold int io_uring_mmap(struct file
>>>>> > *file, struct vm_area_struct *vma)
>>>>> >      return remap_pfn_range(vma, vma->vm_start, pfn, sz,
>>>>> > vma->vm_page_prot);
>>>>> >  }
>>>>> >
>>>>> > +static unsigned long io_uring_mmu_get_unmapped_area(struct file *filp,
>>>>> > +            unsigned long addr, unsigned long len,
>>>>> > +            unsigned long pgoff, unsigned long flags)
>>>>> > +{
>>>>> > +    const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
>>>>> > +    struct vm_unmapped_area_info info;
>>>>> > +    void *ptr;
>>>>> > +
>>>>> > +    /*
>>>>> > +     * Do not allow to map to user-provided address to avoid breaking the
>>>>> > +     * aliasing rules. Userspace is not able to guess the offset address
>>>>> > of
>>>>> > +     * kernel kmalloc()ed memory area.
>>>>> > +     */
>>>>> > +    if (addr)
>>>>> > +        return -EINVAL;
>>>>> > +
>>>>> > +    ptr = io_uring_validate_mmap_request(filp, pgoff, len);
>>>>> > +    if (IS_ERR(ptr))
>>>>> > +        return -ENOMEM;
>>>>> > +
>>>>> > +    info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>>>>> > +    info.length = len;
>>>>> > +    info.low_limit = max(PAGE_SIZE, mmap_min_addr);
>>>>> > +    info.high_limit = arch_get_mmap_base(addr, current->mm->mmap_base);
>>>>> > +#ifdef SHM_COLOUR
>>>>> > +    info.align_mask = PAGE_MASK & (SHM_COLOUR - 1UL);
>>>>> > +#else
>>>>> > +    info.align_mask = PAGE_MASK & (SHMLBA - 1UL);
>>>>> > +#endif
>>>>> > +    info.align_offset = (unsigned long) ptr;
>>>>> > +
>>>>> > +    /*
>>>>> > +     * A failed mmap() very likely causes application failure,
>>>>> > +     * so fall back to the bottom-up function here. This scenario
>>>>> > +     * can happen with large stack limits and large mmap()
>>>>> > +     * allocations.
>>>>> > +     */
>>>>> > +    addr = vm_unmapped_area(&info);
>>>>> > +    if (offset_in_page(addr)) {
>>>>> > +        info.flags = 0;
>>>>> > +        info.low_limit = TASK_UNMAPPED_BASE;
>>>>> > +        info.high_limit = mmap_end;
>>>>> > +        addr = vm_unmapped_area(&info);
>>>>> > +    }
>>>>> > +
>>>>> > +    return addr;
>>>>> > +}
>>>>> > +
>>>>> >  #else /* !CONFIG_MMU */
>>>>> >
>>>>> >  static int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
>>>>> > @@ -3529,6 +3578,8 @@ static const struct file_operations io_uring_fops
>>>>> > = {
>>>>> >  #ifndef CONFIG_MMU
>>>>> >      .get_unmapped_area = io_uring_nommu_get_unmapped_area,
>>>>> >      .mmap_capabilities = io_uring_nommu_mmap_capabilities,
>>>>> > +#else
>>>>> > +    .get_unmapped_area = io_uring_mmu_get_unmapped_area,
>>>>> >  #endif
>>>>> >      .poll        = io_uring_poll,
>>>>> >  #ifdef CONFIG_PROC_FS
>>>>> 
>>>>> Hi Jens, Helge - I've bisected a regression with io_uring on ia64 
>>>>> to this
>>>>> patch in 6.4.  Unfortunately this breaks userspace programs using 
>>>>> io_uring,
>>>>> the easiest one to test is cmake with an io_uring enabled libuv 
>>>>> (i.e., libuv
>>>>> >= 1.45.0) which will hang.
>>>>> 
>>>>> I am aware that ia64 is in a vulnerable place right now which I why 
>>>>> I am
>>>>> keeping this spread limited.  Since this clearly involves
>>>>> architecture-specific changes for parisc,
>>>> 
>>>> it isn't so much architecture-specific... (just one ifdef)
>>>> 
>>>>> is there any chance of looking at
>>>>> what is required to do the same for ia64?  I looked at
>>>>> 0ef36bd2b37815719e31a72d2beecc28ca8ecd26 ("parisc: change value of 
>>>>> SHMLBA
>>>>> from 0x00400000 to PAGE_SIZE") and tried to replicate the SHMLBA ->
>>>>> SHM_COLOUR change, but it made no difference.
>>>>> 
>>>>> If hardware is necessary for testing, I can provide it, including 
>>>>> remote BMC
>>>>> access for restarts/kernel debugging.  Any takers?
>>>> 
>>>> I won't have time to test myself, but maybe you could test?
>>>> 
>>>> Basically we should try to find out why 
>>>> io_uring_mmu_get_unmapped_area()
>>>> doesn't return valid addresses, while arch_get_unmapped_area()
>>>> [in arch/ia64/kernel/sys_ia64.c] does.
>>>> 
>>>> You could apply this patch first:
>>>> It introduces a memory leak (as it requests memory twice), but maybe 
>>>> we
>>>> get an idea?
>>>> The ia64 arch_get_unmapped_area() searches for memory from bottom
>>>> (flags=0), while io_uring function tries top-down first. Maybe 
>>>> that's
>>>> the problem. And I don't understand the offset_in_page() check right
>>>> now.
>>>> 
>>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>>> index 3bca7a79efda..93b1964d2bbb 100644
>>>> --- a/io_uring/io_uring.c
>>>> +++ b/io_uring/io_uring.c
>>>> @@ -3431,13 +3431,17 @@ static unsigned long 
>>>> io_uring_mmu_get_unmapped_area(struct file *filp,
>>>>       * can happen with large stack limits and large mmap()
>>>>       * allocations.
>>>>       */
>>>> +/* compare to arch_get_unmapped_area() in 
>>>> arch/ia64/kernel/sys_ia64.c */
>>>>      addr = vm_unmapped_area(&info);
>>>> -    if (offset_in_page(addr)) {
>>>> +printk("io_uring_mmu_get_unmapped_area() address 1 is: %px\n", 
>>>> addr);
>>>> +    addr = NULL;
>>>> +    if (!addr) {
>>>>          info.flags = 0;
>>>>          info.low_limit = TASK_UNMAPPED_BASE;
>>>>          info.high_limit = mmap_end;
>>>>          addr = vm_unmapped_area(&info);
>>>>      }
>>>> +printk("io_uring_mmu_get_unmapped_area() returns address %px\n", 
>>>> addr);
>>>> 
>>>>      return addr;
>>>>  }
>>>> 
>>>> 
>>>> Another option is to disable the call to 
>>>> io_uring_nommu_get_unmapped_area())
>>>> with the next patch. Maybe you could add printks() to ia64's 
>>>> arch_get_unmapped_area()
>>>> and check what it returns there?
>>>> 
>>>> @@ -3654,6 +3658,8 @@ static const struct file_operations 
>>>> io_uring_fops = {
>>>>  #ifndef CONFIG_MMU
>>>>      .get_unmapped_area = io_uring_nommu_get_unmapped_area,
>>>>      .mmap_capabilities = io_uring_nommu_mmap_capabilities,
>>>> +#elif 0    /* IS_ENABLED(CONFIG_IA64) */
>>>> +    .get_unmapped_area = NULL,
>>>>  #else
>>>>      .get_unmapped_area = io_uring_mmu_get_unmapped_area,
>>>>  #endif
>>>> 
>>>> Helge
>>> 
>>> Thanks Helge.  Sample output from that first patch:
>>> 
>>> [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() address 1 
>>> is: 1ffffffffff40000
>>> [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() returns 
>>> address 2000000001e40000
>>> [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() address 1 
>>> is: 1ffffffffff20000
>>> [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() returns 
>>> address 2000000001f20000
>>> [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() address 1 
>>> is: 1ffffffffff30000
>>> [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() returns 
>>> address 2000000001f30000
>>> [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() address 1 
>>> is: 1ffffffffff90000
>>> [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area() returns 
>>> address 2000000001f90000
>>> 
>>> This pattern seems to be pretty stable, I tried instead just directly 
>>> returning the result of a call to arch_get_unmapped_area() at the end 
>>> of the function and it seems similar:
>>> 
>>> [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area() would 
>>> return address 1ffffffffffd0000
>>> [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would return 
>>> address 2000000001f00000
>>> [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area() would 
>>> return address 1ffffffffff00000
>>> [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would return 
>>> address 1ffffffffff00000
>>> [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area() would 
>>> return address 1fffffffffe20000
>>> [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would return 
>>> address 2000000002000000
>>> [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area() would 
>>> return address 1fffffffffe30000
>>> [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would return 
>>> address 2000000002100000
>>> 
>>> Is that enough of a clue to go on?
>> 
>> SHMLBA on ia64 is 0x100000:
>> arch/ia64/include/asm/shmparam.h:#define        SHMLBA  (1024*1024)
>> but the values returned by io_uring_mmu_get_unmapped_area() does not 
>> fullfill this.
>> 
>> So, probably ia64's SHMLBA isn't pulled in correctly in 
>> io_uring/io_uring.c.
>> Check value of this line:
>>      info.align_mask = PAGE_MASK & (SHMLBA - 1UL);
>> 
>> You could also add
>> #define SHM_COLOUR  0x100000
>> in front of the
>>      #ifdef SHM_COLOUR
>> (define SHM_COLOUR in io_uring/kbuf.c too).
> 
> What is the value of PAGE_SIZE and "ptr" on your machine?
> For 4k page size I get:
> SHMLBA -1   ->        FFFFF
> PAGE_MASK   -> FFFFFFFFF000
> so,
> info.align_mask = PAGE_MASK & (SHMLBA - 1UL) = 0xFF000;
> You could try to set nfo.align_mask = 0xfffff;
> 
> Helge

Using 64KiB (65536) PAGE_SIZE here.  64-bit pointers.

Tried both #define SHM_COLOUR 0x100000, as well and info.align_mask = 
0xFFFFF, but both of them made the problem change from 100% 
reproducible, to intermittent.

After inspecting the ouput I observed that it hangs only when the first 
allocation returns an address below 0x2000000000000000, and the second 
returns an address above it.  When both addresses are above it, it does 
not hang.  Examples:

When it works:
$ cmake --version
cmake version 3.26.4

CMake suite maintained and supported by Kitware (kitware.com/cmake).
$ dmesg --color=always -T | tail -n 4
[Wed Jul 12 20:32:37 2023] io_uring_mmu_get_unmapped_area() would return 
address 1fffffffffe20000
[Wed Jul 12 20:32:37 2023] but arch_get_unmapped_area() would return 
address 2000000002000000
[Wed Jul 12 20:32:37 2023] io_uring_mmu_get_unmapped_area() would return 
address 1fffffffffe50000
[Wed Jul 12 20:32:37 2023] but arch_get_unmapped_area() would return 
address 2000000002100000


When it hangs:
$ cmake --version
cmake version 3.26.4

CMake suite maintained and supported by Kitware (kitware.com/cmake).
^C
$ dmesg --color=always -T | tail -n 4
[Wed Jul 12 20:33:12 2023] io_uring_mmu_get_unmapped_area() would return 
address 1ffffffffff00000
[Wed Jul 12 20:33:12 2023] but arch_get_unmapped_area() would return 
address 1ffffffffff00000
[Wed Jul 12 20:33:12 2023] io_uring_mmu_get_unmapped_area() would return 
address 1fffffffffe60000
[Wed Jul 12 20:33:12 2023] but arch_get_unmapped_area() would return 
address 2000000001f00000

Is io_uring_mmu_get_unmapped_area supported to always return addresses 
above 0x2000000000000000?  Any reason why it is not doing so sometimes?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/5] io_uring: Adjust mapping wrt architecture aliasing requirements
  2023-07-13  0:35             ` matoro
@ 2023-07-13  7:27               ` Helge Deller
  2023-07-13 23:57                 ` matoro
  0 siblings, 1 reply; 36+ messages in thread
From: Helge Deller @ 2023-07-13  7:27 UTC (permalink / raw)
  To: matoro; +Cc: Jens Axboe, io-uring, Linux Ia64, glaubitz, Sam James

* matoro <matoro_mailinglist_kernel@matoro.tk>:
> On 2023-07-12 16:30, Helge Deller wrote:
> > On 7/12/23 21:05, Helge Deller wrote:
> > > On 7/12/23 19:28, matoro wrote:
> > > > On 2023-07-12 12:24, Helge Deller wrote:
> > > > > Hi Matoro,
> > > > >
> > > > > * matoro <matoro_mailinglist_kernel@matoro.tk>:
> > > > > > On 2023-03-14 13:16, Jens Axboe wrote:
> > > > > > > From: Helge Deller <deller@gmx.de>
> > > > > > >
> > > > > > > Some architectures have memory cache aliasing requirements (e.g. parisc)
> > > > > > > if memory is shared between userspace and kernel. This patch fixes the
> > > > > > > kernel to return an aliased address when asked by userspace via mmap().
> > > > > > >
> > > > > > > Signed-off-by: Helge Deller <deller@gmx.de>
> > > > > > > Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > > > > > > ---
> > > > > > >  io_uring/io_uring.c | 51 +++++++++++++++++++++++++++++++++++++++++++++
> > > > > > >  1 file changed, 51 insertions(+)
> > > > > > >
> > > > > > > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> > > > > > > index 722624b6d0dc..3adecebbac71 100644
> > > > > > > --- a/io_uring/io_uring.c
> > > > > > > +++ b/io_uring/io_uring.c
> > > > > > > @@ -72,6 +72,7 @@
> > > > > > >  #include <linux/io_uring.h>
> > > > > > >  #include <linux/audit.h>
> > > > > > >  #include <linux/security.h>
> > > > > > > +#include <asm/shmparam.h>
> > > > > > >
> > > > > > >  #define CREATE_TRACE_POINTS
> > > > > > >  #include <trace/events/io_uring.h>
> > > > > > > @@ -3317,6 +3318,54 @@ static __cold int io_uring_mmap(struct file
> > > > > > > *file, struct vm_area_struct *vma)
> > > > > > >      return remap_pfn_range(vma, vma->vm_start, pfn, sz,
> > > > > > > vma->vm_page_prot);
> > > > > > >  }
> > > > > > >
> > > > > > > +static unsigned long io_uring_mmu_get_unmapped_area(struct file *filp,
> > > > > > > +            unsigned long addr, unsigned long len,
> > > > > > > +            unsigned long pgoff, unsigned long flags)
> > > > > > > +{
> > > > > > > +    const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
> > > > > > > +    struct vm_unmapped_area_info info;
> > > > > > > +    void *ptr;
> > > > > > > +
> > > > > > > +    /*
> > > > > > > +     * Do not allow to map to user-provided address to avoid breaking the
> > > > > > > +     * aliasing rules. Userspace is not able to guess the offset address
> > > > > > > of
> > > > > > > +     * kernel kmalloc()ed memory area.
> > > > > > > +     */
> > > > > > > +    if (addr)
> > > > > > > +        return -EINVAL;
> > > > > > > +
> > > > > > > +    ptr = io_uring_validate_mmap_request(filp, pgoff, len);
> > > > > > > +    if (IS_ERR(ptr))
> > > > > > > +        return -ENOMEM;
> > > > > > > +
> > > > > > > +    info.flags = VM_UNMAPPED_AREA_TOPDOWN;
> > > > > > > +    info.length = len;
> > > > > > > +    info.low_limit = max(PAGE_SIZE, mmap_min_addr);
> > > > > > > +    info.high_limit = arch_get_mmap_base(addr, current->mm->mmap_base);
> > > > > > > +#ifdef SHM_COLOUR
> > > > > > > +    info.align_mask = PAGE_MASK & (SHM_COLOUR - 1UL);
> > > > > > > +#else
> > > > > > > +    info.align_mask = PAGE_MASK & (SHMLBA - 1UL);
> > > > > > > +#endif
> > > > > > > +    info.align_offset = (unsigned long) ptr;
> > > > > > > +
> > > > > > > +    /*
> > > > > > > +     * A failed mmap() very likely causes application failure,
> > > > > > > +     * so fall back to the bottom-up function here. This scenario
> > > > > > > +     * can happen with large stack limits and large mmap()
> > > > > > > +     * allocations.
> > > > > > > +     */
> > > > > > > +    addr = vm_unmapped_area(&info);
> > > > > > > +    if (offset_in_page(addr)) {
> > > > > > > +        info.flags = 0;
> > > > > > > +        info.low_limit = TASK_UNMAPPED_BASE;
> > > > > > > +        info.high_limit = mmap_end;
> > > > > > > +        addr = vm_unmapped_area(&info);
> > > > > > > +    }
> > > > > > > +
> > > > > > > +    return addr;
> > > > > > > +}
> > > > > > > +
> > > > > > >  #else /* !CONFIG_MMU */
> > > > > > >
> > > > > > >  static int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
> > > > > > > @@ -3529,6 +3578,8 @@ static const struct file_operations io_uring_fops
> > > > > > > = {
> > > > > > >  #ifndef CONFIG_MMU
> > > > > > >      .get_unmapped_area = io_uring_nommu_get_unmapped_area,
> > > > > > >      .mmap_capabilities = io_uring_nommu_mmap_capabilities,
> > > > > > > +#else
> > > > > > > +    .get_unmapped_area = io_uring_mmu_get_unmapped_area,
> > > > > > >  #endif
> > > > > > >      .poll        = io_uring_poll,
> > > > > > >  #ifdef CONFIG_PROC_FS
> > > > > >
> > > > > > Hi Jens, Helge - I've bisected a regression with
> > > > > > io_uring on ia64 to this
> > > > > > patch in 6.4.  Unfortunately this breaks userspace
> > > > > > programs using io_uring,
> > > > > > the easiest one to test is cmake with an io_uring
> > > > > > enabled libuv (i.e., libuv
> > > > > > >= 1.45.0) which will hang.
> > > > > >
> > > > > > I am aware that ia64 is in a vulnerable place right now
> > > > > > which I why I am
> > > > > > keeping this spread limited.  Since this clearly involves
> > > > > > architecture-specific changes for parisc,
> > > > >
> > > > > it isn't so much architecture-specific... (just one ifdef)
> > > > >
> > > > > > is there any chance of looking at
> > > > > > what is required to do the same for ia64?  I looked at
> > > > > > 0ef36bd2b37815719e31a72d2beecc28ca8ecd26 ("parisc:
> > > > > > change value of SHMLBA
> > > > > > from 0x00400000 to PAGE_SIZE") and tried to replicate the SHMLBA ->
> > > > > > SHM_COLOUR change, but it made no difference.
> > > > > >
> > > > > > If hardware is necessary for testing, I can provide it,
> > > > > > including remote BMC
> > > > > > access for restarts/kernel debugging.  Any takers?
> > > > >
> > > > > I won't have time to test myself, but maybe you could test?
> > > > >
> > > > > Basically we should try to find out why
> > > > > io_uring_mmu_get_unmapped_area()
> > > > > doesn't return valid addresses, while arch_get_unmapped_area()
> > > > > [in arch/ia64/kernel/sys_ia64.c] does.
> > > > >
> > > > > You could apply this patch first:
> > > > > It introduces a memory leak (as it requests memory twice),
> > > > > but maybe we
> > > > > get an idea?
> > > > > The ia64 arch_get_unmapped_area() searches for memory from bottom
> > > > > (flags=0), while io_uring function tries top-down first.
> > > > > Maybe that's
> > > > > the problem. And I don't understand the offset_in_page() check right
> > > > > now.
> > > > >
> > > > > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> > > > > index 3bca7a79efda..93b1964d2bbb 100644
> > > > > --- a/io_uring/io_uring.c
> > > > > +++ b/io_uring/io_uring.c
> > > > > @@ -3431,13 +3431,17 @@ static unsigned long
> > > > > io_uring_mmu_get_unmapped_area(struct file *filp,
> > > > >       * can happen with large stack limits and large mmap()
> > > > >       * allocations.
> > > > >       */
> > > > > +/* compare to arch_get_unmapped_area() in
> > > > > arch/ia64/kernel/sys_ia64.c */
> > > > >      addr = vm_unmapped_area(&info);
> > > > > -    if (offset_in_page(addr)) {
> > > > > +printk("io_uring_mmu_get_unmapped_area() address 1 is:
> > > > > %px\n", addr);
> > > > > +    addr = NULL;
> > > > > +    if (!addr) {
> > > > >          info.flags = 0;
> > > > >          info.low_limit = TASK_UNMAPPED_BASE;
> > > > >          info.high_limit = mmap_end;
> > > > >          addr = vm_unmapped_area(&info);
> > > > >      }
> > > > > +printk("io_uring_mmu_get_unmapped_area() returns address
> > > > > %px\n", addr);
> > > > >
> > > > >      return addr;
> > > > >  }
> > > > >
> > > > >
> > > > > Another option is to disable the call to
> > > > > io_uring_nommu_get_unmapped_area())
> > > > > with the next patch. Maybe you could add printks() to ia64's
> > > > > arch_get_unmapped_area()
> > > > > and check what it returns there?
> > > > >
> > > > > @@ -3654,6 +3658,8 @@ static const struct file_operations
> > > > > io_uring_fops = {
> > > > >  #ifndef CONFIG_MMU
> > > > >      .get_unmapped_area = io_uring_nommu_get_unmapped_area,
> > > > >      .mmap_capabilities = io_uring_nommu_mmap_capabilities,
> > > > > +#elif 0    /* IS_ENABLED(CONFIG_IA64) */
> > > > > +    .get_unmapped_area = NULL,
> > > > >  #else
> > > > >      .get_unmapped_area = io_uring_mmu_get_unmapped_area,
> > > > >  #endif
> > > > >
> > > > > Helge
> > > >
> > > > Thanks Helge.  Sample output from that first patch:
> > > >
> > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
> > > > address 1 is: 1ffffffffff40000
> > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
> > > > returns address 2000000001e40000
> > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
> > > > address 1 is: 1ffffffffff20000
> > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
> > > > returns address 2000000001f20000
> > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
> > > > address 1 is: 1ffffffffff30000
> > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
> > > > returns address 2000000001f30000
> > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
> > > > address 1 is: 1ffffffffff90000
> > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
> > > > returns address 2000000001f90000
> > > >
> > > > This pattern seems to be pretty stable, I tried instead just
> > > > directly returning the result of a call to
> > > > arch_get_unmapped_area() at the end of the function and it seems
> > > > similar:
> > > >
> > > > [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area()
> > > > would return address 1ffffffffffd0000
> > > > [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would
> > > > return address 2000000001f00000
> > > > [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area()
> > > > would return address 1ffffffffff00000
> > > > [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would
> > > > return address 1ffffffffff00000
> > > > [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area()
> > > > would return address 1fffffffffe20000
> > > > [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would
> > > > return address 2000000002000000
> > > > [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area()
> > > > would return address 1fffffffffe30000
> > > > [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would
> > > > return address 2000000002100000
> > > >
> > > > Is that enough of a clue to go on?
> > >
> > > SHMLBA on ia64 is 0x100000:
> > > arch/ia64/include/asm/shmparam.h:#define        SHMLBA  (1024*1024)
> > > but the values returned by io_uring_mmu_get_unmapped_area() does not
> > > fullfill this.
> > >
> > > So, probably ia64's SHMLBA isn't pulled in correctly in
> > > io_uring/io_uring.c.
> > > Check value of this line:
> > >      info.align_mask = PAGE_MASK & (SHMLBA - 1UL);
> > >
> > > You could also add
> > > #define SHM_COLOUR  0x100000
> > > in front of the
> > >      #ifdef SHM_COLOUR
> > > (define SHM_COLOUR in io_uring/kbuf.c too).
> >
> > What is the value of PAGE_SIZE and "ptr" on your machine?
> > For 4k page size I get:
> > SHMLBA -1   ->        FFFFF
> > PAGE_MASK   -> FFFFFFFFF000
> > so,
> > info.align_mask = PAGE_MASK & (SHMLBA - 1UL) = 0xFF000;
> > You could try to set nfo.align_mask = 0xfffff;
> >
> > Helge
>
> Using 64KiB (65536) PAGE_SIZE here.  64-bit pointers.
>
> Tried both #define SHM_COLOUR 0x100000, as well and info.align_mask =
> 0xFFFFF, but both of them made the problem change from 100% reproducible, to
> intermittent.
>
> After inspecting the ouput I observed that it hangs only when the first
> allocation returns an address below 0x2000000000000000, and the second
> returns an address above it.  When both addresses are above it, it does not
> hang.  Examples:
>
> When it works:
> $ cmake --version
> cmake version 3.26.4
>
> CMake suite maintained and supported by Kitware (kitware.com/cmake).
> $ dmesg --color=always -T | tail -n 4
> [Wed Jul 12 20:32:37 2023] io_uring_mmu_get_unmapped_area() would return
> address 1fffffffffe20000
> [Wed Jul 12 20:32:37 2023] but arch_get_unmapped_area() would return address
> 2000000002000000
> [Wed Jul 12 20:32:37 2023] io_uring_mmu_get_unmapped_area() would return
> address 1fffffffffe50000
> [Wed Jul 12 20:32:37 2023] but arch_get_unmapped_area() would return address
> 2000000002100000
>
>
> When it hangs:
> $ cmake --version
> cmake version 3.26.4
>
> CMake suite maintained and supported by Kitware (kitware.com/cmake).
> ^C
> $ dmesg --color=always -T | tail -n 4
> [Wed Jul 12 20:33:12 2023] io_uring_mmu_get_unmapped_area() would return
> address 1ffffffffff00000
> [Wed Jul 12 20:33:12 2023] but arch_get_unmapped_area() would return address
> 1ffffffffff00000
> [Wed Jul 12 20:33:12 2023] io_uring_mmu_get_unmapped_area() would return
> address 1fffffffffe60000
> [Wed Jul 12 20:33:12 2023] but arch_get_unmapped_area() would return address
> 2000000001f00000
>
> Is io_uring_mmu_get_unmapped_area supported to always return addresses above
> 0x2000000000000000?

Yes, with the patch below.

> Any reason why it is not doing so sometimes?

It depends on the parameters for vm_unmapped_area(). Specifically
info.flags=0.

Try this patch:

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 3bca7a79efda..b259794ab53b 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3429,10 +3429,13 @@ static unsigned long io_uring_mmu_get_unmapped_area(struct file *filp,
 	 * A failed mmap() very likely causes application failure,
 	 * so fall back to the bottom-up function here. This scenario
 	 * can happen with large stack limits and large mmap()
-	 * allocations.
+	 * allocations. Use bottom-up on IA64 for correct aliasing.
 	 */
-	addr = vm_unmapped_area(&info);
-	if (offset_in_page(addr)) {
+	if (IS_ENABLED(CONFIG_IA64))
+		addr = NULL;
+	else
+		addr = vm_unmapped_area(&info);
+	if (!addr) {
 		info.flags = 0;
 		info.low_limit = TASK_UNMAPPED_BASE;
 		info.high_limit = mmap_end;

Helge

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/5] io_uring: Adjust mapping wrt architecture aliasing requirements
  2023-07-13  7:27               ` Helge Deller
@ 2023-07-13 23:57                 ` matoro
  2023-07-16  6:54                   ` Helge Deller
  0 siblings, 1 reply; 36+ messages in thread
From: matoro @ 2023-07-13 23:57 UTC (permalink / raw)
  To: Helge Deller; +Cc: Jens Axboe, io-uring, Linux Ia64, glaubitz, Sam James

On 2023-07-13 03:27, Helge Deller wrote:
> * matoro <matoro_mailinglist_kernel@matoro.tk>:
>> On 2023-07-12 16:30, Helge Deller wrote:
>> > On 7/12/23 21:05, Helge Deller wrote:
>> > > On 7/12/23 19:28, matoro wrote:
>> > > > On 2023-07-12 12:24, Helge Deller wrote:
>> > > > > Hi Matoro,
>> > > > >
>> > > > > * matoro <matoro_mailinglist_kernel@matoro.tk>:
>> > > > > > On 2023-03-14 13:16, Jens Axboe wrote:
>> > > > > > > From: Helge Deller <deller@gmx.de>
>> > > > > > >
>> > > > > > > Some architectures have memory cache aliasing requirements (e.g. parisc)
>> > > > > > > if memory is shared between userspace and kernel. This patch fixes the
>> > > > > > > kernel to return an aliased address when asked by userspace via mmap().
>> > > > > > >
>> > > > > > > Signed-off-by: Helge Deller <deller@gmx.de>
>> > > > > > > Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> > > > > > > ---
>> > > > > > >  io_uring/io_uring.c | 51 +++++++++++++++++++++++++++++++++++++++++++++
>> > > > > > >  1 file changed, 51 insertions(+)
>> > > > > > >
>> > > > > > > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> > > > > > > index 722624b6d0dc..3adecebbac71 100644
>> > > > > > > --- a/io_uring/io_uring.c
>> > > > > > > +++ b/io_uring/io_uring.c
>> > > > > > > @@ -72,6 +72,7 @@
>> > > > > > >  #include <linux/io_uring.h>
>> > > > > > >  #include <linux/audit.h>
>> > > > > > >  #include <linux/security.h>
>> > > > > > > +#include <asm/shmparam.h>
>> > > > > > >
>> > > > > > >  #define CREATE_TRACE_POINTS
>> > > > > > >  #include <trace/events/io_uring.h>
>> > > > > > > @@ -3317,6 +3318,54 @@ static __cold int io_uring_mmap(struct file
>> > > > > > > *file, struct vm_area_struct *vma)
>> > > > > > >      return remap_pfn_range(vma, vma->vm_start, pfn, sz,
>> > > > > > > vma->vm_page_prot);
>> > > > > > >  }
>> > > > > > >
>> > > > > > > +static unsigned long io_uring_mmu_get_unmapped_area(struct file *filp,
>> > > > > > > +            unsigned long addr, unsigned long len,
>> > > > > > > +            unsigned long pgoff, unsigned long flags)
>> > > > > > > +{
>> > > > > > > +    const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
>> > > > > > > +    struct vm_unmapped_area_info info;
>> > > > > > > +    void *ptr;
>> > > > > > > +
>> > > > > > > +    /*
>> > > > > > > +     * Do not allow to map to user-provided address to avoid breaking the
>> > > > > > > +     * aliasing rules. Userspace is not able to guess the offset address
>> > > > > > > of
>> > > > > > > +     * kernel kmalloc()ed memory area.
>> > > > > > > +     */
>> > > > > > > +    if (addr)
>> > > > > > > +        return -EINVAL;
>> > > > > > > +
>> > > > > > > +    ptr = io_uring_validate_mmap_request(filp, pgoff, len);
>> > > > > > > +    if (IS_ERR(ptr))
>> > > > > > > +        return -ENOMEM;
>> > > > > > > +
>> > > > > > > +    info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>> > > > > > > +    info.length = len;
>> > > > > > > +    info.low_limit = max(PAGE_SIZE, mmap_min_addr);
>> > > > > > > +    info.high_limit = arch_get_mmap_base(addr, current->mm->mmap_base);
>> > > > > > > +#ifdef SHM_COLOUR
>> > > > > > > +    info.align_mask = PAGE_MASK & (SHM_COLOUR - 1UL);
>> > > > > > > +#else
>> > > > > > > +    info.align_mask = PAGE_MASK & (SHMLBA - 1UL);
>> > > > > > > +#endif
>> > > > > > > +    info.align_offset = (unsigned long) ptr;
>> > > > > > > +
>> > > > > > > +    /*
>> > > > > > > +     * A failed mmap() very likely causes application failure,
>> > > > > > > +     * so fall back to the bottom-up function here. This scenario
>> > > > > > > +     * can happen with large stack limits and large mmap()
>> > > > > > > +     * allocations.
>> > > > > > > +     */
>> > > > > > > +    addr = vm_unmapped_area(&info);
>> > > > > > > +    if (offset_in_page(addr)) {
>> > > > > > > +        info.flags = 0;
>> > > > > > > +        info.low_limit = TASK_UNMAPPED_BASE;
>> > > > > > > +        info.high_limit = mmap_end;
>> > > > > > > +        addr = vm_unmapped_area(&info);
>> > > > > > > +    }
>> > > > > > > +
>> > > > > > > +    return addr;
>> > > > > > > +}
>> > > > > > > +
>> > > > > > >  #else /* !CONFIG_MMU */
>> > > > > > >
>> > > > > > >  static int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
>> > > > > > > @@ -3529,6 +3578,8 @@ static const struct file_operations io_uring_fops
>> > > > > > > = {
>> > > > > > >  #ifndef CONFIG_MMU
>> > > > > > >      .get_unmapped_area = io_uring_nommu_get_unmapped_area,
>> > > > > > >      .mmap_capabilities = io_uring_nommu_mmap_capabilities,
>> > > > > > > +#else
>> > > > > > > +    .get_unmapped_area = io_uring_mmu_get_unmapped_area,
>> > > > > > >  #endif
>> > > > > > >      .poll        = io_uring_poll,
>> > > > > > >  #ifdef CONFIG_PROC_FS
>> > > > > >
>> > > > > > Hi Jens, Helge - I've bisected a regression with
>> > > > > > io_uring on ia64 to this
>> > > > > > patch in 6.4.  Unfortunately this breaks userspace
>> > > > > > programs using io_uring,
>> > > > > > the easiest one to test is cmake with an io_uring
>> > > > > > enabled libuv (i.e., libuv
>> > > > > > >= 1.45.0) which will hang.
>> > > > > >
>> > > > > > I am aware that ia64 is in a vulnerable place right now
>> > > > > > which I why I am
>> > > > > > keeping this spread limited.  Since this clearly involves
>> > > > > > architecture-specific changes for parisc,
>> > > > >
>> > > > > it isn't so much architecture-specific... (just one ifdef)
>> > > > >
>> > > > > > is there any chance of looking at
>> > > > > > what is required to do the same for ia64?  I looked at
>> > > > > > 0ef36bd2b37815719e31a72d2beecc28ca8ecd26 ("parisc:
>> > > > > > change value of SHMLBA
>> > > > > > from 0x00400000 to PAGE_SIZE") and tried to replicate the SHMLBA ->
>> > > > > > SHM_COLOUR change, but it made no difference.
>> > > > > >
>> > > > > > If hardware is necessary for testing, I can provide it,
>> > > > > > including remote BMC
>> > > > > > access for restarts/kernel debugging.  Any takers?
>> > > > >
>> > > > > I won't have time to test myself, but maybe you could test?
>> > > > >
>> > > > > Basically we should try to find out why
>> > > > > io_uring_mmu_get_unmapped_area()
>> > > > > doesn't return valid addresses, while arch_get_unmapped_area()
>> > > > > [in arch/ia64/kernel/sys_ia64.c] does.
>> > > > >
>> > > > > You could apply this patch first:
>> > > > > It introduces a memory leak (as it requests memory twice),
>> > > > > but maybe we
>> > > > > get an idea?
>> > > > > The ia64 arch_get_unmapped_area() searches for memory from bottom
>> > > > > (flags=0), while io_uring function tries top-down first.
>> > > > > Maybe that's
>> > > > > the problem. And I don't understand the offset_in_page() check right
>> > > > > now.
>> > > > >
>> > > > > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> > > > > index 3bca7a79efda..93b1964d2bbb 100644
>> > > > > --- a/io_uring/io_uring.c
>> > > > > +++ b/io_uring/io_uring.c
>> > > > > @@ -3431,13 +3431,17 @@ static unsigned long
>> > > > > io_uring_mmu_get_unmapped_area(struct file *filp,
>> > > > >       * can happen with large stack limits and large mmap()
>> > > > >       * allocations.
>> > > > >       */
>> > > > > +/* compare to arch_get_unmapped_area() in
>> > > > > arch/ia64/kernel/sys_ia64.c */
>> > > > >      addr = vm_unmapped_area(&info);
>> > > > > -    if (offset_in_page(addr)) {
>> > > > > +printk("io_uring_mmu_get_unmapped_area() address 1 is:
>> > > > > %px\n", addr);
>> > > > > +    addr = NULL;
>> > > > > +    if (!addr) {
>> > > > >          info.flags = 0;
>> > > > >          info.low_limit = TASK_UNMAPPED_BASE;
>> > > > >          info.high_limit = mmap_end;
>> > > > >          addr = vm_unmapped_area(&info);
>> > > > >      }
>> > > > > +printk("io_uring_mmu_get_unmapped_area() returns address
>> > > > > %px\n", addr);
>> > > > >
>> > > > >      return addr;
>> > > > >  }
>> > > > >
>> > > > >
>> > > > > Another option is to disable the call to
>> > > > > io_uring_nommu_get_unmapped_area())
>> > > > > with the next patch. Maybe you could add printks() to ia64's
>> > > > > arch_get_unmapped_area()
>> > > > > and check what it returns there?
>> > > > >
>> > > > > @@ -3654,6 +3658,8 @@ static const struct file_operations
>> > > > > io_uring_fops = {
>> > > > >  #ifndef CONFIG_MMU
>> > > > >      .get_unmapped_area = io_uring_nommu_get_unmapped_area,
>> > > > >      .mmap_capabilities = io_uring_nommu_mmap_capabilities,
>> > > > > +#elif 0    /* IS_ENABLED(CONFIG_IA64) */
>> > > > > +    .get_unmapped_area = NULL,
>> > > > >  #else
>> > > > >      .get_unmapped_area = io_uring_mmu_get_unmapped_area,
>> > > > >  #endif
>> > > > >
>> > > > > Helge
>> > > >
>> > > > Thanks Helge.  Sample output from that first patch:
>> > > >
>> > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
>> > > > address 1 is: 1ffffffffff40000
>> > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
>> > > > returns address 2000000001e40000
>> > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
>> > > > address 1 is: 1ffffffffff20000
>> > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
>> > > > returns address 2000000001f20000
>> > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
>> > > > address 1 is: 1ffffffffff30000
>> > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
>> > > > returns address 2000000001f30000
>> > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
>> > > > address 1 is: 1ffffffffff90000
>> > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
>> > > > returns address 2000000001f90000
>> > > >
>> > > > This pattern seems to be pretty stable, I tried instead just
>> > > > directly returning the result of a call to
>> > > > arch_get_unmapped_area() at the end of the function and it seems
>> > > > similar:
>> > > >
>> > > > [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area()
>> > > > would return address 1ffffffffffd0000
>> > > > [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would
>> > > > return address 2000000001f00000
>> > > > [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area()
>> > > > would return address 1ffffffffff00000
>> > > > [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would
>> > > > return address 1ffffffffff00000
>> > > > [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area()
>> > > > would return address 1fffffffffe20000
>> > > > [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would
>> > > > return address 2000000002000000
>> > > > [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area()
>> > > > would return address 1fffffffffe30000
>> > > > [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would
>> > > > return address 2000000002100000
>> > > >
>> > > > Is that enough of a clue to go on?
>> > >
>> > > SHMLBA on ia64 is 0x100000:
>> > > arch/ia64/include/asm/shmparam.h:#define        SHMLBA  (1024*1024)
>> > > but the values returned by io_uring_mmu_get_unmapped_area() does not
>> > > fullfill this.
>> > >
>> > > So, probably ia64's SHMLBA isn't pulled in correctly in
>> > > io_uring/io_uring.c.
>> > > Check value of this line:
>> > >      info.align_mask = PAGE_MASK & (SHMLBA - 1UL);
>> > >
>> > > You could also add
>> > > #define SHM_COLOUR  0x100000
>> > > in front of the
>> > >      #ifdef SHM_COLOUR
>> > > (define SHM_COLOUR in io_uring/kbuf.c too).
>> >
>> > What is the value of PAGE_SIZE and "ptr" on your machine?
>> > For 4k page size I get:
>> > SHMLBA -1   ->        FFFFF
>> > PAGE_MASK   -> FFFFFFFFF000
>> > so,
>> > info.align_mask = PAGE_MASK & (SHMLBA - 1UL) = 0xFF000;
>> > You could try to set nfo.align_mask = 0xfffff;
>> >
>> > Helge
>> 
>> Using 64KiB (65536) PAGE_SIZE here.  64-bit pointers.
>> 
>> Tried both #define SHM_COLOUR 0x100000, as well and info.align_mask =
>> 0xFFFFF, but both of them made the problem change from 100% 
>> reproducible, to
>> intermittent.
>> 
>> After inspecting the ouput I observed that it hangs only when the 
>> first
>> allocation returns an address below 0x2000000000000000, and the second
>> returns an address above it.  When both addresses are above it, it 
>> does not
>> hang.  Examples:
>> 
>> When it works:
>> $ cmake --version
>> cmake version 3.26.4
>> 
>> CMake suite maintained and supported by Kitware (kitware.com/cmake).
>> $ dmesg --color=always -T | tail -n 4
>> [Wed Jul 12 20:32:37 2023] io_uring_mmu_get_unmapped_area() would 
>> return
>> address 1fffffffffe20000
>> [Wed Jul 12 20:32:37 2023] but arch_get_unmapped_area() would return 
>> address
>> 2000000002000000
>> [Wed Jul 12 20:32:37 2023] io_uring_mmu_get_unmapped_area() would 
>> return
>> address 1fffffffffe50000
>> [Wed Jul 12 20:32:37 2023] but arch_get_unmapped_area() would return 
>> address
>> 2000000002100000
>> 
>> 
>> When it hangs:
>> $ cmake --version
>> cmake version 3.26.4
>> 
>> CMake suite maintained and supported by Kitware (kitware.com/cmake).
>> ^C
>> $ dmesg --color=always -T | tail -n 4
>> [Wed Jul 12 20:33:12 2023] io_uring_mmu_get_unmapped_area() would 
>> return
>> address 1ffffffffff00000
>> [Wed Jul 12 20:33:12 2023] but arch_get_unmapped_area() would return 
>> address
>> 1ffffffffff00000
>> [Wed Jul 12 20:33:12 2023] io_uring_mmu_get_unmapped_area() would 
>> return
>> address 1fffffffffe60000
>> [Wed Jul 12 20:33:12 2023] but arch_get_unmapped_area() would return 
>> address
>> 2000000001f00000
>> 
>> Is io_uring_mmu_get_unmapped_area supported to always return addresses 
>> above
>> 0x2000000000000000?
> 
> Yes, with the patch below.
> 
>> Any reason why it is not doing so sometimes?
> 
> It depends on the parameters for vm_unmapped_area(). Specifically
> info.flags=0.
> 
> Try this patch:
> 
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 3bca7a79efda..b259794ab53b 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -3429,10 +3429,13 @@ static unsigned long 
> io_uring_mmu_get_unmapped_area(struct file *filp,
>  	 * A failed mmap() very likely causes application failure,
>  	 * so fall back to the bottom-up function here. This scenario
>  	 * can happen with large stack limits and large mmap()
> -	 * allocations.
> +	 * allocations. Use bottom-up on IA64 for correct aliasing.
>  	 */
> -	addr = vm_unmapped_area(&info);
> -	if (offset_in_page(addr)) {
> +	if (IS_ENABLED(CONFIG_IA64))
> +		addr = NULL;
> +	else
> +		addr = vm_unmapped_area(&info);
> +	if (!addr) {
>  		info.flags = 0;
>  		info.low_limit = TASK_UNMAPPED_BASE;
>  		info.high_limit = mmap_end;
> 
> Helge

This patch does do the trick, but I am a little unsure if it's the right 
one to go in:

* Adding an arch-specific conditional feels like a bad hack, why is it 
not working with the other vm_unmapped_area_info settings?
* What happened to the offset_in_page check for other arches?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/5] io_uring: Adjust mapping wrt architecture aliasing requirements
  2023-07-13 23:57                 ` matoro
@ 2023-07-16  6:54                   ` Helge Deller
  2023-07-16 18:03                     ` matoro
  0 siblings, 1 reply; 36+ messages in thread
From: Helge Deller @ 2023-07-16  6:54 UTC (permalink / raw)
  To: matoro
  Cc: Helge Deller, Jens Axboe, io-uring, Linux Ia64, glaubitz, Sam James

* matoro <matoro_mailinglist_kernel@matoro.tk>:
> On 2023-07-13 03:27, Helge Deller wrote:
> > * matoro <matoro_mailinglist_kernel@matoro.tk>:
> > > On 2023-07-12 16:30, Helge Deller wrote:
> > > > On 7/12/23 21:05, Helge Deller wrote:
> > > > > On 7/12/23 19:28, matoro wrote:
> > > > > > On 2023-07-12 12:24, Helge Deller wrote:
> > > > > > > Hi Matoro,
> > > > > > >
> > > > > > > * matoro <matoro_mailinglist_kernel@matoro.tk>:
> > > > > > > > On 2023-03-14 13:16, Jens Axboe wrote:
> > > > > > > > > From: Helge Deller <deller@gmx.de>
> > > > > > > > >
> > > > > > > > > Some architectures have memory cache aliasing requirements (e.g. parisc)
> > > > > > > > > if memory is shared between userspace and kernel. This patch fixes the
> > > > > > > > > kernel to return an aliased address when asked by userspace via mmap().
> > > > > > > > >
> > > > > > > > > Signed-off-by: Helge Deller <deller@gmx.de>
> > > > > > > > > Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > > > > > > > > ---
> > > > > > > > >  io_uring/io_uring.c | 51 +++++++++++++++++++++++++++++++++++++++++++++
> > > > > > > > >  1 file changed, 51 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> > > > > > > > > index 722624b6d0dc..3adecebbac71 100644
> > > > > > > > > --- a/io_uring/io_uring.c
> > > > > > > > > +++ b/io_uring/io_uring.c
> > > > > > > > > @@ -72,6 +72,7 @@
> > > > > > > > >  #include <linux/io_uring.h>
> > > > > > > > >  #include <linux/audit.h>
> > > > > > > > >  #include <linux/security.h>
> > > > > > > > > +#include <asm/shmparam.h>
> > > > > > > > >
> > > > > > > > >  #define CREATE_TRACE_POINTS
> > > > > > > > >  #include <trace/events/io_uring.h>
> > > > > > > > > @@ -3317,6 +3318,54 @@ static __cold int io_uring_mmap(struct file
> > > > > > > > > *file, struct vm_area_struct *vma)
> > > > > > > > >      return remap_pfn_range(vma, vma->vm_start, pfn, sz,
> > > > > > > > > vma->vm_page_prot);
> > > > > > > > >  }
> > > > > > > > >
> > > > > > > > > +static unsigned long io_uring_mmu_get_unmapped_area(struct file *filp,
> > > > > > > > > +            unsigned long addr, unsigned long len,
> > > > > > > > > +            unsigned long pgoff, unsigned long flags)
> > > > > > > > > +{
> > > > > > > > > +    const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
> > > > > > > > > +    struct vm_unmapped_area_info info;
> > > > > > > > > +    void *ptr;
> > > > > > > > > +
> > > > > > > > > +    /*
> > > > > > > > > +     * Do not allow to map to user-provided address to avoid breaking the
> > > > > > > > > +     * aliasing rules. Userspace is not able to guess the offset address
> > > > > > > > > of
> > > > > > > > > +     * kernel kmalloc()ed memory area.
> > > > > > > > > +     */
> > > > > > > > > +    if (addr)
> > > > > > > > > +        return -EINVAL;
> > > > > > > > > +
> > > > > > > > > +    ptr = io_uring_validate_mmap_request(filp, pgoff, len);
> > > > > > > > > +    if (IS_ERR(ptr))
> > > > > > > > > +        return -ENOMEM;
> > > > > > > > > +
> > > > > > > > > +    info.flags = VM_UNMAPPED_AREA_TOPDOWN;
> > > > > > > > > +    info.length = len;
> > > > > > > > > +    info.low_limit = max(PAGE_SIZE, mmap_min_addr);
> > > > > > > > > +    info.high_limit = arch_get_mmap_base(addr, current->mm->mmap_base);
> > > > > > > > > +#ifdef SHM_COLOUR
> > > > > > > > > +    info.align_mask = PAGE_MASK & (SHM_COLOUR - 1UL);
> > > > > > > > > +#else
> > > > > > > > > +    info.align_mask = PAGE_MASK & (SHMLBA - 1UL);
> > > > > > > > > +#endif
> > > > > > > > > +    info.align_offset = (unsigned long) ptr;
> > > > > > > > > +
> > > > > > > > > +    /*
> > > > > > > > > +     * A failed mmap() very likely causes application failure,
> > > > > > > > > +     * so fall back to the bottom-up function here. This scenario
> > > > > > > > > +     * can happen with large stack limits and large mmap()
> > > > > > > > > +     * allocations.
> > > > > > > > > +     */
> > > > > > > > > +    addr = vm_unmapped_area(&info);
> > > > > > > > > +    if (offset_in_page(addr)) {
> > > > > > > > > +        info.flags = 0;
> > > > > > > > > +        info.low_limit = TASK_UNMAPPED_BASE;
> > > > > > > > > +        info.high_limit = mmap_end;
> > > > > > > > > +        addr = vm_unmapped_area(&info);
> > > > > > > > > +    }
> > > > > > > > > +
> > > > > > > > > +    return addr;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > >  #else /* !CONFIG_MMU */
> > > > > > > > >
> > > > > > > > >  static int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
> > > > > > > > > @@ -3529,6 +3578,8 @@ static const struct file_operations io_uring_fops
> > > > > > > > > = {
> > > > > > > > >  #ifndef CONFIG_MMU
> > > > > > > > >      .get_unmapped_area = io_uring_nommu_get_unmapped_area,
> > > > > > > > >      .mmap_capabilities = io_uring_nommu_mmap_capabilities,
> > > > > > > > > +#else
> > > > > > > > > +    .get_unmapped_area = io_uring_mmu_get_unmapped_area,
> > > > > > > > >  #endif
> > > > > > > > >      .poll        = io_uring_poll,
> > > > > > > > >  #ifdef CONFIG_PROC_FS
> > > > > > > >
> > > > > > > > Hi Jens, Helge - I've bisected a regression with
> > > > > > > > io_uring on ia64 to this
> > > > > > > > patch in 6.4.  Unfortunately this breaks userspace
> > > > > > > > programs using io_uring,
> > > > > > > > the easiest one to test is cmake with an io_uring
> > > > > > > > enabled libuv (i.e., libuv
> > > > > > > > >= 1.45.0) which will hang.
> > > > > > > >
> > > > > > > > I am aware that ia64 is in a vulnerable place right now
> > > > > > > > which I why I am
> > > > > > > > keeping this spread limited.  Since this clearly involves
> > > > > > > > architecture-specific changes for parisc,
> > > > > > >
> > > > > > > it isn't so much architecture-specific... (just one ifdef)
> > > > > > >
> > > > > > > > is there any chance of looking at
> > > > > > > > what is required to do the same for ia64?  I looked at
> > > > > > > > 0ef36bd2b37815719e31a72d2beecc28ca8ecd26 ("parisc:
> > > > > > > > change value of SHMLBA
> > > > > > > > from 0x00400000 to PAGE_SIZE") and tried to replicate the SHMLBA ->
> > > > > > > > SHM_COLOUR change, but it made no difference.
> > > > > > > >
> > > > > > > > If hardware is necessary for testing, I can provide it,
> > > > > > > > including remote BMC
> > > > > > > > access for restarts/kernel debugging.  Any takers?
> > > > > > >
> > > > > > > I won't have time to test myself, but maybe you could test?
> > > > > > >
> > > > > > > Basically we should try to find out why
> > > > > > > io_uring_mmu_get_unmapped_area()
> > > > > > > doesn't return valid addresses, while arch_get_unmapped_area()
> > > > > > > [in arch/ia64/kernel/sys_ia64.c] does.
> > > > > > >
> > > > > > > You could apply this patch first:
> > > > > > > It introduces a memory leak (as it requests memory twice),
> > > > > > > but maybe we
> > > > > > > get an idea?
> > > > > > > The ia64 arch_get_unmapped_area() searches for memory from bottom
> > > > > > > (flags=0), while io_uring function tries top-down first.
> > > > > > > Maybe that's
> > > > > > > the problem. And I don't understand the offset_in_page() check right
> > > > > > > now.
> > > > > > >
> > > > > > > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> > > > > > > index 3bca7a79efda..93b1964d2bbb 100644
> > > > > > > --- a/io_uring/io_uring.c
> > > > > > > +++ b/io_uring/io_uring.c
> > > > > > > @@ -3431,13 +3431,17 @@ static unsigned long
> > > > > > > io_uring_mmu_get_unmapped_area(struct file *filp,
> > > > > > >       * can happen with large stack limits and large mmap()
> > > > > > >       * allocations.
> > > > > > >       */
> > > > > > > +/* compare to arch_get_unmapped_area() in
> > > > > > > arch/ia64/kernel/sys_ia64.c */
> > > > > > >      addr = vm_unmapped_area(&info);
> > > > > > > -    if (offset_in_page(addr)) {
> > > > > > > +printk("io_uring_mmu_get_unmapped_area() address 1 is:
> > > > > > > %px\n", addr);
> > > > > > > +    addr = NULL;
> > > > > > > +    if (!addr) {
> > > > > > >          info.flags = 0;
> > > > > > >          info.low_limit = TASK_UNMAPPED_BASE;
> > > > > > >          info.high_limit = mmap_end;
> > > > > > >          addr = vm_unmapped_area(&info);
> > > > > > >      }
> > > > > > > +printk("io_uring_mmu_get_unmapped_area() returns address
> > > > > > > %px\n", addr);
> > > > > > >
> > > > > > >      return addr;
> > > > > > >  }
> > > > > > >
> > > > > > >
> > > > > > > Another option is to disable the call to
> > > > > > > io_uring_nommu_get_unmapped_area())
> > > > > > > with the next patch. Maybe you could add printks() to ia64's
> > > > > > > arch_get_unmapped_area()
> > > > > > > and check what it returns there?
> > > > > > >
> > > > > > > @@ -3654,6 +3658,8 @@ static const struct file_operations
> > > > > > > io_uring_fops = {
> > > > > > >  #ifndef CONFIG_MMU
> > > > > > >      .get_unmapped_area = io_uring_nommu_get_unmapped_area,
> > > > > > >      .mmap_capabilities = io_uring_nommu_mmap_capabilities,
> > > > > > > +#elif 0    /* IS_ENABLED(CONFIG_IA64) */
> > > > > > > +    .get_unmapped_area = NULL,
> > > > > > >  #else
> > > > > > >      .get_unmapped_area = io_uring_mmu_get_unmapped_area,
> > > > > > >  #endif
> > > > > > >
> > > > > > > Helge
> > > > > >
> > > > > > Thanks Helge.  Sample output from that first patch:
> > > > > >
> > > > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
> > > > > > address 1 is: 1ffffffffff40000
> > > > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
> > > > > > returns address 2000000001e40000
> > > > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
> > > > > > address 1 is: 1ffffffffff20000
> > > > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
> > > > > > returns address 2000000001f20000
> > > > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
> > > > > > address 1 is: 1ffffffffff30000
> > > > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
> > > > > > returns address 2000000001f30000
> > > > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
> > > > > > address 1 is: 1ffffffffff90000
> > > > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
> > > > > > returns address 2000000001f90000
> > > > > >
> > > > > > This pattern seems to be pretty stable, I tried instead just
> > > > > > directly returning the result of a call to
> > > > > > arch_get_unmapped_area() at the end of the function and it seems
> > > > > > similar:
> > > > > >
> > > > > > [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area()
> > > > > > would return address 1ffffffffffd0000
> > > > > > [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would
> > > > > > return address 2000000001f00000
> > > > > > [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area()
> > > > > > would return address 1ffffffffff00000
> > > > > > [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would
> > > > > > return address 1ffffffffff00000
> > > > > > [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area()
> > > > > > would return address 1fffffffffe20000
> > > > > > [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would
> > > > > > return address 2000000002000000
> > > > > > [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area()
> > > > > > would return address 1fffffffffe30000
> > > > > > [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would
> > > > > > return address 2000000002100000
> > > > > >
> > > > > > Is that enough of a clue to go on?
> > > > >
> > > > > SHMLBA on ia64 is 0x100000:
> > > > > arch/ia64/include/asm/shmparam.h:#define        SHMLBA  (1024*1024)
> > > > > but the values returned by io_uring_mmu_get_unmapped_area() does not
> > > > > fullfill this.
> > > > >
> > > > > So, probably ia64's SHMLBA isn't pulled in correctly in
> > > > > io_uring/io_uring.c.
> > > > > Check value of this line:
> > > > >      info.align_mask = PAGE_MASK & (SHMLBA - 1UL);
> > > > >
> > > > > You could also add
> > > > > #define SHM_COLOUR  0x100000
> > > > > in front of the
> > > > >      #ifdef SHM_COLOUR
> > > > > (define SHM_COLOUR in io_uring/kbuf.c too).
> > > >
> > > > What is the value of PAGE_SIZE and "ptr" on your machine?
> > > > For 4k page size I get:
> > > > SHMLBA -1   ->        FFFFF
> > > > PAGE_MASK   -> FFFFFFFFF000
> > > > so,
> > > > info.align_mask = PAGE_MASK & (SHMLBA - 1UL) = 0xFF000;
> > > > You could try to set nfo.align_mask = 0xfffff;
> > > >
> > > > Helge
> > >
> > > Using 64KiB (65536) PAGE_SIZE here.  64-bit pointers.
> > >
> > > Tried both #define SHM_COLOUR 0x100000, as well and info.align_mask =
> > > 0xFFFFF, but both of them made the problem change from 100%
> > > reproducible, to
> > > intermittent.
> > >
> > > After inspecting the ouput I observed that it hangs only when the
> > > first
> > > allocation returns an address below 0x2000000000000000, and the second
> > > returns an address above it.  When both addresses are above it, it
> > > does not
> > > hang.  Examples:
> > >
> > > When it works:
> > > $ cmake --version
> > > cmake version 3.26.4
> > >
> > > CMake suite maintained and supported by Kitware (kitware.com/cmake).
> > > $ dmesg --color=always -T | tail -n 4
> > > [Wed Jul 12 20:32:37 2023] io_uring_mmu_get_unmapped_area() would
> > > return
> > > address 1fffffffffe20000
> > > [Wed Jul 12 20:32:37 2023] but arch_get_unmapped_area() would return
> > > address
> > > 2000000002000000
> > > [Wed Jul 12 20:32:37 2023] io_uring_mmu_get_unmapped_area() would
> > > return
> > > address 1fffffffffe50000
> > > [Wed Jul 12 20:32:37 2023] but arch_get_unmapped_area() would return
> > > address
> > > 2000000002100000
> > >
> > >
> > > When it hangs:
> > > $ cmake --version
> > > cmake version 3.26.4
> > >
> > > CMake suite maintained and supported by Kitware (kitware.com/cmake).
> > > ^C
> > > $ dmesg --color=always -T | tail -n 4
> > > [Wed Jul 12 20:33:12 2023] io_uring_mmu_get_unmapped_area() would
> > > return
> > > address 1ffffffffff00000
> > > [Wed Jul 12 20:33:12 2023] but arch_get_unmapped_area() would return
> > > address
> > > 1ffffffffff00000
> > > [Wed Jul 12 20:33:12 2023] io_uring_mmu_get_unmapped_area() would
> > > return
> > > address 1fffffffffe60000
> > > [Wed Jul 12 20:33:12 2023] but arch_get_unmapped_area() would return
> > > address
> > > 2000000001f00000
> > >
> > > Is io_uring_mmu_get_unmapped_area supported to always return
> > > addresses above
> > > 0x2000000000000000?
> >
> > Yes, with the patch below.
> >
> > > Any reason why it is not doing so sometimes?
> >
> > It depends on the parameters for vm_unmapped_area(). Specifically
> > info.flags=0.
> >
> > Try this patch:
> >
> > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> > index 3bca7a79efda..b259794ab53b 100644
> > --- a/io_uring/io_uring.c
> > +++ b/io_uring/io_uring.c
> > @@ -3429,10 +3429,13 @@ static unsigned long
> > io_uring_mmu_get_unmapped_area(struct file *filp,
> >  	 * A failed mmap() very likely causes application failure,
> >  	 * so fall back to the bottom-up function here. This scenario
> >  	 * can happen with large stack limits and large mmap()
> > -	 * allocations.
> > +	 * allocations. Use bottom-up on IA64 for correct aliasing.
> >  	 */
> > -	addr = vm_unmapped_area(&info);
> > -	if (offset_in_page(addr)) {
> > +	if (IS_ENABLED(CONFIG_IA64))
> > +		addr = NULL;
> > +	else
> > +		addr = vm_unmapped_area(&info);
> > +	if (!addr) {
> >  		info.flags = 0;
> >  		info.low_limit = TASK_UNMAPPED_BASE;
> >  		info.high_limit = mmap_end;
> >
> > Helge
>
> This patch does do the trick, but I am a little unsure if it's the right one
> to go in:
>
> * Adding an arch-specific conditional feels like a bad hack, why is it not
> working with the other vm_unmapped_area_info settings?

because it tries to map below TASK_UNMAPPED_BASE, for which (I assume) IA-64
has different aliasing/caching rules. There are some comments in the arch/ia64
files, but I'm not a IA-64 expert...

> * What happened to the offset_in_page check for other arches?

I thought it's not necessary.

But below is another (and much better) approach, which you may test.
I see quite some errors with the liburing testcases on hppa, but I think
they are not related to this function.

Can you test and report back?

Helge


From 457f2c2db984bc159119bfb4426d9dc6c2779ed6 Mon Sep 17 00:00:00 2001
From: Helge Deller <deller@gmx.de>
Date: Sun, 16 Jul 2023 08:45:17 +0200
Subject: [PATCH] io_uring: Adjust mapping wrt architecture aliasing
 requirements

When mapping memory to userspace use the architecture-provided
get_unmapped_area() function instead of the own copy which fails on
IA-64 since it doesn't allow mappings below TASK_UNMAPPED_BASE.

Additionally make sure to flag the requested memory as MAP_SHARED so
that any architecture-specific aliasing rules will be applied.

Reported-by: matoro <matoro_mailinglist_kernel@matoro.tk>
Signed-off-by: Helge Deller <deller@gmx.de>

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 3bca7a79efda..2e7dd93e45d0 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3398,48 +3398,27 @@ static unsigned long io_uring_mmu_get_unmapped_area(struct file *filp,
 			unsigned long addr, unsigned long len,
 			unsigned long pgoff, unsigned long flags)
 {
-	const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
-	struct vm_unmapped_area_info info;
 	void *ptr;

 	/*
 	 * Do not allow to map to user-provided address to avoid breaking the
-	 * aliasing rules. Userspace is not able to guess the offset address of
-	 * kernel kmalloc()ed memory area.
+	 * aliasing rules of various architectures. Userspace is not able to
+	 * guess the offset address of kernel kmalloc()ed memory area.
 	 */
-	if (addr)
+	if (addr | (flags & MAP_FIXED))
 		return -EINVAL;

+	/*
+	 * The requested memory region is required to be shared between kernel
+	 * and userspace application.
+	 */
+	flags |= MAP_SHARED;
+
 	ptr = io_uring_validate_mmap_request(filp, pgoff, len);
 	if (IS_ERR(ptr))
 		return -ENOMEM;

-	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
-	info.length = len;
-	info.low_limit = max(PAGE_SIZE, mmap_min_addr);
-	info.high_limit = arch_get_mmap_base(addr, current->mm->mmap_base);
-#ifdef SHM_COLOUR
-	info.align_mask = PAGE_MASK & (SHM_COLOUR - 1UL);
-#else
-	info.align_mask = PAGE_MASK & (SHMLBA - 1UL);
-#endif
-	info.align_offset = (unsigned long) ptr;
-
-	/*
-	 * A failed mmap() very likely causes application failure,
-	 * so fall back to the bottom-up function here. This scenario
-	 * can happen with large stack limits and large mmap()
-	 * allocations.
-	 */
-	addr = vm_unmapped_area(&info);
-	if (offset_in_page(addr)) {
-		info.flags = 0;
-		info.low_limit = TASK_UNMAPPED_BASE;
-		info.high_limit = mmap_end;
-		addr = vm_unmapped_area(&info);
-	}
-
-	return addr;
+	return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
 }

 #else /* !CONFIG_MMU */

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/5] io_uring: Adjust mapping wrt architecture aliasing requirements
  2023-07-16  6:54                   ` Helge Deller
@ 2023-07-16 18:03                     ` matoro
  2023-07-16 20:54                       ` Helge Deller
  0 siblings, 1 reply; 36+ messages in thread
From: matoro @ 2023-07-16 18:03 UTC (permalink / raw)
  To: Helge Deller; +Cc: Jens Axboe, io-uring, Linux Ia64, glaubitz, Sam James

On 2023-07-16 02:54, Helge Deller wrote:
> * matoro <matoro_mailinglist_kernel@matoro.tk>:
>> On 2023-07-13 03:27, Helge Deller wrote:
>> > * matoro <matoro_mailinglist_kernel@matoro.tk>:
>> > > On 2023-07-12 16:30, Helge Deller wrote:
>> > > > On 7/12/23 21:05, Helge Deller wrote:
>> > > > > On 7/12/23 19:28, matoro wrote:
>> > > > > > On 2023-07-12 12:24, Helge Deller wrote:
>> > > > > > > Hi Matoro,
>> > > > > > >
>> > > > > > > * matoro <matoro_mailinglist_kernel@matoro.tk>:
>> > > > > > > > On 2023-03-14 13:16, Jens Axboe wrote:
>> > > > > > > > > From: Helge Deller <deller@gmx.de>
>> > > > > > > > >
>> > > > > > > > > Some architectures have memory cache aliasing requirements (e.g. parisc)
>> > > > > > > > > if memory is shared between userspace and kernel. This patch fixes the
>> > > > > > > > > kernel to return an aliased address when asked by userspace via mmap().
>> > > > > > > > >
>> > > > > > > > > Signed-off-by: Helge Deller <deller@gmx.de>
>> > > > > > > > > Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> > > > > > > > > ---
>> > > > > > > > >  io_uring/io_uring.c | 51 +++++++++++++++++++++++++++++++++++++++++++++
>> > > > > > > > >  1 file changed, 51 insertions(+)
>> > > > > > > > >
>> > > > > > > > > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> > > > > > > > > index 722624b6d0dc..3adecebbac71 100644
>> > > > > > > > > --- a/io_uring/io_uring.c
>> > > > > > > > > +++ b/io_uring/io_uring.c
>> > > > > > > > > @@ -72,6 +72,7 @@
>> > > > > > > > >  #include <linux/io_uring.h>
>> > > > > > > > >  #include <linux/audit.h>
>> > > > > > > > >  #include <linux/security.h>
>> > > > > > > > > +#include <asm/shmparam.h>
>> > > > > > > > >
>> > > > > > > > >  #define CREATE_TRACE_POINTS
>> > > > > > > > >  #include <trace/events/io_uring.h>
>> > > > > > > > > @@ -3317,6 +3318,54 @@ static __cold int io_uring_mmap(struct file
>> > > > > > > > > *file, struct vm_area_struct *vma)
>> > > > > > > > >      return remap_pfn_range(vma, vma->vm_start, pfn, sz,
>> > > > > > > > > vma->vm_page_prot);
>> > > > > > > > >  }
>> > > > > > > > >
>> > > > > > > > > +static unsigned long io_uring_mmu_get_unmapped_area(struct file *filp,
>> > > > > > > > > +            unsigned long addr, unsigned long len,
>> > > > > > > > > +            unsigned long pgoff, unsigned long flags)
>> > > > > > > > > +{
>> > > > > > > > > +    const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
>> > > > > > > > > +    struct vm_unmapped_area_info info;
>> > > > > > > > > +    void *ptr;
>> > > > > > > > > +
>> > > > > > > > > +    /*
>> > > > > > > > > +     * Do not allow to map to user-provided address to avoid breaking the
>> > > > > > > > > +     * aliasing rules. Userspace is not able to guess the offset address
>> > > > > > > > > of
>> > > > > > > > > +     * kernel kmalloc()ed memory area.
>> > > > > > > > > +     */
>> > > > > > > > > +    if (addr)
>> > > > > > > > > +        return -EINVAL;
>> > > > > > > > > +
>> > > > > > > > > +    ptr = io_uring_validate_mmap_request(filp, pgoff, len);
>> > > > > > > > > +    if (IS_ERR(ptr))
>> > > > > > > > > +        return -ENOMEM;
>> > > > > > > > > +
>> > > > > > > > > +    info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>> > > > > > > > > +    info.length = len;
>> > > > > > > > > +    info.low_limit = max(PAGE_SIZE, mmap_min_addr);
>> > > > > > > > > +    info.high_limit = arch_get_mmap_base(addr, current->mm->mmap_base);
>> > > > > > > > > +#ifdef SHM_COLOUR
>> > > > > > > > > +    info.align_mask = PAGE_MASK & (SHM_COLOUR - 1UL);
>> > > > > > > > > +#else
>> > > > > > > > > +    info.align_mask = PAGE_MASK & (SHMLBA - 1UL);
>> > > > > > > > > +#endif
>> > > > > > > > > +    info.align_offset = (unsigned long) ptr;
>> > > > > > > > > +
>> > > > > > > > > +    /*
>> > > > > > > > > +     * A failed mmap() very likely causes application failure,
>> > > > > > > > > +     * so fall back to the bottom-up function here. This scenario
>> > > > > > > > > +     * can happen with large stack limits and large mmap()
>> > > > > > > > > +     * allocations.
>> > > > > > > > > +     */
>> > > > > > > > > +    addr = vm_unmapped_area(&info);
>> > > > > > > > > +    if (offset_in_page(addr)) {
>> > > > > > > > > +        info.flags = 0;
>> > > > > > > > > +        info.low_limit = TASK_UNMAPPED_BASE;
>> > > > > > > > > +        info.high_limit = mmap_end;
>> > > > > > > > > +        addr = vm_unmapped_area(&info);
>> > > > > > > > > +    }
>> > > > > > > > > +
>> > > > > > > > > +    return addr;
>> > > > > > > > > +}
>> > > > > > > > > +
>> > > > > > > > >  #else /* !CONFIG_MMU */
>> > > > > > > > >
>> > > > > > > > >  static int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
>> > > > > > > > > @@ -3529,6 +3578,8 @@ static const struct file_operations io_uring_fops
>> > > > > > > > > = {
>> > > > > > > > >  #ifndef CONFIG_MMU
>> > > > > > > > >      .get_unmapped_area = io_uring_nommu_get_unmapped_area,
>> > > > > > > > >      .mmap_capabilities = io_uring_nommu_mmap_capabilities,
>> > > > > > > > > +#else
>> > > > > > > > > +    .get_unmapped_area = io_uring_mmu_get_unmapped_area,
>> > > > > > > > >  #endif
>> > > > > > > > >      .poll        = io_uring_poll,
>> > > > > > > > >  #ifdef CONFIG_PROC_FS
>> > > > > > > >
>> > > > > > > > Hi Jens, Helge - I've bisected a regression with
>> > > > > > > > io_uring on ia64 to this
>> > > > > > > > patch in 6.4.  Unfortunately this breaks userspace
>> > > > > > > > programs using io_uring,
>> > > > > > > > the easiest one to test is cmake with an io_uring
>> > > > > > > > enabled libuv (i.e., libuv
>> > > > > > > > >= 1.45.0) which will hang.
>> > > > > > > >
>> > > > > > > > I am aware that ia64 is in a vulnerable place right now
>> > > > > > > > which I why I am
>> > > > > > > > keeping this spread limited.  Since this clearly involves
>> > > > > > > > architecture-specific changes for parisc,
>> > > > > > >
>> > > > > > > it isn't so much architecture-specific... (just one ifdef)
>> > > > > > >
>> > > > > > > > is there any chance of looking at
>> > > > > > > > what is required to do the same for ia64?  I looked at
>> > > > > > > > 0ef36bd2b37815719e31a72d2beecc28ca8ecd26 ("parisc:
>> > > > > > > > change value of SHMLBA
>> > > > > > > > from 0x00400000 to PAGE_SIZE") and tried to replicate the SHMLBA ->
>> > > > > > > > SHM_COLOUR change, but it made no difference.
>> > > > > > > >
>> > > > > > > > If hardware is necessary for testing, I can provide it,
>> > > > > > > > including remote BMC
>> > > > > > > > access for restarts/kernel debugging.  Any takers?
>> > > > > > >
>> > > > > > > I won't have time to test myself, but maybe you could test?
>> > > > > > >
>> > > > > > > Basically we should try to find out why
>> > > > > > > io_uring_mmu_get_unmapped_area()
>> > > > > > > doesn't return valid addresses, while arch_get_unmapped_area()
>> > > > > > > [in arch/ia64/kernel/sys_ia64.c] does.
>> > > > > > >
>> > > > > > > You could apply this patch first:
>> > > > > > > It introduces a memory leak (as it requests memory twice),
>> > > > > > > but maybe we
>> > > > > > > get an idea?
>> > > > > > > The ia64 arch_get_unmapped_area() searches for memory from bottom
>> > > > > > > (flags=0), while io_uring function tries top-down first.
>> > > > > > > Maybe that's
>> > > > > > > the problem. And I don't understand the offset_in_page() check right
>> > > > > > > now.
>> > > > > > >
>> > > > > > > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> > > > > > > index 3bca7a79efda..93b1964d2bbb 100644
>> > > > > > > --- a/io_uring/io_uring.c
>> > > > > > > +++ b/io_uring/io_uring.c
>> > > > > > > @@ -3431,13 +3431,17 @@ static unsigned long
>> > > > > > > io_uring_mmu_get_unmapped_area(struct file *filp,
>> > > > > > >       * can happen with large stack limits and large mmap()
>> > > > > > >       * allocations.
>> > > > > > >       */
>> > > > > > > +/* compare to arch_get_unmapped_area() in
>> > > > > > > arch/ia64/kernel/sys_ia64.c */
>> > > > > > >      addr = vm_unmapped_area(&info);
>> > > > > > > -    if (offset_in_page(addr)) {
>> > > > > > > +printk("io_uring_mmu_get_unmapped_area() address 1 is:
>> > > > > > > %px\n", addr);
>> > > > > > > +    addr = NULL;
>> > > > > > > +    if (!addr) {
>> > > > > > >          info.flags = 0;
>> > > > > > >          info.low_limit = TASK_UNMAPPED_BASE;
>> > > > > > >          info.high_limit = mmap_end;
>> > > > > > >          addr = vm_unmapped_area(&info);
>> > > > > > >      }
>> > > > > > > +printk("io_uring_mmu_get_unmapped_area() returns address
>> > > > > > > %px\n", addr);
>> > > > > > >
>> > > > > > >      return addr;
>> > > > > > >  }
>> > > > > > >
>> > > > > > >
>> > > > > > > Another option is to disable the call to
>> > > > > > > io_uring_nommu_get_unmapped_area())
>> > > > > > > with the next patch. Maybe you could add printks() to ia64's
>> > > > > > > arch_get_unmapped_area()
>> > > > > > > and check what it returns there?
>> > > > > > >
>> > > > > > > @@ -3654,6 +3658,8 @@ static const struct file_operations
>> > > > > > > io_uring_fops = {
>> > > > > > >  #ifndef CONFIG_MMU
>> > > > > > >      .get_unmapped_area = io_uring_nommu_get_unmapped_area,
>> > > > > > >      .mmap_capabilities = io_uring_nommu_mmap_capabilities,
>> > > > > > > +#elif 0    /* IS_ENABLED(CONFIG_IA64) */
>> > > > > > > +    .get_unmapped_area = NULL,
>> > > > > > >  #else
>> > > > > > >      .get_unmapped_area = io_uring_mmu_get_unmapped_area,
>> > > > > > >  #endif
>> > > > > > >
>> > > > > > > Helge
>> > > > > >
>> > > > > > Thanks Helge.  Sample output from that first patch:
>> > > > > >
>> > > > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
>> > > > > > address 1 is: 1ffffffffff40000
>> > > > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
>> > > > > > returns address 2000000001e40000
>> > > > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
>> > > > > > address 1 is: 1ffffffffff20000
>> > > > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
>> > > > > > returns address 2000000001f20000
>> > > > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
>> > > > > > address 1 is: 1ffffffffff30000
>> > > > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
>> > > > > > returns address 2000000001f30000
>> > > > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
>> > > > > > address 1 is: 1ffffffffff90000
>> > > > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
>> > > > > > returns address 2000000001f90000
>> > > > > >
>> > > > > > This pattern seems to be pretty stable, I tried instead just
>> > > > > > directly returning the result of a call to
>> > > > > > arch_get_unmapped_area() at the end of the function and it seems
>> > > > > > similar:
>> > > > > >
>> > > > > > [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area()
>> > > > > > would return address 1ffffffffffd0000
>> > > > > > [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would
>> > > > > > return address 2000000001f00000
>> > > > > > [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area()
>> > > > > > would return address 1ffffffffff00000
>> > > > > > [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would
>> > > > > > return address 1ffffffffff00000
>> > > > > > [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area()
>> > > > > > would return address 1fffffffffe20000
>> > > > > > [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would
>> > > > > > return address 2000000002000000
>> > > > > > [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area()
>> > > > > > would return address 1fffffffffe30000
>> > > > > > [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would
>> > > > > > return address 2000000002100000
>> > > > > >
>> > > > > > Is that enough of a clue to go on?
>> > > > >
>> > > > > SHMLBA on ia64 is 0x100000:
>> > > > > arch/ia64/include/asm/shmparam.h:#define        SHMLBA  (1024*1024)
>> > > > > but the values returned by io_uring_mmu_get_unmapped_area() does not
>> > > > > fullfill this.
>> > > > >
>> > > > > So, probably ia64's SHMLBA isn't pulled in correctly in
>> > > > > io_uring/io_uring.c.
>> > > > > Check value of this line:
>> > > > >      info.align_mask = PAGE_MASK & (SHMLBA - 1UL);
>> > > > >
>> > > > > You could also add
>> > > > > #define SHM_COLOUR  0x100000
>> > > > > in front of the
>> > > > >      #ifdef SHM_COLOUR
>> > > > > (define SHM_COLOUR in io_uring/kbuf.c too).
>> > > >
>> > > > What is the value of PAGE_SIZE and "ptr" on your machine?
>> > > > For 4k page size I get:
>> > > > SHMLBA -1   ->        FFFFF
>> > > > PAGE_MASK   -> FFFFFFFFF000
>> > > > so,
>> > > > info.align_mask = PAGE_MASK & (SHMLBA - 1UL) = 0xFF000;
>> > > > You could try to set nfo.align_mask = 0xfffff;
>> > > >
>> > > > Helge
>> > >
>> > > Using 64KiB (65536) PAGE_SIZE here.  64-bit pointers.
>> > >
>> > > Tried both #define SHM_COLOUR 0x100000, as well and info.align_mask =
>> > > 0xFFFFF, but both of them made the problem change from 100%
>> > > reproducible, to
>> > > intermittent.
>> > >
>> > > After inspecting the ouput I observed that it hangs only when the
>> > > first
>> > > allocation returns an address below 0x2000000000000000, and the second
>> > > returns an address above it.  When both addresses are above it, it
>> > > does not
>> > > hang.  Examples:
>> > >
>> > > When it works:
>> > > $ cmake --version
>> > > cmake version 3.26.4
>> > >
>> > > CMake suite maintained and supported by Kitware (kitware.com/cmake).
>> > > $ dmesg --color=always -T | tail -n 4
>> > > [Wed Jul 12 20:32:37 2023] io_uring_mmu_get_unmapped_area() would
>> > > return
>> > > address 1fffffffffe20000
>> > > [Wed Jul 12 20:32:37 2023] but arch_get_unmapped_area() would return
>> > > address
>> > > 2000000002000000
>> > > [Wed Jul 12 20:32:37 2023] io_uring_mmu_get_unmapped_area() would
>> > > return
>> > > address 1fffffffffe50000
>> > > [Wed Jul 12 20:32:37 2023] but arch_get_unmapped_area() would return
>> > > address
>> > > 2000000002100000
>> > >
>> > >
>> > > When it hangs:
>> > > $ cmake --version
>> > > cmake version 3.26.4
>> > >
>> > > CMake suite maintained and supported by Kitware (kitware.com/cmake).
>> > > ^C
>> > > $ dmesg --color=always -T | tail -n 4
>> > > [Wed Jul 12 20:33:12 2023] io_uring_mmu_get_unmapped_area() would
>> > > return
>> > > address 1ffffffffff00000
>> > > [Wed Jul 12 20:33:12 2023] but arch_get_unmapped_area() would return
>> > > address
>> > > 1ffffffffff00000
>> > > [Wed Jul 12 20:33:12 2023] io_uring_mmu_get_unmapped_area() would
>> > > return
>> > > address 1fffffffffe60000
>> > > [Wed Jul 12 20:33:12 2023] but arch_get_unmapped_area() would return
>> > > address
>> > > 2000000001f00000
>> > >
>> > > Is io_uring_mmu_get_unmapped_area supported to always return
>> > > addresses above
>> > > 0x2000000000000000?
>> >
>> > Yes, with the patch below.
>> >
>> > > Any reason why it is not doing so sometimes?
>> >
>> > It depends on the parameters for vm_unmapped_area(). Specifically
>> > info.flags=0.
>> >
>> > Try this patch:
>> >
>> > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> > index 3bca7a79efda..b259794ab53b 100644
>> > --- a/io_uring/io_uring.c
>> > +++ b/io_uring/io_uring.c
>> > @@ -3429,10 +3429,13 @@ static unsigned long
>> > io_uring_mmu_get_unmapped_area(struct file *filp,
>> >  	 * A failed mmap() very likely causes application failure,
>> >  	 * so fall back to the bottom-up function here. This scenario
>> >  	 * can happen with large stack limits and large mmap()
>> > -	 * allocations.
>> > +	 * allocations. Use bottom-up on IA64 for correct aliasing.
>> >  	 */
>> > -	addr = vm_unmapped_area(&info);
>> > -	if (offset_in_page(addr)) {
>> > +	if (IS_ENABLED(CONFIG_IA64))
>> > +		addr = NULL;
>> > +	else
>> > +		addr = vm_unmapped_area(&info);
>> > +	if (!addr) {
>> >  		info.flags = 0;
>> >  		info.low_limit = TASK_UNMAPPED_BASE;
>> >  		info.high_limit = mmap_end;
>> >
>> > Helge
>> 
>> This patch does do the trick, but I am a little unsure if it's the 
>> right one
>> to go in:
>> 
>> * Adding an arch-specific conditional feels like a bad hack, why is it 
>> not
>> working with the other vm_unmapped_area_info settings?
> 
> because it tries to map below TASK_UNMAPPED_BASE, for which (I assume) 
> IA-64
> has different aliasing/caching rules. There are some comments in the 
> arch/ia64
> files, but I'm not a IA-64 expert...
> 
>> * What happened to the offset_in_page check for other arches?
> 
> I thought it's not necessary.
> 
> But below is another (and much better) approach, which you may test.
> I see quite some errors with the liburing testcases on hppa, but I 
> think
> they are not related to this function.
> 
> Can you test and report back?
> 
> Helge
> 
> 
> From 457f2c2db984bc159119bfb4426d9dc6c2779ed6 Mon Sep 17 00:00:00 2001
> From: Helge Deller <deller@gmx.de>
> Date: Sun, 16 Jul 2023 08:45:17 +0200
> Subject: [PATCH] io_uring: Adjust mapping wrt architecture aliasing
>  requirements
> 
> When mapping memory to userspace use the architecture-provided
> get_unmapped_area() function instead of the own copy which fails on
> IA-64 since it doesn't allow mappings below TASK_UNMAPPED_BASE.
> 
> Additionally make sure to flag the requested memory as MAP_SHARED so
> that any architecture-specific aliasing rules will be applied.
> 
> Reported-by: matoro <matoro_mailinglist_kernel@matoro.tk>
> Signed-off-by: Helge Deller <deller@gmx.de>
> 
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 3bca7a79efda..2e7dd93e45d0 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -3398,48 +3398,27 @@ static unsigned long 
> io_uring_mmu_get_unmapped_area(struct file *filp,
>  			unsigned long addr, unsigned long len,
>  			unsigned long pgoff, unsigned long flags)
>  {
> -	const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
> -	struct vm_unmapped_area_info info;
>  	void *ptr;
> 
>  	/*
>  	 * Do not allow to map to user-provided address to avoid breaking the
> -	 * aliasing rules. Userspace is not able to guess the offset address 
> of
> -	 * kernel kmalloc()ed memory area.
> +	 * aliasing rules of various architectures. Userspace is not able to
> +	 * guess the offset address of kernel kmalloc()ed memory area.
>  	 */
> -	if (addr)
> +	if (addr | (flags & MAP_FIXED))
>  		return -EINVAL;
> 
> +	/*
> +	 * The requested memory region is required to be shared between 
> kernel
> +	 * and userspace application.
> +	 */
> +	flags |= MAP_SHARED;
> +
>  	ptr = io_uring_validate_mmap_request(filp, pgoff, len);
>  	if (IS_ERR(ptr))
>  		return -ENOMEM;
> 
> -	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
> -	info.length = len;
> -	info.low_limit = max(PAGE_SIZE, mmap_min_addr);
> -	info.high_limit = arch_get_mmap_base(addr, current->mm->mmap_base);
> -#ifdef SHM_COLOUR
> -	info.align_mask = PAGE_MASK & (SHM_COLOUR - 1UL);
> -#else
> -	info.align_mask = PAGE_MASK & (SHMLBA - 1UL);
> -#endif
> -	info.align_offset = (unsigned long) ptr;
> -
> -	/*
> -	 * A failed mmap() very likely causes application failure,
> -	 * so fall back to the bottom-up function here. This scenario
> -	 * can happen with large stack limits and large mmap()
> -	 * allocations.
> -	 */
> -	addr = vm_unmapped_area(&info);
> -	if (offset_in_page(addr)) {
> -		info.flags = 0;
> -		info.low_limit = TASK_UNMAPPED_BASE;
> -		info.high_limit = mmap_end;
> -		addr = vm_unmapped_area(&info);
> -	}
> -
> -	return addr;
> +	return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
>  }
> 
>  #else /* !CONFIG_MMU */

This seems really close.  It worked for the trivial test case, so I ran 
the test suite from https://github.com/axboe/liburing to compare.  With 
kernel 6.3, I get 100% pass, after I get one failure:
Running test read-write.t                                           
cqe->res=33, expected=32
test_rem_buf_single(BUFFERS + 1) failed
Not root, skipping test_write_efbig
Test read-write.t failed with ret 1

Trying this patch out on other arches to see if it also affects them or 
is ia64-specific.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/5] io_uring: Adjust mapping wrt architecture aliasing requirements
  2023-07-16 18:03                     ` matoro
@ 2023-07-16 20:54                       ` Helge Deller
  0 siblings, 0 replies; 36+ messages in thread
From: Helge Deller @ 2023-07-16 20:54 UTC (permalink / raw)
  To: matoro; +Cc: Jens Axboe, io-uring, Linux Ia64, glaubitz, Sam James

On 7/16/23 20:03, matoro wrote:
> On 2023-07-16 02:54, Helge Deller wrote:
>> * matoro <matoro_mailinglist_kernel@matoro.tk>:
>>> On 2023-07-13 03:27, Helge Deller wrote:
>>> > * matoro <matoro_mailinglist_kernel@matoro.tk>:
>>> > > On 2023-07-12 16:30, Helge Deller wrote:
>>> > > > On 7/12/23 21:05, Helge Deller wrote:
>>> > > > > On 7/12/23 19:28, matoro wrote:
>>> > > > > > On 2023-07-12 12:24, Helge Deller wrote:
>>> > > > > > > Hi Matoro,
>>> > > > > > >
>>> > > > > > > * matoro <matoro_mailinglist_kernel@matoro.tk>:
>>> > > > > > > > On 2023-03-14 13:16, Jens Axboe wrote:
>>> > > > > > > > > From: Helge Deller <deller@gmx.de>
>>> > > > > > > > >
>>> > > > > > > > > Some architectures have memory cache aliasing requirements (e.g. parisc)
>>> > > > > > > > > if memory is shared between userspace and kernel. This patch fixes the
>>> > > > > > > > > kernel to return an aliased address when asked by userspace via mmap().
>>> > > > > > > > >
>>> > > > > > > > > Signed-off-by: Helge Deller <deller@gmx.de>
>>> > > > > > > > > Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>> > > > > > > > > ---
>>> > > > > > > > >  io_uring/io_uring.c | 51 +++++++++++++++++++++++++++++++++++++++++++++
>>> > > > > > > > >  1 file changed, 51 insertions(+)
>>> > > > > > > > >
>>> > > > > > > > > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>> > > > > > > > > index 722624b6d0dc..3adecebbac71 100644
>>> > > > > > > > > --- a/io_uring/io_uring.c
>>> > > > > > > > > +++ b/io_uring/io_uring.c
>>> > > > > > > > > @@ -72,6 +72,7 @@
>>> > > > > > > > >  #include <linux/io_uring.h>
>>> > > > > > > > >  #include <linux/audit.h>
>>> > > > > > > > >  #include <linux/security.h>
>>> > > > > > > > > +#include <asm/shmparam.h>
>>> > > > > > > > >
>>> > > > > > > > >  #define CREATE_TRACE_POINTS
>>> > > > > > > > >  #include <trace/events/io_uring.h>
>>> > > > > > > > > @@ -3317,6 +3318,54 @@ static __cold int io_uring_mmap(struct file
>>> > > > > > > > > *file, struct vm_area_struct *vma)
>>> > > > > > > > >      return remap_pfn_range(vma, vma->vm_start, pfn, sz,
>>> > > > > > > > > vma->vm_page_prot);
>>> > > > > > > > >  }
>>> > > > > > > > >
>>> > > > > > > > > +static unsigned long io_uring_mmu_get_unmapped_area(struct file *filp,
>>> > > > > > > > > +            unsigned long addr, unsigned long len,
>>> > > > > > > > > +            unsigned long pgoff, unsigned long flags)
>>> > > > > > > > > +{
>>> > > > > > > > > +    const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
>>> > > > > > > > > +    struct vm_unmapped_area_info info;
>>> > > > > > > > > +    void *ptr;
>>> > > > > > > > > +
>>> > > > > > > > > +    /*
>>> > > > > > > > > +     * Do not allow to map to user-provided address to avoid breaking the
>>> > > > > > > > > +     * aliasing rules. Userspace is not able to guess the offset address
>>> > > > > > > > > of
>>> > > > > > > > > +     * kernel kmalloc()ed memory area.
>>> > > > > > > > > +     */
>>> > > > > > > > > +    if (addr)
>>> > > > > > > > > +        return -EINVAL;
>>> > > > > > > > > +
>>> > > > > > > > > +    ptr = io_uring_validate_mmap_request(filp, pgoff, len);
>>> > > > > > > > > +    if (IS_ERR(ptr))
>>> > > > > > > > > +        return -ENOMEM;
>>> > > > > > > > > +
>>> > > > > > > > > +    info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>>> > > > > > > > > +    info.length = len;
>>> > > > > > > > > +    info.low_limit = max(PAGE_SIZE, mmap_min_addr);
>>> > > > > > > > > +    info.high_limit = arch_get_mmap_base(addr, current->mm->mmap_base);
>>> > > > > > > > > +#ifdef SHM_COLOUR
>>> > > > > > > > > +    info.align_mask = PAGE_MASK & (SHM_COLOUR - 1UL);
>>> > > > > > > > > +#else
>>> > > > > > > > > +    info.align_mask = PAGE_MASK & (SHMLBA - 1UL);
>>> > > > > > > > > +#endif
>>> > > > > > > > > +    info.align_offset = (unsigned long) ptr;
>>> > > > > > > > > +
>>> > > > > > > > > +    /*
>>> > > > > > > > > +     * A failed mmap() very likely causes application failure,
>>> > > > > > > > > +     * so fall back to the bottom-up function here. This scenario
>>> > > > > > > > > +     * can happen with large stack limits and large mmap()
>>> > > > > > > > > +     * allocations.
>>> > > > > > > > > +     */
>>> > > > > > > > > +    addr = vm_unmapped_area(&info);
>>> > > > > > > > > +    if (offset_in_page(addr)) {
>>> > > > > > > > > +        info.flags = 0;
>>> > > > > > > > > +        info.low_limit = TASK_UNMAPPED_BASE;
>>> > > > > > > > > +        info.high_limit = mmap_end;
>>> > > > > > > > > +        addr = vm_unmapped_area(&info);
>>> > > > > > > > > +    }
>>> > > > > > > > > +
>>> > > > > > > > > +    return addr;
>>> > > > > > > > > +}
>>> > > > > > > > > +
>>> > > > > > > > >  #else /* !CONFIG_MMU */
>>> > > > > > > > >
>>> > > > > > > > >  static int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
>>> > > > > > > > > @@ -3529,6 +3578,8 @@ static const struct file_operations io_uring_fops
>>> > > > > > > > > = {
>>> > > > > > > > >  #ifndef CONFIG_MMU
>>> > > > > > > > >      .get_unmapped_area = io_uring_nommu_get_unmapped_area,
>>> > > > > > > > >      .mmap_capabilities = io_uring_nommu_mmap_capabilities,
>>> > > > > > > > > +#else
>>> > > > > > > > > +    .get_unmapped_area = io_uring_mmu_get_unmapped_area,
>>> > > > > > > > >  #endif
>>> > > > > > > > >      .poll        = io_uring_poll,
>>> > > > > > > > >  #ifdef CONFIG_PROC_FS
>>> > > > > > > >
>>> > > > > > > > Hi Jens, Helge - I've bisected a regression with
>>> > > > > > > > io_uring on ia64 to this
>>> > > > > > > > patch in 6.4.  Unfortunately this breaks userspace
>>> > > > > > > > programs using io_uring,
>>> > > > > > > > the easiest one to test is cmake with an io_uring
>>> > > > > > > > enabled libuv (i.e., libuv
>>> > > > > > > > >= 1.45.0) which will hang.
>>> > > > > > > >
>>> > > > > > > > I am aware that ia64 is in a vulnerable place right now
>>> > > > > > > > which I why I am
>>> > > > > > > > keeping this spread limited.  Since this clearly involves
>>> > > > > > > > architecture-specific changes for parisc,
>>> > > > > > >
>>> > > > > > > it isn't so much architecture-specific... (just one ifdef)
>>> > > > > > >
>>> > > > > > > > is there any chance of looking at
>>> > > > > > > > what is required to do the same for ia64?  I looked at
>>> > > > > > > > 0ef36bd2b37815719e31a72d2beecc28ca8ecd26 ("parisc:
>>> > > > > > > > change value of SHMLBA
>>> > > > > > > > from 0x00400000 to PAGE_SIZE") and tried to replicate the SHMLBA ->
>>> > > > > > > > SHM_COLOUR change, but it made no difference.
>>> > > > > > > >
>>> > > > > > > > If hardware is necessary for testing, I can provide it,
>>> > > > > > > > including remote BMC
>>> > > > > > > > access for restarts/kernel debugging.  Any takers?
>>> > > > > > >
>>> > > > > > > I won't have time to test myself, but maybe you could test?
>>> > > > > > >
>>> > > > > > > Basically we should try to find out why
>>> > > > > > > io_uring_mmu_get_unmapped_area()
>>> > > > > > > doesn't return valid addresses, while arch_get_unmapped_area()
>>> > > > > > > [in arch/ia64/kernel/sys_ia64.c] does.
>>> > > > > > >
>>> > > > > > > You could apply this patch first:
>>> > > > > > > It introduces a memory leak (as it requests memory twice),
>>> > > > > > > but maybe we
>>> > > > > > > get an idea?
>>> > > > > > > The ia64 arch_get_unmapped_area() searches for memory from bottom
>>> > > > > > > (flags=0), while io_uring function tries top-down first.
>>> > > > > > > Maybe that's
>>> > > > > > > the problem. And I don't understand the offset_in_page() check right
>>> > > > > > > now.
>>> > > > > > >
>>> > > > > > > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>> > > > > > > index 3bca7a79efda..93b1964d2bbb 100644
>>> > > > > > > --- a/io_uring/io_uring.c
>>> > > > > > > +++ b/io_uring/io_uring.c
>>> > > > > > > @@ -3431,13 +3431,17 @@ static unsigned long
>>> > > > > > > io_uring_mmu_get_unmapped_area(struct file *filp,
>>> > > > > > >       * can happen with large stack limits and large mmap()
>>> > > > > > >       * allocations.
>>> > > > > > >       */
>>> > > > > > > +/* compare to arch_get_unmapped_area() in
>>> > > > > > > arch/ia64/kernel/sys_ia64.c */
>>> > > > > > >      addr = vm_unmapped_area(&info);
>>> > > > > > > -    if (offset_in_page(addr)) {
>>> > > > > > > +printk("io_uring_mmu_get_unmapped_area() address 1 is:
>>> > > > > > > %px\n", addr);
>>> > > > > > > +    addr = NULL;
>>> > > > > > > +    if (!addr) {
>>> > > > > > >          info.flags = 0;
>>> > > > > > >          info.low_limit = TASK_UNMAPPED_BASE;
>>> > > > > > >          info.high_limit = mmap_end;
>>> > > > > > >          addr = vm_unmapped_area(&info);
>>> > > > > > >      }
>>> > > > > > > +printk("io_uring_mmu_get_unmapped_area() returns address
>>> > > > > > > %px\n", addr);
>>> > > > > > >
>>> > > > > > >      return addr;
>>> > > > > > >  }
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > Another option is to disable the call to
>>> > > > > > > io_uring_nommu_get_unmapped_area())
>>> > > > > > > with the next patch. Maybe you could add printks() to ia64's
>>> > > > > > > arch_get_unmapped_area()
>>> > > > > > > and check what it returns there?
>>> > > > > > >
>>> > > > > > > @@ -3654,6 +3658,8 @@ static const struct file_operations
>>> > > > > > > io_uring_fops = {
>>> > > > > > >  #ifndef CONFIG_MMU
>>> > > > > > >      .get_unmapped_area = io_uring_nommu_get_unmapped_area,
>>> > > > > > >      .mmap_capabilities = io_uring_nommu_mmap_capabilities,
>>> > > > > > > +#elif 0    /* IS_ENABLED(CONFIG_IA64) */
>>> > > > > > > +    .get_unmapped_area = NULL,
>>> > > > > > >  #else
>>> > > > > > >      .get_unmapped_area = io_uring_mmu_get_unmapped_area,
>>> > > > > > >  #endif
>>> > > > > > >
>>> > > > > > > Helge
>>> > > > > >
>>> > > > > > Thanks Helge.  Sample output from that first patch:
>>> > > > > >
>>> > > > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
>>> > > > > > address 1 is: 1ffffffffff40000
>>> > > > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
>>> > > > > > returns address 2000000001e40000
>>> > > > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
>>> > > > > > address 1 is: 1ffffffffff20000
>>> > > > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
>>> > > > > > returns address 2000000001f20000
>>> > > > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
>>> > > > > > address 1 is: 1ffffffffff30000
>>> > > > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
>>> > > > > > returns address 2000000001f30000
>>> > > > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
>>> > > > > > address 1 is: 1ffffffffff90000
>>> > > > > > [Wed Jul 12 13:09:50 2023] io_uring_mmu_get_unmapped_area()
>>> > > > > > returns address 2000000001f90000
>>> > > > > >
>>> > > > > > This pattern seems to be pretty stable, I tried instead just
>>> > > > > > directly returning the result of a call to
>>> > > > > > arch_get_unmapped_area() at the end of the function and it seems
>>> > > > > > similar:
>>> > > > > >
>>> > > > > > [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area()
>>> > > > > > would return address 1ffffffffffd0000
>>> > > > > > [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would
>>> > > > > > return address 2000000001f00000
>>> > > > > > [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area()
>>> > > > > > would return address 1ffffffffff00000
>>> > > > > > [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would
>>> > > > > > return address 1ffffffffff00000
>>> > > > > > [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area()
>>> > > > > > would return address 1fffffffffe20000
>>> > > > > > [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would
>>> > > > > > return address 2000000002000000
>>> > > > > > [Wed Jul 12 13:27:07 2023] io_uring_mmu_get_unmapped_area()
>>> > > > > > would return address 1fffffffffe30000
>>> > > > > > [Wed Jul 12 13:27:07 2023] but arch_get_unmapped_area() would
>>> > > > > > return address 2000000002100000
>>> > > > > >
>>> > > > > > Is that enough of a clue to go on?
>>> > > > >
>>> > > > > SHMLBA on ia64 is 0x100000:
>>> > > > > arch/ia64/include/asm/shmparam.h:#define        SHMLBA  (1024*1024)
>>> > > > > but the values returned by io_uring_mmu_get_unmapped_area() does not
>>> > > > > fullfill this.
>>> > > > >
>>> > > > > So, probably ia64's SHMLBA isn't pulled in correctly in
>>> > > > > io_uring/io_uring.c.
>>> > > > > Check value of this line:
>>> > > > >      info.align_mask = PAGE_MASK & (SHMLBA - 1UL);
>>> > > > >
>>> > > > > You could also add
>>> > > > > #define SHM_COLOUR  0x100000
>>> > > > > in front of the
>>> > > > >      #ifdef SHM_COLOUR
>>> > > > > (define SHM_COLOUR in io_uring/kbuf.c too).
>>> > > >
>>> > > > What is the value of PAGE_SIZE and "ptr" on your machine?
>>> > > > For 4k page size I get:
>>> > > > SHMLBA -1   ->        FFFFF
>>> > > > PAGE_MASK   -> FFFFFFFFF000
>>> > > > so,
>>> > > > info.align_mask = PAGE_MASK & (SHMLBA - 1UL) = 0xFF000;
>>> > > > You could try to set nfo.align_mask = 0xfffff;
>>> > > >
>>> > > > Helge
>>> > >
>>> > > Using 64KiB (65536) PAGE_SIZE here.  64-bit pointers.
>>> > >
>>> > > Tried both #define SHM_COLOUR 0x100000, as well and info.align_mask =
>>> > > 0xFFFFF, but both of them made the problem change from 100%
>>> > > reproducible, to
>>> > > intermittent.
>>> > >
>>> > > After inspecting the ouput I observed that it hangs only when the
>>> > > first
>>> > > allocation returns an address below 0x2000000000000000, and the second
>>> > > returns an address above it.  When both addresses are above it, it
>>> > > does not
>>> > > hang.  Examples:
>>> > >
>>> > > When it works:
>>> > > $ cmake --version
>>> > > cmake version 3.26.4
>>> > >
>>> > > CMake suite maintained and supported by Kitware (kitware.com/cmake).
>>> > > $ dmesg --color=always -T | tail -n 4
>>> > > [Wed Jul 12 20:32:37 2023] io_uring_mmu_get_unmapped_area() would
>>> > > return
>>> > > address 1fffffffffe20000
>>> > > [Wed Jul 12 20:32:37 2023] but arch_get_unmapped_area() would return
>>> > > address
>>> > > 2000000002000000
>>> > > [Wed Jul 12 20:32:37 2023] io_uring_mmu_get_unmapped_area() would
>>> > > return
>>> > > address 1fffffffffe50000
>>> > > [Wed Jul 12 20:32:37 2023] but arch_get_unmapped_area() would return
>>> > > address
>>> > > 2000000002100000
>>> > >
>>> > >
>>> > > When it hangs:
>>> > > $ cmake --version
>>> > > cmake version 3.26.4
>>> > >
>>> > > CMake suite maintained and supported by Kitware (kitware.com/cmake).
>>> > > ^C
>>> > > $ dmesg --color=always -T | tail -n 4
>>> > > [Wed Jul 12 20:33:12 2023] io_uring_mmu_get_unmapped_area() would
>>> > > return
>>> > > address 1ffffffffff00000
>>> > > [Wed Jul 12 20:33:12 2023] but arch_get_unmapped_area() would return
>>> > > address
>>> > > 1ffffffffff00000
>>> > > [Wed Jul 12 20:33:12 2023] io_uring_mmu_get_unmapped_area() would
>>> > > return
>>> > > address 1fffffffffe60000
>>> > > [Wed Jul 12 20:33:12 2023] but arch_get_unmapped_area() would return
>>> > > address
>>> > > 2000000001f00000
>>> > >
>>> > > Is io_uring_mmu_get_unmapped_area supported to always return
>>> > > addresses above
>>> > > 0x2000000000000000?
>>> >
>>> > Yes, with the patch below.
>>> >
>>> > > Any reason why it is not doing so sometimes?
>>> >
>>> > It depends on the parameters for vm_unmapped_area(). Specifically
>>> > info.flags=0.
>>> >
>>> > Try this patch:
>>> >
>>> > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>> > index 3bca7a79efda..b259794ab53b 100644
>>> > --- a/io_uring/io_uring.c
>>> > +++ b/io_uring/io_uring.c
>>> > @@ -3429,10 +3429,13 @@ static unsigned long
>>> > io_uring_mmu_get_unmapped_area(struct file *filp,
>>> >       * A failed mmap() very likely causes application failure,
>>> >       * so fall back to the bottom-up function here. This scenario
>>> >       * can happen with large stack limits and large mmap()
>>> > -     * allocations.
>>> > +     * allocations. Use bottom-up on IA64 for correct aliasing.
>>> >       */
>>> > -    addr = vm_unmapped_area(&info);
>>> > -    if (offset_in_page(addr)) {
>>> > +    if (IS_ENABLED(CONFIG_IA64))
>>> > +        addr = NULL;
>>> > +    else
>>> > +        addr = vm_unmapped_area(&info);
>>> > +    if (!addr) {
>>> >          info.flags = 0;
>>> >          info.low_limit = TASK_UNMAPPED_BASE;
>>> >          info.high_limit = mmap_end;
>>> >
>>> > Helge
>>>
>>> This patch does do the trick, but I am a little unsure if it's the right one
>>> to go in:
>>>
>>> * Adding an arch-specific conditional feels like a bad hack, why is it not
>>> working with the other vm_unmapped_area_info settings?
>>
>> because it tries to map below TASK_UNMAPPED_BASE, for which (I assume) IA-64
>> has different aliasing/caching rules. There are some comments in the arch/ia64
>> files, but I'm not a IA-64 expert...
>>
>>> * What happened to the offset_in_page check for other arches?
>>
>> I thought it's not necessary.
>>
>> But below is another (and much better) approach, which you may test.
>> I see quite some errors with the liburing testcases on hppa, but I think
>> they are not related to this function.
>>
>> Can you test and report back?
>>
>> Helge
>>
>>
>> From 457f2c2db984bc159119bfb4426d9dc6c2779ed6 Mon Sep 17 00:00:00 2001
>> From: Helge Deller <deller@gmx.de>
>> Date: Sun, 16 Jul 2023 08:45:17 +0200
>> Subject: [PATCH] io_uring: Adjust mapping wrt architecture aliasing
>>  requirements
>>
>> When mapping memory to userspace use the architecture-provided
>> get_unmapped_area() function instead of the own copy which fails on
>> IA-64 since it doesn't allow mappings below TASK_UNMAPPED_BASE.
>>
>> Additionally make sure to flag the requested memory as MAP_SHARED so
>> that any architecture-specific aliasing rules will be applied.
>>
>> Reported-by: matoro <matoro_mailinglist_kernel@matoro.tk>
>> Signed-off-by: Helge Deller <deller@gmx.de>
>>
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index 3bca7a79efda..2e7dd93e45d0 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -3398,48 +3398,27 @@ static unsigned long io_uring_mmu_get_unmapped_area(struct file *filp,
>>              unsigned long addr, unsigned long len,
>>              unsigned long pgoff, unsigned long flags)
>>  {
>> -    const unsigned long mmap_end = arch_get_mmap_end(addr, len, flags);
>> -    struct vm_unmapped_area_info info;
>>      void *ptr;
>>
>>      /*
>>       * Do not allow to map to user-provided address to avoid breaking the
>> -     * aliasing rules. Userspace is not able to guess the offset address of
>> -     * kernel kmalloc()ed memory area.
>> +     * aliasing rules of various architectures. Userspace is not able to
>> +     * guess the offset address of kernel kmalloc()ed memory area.
>>       */
>> -    if (addr)
>> +    if (addr | (flags & MAP_FIXED))
>>          return -EINVAL;
>>
>> +    /*
>> +     * The requested memory region is required to be shared between kernel
>> +     * and userspace application.
>> +     */
>> +    flags |= MAP_SHARED;
>> +
>>      ptr = io_uring_validate_mmap_request(filp, pgoff, len);
>>      if (IS_ERR(ptr))
>>          return -ENOMEM;
>>
>> -    info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>> -    info.length = len;
>> -    info.low_limit = max(PAGE_SIZE, mmap_min_addr);
>> -    info.high_limit = arch_get_mmap_base(addr, current->mm->mmap_base);
>> -#ifdef SHM_COLOUR
>> -    info.align_mask = PAGE_MASK & (SHM_COLOUR - 1UL);
>> -#else
>> -    info.align_mask = PAGE_MASK & (SHMLBA - 1UL);
>> -#endif
>> -    info.align_offset = (unsigned long) ptr;
>> -
>> -    /*
>> -     * A failed mmap() very likely causes application failure,
>> -     * so fall back to the bottom-up function here. This scenario
>> -     * can happen with large stack limits and large mmap()
>> -     * allocations.
>> -     */
>> -    addr = vm_unmapped_area(&info);
>> -    if (offset_in_page(addr)) {
>> -        info.flags = 0;
>> -        info.low_limit = TASK_UNMAPPED_BASE;
>> -        info.high_limit = mmap_end;
>> -        addr = vm_unmapped_area(&info);
>> -    }
>> -
>> -    return addr;
>> +    return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags);
>>  }
>>
>>  #else /* !CONFIG_MMU */
>
> This seems really close.  It worked for the trivial test case, so I ran the test suite from https://github.com/axboe/liburing to compare.  With kernel 6.3, I get 100% pass, after I get one failure:
> Running test read-write.t cqe->res=33, expected=32
> test_rem_buf_single(BUFFERS + 1) failed
> Not root, skipping test_write_efbig
> Test read-write.t failed with ret 1
>
> Trying this patch out on other arches to see if it also affects them or is ia64-specific.

I'm sorry, but this patch does break parisc heavily...

I'll need to think more...

Helge




^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2023-07-16 20:54 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-14 17:16 [PATCHSET 0/5] User mapped provided buffer rings Jens Axboe
2023-03-14 17:16 ` [PATCH 1/5] io_uring: Adjust mapping wrt architecture aliasing requirements Jens Axboe
2023-07-12  4:43   ` matoro
2023-07-12 16:24     ` Helge Deller
2023-07-12 17:28       ` matoro
2023-07-12 19:05         ` Helge Deller
2023-07-12 20:30           ` Helge Deller
2023-07-13  0:35             ` matoro
2023-07-13  7:27               ` Helge Deller
2023-07-13 23:57                 ` matoro
2023-07-16  6:54                   ` Helge Deller
2023-07-16 18:03                     ` matoro
2023-07-16 20:54                       ` Helge Deller
2023-03-14 17:16 ` [PATCH 2/5] io_uring/kbuf: move pinning of provided buffer ring into helper Jens Axboe
2023-03-14 17:16 ` [PATCH 3/5] io_uring/kbuf: add buffer_list->is_mapped member Jens Axboe
2023-03-14 17:16 ` [PATCH 4/5] io_uring/kbuf: rename struct io_uring_buf_reg 'pad' to'flags' Jens Axboe
2023-03-14 17:16 ` [PATCH 5/5] io_uring: add support for user mapped provided buffer ring Jens Axboe
2023-03-16 18:07   ` Ammar Faizi
2023-03-16 18:42     ` Jens Axboe
2023-03-15 20:03 ` [PATCHSET 0/5] User mapped provided buffer rings Helge Deller
2023-03-15 20:07   ` Helge Deller
2023-03-15 20:38     ` Jens Axboe
2023-03-15 21:04       ` John David Anglin
2023-03-15 21:08         ` Jens Axboe
2023-03-15 21:18       ` Jens Axboe
2023-03-16 10:18         ` Helge Deller
2023-03-16 17:00           ` Jens Axboe
2023-03-16 19:08         ` John David Anglin
2023-03-16 19:46           ` Jens Axboe
2023-03-17  2:09             ` Jens Axboe
2023-03-17  2:17               ` Jens Axboe
2023-03-17 15:36                 ` John David Anglin
2023-03-17 15:57                   ` Jens Axboe
2023-03-17 16:15                     ` John David Anglin
2023-03-17 16:37                       ` Jens Axboe
2023-03-15 20:11   ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).