fio.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Add support for uring-passthrough in t/io_uring
       [not found] <CGME20220824094118epcas5p461b651ef5c3b1b04fc731275cf2fe0c2@epcas5p4.samsung.com>
@ 2022-08-24  9:31 ` Anuj Gupta
       [not found]   ` <CGME20220824094125epcas5p48b88f023cce31317adc68abc225df95c@epcas5p4.samsung.com>
       [not found]   ` <CGME20220824094134epcas5p4b7cacb00484146b5d510ded15a80c712@epcas5p4.samsung.com>
  0 siblings, 2 replies; 7+ messages in thread
From: Anuj Gupta @ 2022-08-24  9:31 UTC (permalink / raw)
  To: axboe, vincentfu; +Cc: joshi.k, ankit.kumar, fio, Anuj Gupta

This series adds support for measuring peak performance of uring-passthrough path
using t/io_uring utility. Added new -u1 option, that makes t/io_uring to do io using
nvme passthrough commands.

Uring-passthrough on nvme-generic device:

root@test:/home/test/gost/anuj/github/fio# taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -n1 -u1 /dev/ng0n1
submitter=0, tid=11289, file=/dev/ng0n1, node=-1
polled=0, fixedbufs=0/0, register_files=1, buffered=1, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=1.88M, BW=917MiB/s, IOS/call=32/31
IOPS=1.88M, BW=916MiB/s, IOS/call=32/32
IOPS=1.88M, BW=915MiB/s, IOS/call=32/32
IOPS=1.88M, BW=915MiB/s, IOS/call=32/32
IOPS=1.87M, BW=914MiB/s, IOS/call=32/32
IOPS=1.87M, BW=913MiB/s, IOS/call=32/32
^CExiting on signal
Maximum IOPS=1.88M

Regular io_uring on nvme-block device:

root@test:/home/test/gost/anuj/github/fio# taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -n1 /dev/nvme0n1
submitter=0, tid=11292, file=/dev/nvme0n1, node=-1
polled=0, fixedbufs=0/0, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=1.48M, BW=724MiB/s, IOS/call=32/31
IOPS=1.48M, BW=722MiB/s, IOS/call=32/32
IOPS=1.48M, BW=721MiB/s, IOS/call=32/32
IOPS=1.48M, BW=721MiB/s, IOS/call=32/32
^CExiting on signal
Maximum IOPS=1.48M



Anuj Gupta (2):
  t/io_uring: prep for including engines/nvme.h in t/io_uring
  t/io_uring: add support for async-passthru

 t/io_uring.c | 164 +++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 147 insertions(+), 17 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] t/io_uring: prep for including engines/nvme.h in t/io_uring
       [not found]   ` <CGME20220824094125epcas5p48b88f023cce31317adc68abc225df95c@epcas5p4.samsung.com>
@ 2022-08-24  9:31     ` Anuj Gupta
  0 siblings, 0 replies; 7+ messages in thread
From: Anuj Gupta @ 2022-08-24  9:31 UTC (permalink / raw)
  To: axboe, vincentfu; +Cc: joshi.k, ankit.kumar, fio, Anuj Gupta

Change page_size and cal_clat_percentiles name to something different
as these are indirectly picked from engines/nvme.h (fio.h and stat.h)

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 t/io_uring.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/t/io_uring.c b/t/io_uring.c
index 35bf1956..a42abd46 100644
--- a/t/io_uring.c
+++ b/t/io_uring.c
@@ -117,7 +117,7 @@ static struct submitter *submitter;
 static volatile int finish;
 static int stats_running;
 static unsigned long max_iops;
-static long page_size;
+static long t_io_uring_page_size;
 
 static int depth = DEPTH;
 static int batch_submit = BATCH_SUBMIT;
@@ -195,9 +195,9 @@ static unsigned long plat_idx_to_val(unsigned int idx)
 	return cycles_to_nsec(base + ((k + 0.5) * (1 << error_bits)));
 }
 
-unsigned int calc_clat_percentiles(unsigned long *io_u_plat, unsigned long nr,
-				   unsigned long **output,
-				   unsigned long *maxv, unsigned long *minv)
+unsigned int calculate_clat_percentiles(unsigned long *io_u_plat,
+		unsigned long nr, unsigned long **output,
+		unsigned long *maxv, unsigned long *minv)
 {
 	unsigned long sum = 0;
 	unsigned int len = plist_len, i, j = 0;
@@ -251,7 +251,7 @@ static void show_clat_percentiles(unsigned long *io_u_plat, unsigned long nr,
 	bool is_last;
 	char fmt[32];
 
-	len = calc_clat_percentiles(io_u_plat, nr, &ovals, &maxv, &minv);
+	len = calculate_clat_percentiles(io_u_plat, nr, &ovals, &maxv, &minv);
 	if (!len || !ovals)
 		goto out;
 
@@ -786,7 +786,7 @@ static void *allocate_mem(struct submitter *s, int size)
 		return numa_alloc_onnode(size, s->numa_node);
 #endif
 
-	if (posix_memalign(&buf, page_size, bs)) {
+	if (posix_memalign(&buf, t_io_uring_page_size, bs)) {
 		printf("failed alloc\n");
 		return NULL;
 	}
@@ -1543,9 +1543,9 @@ int main(int argc, char *argv[])
 
 	arm_sig_int();
 
-	page_size = sysconf(_SC_PAGESIZE);
-	if (page_size < 0)
-		page_size = 4096;
+	t_io_uring_page_size = sysconf(_SC_PAGESIZE);
+	if (t_io_uring_page_size < 0)
+		t_io_uring_page_size = 4096;
 
 	for (j = 0; j < nthreads; j++) {
 		s = get_submitter(j);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] t/io_uring: add support for async-passthru
       [not found]   ` <CGME20220824094134epcas5p4b7cacb00484146b5d510ded15a80c712@epcas5p4.samsung.com>
@ 2022-08-24  9:31     ` Anuj Gupta
  2022-08-24 20:09       ` Jens Axboe
  2022-08-25  8:24       ` Kanchan Joshi
  0 siblings, 2 replies; 7+ messages in thread
From: Anuj Gupta @ 2022-08-24  9:31 UTC (permalink / raw)
  To: axboe, vincentfu; +Cc: joshi.k, ankit.kumar, fio, Anuj Gupta

This patch adds support for async-passthru in t/io_uring. User needs to
specify -u1 option in the command

Example commandline:
t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -n1 -u1 /dev/ng0n1

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 t/io_uring.c | 146 ++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 138 insertions(+), 8 deletions(-)

diff --git a/t/io_uring.c b/t/io_uring.c
index a42abd46..78fedaf1 100644
--- a/t/io_uring.c
+++ b/t/io_uring.c
@@ -35,6 +35,7 @@
 #include "../lib/rand.h"
 #include "../minmax.h"
 #include "../os/linux/io_uring.h"
+#include "../engines/nvme.h"
 
 struct io_sq_ring {
 	unsigned *head;
@@ -139,6 +140,9 @@ static int random_io = 1;	/* random or sequential IO */
 static int register_ring = 1;	/* register ring */
 static int use_sync = 0;	/* use preadv2 */
 static int numa_placement = 0;	/* set to node of device */
+static int pt = 0;		/* passthrough I/O or not */
+static unsigned int nsid = 0;	/* nsid field required for nvme-passthrough */
+static unsigned int lbs = 0;	/* logical_block_size field required for nvme-passthrough */
 
 static unsigned long tsc_rate;
 
@@ -161,6 +165,54 @@ struct io_uring_map_buffers {
 };
 #endif
 
+static int nvme_identify(int fd, __u32 nsid, enum nvme_identify_cns cns,
+			 enum nvme_csi csi, void *data)
+{
+	struct nvme_passthru_cmd cmd = {
+		.opcode         = nvme_admin_identify,
+		.nsid           = nsid,
+		.addr           = (__u64)(uintptr_t)data,
+		.data_len       = NVME_IDENTIFY_DATA_SIZE,
+		.cdw10          = cns,
+		.cdw11          = csi << NVME_IDENTIFY_CSI_SHIFT,
+		.timeout_ms     = NVME_DEFAULT_IOCTL_TIMEOUT,
+	};
+
+	return ioctl(fd, NVME_IOCTL_ADMIN_CMD, &cmd);
+}
+
+static int nvme_get_info(int fd, __u32 *nsid, __u32 *lba_sz, __u64 *nlba)
+{
+	struct nvme_id_ns ns;
+	int namespace_id;
+	int err;
+
+	namespace_id = ioctl(fd, NVME_IOCTL_ID);
+	if (namespace_id < 0) {
+		fprintf(stderr, "error failed to fetch namespace-id\n");
+		close(fd);
+		return -errno;
+	}
+
+	/*
+	 * Identify namespace to get namespace-id, namespace size in LBA's
+	 * and LBA data size.
+	 */
+	err = nvme_identify(fd, namespace_id, NVME_IDENTIFY_CNS_NS,
+				NVME_CSI_NVM, &ns);
+	if (err) {
+		fprintf(stderr, "error failed to fetch identify namespace\n");
+		close(fd);
+		return err;
+	}
+
+	*nsid = namespace_id;
+	*lba_sz = 1 << ns.lbaf[(ns.flbas & 0x0f)].ds;
+	*nlba = ns.nsze;
+
+	return 0;
+}
+
 static unsigned long cycles_to_nsec(unsigned long cycles)
 {
 	uint64_t val;
@@ -457,12 +509,13 @@ static unsigned file_depth(struct submitter *s)
 
 static void init_io(struct submitter *s, unsigned index)
 {
-	struct io_uring_sqe *sqe = &s->sqes[index];
+	struct io_uring_sqe *sqe = NULL;
 	unsigned long offset;
 	struct file *f;
 	long r;
 
 	if (do_nop) {
+		sqe = &s->sqes[index];
 		sqe->opcode = IORING_OP_NOP;
 		return;
 	}
@@ -490,6 +543,41 @@ static void init_io(struct submitter *s, unsigned index)
 			f->cur_off = 0;
 	}
 
+	if (pt) {
+		struct nvme_uring_cmd *cmd;
+		unsigned long long slba;
+		unsigned long long nlb;
+
+		sqe = &s->sqes[index<<1];
+		if (register_files) {
+			sqe->fd = f->fixed_fd;
+			sqe->flags = IOSQE_FIXED_FILE;
+		} else {
+			sqe->fd = f->real_fd;
+			sqe->flags = 0;
+		}
+		sqe->opcode = IORING_OP_URING_CMD;
+		sqe->user_data = (unsigned long) f->fileno;
+		if (stats)
+			sqe->user_data |= ((unsigned long)s->clock_index << 32);
+		sqe->cmd_op = NVME_URING_CMD_IO;
+		slba = offset / lbs;
+		nlb = bs/lbs - 1;
+		cmd = (struct nvme_uring_cmd *)&sqe->cmd;
+		memset(cmd, 0, sizeof(struct nvme_uring_cmd));
+		/* cdw10 and cdw11 represent starting slba*/
+		cmd->cdw10 = slba & 0xffffffff;
+		cmd->cdw11 = slba >> 32;
+		/* cdw12 represent number of lba to be read*/
+		cmd->cdw12 = nlb;
+		cmd->addr = (unsigned long) s->iovecs[index].iov_base;
+		cmd->data_len = bs;
+		cmd->nsid = nsid;
+		cmd->opcode = 2;
+		return;
+	}
+
+	sqe = &s->sqes[index];
 	if (register_files) {
 		sqe->flags = IOSQE_FIXED_FILE;
 		sqe->fd = f->fixed_fd;
@@ -549,7 +637,22 @@ static int get_file_size(struct file *f)
 
 	if (fstat(f->real_fd, &st) < 0)
 		return -1;
-	if (S_ISBLK(st.st_mode)) {
+	if (pt) {
+		__u64 nlba;
+		int ret;
+
+		if (!S_ISCHR(st.st_mode)) {
+			fprintf(stderr, "passthrough works with only nvme-ns "
+					"generic devices (/dev/ngXnY)\n");
+			return -1;
+		}
+		ret = nvme_get_info(f->real_fd, &nsid, &lbs, &nlba);
+		if (ret)
+			return -1;
+		f->max_blocks = nlba / bs;
+		f->max_size = nlba;
+		return 0;
+	} else if (S_ISBLK(st.st_mode)) {
 		unsigned long long bytes;
 
 		if (ioctl(f->real_fd, BLKGETSIZE64, &bytes) != 0)
@@ -587,11 +690,14 @@ static int reap_events_uring(struct submitter *s)
 
 			f = &s->files[fileno];
 			f->pending_ios--;
-			if (cqe->res != bs) {
+			if (!pt && cqe->res != bs) {
 				printf("io: unexpected ret=%d\n", cqe->res);
 				if (polled && cqe->res == -EOPNOTSUPP)
 					printf("Your filesystem/driver/kernel doesn't support polled IO\n");
 				return -1;
+			} else if (pt && cqe->res != 0) {
+				printf("io: unexpected ret=%d\n", cqe->res);
+				return -1;
 			}
 		}
 		if (stats) {
@@ -697,6 +803,7 @@ static int setup_ring(struct submitter *s)
 	struct io_uring_params p;
 	int ret, fd;
 	void *ptr;
+	size_t len;
 
 	memset(&p, 0, sizeof(p));
 
@@ -709,6 +816,10 @@ static int setup_ring(struct submitter *s)
 			p.sq_thread_cpu = sq_thread_cpu;
 		}
 	}
+	if (pt) {
+		p.flags |= IORING_SETUP_SQE128;
+		p.flags |= IORING_SETUP_CQE32;
+	}
 
 	fd = io_uring_setup(depth, &p);
 	if (fd < 0) {
@@ -761,11 +872,22 @@ static int setup_ring(struct submitter *s)
 	sring->array = ptr + p.sq_off.array;
 	sq_ring_mask = *sring->ring_mask;
 
-	s->sqes = mmap(0, p.sq_entries * sizeof(struct io_uring_sqe),
+	if (p.flags & IORING_SETUP_SQE128)
+		len = 2 * p.sq_entries * sizeof(struct io_uring_sqe);
+	else
+		len = p.sq_entries * sizeof(struct io_uring_sqe);
+	s->sqes = mmap(0, len,
 			PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd,
 			IORING_OFF_SQES);
 
-	ptr = mmap(0, p.cq_off.cqes + p.cq_entries * sizeof(struct io_uring_cqe),
+	if (p.flags & IORING_SETUP_CQE32) {
+		len = p.cq_off.cqes +
+			2 * p.cq_entries * sizeof(struct io_uring_cqe);
+	} else {
+		len = p.cq_off.cqes +
+			p.cq_entries * sizeof(struct io_uring_cqe);
+	}
+	ptr = mmap(0, len,
 			PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd,
 			IORING_OFF_CQ_RING);
 	cring->head = ptr + p.cq_off.head;
@@ -1306,11 +1428,12 @@ static void usage(char *argv, int status)
 		" -a <bool> : Use legacy aio, default %d\n"
 		" -S <bool> : Use sync IO (preadv2), default %d\n"
 		" -X <bool> : Use registered ring %d\n"
-		" -P <bool> : Automatically place on device home node %d\n",
+		" -P <bool> : Automatically place on device home node %d\n"
+		" -u <bool> : Use nvme-passthrough I/O, default %d\n",
 		argv, DEPTH, BATCH_SUBMIT, BATCH_COMPLETE, BS, polled,
 		fixedbufs, dma_map, register_files, nthreads, !buffered, do_nop,
 		stats, runtime == 0 ? "unlimited" : runtime_str, random_io, aio,
-		use_sync, register_ring, numa_placement);
+		use_sync, register_ring, numa_placement, pt);
 	exit(status);
 }
 
@@ -1369,7 +1492,7 @@ int main(int argc, char *argv[])
 	if (!do_nop && argc < 2)
 		usage(argv[0], 1);
 
-	while ((opt = getopt(argc, argv, "d:s:c:b:p:B:F:n:N:O:t:T:a:r:D:R:X:S:P:h?")) != -1) {
+	while ((opt = getopt(argc, argv, "d:s:c:b:p:B:F:n:N:O:t:T:a:r:D:R:X:S:P:u:h?")) != -1) {
 		switch (opt) {
 		case 'a':
 			aio = !!atoi(optarg);
@@ -1450,6 +1573,9 @@ int main(int argc, char *argv[])
 		case 'P':
 			numa_placement = !!atoi(optarg);
 			break;
+		case 'u':
+			pt = !!atoi(optarg);
+			break;
 		case 'h':
 		case '?':
 		default:
@@ -1512,6 +1638,10 @@ int main(int argc, char *argv[])
 			printf("failed getting size of device/file\n");
 			return 1;
 		}
+		if (pt && bs < lbs) {
+			printf("error: bs:%d is less than logical_block_size:%d\n", bs, lbs);
+			return 1;
+		}
 		if (f.max_blocks <= 1) {
 			printf("Zero file/device size?\n");
 			return 1;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] t/io_uring: add support for async-passthru
  2022-08-24  9:31     ` [PATCH 2/2] t/io_uring: add support for async-passthru Anuj Gupta
@ 2022-08-24 20:09       ` Jens Axboe
  2022-08-25 11:20         ` Anuj Gupta
  2022-08-25  8:24       ` Kanchan Joshi
  1 sibling, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2022-08-24 20:09 UTC (permalink / raw)
  To: Anuj Gupta, vincentfu; +Cc: joshi.k, ankit.kumar, fio

On 8/24/22 3:31 AM, Anuj Gupta wrote:
> @@ -457,12 +509,13 @@ static unsigned file_depth(struct submitter *s)
>  
>  static void init_io(struct submitter *s, unsigned index)
>  {
> -	struct io_uring_sqe *sqe = &s->sqes[index];
> +	struct io_uring_sqe *sqe = NULL;
>  	unsigned long offset;
>  	struct file *f;
>  	long r;
>  
>  	if (do_nop) {
> +		sqe = &s->sqes[index];
>  		sqe->opcode = IORING_OP_NOP;
>  		return;

I'd just leave the initial assignment rather than have two manual ones
for the same thing, and then just have the 'pt' path override it.

> @@ -490,6 +543,41 @@ static void init_io(struct submitter *s, unsigned index)
>  			f->cur_off = 0;
>  	}
>  
> +	if (pt) {
> +		struct nvme_uring_cmd *cmd;
> +		unsigned long long slba;
> +		unsigned long long nlb;
> +
> +		sqe = &s->sqes[index<<1];

		sqe = &s->sqes[index << 1];

please.

> +		if (register_files) {
> +			sqe->fd = f->fixed_fd;
> +			sqe->flags = IOSQE_FIXED_FILE;
> +		} else {
> +			sqe->fd = f->real_fd;
> +			sqe->flags = 0;
> +		}
> +		sqe->opcode = IORING_OP_URING_CMD;
> +		sqe->user_data = (unsigned long) f->fileno;
> +		if (stats)
> +			sqe->user_data |= ((unsigned long)s->clock_index << 32);
> +		sqe->cmd_op = NVME_URING_CMD_IO;
> +		slba = offset / lbs;
> +		nlb = bs/lbs - 1;

You have two expensive integer divisions here. Would make sense to turn
lbs into a shift instead, those are a lot cheaper to do. And then you
can put the expensive part in the setup part rather than have it in the
fast path.

Another thing to consider is to make the pt path separate rather than
need a bunch of 'if (pt) ... else ...' in here?

> @@ -587,11 +690,14 @@ static int reap_events_uring(struct submitter *s)
>  
>  			f = &s->files[fileno];
>  			f->pending_ios--;
> -			if (cqe->res != bs) {
> +			if (!pt && cqe->res != bs) {
>  				printf("io: unexpected ret=%d\n", cqe->res);
>  				if (polled && cqe->res == -EOPNOTSUPP)
>  					printf("Your filesystem/driver/kernel doesn't support polled IO\n");
>  				return -1;
> +			} else if (pt && cqe->res != 0) {
> +				printf("io: unexpected ret=%d\n", cqe->res);
> +				return -1;
>  			}
>  		}
>  		if (stats) {

These two would be examples of things that would be cleaner with a
separate path too.

> @@ -1306,11 +1428,12 @@ static void usage(char *argv, int status)
>  		" -a <bool> : Use legacy aio, default %d\n"
>  		" -S <bool> : Use sync IO (preadv2), default %d\n"
>  		" -X <bool> : Use registered ring %d\n"
> -		" -P <bool> : Automatically place on device home node %d\n",
> +		" -P <bool> : Automatically place on device home node %d\n"
> +		" -u <bool> : Use nvme-passthrough I/O, default %d\n",
>  		argv, DEPTH, BATCH_SUBMIT, BATCH_COMPLETE, BS, polled,
>  		fixedbufs, dma_map, register_files, nthreads, !buffered, do_nop,
>  		stats, runtime == 0 ? "unlimited" : runtime_str, random_io, aio,
> -		use_sync, register_ring, numa_placement);
> +		use_sync, register_ring, numa_placement, pt);
>  	exit(status);

Reminds me, should the nsid and lbs be in the submitter struct?

Thanks for doing this work!!

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] t/io_uring: add support for async-passthru
  2022-08-24  9:31     ` [PATCH 2/2] t/io_uring: add support for async-passthru Anuj Gupta
  2022-08-24 20:09       ` Jens Axboe
@ 2022-08-25  8:24       ` Kanchan Joshi
  2022-08-25 11:39         ` Anuj Gupta
  1 sibling, 1 reply; 7+ messages in thread
From: Kanchan Joshi @ 2022-08-25  8:24 UTC (permalink / raw)
  To: Anuj Gupta; +Cc: axboe, vincentfu, ankit.kumar, fio

[-- Attachment #1: Type: text/plain, Size: 6659 bytes --]

On Wed, Aug 24, 2022 at 03:01:09PM +0530, Anuj Gupta wrote:
>This patch adds support for async-passthru in t/io_uring. User needs to
>specify -u1 option in the command
>
>Example commandline:
>t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -n1 -u1 /dev/ng0n1
>
>Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
>---
> t/io_uring.c | 146 ++++++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 138 insertions(+), 8 deletions(-)
>
>diff --git a/t/io_uring.c b/t/io_uring.c
>index a42abd46..78fedaf1 100644
>--- a/t/io_uring.c
>+++ b/t/io_uring.c
>@@ -35,6 +35,7 @@
> #include "../lib/rand.h"
> #include "../minmax.h"
> #include "../os/linux/io_uring.h"
>+#include "../engines/nvme.h"
>
> struct io_sq_ring {
> 	unsigned *head;
>@@ -139,6 +140,9 @@ static int random_io = 1;	/* random or sequential IO */
> static int register_ring = 1;	/* register ring */
> static int use_sync = 0;	/* use preadv2 */
> static int numa_placement = 0;	/* set to node of device */
>+static int pt = 0;		/* passthrough I/O or not */
>+static unsigned int nsid = 0;	/* nsid field required for nvme-passthrough */
>+static unsigned int lbs = 0;	/* logical_block_size field required for nvme-passthrough */
>
> static unsigned long tsc_rate;
>
>@@ -161,6 +165,54 @@ struct io_uring_map_buffers {
> };
> #endif
>
>+static int nvme_identify(int fd, __u32 nsid, enum nvme_identify_cns cns,
>+			 enum nvme_csi csi, void *data)
>+{
>+	struct nvme_passthru_cmd cmd = {
>+		.opcode         = nvme_admin_identify,
>+		.nsid           = nsid,
>+		.addr           = (__u64)(uintptr_t)data,
>+		.data_len       = NVME_IDENTIFY_DATA_SIZE,
>+		.cdw10          = cns,
>+		.cdw11          = csi << NVME_IDENTIFY_CSI_SHIFT,
>+		.timeout_ms     = NVME_DEFAULT_IOCTL_TIMEOUT,
>+	};
>+
>+	return ioctl(fd, NVME_IOCTL_ADMIN_CMD, &cmd);
>+}
>+
>+static int nvme_get_info(int fd, __u32 *nsid, __u32 *lba_sz, __u64 *nlba)
>+{
>+	struct nvme_id_ns ns;
>+	int namespace_id;
>+	int err;
>+
>+	namespace_id = ioctl(fd, NVME_IOCTL_ID);
>+	if (namespace_id < 0) {
>+		fprintf(stderr, "error failed to fetch namespace-id\n");
>+		close(fd);
>+		return -errno;
>+	}
>+
>+	/*
>+	 * Identify namespace to get namespace-id, namespace size in LBA's
>+	 * and LBA data size.
>+	 */
>+	err = nvme_identify(fd, namespace_id, NVME_IDENTIFY_CNS_NS,
>+				NVME_CSI_NVM, &ns);
>+	if (err) {
>+		fprintf(stderr, "error failed to fetch identify namespace\n");
>+		close(fd);
>+		return err;
>+	}
>+
>+	*nsid = namespace_id;
>+	*lba_sz = 1 << ns.lbaf[(ns.flbas & 0x0f)].ds;
>+	*nlba = ns.nsze;
>+
>+	return 0;
>+}
>+
> static unsigned long cycles_to_nsec(unsigned long cycles)
> {
> 	uint64_t val;
>@@ -457,12 +509,13 @@ static unsigned file_depth(struct submitter *s)
>
> static void init_io(struct submitter *s, unsigned index)
> {
>-	struct io_uring_sqe *sqe = &s->sqes[index];
>+	struct io_uring_sqe *sqe = NULL;
> 	unsigned long offset;
> 	struct file *f;
> 	long r;
>
> 	if (do_nop) {
>+		sqe = &s->sqes[index];

Nit: this and previous change is not must as <see below>

> 		sqe->opcode = IORING_OP_NOP;
> 		return;
> 	}
>@@ -490,6 +543,41 @@ static void init_io(struct submitter *s, unsigned index)
> 			f->cur_off = 0;
> 	}
>
>+	if (pt) {
>+		struct nvme_uring_cmd *cmd;
>+		unsigned long long slba;
>+		unsigned long long nlb;
>+
>+		sqe = &s->sqes[index<<1];

this is enough.

>+		if (register_files) {
>+			sqe->fd = f->fixed_fd;
>+			sqe->flags = IOSQE_FIXED_FILE;
>+		} else {
>+			sqe->fd = f->real_fd;
>+			sqe->flags = 0;
>+		}
>+		sqe->opcode = IORING_OP_URING_CMD;
>+		sqe->user_data = (unsigned long) f->fileno;
>+		if (stats)
>+			sqe->user_data |= ((unsigned long)s->clock_index << 32);
>+		sqe->cmd_op = NVME_URING_CMD_IO;
>+		slba = offset / lbs;
>+		nlb = bs/lbs - 1;
>+		cmd = (struct nvme_uring_cmd *)&sqe->cmd;
>+		memset(cmd, 0, sizeof(struct nvme_uring_cmd));
>+		/* cdw10 and cdw11 represent starting slba*/
>+		cmd->cdw10 = slba & 0xffffffff;
>+		cmd->cdw11 = slba >> 32;
>+		/* cdw12 represent number of lba to be read*/
>+		cmd->cdw12 = nlb;
>+		cmd->addr = (unsigned long) s->iovecs[index].iov_base;
>+		cmd->data_len = bs;
>+		cmd->nsid = nsid;
>+		cmd->opcode = 2;
>+		return;
>+	}
>+
>+	sqe = &s->sqes[index];
> 	if (register_files) {
> 		sqe->flags = IOSQE_FIXED_FILE;
> 		sqe->fd = f->fixed_fd;
>@@ -549,7 +637,22 @@ static int get_file_size(struct file *f)
>
> 	if (fstat(f->real_fd, &st) < 0)
> 		return -1;
>-	if (S_ISBLK(st.st_mode)) {
>+	if (pt) {
>+		__u64 nlba;
>+		int ret;
>+
>+		if (!S_ISCHR(st.st_mode)) {
>+			fprintf(stderr, "passthrough works with only nvme-ns "
>+					"generic devices (/dev/ngXnY)\n");
>+			return -1;
>+		}
>+		ret = nvme_get_info(f->real_fd, &nsid, &lbs, &nlba);
>+		if (ret)
>+			return -1;
>+		f->max_blocks = nlba / bs;
>+		f->max_size = nlba;
>+		return 0;
>+	} else if (S_ISBLK(st.st_mode)) {
> 		unsigned long long bytes;
>
> 		if (ioctl(f->real_fd, BLKGETSIZE64, &bytes) != 0)
>@@ -587,11 +690,14 @@ static int reap_events_uring(struct submitter *s)
>
> 			f = &s->files[fileno];
> 			f->pending_ios--;
>-			if (cqe->res != bs) {
>+			if (!pt && cqe->res != bs) {
> 				printf("io: unexpected ret=%d\n", cqe->res);
> 				if (polled && cqe->res == -EOPNOTSUPP)
> 					printf("Your filesystem/driver/kernel doesn't support polled IO\n");
> 				return -1;
>+			} else if (pt && cqe->res != 0) {
>+				printf("io: unexpected ret=%d\n", cqe->res);
>+				return -1;

Missing inner check (polled one) for pt case?
How about something like this fragement -

                if (!do_nop) {
                        int fileno = cqe->user_data & 0xffffffff;
+                       bool error;

                        f = &s->files[fileno];
                        f->pending_ios--;
-                       if (!pt && cqe->res != bs) {
+                       if (!pt)
+                               error = (cqe->res != bs);
+                       else
+                               error = (cqe->res != 0);
+                       if (error) {
                                printf("io: unexpected ret=%d\n", cqe->res);
                                if (polled && cqe->res == -EOPNOTSUPP)
                                        printf("Your filesystem/driver/kernel doesn't support polled IO\n");
                                return -1;

[...]

>@@ -1512,6 +1638,10 @@ int main(int argc, char *argv[])
> 			printf("failed getting size of device/file\n");
> 			return 1;
> 		}
>+		if (pt && bs < lbs) {
>+			printf("error: bs:%d is less than logical_block_size:%d\n", bs, lbs);
>+			return 1;
>+		}

would it be better to replace the check/print with "bs % lbs" instead?
since io size needs to be multiple of lbs to go through.


[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] t/io_uring: add support for async-passthru
  2022-08-24 20:09       ` Jens Axboe
@ 2022-08-25 11:20         ` Anuj Gupta
  0 siblings, 0 replies; 7+ messages in thread
From: Anuj Gupta @ 2022-08-25 11:20 UTC (permalink / raw)
  To: Jens Axboe, vincentfu; +Cc: joshi.k, ankit.kumar, fio

[-- Attachment #1: Type: text/plain, Size: 3931 bytes --]

On Wed, Aug 24, 2022 at 02:09:53PM -0600, Jens Axboe wrote:
> On 8/24/22 3:31 AM, Anuj Gupta wrote:
> > @@ -457,12 +509,13 @@ static unsigned file_depth(struct submitter *s)
> >  
> >  static void init_io(struct submitter *s, unsigned index)
> >  {
> > -	struct io_uring_sqe *sqe = &s->sqes[index];
> > +	struct io_uring_sqe *sqe = NULL;
> >  	unsigned long offset;
> >  	struct file *f;
> >  	long r;
> >  
> >  	if (do_nop) {
> > +		sqe = &s->sqes[index];
> >  		sqe->opcode = IORING_OP_NOP;
> >  		return;
> 
> I'd just leave the initial assignment rather than have two manual ones
> for the same thing, and then just have the 'pt' path override it.
>

right, makes sense

> > @@ -490,6 +543,41 @@ static void init_io(struct submitter *s, unsigned index)
> >  			f->cur_off = 0;
> >  	}
> >  
> > +	if (pt) {
> > +		struct nvme_uring_cmd *cmd;
> > +		unsigned long long slba;
> > +		unsigned long long nlb;
> > +
> > +		sqe = &s->sqes[index<<1];
> 
> 		sqe = &s->sqes[index << 1];
> 
> please.
>

acked

> > +		if (register_files) {
> > +			sqe->fd = f->fixed_fd;
> > +			sqe->flags = IOSQE_FIXED_FILE;
> > +		} else {
> > +			sqe->fd = f->real_fd;
> > +			sqe->flags = 0;
> > +		}
> > +		sqe->opcode = IORING_OP_URING_CMD;
> > +		sqe->user_data = (unsigned long) f->fileno;
> > +		if (stats)
> > +			sqe->user_data |= ((unsigned long)s->clock_index << 32);
> > +		sqe->cmd_op = NVME_URING_CMD_IO;
> > +		slba = offset / lbs;
> > +		nlb = bs/lbs - 1;
> 
> You have two expensive integer divisions here. Would make sense to turn
> lbs into a shift instead, those are a lot cheaper to do. And then you
> can put the expensive part in the setup part rather than have it in the
> fast path.
>

sure, will change division to shift here. Will see if I can move the expensive
nvme command formation part in the setup (to the submitter_init function)

> Another thing to consider is to make the pt path separate rather than
> need a bunch of 'if (pt) ... else ...' in here?
> 

right, for passthru we do things differently compared to existing path primarily during
"init_io" and "reap_events_uring". I think we can have a "init_io_pt" and "reap_events_uring_pt"
specific to passthru. But i think we can keep using the same "submitter_uring_fn" for both existing
path and passthru as that part should still remain the same

> > @@ -587,11 +690,14 @@ static int reap_events_uring(struct submitter *s)
> >  
> >  			f = &s->files[fileno];
> >  			f->pending_ios--;
> > -			if (cqe->res != bs) {
> > +			if (!pt && cqe->res != bs) {
> >  				printf("io: unexpected ret=%d\n", cqe->res);
> >  				if (polled && cqe->res == -EOPNOTSUPP)
> >  					printf("Your filesystem/driver/kernel doesn't support polled IO\n");
> >  				return -1;
> > +			} else if (pt && cqe->res != 0) {
> > +				printf("io: unexpected ret=%d\n", cqe->res);
> > +				return -1;
> >  			}
> >  		}
> >  		if (stats) {
> 
> These two would be examples of things that would be cleaner with a
> separate path too.
> 
> > @@ -1306,11 +1428,12 @@ static void usage(char *argv, int status)
> >  		" -a <bool> : Use legacy aio, default %d\n"
> >  		" -S <bool> : Use sync IO (preadv2), default %d\n"
> >  		" -X <bool> : Use registered ring %d\n"
> > -		" -P <bool> : Automatically place on device home node %d\n",
> > +		" -P <bool> : Automatically place on device home node %d\n"
> > +		" -u <bool> : Use nvme-passthrough I/O, default %d\n",
> >  		argv, DEPTH, BATCH_SUBMIT, BATCH_COMPLETE, BS, polled,
> >  		fixedbufs, dma_map, register_files, nthreads, !buffered, do_nop,
> >  		stats, runtime == 0 ? "unlimited" : runtime_str, random_io, aio,
> > -		use_sync, register_ring, numa_placement);
> > +		use_sync, register_ring, numa_placement, pt);
> >  	exit(status);
> 
> Reminds me, should the nsid and lbs be in the submitter struct?
>

right, will make these two a part of submitter struct

> Thanks for doing this work!!
> 
> -- 
> Jens Axboe
> 

--
Anuj Gupta






[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] t/io_uring: add support for async-passthru
  2022-08-25  8:24       ` Kanchan Joshi
@ 2022-08-25 11:39         ` Anuj Gupta
  0 siblings, 0 replies; 7+ messages in thread
From: Anuj Gupta @ 2022-08-25 11:39 UTC (permalink / raw)
  To: Kanchan Joshi; +Cc: axboe, vincentfu, ankit.kumar, fio

[-- Attachment #1: Type: text/plain, Size: 7596 bytes --]

On Thu, Aug 25, 2022 at 01:54:44PM +0530, Kanchan Joshi wrote:
> On Wed, Aug 24, 2022 at 03:01:09PM +0530, Anuj Gupta wrote:
> > This patch adds support for async-passthru in t/io_uring. User needs to
> > specify -u1 option in the command
> > 
> > Example commandline:
> > t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -n1 -u1 /dev/ng0n1
> > 
> > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> > ---
> > t/io_uring.c | 146 ++++++++++++++++++++++++++++++++++++++++++++++++---
> > 1 file changed, 138 insertions(+), 8 deletions(-)
> > 
> > diff --git a/t/io_uring.c b/t/io_uring.c
> > index a42abd46..78fedaf1 100644
> > --- a/t/io_uring.c
> > +++ b/t/io_uring.c
> > @@ -35,6 +35,7 @@
> > #include "../lib/rand.h"
> > #include "../minmax.h"
> > #include "../os/linux/io_uring.h"
> > +#include "../engines/nvme.h"
> > 
> > struct io_sq_ring {
> > 	unsigned *head;
> > @@ -139,6 +140,9 @@ static int random_io = 1;	/* random or sequential IO */
> > static int register_ring = 1;	/* register ring */
> > static int use_sync = 0;	/* use preadv2 */
> > static int numa_placement = 0;	/* set to node of device */
> > +static int pt = 0;		/* passthrough I/O or not */
> > +static unsigned int nsid = 0;	/* nsid field required for nvme-passthrough */
> > +static unsigned int lbs = 0;	/* logical_block_size field required for nvme-passthrough */
> > 
> > static unsigned long tsc_rate;
> > 
> > @@ -161,6 +165,54 @@ struct io_uring_map_buffers {
> > };
> > #endif
> > 
> > +static int nvme_identify(int fd, __u32 nsid, enum nvme_identify_cns cns,
> > +			 enum nvme_csi csi, void *data)
> > +{
> > +	struct nvme_passthru_cmd cmd = {
> > +		.opcode         = nvme_admin_identify,
> > +		.nsid           = nsid,
> > +		.addr           = (__u64)(uintptr_t)data,
> > +		.data_len       = NVME_IDENTIFY_DATA_SIZE,
> > +		.cdw10          = cns,
> > +		.cdw11          = csi << NVME_IDENTIFY_CSI_SHIFT,
> > +		.timeout_ms     = NVME_DEFAULT_IOCTL_TIMEOUT,
> > +	};
> > +
> > +	return ioctl(fd, NVME_IOCTL_ADMIN_CMD, &cmd);
> > +}
> > +
> > +static int nvme_get_info(int fd, __u32 *nsid, __u32 *lba_sz, __u64 *nlba)
> > +{
> > +	struct nvme_id_ns ns;
> > +	int namespace_id;
> > +	int err;
> > +
> > +	namespace_id = ioctl(fd, NVME_IOCTL_ID);
> > +	if (namespace_id < 0) {
> > +		fprintf(stderr, "error failed to fetch namespace-id\n");
> > +		close(fd);
> > +		return -errno;
> > +	}
> > +
> > +	/*
> > +	 * Identify namespace to get namespace-id, namespace size in LBA's
> > +	 * and LBA data size.
> > +	 */
> > +	err = nvme_identify(fd, namespace_id, NVME_IDENTIFY_CNS_NS,
> > +				NVME_CSI_NVM, &ns);
> > +	if (err) {
> > +		fprintf(stderr, "error failed to fetch identify namespace\n");
> > +		close(fd);
> > +		return err;
> > +	}
> > +
> > +	*nsid = namespace_id;
> > +	*lba_sz = 1 << ns.lbaf[(ns.flbas & 0x0f)].ds;
> > +	*nlba = ns.nsze;
> > +
> > +	return 0;
> > +}
> > +
> > static unsigned long cycles_to_nsec(unsigned long cycles)
> > {
> > 	uint64_t val;
> > @@ -457,12 +509,13 @@ static unsigned file_depth(struct submitter *s)
> > 
> > static void init_io(struct submitter *s, unsigned index)
> > {
> > -	struct io_uring_sqe *sqe = &s->sqes[index];
> > +	struct io_uring_sqe *sqe = NULL;
> > 	unsigned long offset;
> > 	struct file *f;
> > 	long r;
> > 
> > 	if (do_nop) {
> > +		sqe = &s->sqes[index];
> 
> Nit: this and previous change is not must as <see below>
> 

got it, will change this

> > 		sqe->opcode = IORING_OP_NOP;
> > 		return;
> > 	}
> > @@ -490,6 +543,41 @@ static void init_io(struct submitter *s, unsigned index)
> > 			f->cur_off = 0;
> > 	}
> > 
> > +	if (pt) {
> > +		struct nvme_uring_cmd *cmd;
> > +		unsigned long long slba;
> > +		unsigned long long nlb;
> > +
> > +		sqe = &s->sqes[index<<1];
> 
> this is enough.
> 
> > +		if (register_files) {
> > +			sqe->fd = f->fixed_fd;
> > +			sqe->flags = IOSQE_FIXED_FILE;
> > +		} else {
> > +			sqe->fd = f->real_fd;
> > +			sqe->flags = 0;
> > +		}
> > +		sqe->opcode = IORING_OP_URING_CMD;
> > +		sqe->user_data = (unsigned long) f->fileno;
> > +		if (stats)
> > +			sqe->user_data |= ((unsigned long)s->clock_index << 32);
> > +		sqe->cmd_op = NVME_URING_CMD_IO;
> > +		slba = offset / lbs;
> > +		nlb = bs/lbs - 1;
> > +		cmd = (struct nvme_uring_cmd *)&sqe->cmd;
> > +		memset(cmd, 0, sizeof(struct nvme_uring_cmd));
> > +		/* cdw10 and cdw11 represent starting slba*/
> > +		cmd->cdw10 = slba & 0xffffffff;
> > +		cmd->cdw11 = slba >> 32;
> > +		/* cdw12 represent number of lba to be read*/
> > +		cmd->cdw12 = nlb;
> > +		cmd->addr = (unsigned long) s->iovecs[index].iov_base;
> > +		cmd->data_len = bs;
> > +		cmd->nsid = nsid;
> > +		cmd->opcode = 2;
> > +		return;
> > +	}
> > +
> > +	sqe = &s->sqes[index];
> > 	if (register_files) {
> > 		sqe->flags = IOSQE_FIXED_FILE;
> > 		sqe->fd = f->fixed_fd;
> > @@ -549,7 +637,22 @@ static int get_file_size(struct file *f)
> > 
> > 	if (fstat(f->real_fd, &st) < 0)
> > 		return -1;
> > -	if (S_ISBLK(st.st_mode)) {
> > +	if (pt) {
> > +		__u64 nlba;
> > +		int ret;
> > +
> > +		if (!S_ISCHR(st.st_mode)) {
> > +			fprintf(stderr, "passthrough works with only nvme-ns "
> > +					"generic devices (/dev/ngXnY)\n");
> > +			return -1;
> > +		}
> > +		ret = nvme_get_info(f->real_fd, &nsid, &lbs, &nlba);
> > +		if (ret)
> > +			return -1;
> > +		f->max_blocks = nlba / bs;
> > +		f->max_size = nlba;
> > +		return 0;
> > +	} else if (S_ISBLK(st.st_mode)) {
> > 		unsigned long long bytes;
> > 
> > 		if (ioctl(f->real_fd, BLKGETSIZE64, &bytes) != 0)
> > @@ -587,11 +690,14 @@ static int reap_events_uring(struct submitter *s)
> > 
> > 			f = &s->files[fileno];
> > 			f->pending_ios--;
> > -			if (cqe->res != bs) {
> > +			if (!pt && cqe->res != bs) {
> > 				printf("io: unexpected ret=%d\n", cqe->res);
> > 				if (polled && cqe->res == -EOPNOTSUPP)
> > 					printf("Your filesystem/driver/kernel doesn't support polled IO\n");
> > 				return -1;
> > +			} else if (pt && cqe->res != 0) {
> > +				printf("io: unexpected ret=%d\n", cqe->res);
> > +				return -1;
> 
> Missing inner check (polled one) for pt case?

right, adding a check for polling makes sense.

> How about something like this fragement -
> 
>                if (!do_nop) {
>                        int fileno = cqe->user_data & 0xffffffff;
> +                       bool error;
> 
>                        f = &s->files[fileno];
>                        f->pending_ios--;
> -                       if (!pt && cqe->res != bs) {
> +                       if (!pt)
> +                               error = (cqe->res != bs);
> +                       else
> +                               error = (cqe->res != 0);
> +                       if (error) {
>                                printf("io: unexpected ret=%d\n", cqe->res);
>                                if (polled && cqe->res == -EOPNOTSUPP)
>                                        printf("Your filesystem/driver/kernel doesn't support polled IO\n");
>                                return -1;
> 
> [...]
>

I think, the result is -EINVAL rather than -EOPNOTSUPP, if the user enables polling for passthru i/o.
With that we can check for polled-io for passthru

> > @@ -1512,6 +1638,10 @@ int main(int argc, char *argv[])
> > 			printf("failed getting size of device/file\n");
> > 			return 1;
> > 		}
> > +		if (pt && bs < lbs) {
> > +			printf("error: bs:%d is less than logical_block_size:%d\n", bs, lbs);
> > +			return 1;
> > +		}
> 
> would it be better to replace the check/print with "bs % lbs" instead?
> since io size needs to be multiple of lbs to go through.
> 

right, will modify the check accordingly

--
Anuj Gupta


[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-08-25 12:02 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20220824094118epcas5p461b651ef5c3b1b04fc731275cf2fe0c2@epcas5p4.samsung.com>
2022-08-24  9:31 ` [PATCH 0/2] Add support for uring-passthrough in t/io_uring Anuj Gupta
     [not found]   ` <CGME20220824094125epcas5p48b88f023cce31317adc68abc225df95c@epcas5p4.samsung.com>
2022-08-24  9:31     ` [PATCH 1/2] t/io_uring: prep for including engines/nvme.h " Anuj Gupta
     [not found]   ` <CGME20220824094134epcas5p4b7cacb00484146b5d510ded15a80c712@epcas5p4.samsung.com>
2022-08-24  9:31     ` [PATCH 2/2] t/io_uring: add support for async-passthru Anuj Gupta
2022-08-24 20:09       ` Jens Axboe
2022-08-25 11:20         ` Anuj Gupta
2022-08-25  8:24       ` Kanchan Joshi
2022-08-25 11:39         ` Anuj Gupta

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).