* [PATCH v2 0/2] Add support for uring-passthrough in t/io_uring
[not found] <CGME20220826114305epcas5p4a5636b062f33534f75f0d907af31bc50@epcas5p4.samsung.com>
@ 2022-08-26 11:33 ` Anuj Gupta
[not found] ` <CGME20220826114309epcas5p36e313a77d0dc872fc15319b203e05d56@epcas5p3.samsung.com>
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Anuj Gupta @ 2022-08-26 11:33 UTC (permalink / raw)
To: axboe, vincentfu; +Cc: joshi.k, ankit.kumar, fio, Anuj Gupta
This series adds support for measuring peak performance of uring-passthrough path
using t/io_uring utility. Added new -u1 option, that makes t/io_uring to do io using
nvme passthrough commands.
Uring-passthrough on nvme-generic device:
root@test-MS-7C34:/home/test/upstream/github/fio# taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -n1 -u1 /dev/ng0n1
submitter=0, tid=5415, file=/dev/ng0n1, node=-1
polled=0, fixedbufs=0/0, register_files=1, buffered=1, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=1.97M, BW=959MiB/s, IOS/call=32/31
IOPS=1.96M, BW=956MiB/s, IOS/call=32/32
IOPS=1.96M, BW=957MiB/s, IOS/call=32/32
IOPS=1.96M, BW=956MiB/s, IOS/call=31/31
^CExiting on signal
Maximum IOPS=1.97M
Regular io_uring on nvme-block device:
root@test-MS-7C34:/home/test/upstream/github/fio# taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -n1 /dev/nvme0n1
submitter=0, tid=5418, file=/dev/nvme0n1, node=-1
polled=0, fixedbufs=0/0, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=1.53M, BW=748MiB/s, IOS/call=32/31
IOPS=1.53M, BW=746MiB/s, IOS/call=32/32
IOPS=1.53M, BW=746MiB/s, IOS/call=32/32
IOPS=1.53M, BW=745MiB/s, IOS/call=32/32
IOPS=1.53M, BW=745MiB/s, IOS/call=32/32
^CExiting on signal
Maximum IOPS=1.53M
Changes since v1:
https://lore.kernel.org/fio/20220824093109.308791-1-anuj20.g@samsung.com/T/#m17c6dcf1a3989a74865ceee0b87f470716aa5711
Use shift instead of expensive integer divisions (Jens)
Put expensive nvme_uring_cmd initialization part in setup rather than having it in fast path (Jens)
Create separate init_io_pt and reap_events_uring_pt functions for initializing i/o and reaping completions for passthru path(Jens)
Put nsid and lba_shift a part of struct file (Jens)
Add check for polled i/o for passthru case during failure (Kanchan)
Check (bs % lbs) instead of (bs < lbs) (Kanchan)
Fix index calculation for cqe during passthru i/o
Anuj Gupta (2):
t/io_uring: prep for including engines/nvme.h in t/io_uring
t/io_uring: add support for async-passthru
t/io_uring.c | 256 +++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 239 insertions(+), 17 deletions(-)
--
2.25.1
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH v2 1/2] t/io_uring: prep for including engines/nvme.h in t/io_uring
[not found] ` <CGME20220826114309epcas5p36e313a77d0dc872fc15319b203e05d56@epcas5p3.samsung.com>
@ 2022-08-26 11:33 ` Anuj Gupta
0 siblings, 0 replies; 4+ messages in thread
From: Anuj Gupta @ 2022-08-26 11:33 UTC (permalink / raw)
To: axboe, vincentfu; +Cc: joshi.k, ankit.kumar, fio, Anuj Gupta
Change page_size and cal_clat_percentiles name to something different
as these are indirectly picked from engines/nvme.h (fio.h and stat.h)
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
t/io_uring.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/t/io_uring.c b/t/io_uring.c
index 35bf1956..a42abd46 100644
--- a/t/io_uring.c
+++ b/t/io_uring.c
@@ -117,7 +117,7 @@ static struct submitter *submitter;
static volatile int finish;
static int stats_running;
static unsigned long max_iops;
-static long page_size;
+static long t_io_uring_page_size;
static int depth = DEPTH;
static int batch_submit = BATCH_SUBMIT;
@@ -195,9 +195,9 @@ static unsigned long plat_idx_to_val(unsigned int idx)
return cycles_to_nsec(base + ((k + 0.5) * (1 << error_bits)));
}
-unsigned int calc_clat_percentiles(unsigned long *io_u_plat, unsigned long nr,
- unsigned long **output,
- unsigned long *maxv, unsigned long *minv)
+unsigned int calculate_clat_percentiles(unsigned long *io_u_plat,
+ unsigned long nr, unsigned long **output,
+ unsigned long *maxv, unsigned long *minv)
{
unsigned long sum = 0;
unsigned int len = plist_len, i, j = 0;
@@ -251,7 +251,7 @@ static void show_clat_percentiles(unsigned long *io_u_plat, unsigned long nr,
bool is_last;
char fmt[32];
- len = calc_clat_percentiles(io_u_plat, nr, &ovals, &maxv, &minv);
+ len = calculate_clat_percentiles(io_u_plat, nr, &ovals, &maxv, &minv);
if (!len || !ovals)
goto out;
@@ -786,7 +786,7 @@ static void *allocate_mem(struct submitter *s, int size)
return numa_alloc_onnode(size, s->numa_node);
#endif
- if (posix_memalign(&buf, page_size, bs)) {
+ if (posix_memalign(&buf, t_io_uring_page_size, bs)) {
printf("failed alloc\n");
return NULL;
}
@@ -1543,9 +1543,9 @@ int main(int argc, char *argv[])
arm_sig_int();
- page_size = sysconf(_SC_PAGESIZE);
- if (page_size < 0)
- page_size = 4096;
+ t_io_uring_page_size = sysconf(_SC_PAGESIZE);
+ if (t_io_uring_page_size < 0)
+ t_io_uring_page_size = 4096;
for (j = 0; j < nthreads; j++) {
s = get_submitter(j);
--
2.25.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH v2 2/2] t/io_uring: add support for async-passthru
[not found] ` <CGME20220826114312epcas5p3adb78ca25ec4dae7655940e96f5cdd85@epcas5p3.samsung.com>
@ 2022-08-26 11:33 ` Anuj Gupta
0 siblings, 0 replies; 4+ messages in thread
From: Anuj Gupta @ 2022-08-26 11:33 UTC (permalink / raw)
To: axboe, vincentfu; +Cc: joshi.k, ankit.kumar, fio, Anuj Gupta
This patch adds support for async-passthru in t/io_uring. User needs to
specify -u1 option in the command
Example commandline:
t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -n1 -u1 /dev/ng0n1
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
t/io_uring.c | 238 +++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 230 insertions(+), 8 deletions(-)
diff --git a/t/io_uring.c b/t/io_uring.c
index a42abd46..4e1d617f 100644
--- a/t/io_uring.c
+++ b/t/io_uring.c
@@ -35,6 +35,7 @@
#include "../lib/rand.h"
#include "../minmax.h"
#include "../os/linux/io_uring.h"
+#include "../engines/nvme.h"
struct io_sq_ring {
unsigned *head;
@@ -67,6 +68,8 @@ struct file {
unsigned long max_size;
unsigned long cur_off;
unsigned pending_ios;
+ unsigned int nsid; /* nsid field required for nvme-passthrough */
+ unsigned int lba_shift; /* lba_shift field required for nvme-passthrough */
int real_fd;
int fixed_fd;
int fileno;
@@ -139,6 +142,7 @@ static int random_io = 1; /* random or sequential IO */
static int register_ring = 1; /* register ring */
static int use_sync = 0; /* use preadv2 */
static int numa_placement = 0; /* set to node of device */
+static int pt = 0; /* passthrough I/O or not */
static unsigned long tsc_rate;
@@ -161,6 +165,54 @@ struct io_uring_map_buffers {
};
#endif
+static int nvme_identify(int fd, __u32 nsid, enum nvme_identify_cns cns,
+ enum nvme_csi csi, void *data)
+{
+ struct nvme_passthru_cmd cmd = {
+ .opcode = nvme_admin_identify,
+ .nsid = nsid,
+ .addr = (__u64)(uintptr_t)data,
+ .data_len = NVME_IDENTIFY_DATA_SIZE,
+ .cdw10 = cns,
+ .cdw11 = csi << NVME_IDENTIFY_CSI_SHIFT,
+ .timeout_ms = NVME_DEFAULT_IOCTL_TIMEOUT,
+ };
+
+ return ioctl(fd, NVME_IOCTL_ADMIN_CMD, &cmd);
+}
+
+static int nvme_get_info(int fd, __u32 *nsid, __u32 *lba_sz, __u64 *nlba)
+{
+ struct nvme_id_ns ns;
+ int namespace_id;
+ int err;
+
+ namespace_id = ioctl(fd, NVME_IOCTL_ID);
+ if (namespace_id < 0) {
+ fprintf(stderr, "error failed to fetch namespace-id\n");
+ close(fd);
+ return -errno;
+ }
+
+ /*
+ * Identify namespace to get namespace-id, namespace size in LBA's
+ * and LBA data size.
+ */
+ err = nvme_identify(fd, namespace_id, NVME_IDENTIFY_CNS_NS,
+ NVME_CSI_NVM, &ns);
+ if (err) {
+ fprintf(stderr, "error failed to fetch identify namespace\n");
+ close(fd);
+ return err;
+ }
+
+ *nsid = namespace_id;
+ *lba_sz = 1 << ns.lbaf[(ns.flbas & 0x0f)].ds;
+ *nlba = ns.nsze;
+
+ return 0;
+}
+
static unsigned long cycles_to_nsec(unsigned long cycles)
{
uint64_t val;
@@ -520,6 +572,65 @@ static void init_io(struct submitter *s, unsigned index)
sqe->user_data |= ((uint64_t)s->clock_index << 32);
}
+static void init_io_pt(struct submitter *s, unsigned index)
+{
+ struct io_uring_sqe *sqe = &s->sqes[index << 1];
+ unsigned long offset;
+ struct file *f;
+ struct nvme_uring_cmd *cmd;
+ unsigned long long slba;
+ unsigned long long nlb;
+ long r;
+
+ if (s->nr_files == 1) {
+ f = &s->files[0];
+ } else {
+ f = &s->files[s->cur_file];
+ if (f->pending_ios >= file_depth(s)) {
+ s->cur_file++;
+ if (s->cur_file == s->nr_files)
+ s->cur_file = 0;
+ f = &s->files[s->cur_file];
+ }
+ }
+ f->pending_ios++;
+
+ if (random_io) {
+ r = __rand64(&s->rand_state);
+ offset = (r % (f->max_blocks - 1)) * bs;
+ } else {
+ offset = f->cur_off;
+ f->cur_off += bs;
+ if (f->cur_off + bs > f->max_size)
+ f->cur_off = 0;
+ }
+
+ if (register_files) {
+ sqe->fd = f->fixed_fd;
+ sqe->flags = IOSQE_FIXED_FILE;
+ } else {
+ sqe->fd = f->real_fd;
+ sqe->flags = 0;
+ }
+ sqe->opcode = IORING_OP_URING_CMD;
+ sqe->user_data = (unsigned long) f->fileno;
+ if (stats)
+ sqe->user_data |= ((unsigned long)s->clock_index << 32);
+ sqe->cmd_op = NVME_URING_CMD_IO;
+ slba = offset >> f->lba_shift;
+ nlb = (bs >> f->lba_shift) - 1;
+ cmd = (struct nvme_uring_cmd *)&sqe->cmd;
+ /* cdw10 and cdw11 represent starting slba*/
+ cmd->cdw10 = slba & 0xffffffff;
+ cmd->cdw11 = slba >> 32;
+ /* cdw12 represent number of lba to be read*/
+ cmd->cdw12 = nlb;
+ cmd->addr = (unsigned long) s->iovecs[index].iov_base;
+ cmd->data_len = bs;
+ cmd->nsid = f->nsid;
+ cmd->opcode = 2;
+}
+
static int prep_more_ios_uring(struct submitter *s, int max_ios)
{
struct io_sq_ring *ring = &s->sq_ring;
@@ -532,7 +643,10 @@ static int prep_more_ios_uring(struct submitter *s, int max_ios)
break;
index = tail & sq_ring_mask;
- init_io(s, index);
+ if (pt)
+ init_io_pt(s, index);
+ else
+ init_io(s, index);
ring->array[index] = index;
prepped++;
tail = next_tail;
@@ -549,7 +663,29 @@ static int get_file_size(struct file *f)
if (fstat(f->real_fd, &st) < 0)
return -1;
- if (S_ISBLK(st.st_mode)) {
+ if (pt) {
+ __u64 nlba;
+ __u32 lbs;
+ int ret;
+
+ if (!S_ISCHR(st.st_mode)) {
+ fprintf(stderr, "passthrough works with only nvme-ns "
+ "generic devices (/dev/ngXnY)\n");
+ return -1;
+ }
+ ret = nvme_get_info(f->real_fd, &f->nsid, &lbs, &nlba);
+ if (ret)
+ return -1;
+ if ((bs % lbs) != 0) {
+ printf("error: bs:%d should be a multiple logical_block_size:%d\n",
+ bs, lbs);
+ return -1;
+ }
+ f->max_blocks = nlba / bs;
+ f->max_size = nlba;
+ f->lba_shift = ilog2(lbs);
+ return 0;
+ } else if (S_ISBLK(st.st_mode)) {
unsigned long long bytes;
if (ioctl(f->real_fd, BLKGETSIZE64, &bytes) != 0)
@@ -620,6 +756,60 @@ static int reap_events_uring(struct submitter *s)
return reaped;
}
+static int reap_events_uring_pt(struct submitter *s)
+{
+ struct io_cq_ring *ring = &s->cq_ring;
+ struct io_uring_cqe *cqe;
+ unsigned head, reaped = 0;
+ int last_idx = -1, stat_nr = 0;
+ unsigned index;
+ int fileno;
+
+ head = *ring->head;
+ do {
+ struct file *f;
+
+ read_barrier();
+ if (head == atomic_load_acquire(ring->tail))
+ break;
+ index = head & cq_ring_mask;
+ cqe = &ring->cqes[index << 1];
+ fileno = cqe->user_data & 0xffffffff;
+ f = &s->files[fileno];
+ f->pending_ios--;
+
+ if (cqe->res != 0) {
+ printf("io: unexpected ret=%d\n", cqe->res);
+ if (polled && cqe->res == -EINVAL)
+ printf("passthrough doesn't support polled IO\n");
+ return -1;
+ }
+ if (stats) {
+ int clock_index = cqe->user_data >> 32;
+
+ if (last_idx != clock_index) {
+ if (last_idx != -1) {
+ add_stat(s, last_idx, stat_nr);
+ stat_nr = 0;
+ }
+ last_idx = clock_index;
+ }
+ stat_nr++;
+ }
+ reaped++;
+ head++;
+ } while (1);
+
+ if (stat_nr)
+ add_stat(s, last_idx, stat_nr);
+
+ if (reaped) {
+ s->inflight -= reaped;
+ atomic_store_release(ring->head, head);
+ }
+ return reaped;
+}
+
static void set_affinity(struct submitter *s)
{
#ifdef CONFIG_LIBNUMA
@@ -697,6 +887,7 @@ static int setup_ring(struct submitter *s)
struct io_uring_params p;
int ret, fd;
void *ptr;
+ size_t len;
memset(&p, 0, sizeof(p));
@@ -709,6 +900,10 @@ static int setup_ring(struct submitter *s)
p.sq_thread_cpu = sq_thread_cpu;
}
}
+ if (pt) {
+ p.flags |= IORING_SETUP_SQE128;
+ p.flags |= IORING_SETUP_CQE32;
+ }
fd = io_uring_setup(depth, &p);
if (fd < 0) {
@@ -761,11 +956,22 @@ static int setup_ring(struct submitter *s)
sring->array = ptr + p.sq_off.array;
sq_ring_mask = *sring->ring_mask;
- s->sqes = mmap(0, p.sq_entries * sizeof(struct io_uring_sqe),
+ if (p.flags & IORING_SETUP_SQE128)
+ len = 2 * p.sq_entries * sizeof(struct io_uring_sqe);
+ else
+ len = p.sq_entries * sizeof(struct io_uring_sqe);
+ s->sqes = mmap(0, len,
PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd,
IORING_OFF_SQES);
- ptr = mmap(0, p.cq_off.cqes + p.cq_entries * sizeof(struct io_uring_cqe),
+ if (p.flags & IORING_SETUP_CQE32) {
+ len = p.cq_off.cqes +
+ 2 * p.cq_entries * sizeof(struct io_uring_cqe);
+ } else {
+ len = p.cq_off.cqes +
+ p.cq_entries * sizeof(struct io_uring_cqe);
+ }
+ ptr = mmap(0, len,
PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd,
IORING_OFF_CQ_RING);
cring->head = ptr + p.cq_off.head;
@@ -856,7 +1062,16 @@ static int submitter_init(struct submitter *s)
s->plat = NULL;
nr_batch = 0;
}
+ /* perform the expensive command initialization part for passthrough here
+ * rather than in the fast path
+ */
+ if (pt) {
+ for (i = 0; i < roundup_pow2(depth); i++) {
+ struct io_uring_sqe *sqe = &s->sqes[i << 1];
+ memset(&sqe->cmd, 0, sizeof(struct nvme_uring_cmd));
+ }
+ }
return nr_batch;
}
@@ -1112,7 +1327,10 @@ submit:
do {
int r;
- r = reap_events_uring(s);
+ if (pt)
+ r = reap_events_uring_pt(s);
+ else
+ r = reap_events_uring(s);
if (r == -1) {
s->finish = 1;
break;
@@ -1306,11 +1524,12 @@ static void usage(char *argv, int status)
" -a <bool> : Use legacy aio, default %d\n"
" -S <bool> : Use sync IO (preadv2), default %d\n"
" -X <bool> : Use registered ring %d\n"
- " -P <bool> : Automatically place on device home node %d\n",
+ " -P <bool> : Automatically place on device home node %d\n"
+ " -u <bool> : Use nvme-passthrough I/O, default %d\n",
argv, DEPTH, BATCH_SUBMIT, BATCH_COMPLETE, BS, polled,
fixedbufs, dma_map, register_files, nthreads, !buffered, do_nop,
stats, runtime == 0 ? "unlimited" : runtime_str, random_io, aio,
- use_sync, register_ring, numa_placement);
+ use_sync, register_ring, numa_placement, pt);
exit(status);
}
@@ -1369,7 +1588,7 @@ int main(int argc, char *argv[])
if (!do_nop && argc < 2)
usage(argv[0], 1);
- while ((opt = getopt(argc, argv, "d:s:c:b:p:B:F:n:N:O:t:T:a:r:D:R:X:S:P:h?")) != -1) {
+ while ((opt = getopt(argc, argv, "d:s:c:b:p:B:F:n:N:O:t:T:a:r:D:R:X:S:P:u:h?")) != -1) {
switch (opt) {
case 'a':
aio = !!atoi(optarg);
@@ -1450,6 +1669,9 @@ int main(int argc, char *argv[])
case 'P':
numa_placement = !!atoi(optarg);
break;
+ case 'u':
+ pt = !!atoi(optarg);
+ break;
case 'h':
case '?':
default:
--
2.25.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH v2 0/2] Add support for uring-passthrough in t/io_uring
2022-08-26 11:33 ` [PATCH v2 0/2] Add support for uring-passthrough in t/io_uring Anuj Gupta
[not found] ` <CGME20220826114309epcas5p36e313a77d0dc872fc15319b203e05d56@epcas5p3.samsung.com>
[not found] ` <CGME20220826114312epcas5p3adb78ca25ec4dae7655940e96f5cdd85@epcas5p3.samsung.com>
@ 2022-08-26 13:30 ` Jens Axboe
2 siblings, 0 replies; 4+ messages in thread
From: Jens Axboe @ 2022-08-26 13:30 UTC (permalink / raw)
To: vincentfu, Anuj Gupta; +Cc: ankit.kumar, joshi.k, fio
On Fri, 26 Aug 2022 17:03:04 +0530, Anuj Gupta wrote:
> This series adds support for measuring peak performance of uring-passthrough path
> using t/io_uring utility. Added new -u1 option, that makes t/io_uring to do io using
> nvme passthrough commands.
>
> Uring-passthrough on nvme-generic device:
>
> root@test-MS-7C34:/home/test/upstream/github/fio# taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B0 -O0 -n1 -u1 /dev/ng0n1
> submitter=0, tid=5415, file=/dev/ng0n1, node=-1
> polled=0, fixedbufs=0/0, register_files=1, buffered=1, QD=128
> Engine=io_uring, sq_ring=128, cq_ring=128
> IOPS=1.97M, BW=959MiB/s, IOS/call=32/31
> IOPS=1.96M, BW=956MiB/s, IOS/call=32/32
> IOPS=1.96M, BW=957MiB/s, IOS/call=32/32
> IOPS=1.96M, BW=956MiB/s, IOS/call=31/31
> ^CExiting on signal
> Maximum IOPS=1.97M
>
> [...]
Applied, thanks!
[1/2] t/io_uring: prep for including engines/nvme.h in t/io_uring
commit: c409e4c2a549ccc0334f2c084a76e80314d42c42
[2/2] t/io_uring: add support for async-passthru
commit: 7d04588a766308d5903f6cfe34ed72f6c7612d19
Best regards,
--
Jens Axboe
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2022-08-26 13:30 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <CGME20220826114305epcas5p4a5636b062f33534f75f0d907af31bc50@epcas5p4.samsung.com>
2022-08-26 11:33 ` [PATCH v2 0/2] Add support for uring-passthrough in t/io_uring Anuj Gupta
[not found] ` <CGME20220826114309epcas5p36e313a77d0dc872fc15319b203e05d56@epcas5p3.samsung.com>
2022-08-26 11:33 ` [PATCH v2 1/2] t/io_uring: prep for including engines/nvme.h " Anuj Gupta
[not found] ` <CGME20220826114312epcas5p3adb78ca25ec4dae7655940e96f5cdd85@epcas5p3.samsung.com>
2022-08-26 11:33 ` [PATCH v2 2/2] t/io_uring: add support for async-passthru Anuj Gupta
2022-08-26 13:30 ` [PATCH v2 0/2] Add support for uring-passthrough in t/io_uring Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).