* [io_uring] Problems using io_uring engine @ 2020-05-25 17:45 Hamilton Tobon Mosquera 2020-05-25 19:21 ` Jens Axboe 0 siblings, 1 reply; 9+ messages in thread From: Hamilton Tobon Mosquera @ 2020-05-25 17:45 UTC (permalink / raw) To: fio Hi there, I'm trying to run sequential and random reads/writes in parallel using the io_uring engine with the HIPRI flag activated to enable polling. The file size is 200G, the number of fio threads is 4. The --size flags is set to 200/4, and the --offset_increment is set to 25%. I could effectively do it with pvsync2 engine but when I switch to io_uring the workloads immediately (they do not even work for 1 sec) fail with these errors: fio: io_u error on file /path/to/file: Operation not supported: write offset=3238043648, buflen=4096 fio: io_u error on file /path/to/file: Operation not supported: write offset=88715874304, buflen=4096 fio: io_u error on file /path/to/file: Operation not supported: write offset=174739943424, buflen=4096 fio: io_u error on file /path/to/file: Operation not supported: write offset=112642154496, buflen=4096 The offsets are under 200G, I don't understand why it's returning those errors. Can someone help please?. Thank you in advance. Hamilton. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [io_uring] Problems using io_uring engine 2020-05-25 17:45 [io_uring] Problems using io_uring engine Hamilton Tobon Mosquera @ 2020-05-25 19:21 ` Jens Axboe 2020-05-25 22:38 ` Hamilton Tobon Mosquera 0 siblings, 1 reply; 9+ messages in thread From: Jens Axboe @ 2020-05-25 19:21 UTC (permalink / raw) To: Hamilton Tobon Mosquera, fio On 5/25/20 11:45 AM, Hamilton Tobon Mosquera wrote: > Hi there, > > I'm trying to run sequential and random reads/writes in parallel using > the io_uring engine with the HIPRI flag activated to enable polling. The > file size is 200G, the number of fio threads is 4. The --size flags is > set to 200/4, and the --offset_increment is set to 25%. I could > effectively do it with pvsync2 engine but when I switch to io_uring the > workloads immediately (they do not even work for 1 sec) fail with these > errors: > > fio: io_u error on file /path/to/file: Operation not supported: write > offset=3238043648, buflen=4096 > fio: io_u error on file /path/to/file: Operation not supported: write > offset=88715874304, buflen=4096 > fio: io_u error on file /path/to/file: Operation not supported: write > offset=174739943424, buflen=4096 > fio: io_u error on file /path/to/file: Operation not supported: write > offset=112642154496, buflen=4096 > > > The offsets are under 200G, I don't understand why it's returning those > errors. Can someone help please?. What file system are you using? I don't think it supports IO polling. io_uring actually checks for this, with preadv you just get normal schedule based IO. -- Jens Axboe ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [io_uring] Problems using io_uring engine 2020-05-25 19:21 ` Jens Axboe @ 2020-05-25 22:38 ` Hamilton Tobon Mosquera 2020-05-26 0:19 ` Jens Axboe 2020-05-26 4:17 ` Jens Axboe 0 siblings, 2 replies; 9+ messages in thread From: Hamilton Tobon Mosquera @ 2020-05-25 22:38 UTC (permalink / raw) To: Jens Axboe, fio Thank you for your answer. I'm using ext4. I guess it supports polling because I could get sub 10 microseconds latency with an Intel Optane SSDPED1D280GA 260GB and pvsync2. If it helps here's how I'm running it: fio global.fio --size=50G --ioengine=io_uring --hipri --direct=1 --rw=randwrite --iodepth=256 --bs=4K --numjobs=4 --offset_increment=25% The global.fio has: ioengine=io_uring hipri direct=1 thread=1 buffered=0 size=100% randrepeat=0 time_based ramp_time=0 norandommap refill_buffers log_max_value=1 log_avg_msec=1000 group_reporting percentile_list=50:60:70:80:90:95:99 Your help is highly appreciated, thank you. Hamilton. On 25/05/20 3:21 p. m., Jens Axboe wrote: > On 5/25/20 11:45 AM, Hamilton Tobon Mosquera wrote: >> Hi there, >> >> I'm trying to run sequential and random reads/writes in parallel using >> the io_uring engine with the HIPRI flag activated to enable polling. The >> file size is 200G, the number of fio threads is 4. The --size flags is >> set to 200/4, and the --offset_increment is set to 25%. I could >> effectively do it with pvsync2 engine but when I switch to io_uring the >> workloads immediately (they do not even work for 1 sec) fail with these >> errors: >> >> fio: io_u error on file /path/to/file: Operation not supported: write >> offset=3238043648, buflen=4096 >> fio: io_u error on file /path/to/file: Operation not supported: write >> offset=88715874304, buflen=4096 >> fio: io_u error on file /path/to/file: Operation not supported: write >> offset=174739943424, buflen=4096 >> fio: io_u error on file /path/to/file: Operation not supported: write >> offset=112642154496, buflen=4096 >> >> >> The offsets are under 200G, I don't understand why it's returning those >> errors. Can someone help please?. > What file system are you using? I don't think it supports IO polling. > io_uring actually checks for this, with preadv you just get normal > schedule based IO. > > -- > Jens Axboe > > La información contenida en este correo electrónico está dirigida únicamente a su destinatario y puede contener información confidencial, material privilegiado o información protegida por derecho de autor. Está prohibida cualquier copia, utilización, indebida retención, modificación, difusión, distribución o reproducción total o parcial. Si usted recibe este mensaje por error, por favor contacte al remitente y elimínelo. La información aquí contenida es responsabilidad exclusiva de su remitente por lo tanto la Universidad EAFIT no se hace responsable de lo que el mensaje contenga. The information contained in this email is addressed to its recipient only and may contain confidential information, privileged material or information protected by copyright. Its prohibited any copy, use, improper retention, modification, dissemination, distribution or total or partial reproduction. If you receive this message by error, please contact the sender and delete it. The information contained herein is the sole responsibility of the sender therefore Universidad EAFIT is not responsible for what the message contains. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [io_uring] Problems using io_uring engine 2020-05-25 22:38 ` Hamilton Tobon Mosquera @ 2020-05-26 0:19 ` Jens Axboe 2020-05-26 4:17 ` Jens Axboe 1 sibling, 0 replies; 9+ messages in thread From: Jens Axboe @ 2020-05-26 0:19 UTC (permalink / raw) To: Hamilton Tobon Mosquera, fio On 5/25/20 4:38 PM, Hamilton Tobon Mosquera wrote: > Thank you for your answer. > > I'm using ext4. I guess it supports polling because I could get sub 10 > microseconds latency with an Intel Optane SSDPED1D280GA 260GB and > pvsync2. If it helps here's how I'm running it: > > fio global.fio --size=50G --ioengine=io_uring --hipri --direct=1 > --rw=randwrite --iodepth=256 --bs=4K --numjobs=4 --offset_increment=25% > > The global.fio has: > > ioengine=io_uring > hipri > direct=1 > thread=1 > buffered=0 > size=100% > randrepeat=0 > time_based > ramp_time=0 > norandommap > refill_buffers > log_max_value=1 > log_avg_msec=1000 > group_reporting > percentile_list=50:60:70:80:90:95:99 > > Your help is highly appreciated, thank you. I almost guarantee you that you are NOT using polling. Check with vmstat 1 and look at the interrupt rate. If it's about your IOPS rate, then you're doing IRQ based completions. If it's closer to 0, you're doing polled. If your fs supports it, then you likely did not allocate poll queues for NVMe. If nvme is builtin to the kernel, use nvme.poll_queues=N to allocate N poll queues, or use poll_queues=N as a module parameter if nvme is modular. Ideally you want N to be equal to the number of CPUs in the system. NVMe will report that it used at load time, here's an example from my laptop: [ 2.396978] nvme nvme0: 1/8/8 default/read/poll queues You can check this right now by looking at dmesg. If you don't have any poll queues, preadv2 with IOCB_HIPRI will be IRQ based, not polled. io_uring just tells you this up front with -EOPNOTSUPP. -- Jens Axboe ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [io_uring] Problems using io_uring engine 2020-05-25 22:38 ` Hamilton Tobon Mosquera 2020-05-26 0:19 ` Jens Axboe @ 2020-05-26 4:17 ` Jens Axboe 2020-05-26 13:57 ` Hamilton Tobon Mosquera 1 sibling, 1 reply; 9+ messages in thread From: Jens Axboe @ 2020-05-26 4:17 UTC (permalink / raw) To: Hamilton Tobon Mosquera, fio On 5/25/20 4:38 PM, Hamilton Tobon Mosquera wrote: > Thank you for your answer. > > I'm using ext4. I guess it supports polling because I could get sub 10 > microseconds latency with an Intel Optane SSDPED1D280GA 260GB and > pvsync2. If it helps here's how I'm running it: > > fio global.fio --size=50G --ioengine=io_uring --hipri --direct=1 > --rw=randwrite --iodepth=256 --bs=4K --numjobs=4 --offset_increment=25% > > The global.fio has: > > ioengine=io_uring > hipri > direct=1 > thread=1 > buffered=0 > size=100% > randrepeat=0 > time_based > ramp_time=0 > norandommap > refill_buffers > log_max_value=1 > log_avg_msec=1000 > group_reporting > percentile_list=50:60:70:80:90:95:99 > > Your help is highly appreciated, thank you. I almost guarantee you that you are NOT using polling. Check with vmstat 1 and look at the interrupt rate. If it's about your IOPS rate, then you're doing IRQ based completions. If it's closer to 0, you're doing polled. If your fs supports it, then you likely did not allocate poll queues for NVMe. If nvme is builtin to the kernel, use nvme.poll_queues=N to allocate N poll queues, or use poll_queues=N as a module parameter if nvme is modular. Ideally you want N to be equal to the number of CPUs in the system. NVMe will report that it used at load time, here's an example from my laptop: [ 2.396978] nvme nvme0: 1/8/8 default/read/poll queues You can check this right now by looking at dmesg. If you don't have any poll queues, preadv2 with IOCB_HIPRI will be IRQ based, not polled. io_uring just tells you this up front with -EOPNOTSUPP. -- Jens Axboe ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [io_uring] Problems using io_uring engine 2020-05-26 4:17 ` Jens Axboe @ 2020-05-26 13:57 ` Hamilton Tobon Mosquera 2020-05-26 19:18 ` Jens Axboe 0 siblings, 1 reply; 9+ messages in thread From: Hamilton Tobon Mosquera @ 2020-05-26 13:57 UTC (permalink / raw) To: Jens Axboe, fio Thank you for your answer. This is how I'm making sure that it is polling. The workloads take 2 minutes, I'm checking the interrupts registered in /proc/interrupts for the nvme device (the Intel Optane) when the workload starts and when the workload ends. The interrupts count is almost zero, about 25 or so, while when using an interrupt based engine I get about 600K interrupts. Also, the way I'm loading the nvme driver is: modprobe nvme poll_queues=4 As you said I'm using 4 polling queues because I only have 4 physical cores. To check that they were actually created I use: systool -vm nvme Which shows that effectively there are 4 polling queues created. I also checked the file /sys/block/nvme0n1/queue/io_poll and it is set to 1. Sometimes I change the file /sys/block/nvme0n1/queue/io_poll_delay to switch between hybrid and normal polling and it shows differences in the CPU usage, the latencies, IOPS, ... Another way is by checking the CPU usage, which says that the CPU is almost completely occupied when polling. Also, I tried with dmesg as you suggested and this is the output: [627676.640431] nvme nvme0: 4/0/4 default/read/poll queues I guess that shows that I was effectively using polling in the workloads. What is weird is that when I don't use the flag HIPRI it runs ok but using interrupts not polling. It might be important to say that I'm always running with root user. Does this information give you more hints about the problem?. Could you please tell me in what filesystem polling is known to work 100% of the time?. Thank you for your help. Hamilton. On 26/05/20 12:17 a. m., Jens Axboe wrote: > On 5/25/20 4:38 PM, Hamilton Tobon Mosquera wrote: >> Thank you for your answer. >> >> I'm using ext4. I guess it supports polling because I could get sub 10 >> microseconds latency with an Intel Optane SSDPED1D280GA 260GB and >> pvsync2. If it helps here's how I'm running it: >> >> fio global.fio --size=50G --ioengine=io_uring --hipri --direct=1 >> --rw=randwrite --iodepth=256 --bs=4K --numjobs=4 --offset_increment=25% >> >> The global.fio has: >> >> ioengine=io_uring >> hipri >> direct=1 >> thread=1 >> buffered=0 >> size=100% >> randrepeat=0 >> time_based >> ramp_time=0 >> norandommap >> refill_buffers >> log_max_value=1 >> log_avg_msec=1000 >> group_reporting >> percentile_list=50:60:70:80:90:95:99 >> >> Your help is highly appreciated, thank you. > I almost guarantee you that you are NOT using polling. Check with > vmstat 1 and look at the interrupt rate. If it's about your IOPS > rate, then you're doing IRQ based completions. If it's closer to 0, > you're doing polled. > > If your fs supports it, then you likely did not allocate poll > queues for NVMe. If nvme is builtin to the kernel, use nvme.poll_queues=N > to allocate N poll queues, or use poll_queues=N as a module parameter > if nvme is modular. > > Ideally you want N to be equal to the number of CPUs in the system. > NVMe will report that it used at load time, here's an example from > my laptop: > > [ 2.396978] nvme nvme0: 1/8/8 default/read/poll queues > > You can check this right now by looking at dmesg. If you don't > have any poll queues, preadv2 with IOCB_HIPRI will be IRQ based, > not polled. io_uring just tells you this up front with -EOPNOTSUPP. > > -- > Jens Axboe > > La información contenida en este correo electrónico está dirigida únicamente a su destinatario y puede contener información confidencial, material privilegiado o información protegida por derecho de autor. Está prohibida cualquier copia, utilización, indebida retención, modificación, difusión, distribución o reproducción total o parcial. Si usted recibe este mensaje por error, por favor contacte al remitente y elimínelo. La información aquí contenida es responsabilidad exclusiva de su remitente por lo tanto la Universidad EAFIT no se hace responsable de lo que el mensaje contenga. The information contained in this email is addressed to its recipient only and may contain confidential information, privileged material or information protected by copyright. Its prohibited any copy, use, improper retention, modification, dissemination, distribution or total or partial reproduction. If you receive this message by error, please contact the sender and delete it. The information contained herein is the sole responsibility of the sender therefore Universidad EAFIT is not responsible for what the message contains. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [io_uring] Problems using io_uring engine 2020-05-26 13:57 ` Hamilton Tobon Mosquera @ 2020-05-26 19:18 ` Jens Axboe 2020-05-26 20:21 ` Hamilton Tobon Mosquera 0 siblings, 1 reply; 9+ messages in thread From: Jens Axboe @ 2020-05-26 19:18 UTC (permalink / raw) To: Hamilton Tobon Mosquera, fio On 5/26/20 7:57 AM, Hamilton Tobon Mosquera wrote: > Thank you for your answer. > > This is how I'm making sure that it is polling. The workloads take 2 > minutes, I'm checking the interrupts registered in /proc/interrupts for > the nvme device (the Intel Optane) when the workload starts and when the > workload ends. The interrupts count is almost zero, about 25 or so, > while when using an interrupt based engine I get about 600K interrupts. > > Also, the way I'm loading the nvme driver is: > > modprobe nvme poll_queues=4 > > As you said I'm using 4 polling queues because I only have 4 physical > cores. To check that they were actually created I use: > > systool -vm nvme > > Which shows that effectively there are 4 polling queues created. > > I also checked the file /sys/block/nvme0n1/queue/io_poll and it is set > to 1. Sometimes I change the file /sys/block/nvme0n1/queue/io_poll_delay > to switch between hybrid and normal polling and it shows differences in > the CPU usage, the latencies, IOPS, ... > > Another way is by checking the CPU usage, which says that the CPU is > almost completely occupied when polling. > > Also, I tried with dmesg as you suggested and this is the output: > > [627676.640431] nvme nvme0: 4/0/4 default/read/poll queues > > I guess that shows that I was effectively using polling in the > workloads. What is weird is that when I don't use the flag HIPRI it runs > ok but using interrupts not polling. It might be important to say that > I'm always running with root user. > > Does this information give you more hints about the problem?. Could you > please tell me in what filesystem polling is known to work 100% of the > time?. > > Thank you for your help. You did the right thing on the NVMe side, I'm guessing then that it's ext4 again. What kernel are you using? I think only 5.7 and newer supports polling on ext4, you'll have better luck with XFS. And btw, please don't top-post. Reply with proper quoting, top posting totally messes up the flow of conversation. -- Jens Axboe ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [io_uring] Problems using io_uring engine 2020-05-26 19:18 ` Jens Axboe @ 2020-05-26 20:21 ` Hamilton Tobon Mosquera 2020-05-26 20:26 ` Jens Axboe 0 siblings, 1 reply; 9+ messages in thread From: Hamilton Tobon Mosquera @ 2020-05-26 20:21 UTC (permalink / raw) To: Jens Axboe, fio On 26/05/20 3:18 p. m., Jens Axboe wrote: > On 5/26/20 7:57 AM, Hamilton Tobon Mosquera wrote: >> Thank you for your answer. >> >> This is how I'm making sure that it is polling. The workloads take 2 >> minutes, I'm checking the interrupts registered in /proc/interrupts for >> the nvme device (the Intel Optane) when the workload starts and when the >> workload ends. The interrupts count is almost zero, about 25 or so, >> while when using an interrupt based engine I get about 600K interrupts. >> >> Also, the way I'm loading the nvme driver is: >> >> modprobe nvme poll_queues=4 >> >> As you said I'm using 4 polling queues because I only have 4 physical >> cores. To check that they were actually created I use: >> >> systool -vm nvme >> >> Which shows that effectively there are 4 polling queues created. >> >> I also checked the file /sys/block/nvme0n1/queue/io_poll and it is set >> to 1. Sometimes I change the file /sys/block/nvme0n1/queue/io_poll_delay >> to switch between hybrid and normal polling and it shows differences in >> the CPU usage, the latencies, IOPS, ... >> >> Another way is by checking the CPU usage, which says that the CPU is >> almost completely occupied when polling. >> >> Also, I tried with dmesg as you suggested and this is the output: >> >> [627676.640431] nvme nvme0: 4/0/4 default/read/poll queues >> >> I guess that shows that I was effectively using polling in the >> workloads. What is weird is that when I don't use the flag HIPRI it runs >> ok but using interrupts not polling. It might be important to say that >> I'm always running with root user. >> >> Does this information give you more hints about the problem?. Could you >> please tell me in what filesystem polling is known to work 100% of the >> time?. >> >> Thank you for your help. > You did the right thing on the NVMe side, I'm guessing then that it's > ext4 again. What kernel are you using? I think only 5.7 and newer > supports polling on ext4, you'll have better luck with XFS. > > And btw, please don't top-post. Reply with proper quoting, top > posting totally messes up the flow of conversation. > > -- > Jens Axboe Thank you for your answer. Effectively it seems to be ext4 the problem. I tried with XFS and it works, which seems weird to me. Does this mean that pvsync2 wasn't polling at all?. I have kernel 5.5 on Centos 8. PD: Sorry for screwing the conversation, I didn't notice that before. Thanks for your time. Hamilton. > > La información contenida en este correo electrónico está dirigida únicamente a su destinatario y puede contener información confidencial, material privilegiado o información protegida por derecho de autor. Está prohibida cualquier copia, utilización, indebida retención, modificación, difusión, distribución o reproducción total o parcial. Si usted recibe este mensaje por error, por favor contacte al remitente y elimínelo. La información aquí contenida es responsabilidad exclusiva de su remitente por lo tanto la Universidad EAFIT no se hace responsable de lo que el mensaje contenga. The information contained in this email is addressed to its recipient only and may contain confidential information, privileged material or information protected by copyright. Its prohibited any copy, use, improper retention, modification, dissemination, distribution or total or partial reproduction. If you receive this message by error, please contact the sender and delete it. The information contained herein is the sole responsibility of the sender therefore Universidad EAFIT is not responsible for what the message contains. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [io_uring] Problems using io_uring engine 2020-05-26 20:21 ` Hamilton Tobon Mosquera @ 2020-05-26 20:26 ` Jens Axboe 0 siblings, 0 replies; 9+ messages in thread From: Jens Axboe @ 2020-05-26 20:26 UTC (permalink / raw) To: Hamilton Tobon Mosquera, fio On 5/26/20 2:21 PM, Hamilton Tobon Mosquera wrote: > On 26/05/20 3:18 p. m., Jens Axboe wrote: > >> On 5/26/20 7:57 AM, Hamilton Tobon Mosquera wrote: >>> Thank you for your answer. >>> >>> This is how I'm making sure that it is polling. The workloads take 2 >>> minutes, I'm checking the interrupts registered in /proc/interrupts for >>> the nvme device (the Intel Optane) when the workload starts and when the >>> workload ends. The interrupts count is almost zero, about 25 or so, >>> while when using an interrupt based engine I get about 600K interrupts. >>> >>> Also, the way I'm loading the nvme driver is: >>> >>> modprobe nvme poll_queues=4 >>> >>> As you said I'm using 4 polling queues because I only have 4 physical >>> cores. To check that they were actually created I use: >>> >>> systool -vm nvme >>> >>> Which shows that effectively there are 4 polling queues created. >>> >>> I also checked the file /sys/block/nvme0n1/queue/io_poll and it is set >>> to 1. Sometimes I change the file /sys/block/nvme0n1/queue/io_poll_delay >>> to switch between hybrid and normal polling and it shows differences in >>> the CPU usage, the latencies, IOPS, ... >>> >>> Another way is by checking the CPU usage, which says that the CPU is >>> almost completely occupied when polling. >>> >>> Also, I tried with dmesg as you suggested and this is the output: >>> >>> [627676.640431] nvme nvme0: 4/0/4 default/read/poll queues >>> >>> I guess that shows that I was effectively using polling in the >>> workloads. What is weird is that when I don't use the flag HIPRI it runs >>> ok but using interrupts not polling. It might be important to say that >>> I'm always running with root user. >>> >>> Does this information give you more hints about the problem?. Could you >>> please tell me in what filesystem polling is known to work 100% of the >>> time?. >>> >>> Thank you for your help. >> You did the right thing on the NVMe side, I'm guessing then that it's >> ext4 again. What kernel are you using? I think only 5.7 and newer >> supports polling on ext4, you'll have better luck with XFS. >> >> And btw, please don't top-post. Reply with proper quoting, top >> posting totally messes up the flow of conversation. >> >> -- >> Jens Axboe > > > Thank you for your answer. > > Effectively it seems to be ext4 the problem. I tried with XFS and it > works, which seems weird to me. Does this mean that pvsync2 wasn't > polling at all?. There's basically two types of polling: - sync polling, this is what was introduced with preadv2 and RWF_HIPRI - async polling, this allows to poll for explicit IO The former just polls the device for _any_ completion, the latter can poll for an explicit IO. The latter is what io_uring uses, as the sync polling doesn't really work that well for a single sync IO from a single poll user. To support async polling, the fs needs to support it. ext4 only recently got that support, I added XFS support when I wrote the code originally. -- Jens Axboe ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2020-05-26 20:26 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-05-25 17:45 [io_uring] Problems using io_uring engine Hamilton Tobon Mosquera 2020-05-25 19:21 ` Jens Axboe 2020-05-25 22:38 ` Hamilton Tobon Mosquera 2020-05-26 0:19 ` Jens Axboe 2020-05-26 4:17 ` Jens Axboe 2020-05-26 13:57 ` Hamilton Tobon Mosquera 2020-05-26 19:18 ` Jens Axboe 2020-05-26 20:21 ` Hamilton Tobon Mosquera 2020-05-26 20:26 ` Jens Axboe
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.