* Reordering of ublk IO requests @ 2022-11-16 15:00 Andreas Hindborg 2022-11-17 2:18 ` Ming Lei 0 siblings, 1 reply; 27+ messages in thread From: Andreas Hindborg @ 2022-11-16 15:00 UTC (permalink / raw) To: Ming Lei; +Cc: Jens Axboe, linux-block Hi Ming, I have a question regarding ublk. For context, I am working on adding zoned storage support to ublk, and you have been very kind to help me on Github [1]. I have a problem with ordering of requests after they are issued to the ublk driver. Basically ublk will reverse the ordering of requests that are batched. The behavior can be observed with the following fio workload: > fio --name=test --ioengine=io_uring --rw=read --bs=4m --direct=1 > --size=4m --filename=/dev/ublkb0 For a loopback ublk target I get the following from blktrace: > 259,2 0 3469 286.337681303 724 D R 0 + 1024 [fio] > 259,2 0 3470 286.337691313 724 D R 1024 + 1024 [fio] > 259,2 0 3471 286.337694423 724 D R 2048 + 1024 [fio] > 259,2 0 3472 286.337696583 724 D R 3072 + 1024 [fio] > 259,2 0 3473 286.337698433 724 D R 4096 + 1024 [fio] > 259,2 0 3474 286.337700213 724 D R 5120 + 1024 [fio] > 259,2 0 3475 286.337702723 724 D R 6144 + 1024 [fio] > 259,2 0 3476 286.337704323 724 D R 7168 + 1024 [fio] > 259,1 0 1794 286.337794934 390 D R 6144 + 2048 [ublk] > 259,1 0 1795 286.337805504 390 D R 4096 + 2048 [ublk] > 259,1 0 1796 286.337816274 390 D R 2048 + 2048 [ublk] > 259,1 0 1797 286.337821744 390 D R 0 + 2048 [ublk] And enabling debug prints in ublk shows that the reversal happens when ublk defers work to the io_uring context: > kernel: ublk_queue_rq: qid 0, tag 180, sect 0 > kernel: ublk_queue_rq: qid 0, tag 181, sect 1024 > kernel: ublk_queue_rq: qid 0, tag 182, sect 2048 > kernel: ublk_queue_rq: qid 0, tag 183, sect 3072 > kernel: ublk_queue_rq: qid 0, tag 184, sect 4096 > kernel: ublk_queue_rq: qid 0, tag 185, sect 5120 > kernel: ublk_queue_rq: qid 0, tag 186, sect 6144 > kernel: ublk_queue_rq: qid 0, tag 187, sect 7168 > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 187 io_flags 1 addr 7f245d359000, sect 7168 > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 186 io_flags 1 addr 7f245fcfd000, sect 6144 > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 185 io_flags 1 addr 7f245fd7f000, sect 5120 > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 184 io_flags 1 addr 7f245fe01000, sect 4096 > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 183 io_flags 1 addr 7f245fe83000, sect 3072 > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 182 io_flags 1 addr 7f245ff05000, sect 2048 > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 181 io_flags 1 addr 7f245ff87000, sect 1024 > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 180 io_flags 1 addr 7f2460009000, sect 0 The problem seems to be that the method used to defer work to the io_uring context, task_work_add(), is using a stack to queue the callbacks. As you probably are aware, writes to zoned storage must happen at the write pointer, so when order is lost there is a problem. But I would assume that this behavior is also undesirable in other circumstances. I am not sure what is the best approach to change this behavor. Maybe having a separate queue for the requests and then only scheduling work if a worker is not already processing the queue? If you like, I can try to implement a fix? Best regards, Andreas Hindborg [1] https://github.com/ming1/ubdsrv/pull/28 ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-16 15:00 Reordering of ublk IO requests Andreas Hindborg @ 2022-11-17 2:18 ` Ming Lei 2022-11-17 8:05 ` Andreas Hindborg 0 siblings, 1 reply; 27+ messages in thread From: Ming Lei @ 2022-11-17 2:18 UTC (permalink / raw) To: Andreas Hindborg; +Cc: Jens Axboe, linux-block, ming.lei Hi Andreas, On Wed, Nov 16, 2022 at 04:00:15PM +0100, Andreas Hindborg wrote: > > Hi Ming, > > I have a question regarding ublk. For context, I am working on adding > zoned storage support to ublk, and you have been very kind to help me on > Github [1]. > > I have a problem with ordering of requests after they are issued to the > ublk driver. Basically ublk will reverse the ordering of requests that are > batched. The behavior can be observed with the following fio workload: > > > fio --name=test --ioengine=io_uring --rw=read --bs=4m --direct=1 > > --size=4m --filename=/dev/ublkb0 > > For a loopback ublk target I get the following from blktrace: > > > 259,2 0 3469 286.337681303 724 D R 0 + 1024 [fio] > > 259,2 0 3470 286.337691313 724 D R 1024 + 1024 [fio] > > 259,2 0 3471 286.337694423 724 D R 2048 + 1024 [fio] > > 259,2 0 3472 286.337696583 724 D R 3072 + 1024 [fio] > > 259,2 0 3473 286.337698433 724 D R 4096 + 1024 [fio] > > 259,2 0 3474 286.337700213 724 D R 5120 + 1024 [fio] > > 259,2 0 3475 286.337702723 724 D R 6144 + 1024 [fio] > > 259,2 0 3476 286.337704323 724 D R 7168 + 1024 [fio] > > 259,1 0 1794 286.337794934 390 D R 6144 + 2048 [ublk] > > 259,1 0 1795 286.337805504 390 D R 4096 + 2048 [ublk] > > 259,1 0 1796 286.337816274 390 D R 2048 + 2048 [ublk] > > 259,1 0 1797 286.337821744 390 D R 0 + 2048 [ublk] > > And enabling debug prints in ublk shows that the reversal happens when > ublk defers work to the io_uring context: > > > kernel: ublk_queue_rq: qid 0, tag 180, sect 0 > > kernel: ublk_queue_rq: qid 0, tag 181, sect 1024 > > kernel: ublk_queue_rq: qid 0, tag 182, sect 2048 > > kernel: ublk_queue_rq: qid 0, tag 183, sect 3072 > > kernel: ublk_queue_rq: qid 0, tag 184, sect 4096 > > kernel: ublk_queue_rq: qid 0, tag 185, sect 5120 > > kernel: ublk_queue_rq: qid 0, tag 186, sect 6144 > > kernel: ublk_queue_rq: qid 0, tag 187, sect 7168 > > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 187 io_flags 1 addr 7f245d359000, sect 7168 > > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 186 io_flags 1 addr 7f245fcfd000, sect 6144 > > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 185 io_flags 1 addr 7f245fd7f000, sect 5120 > > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 184 io_flags 1 addr 7f245fe01000, sect 4096 > > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 183 io_flags 1 addr 7f245fe83000, sect 3072 > > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 182 io_flags 1 addr 7f245ff05000, sect 2048 > > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 181 io_flags 1 addr 7f245ff87000, sect 1024 > > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 180 io_flags 1 addr 7f2460009000, sect 0 > > The problem seems to be that the method used to defer work to the > io_uring context, task_work_add(), is using a stack to queue the > callbacks. Is the observation done on zoned ublk or plain ublk-loop? If it is plain ublk-loop, the reverse order can be started from blk_add_rq_to_plug(), but it won't happen for zoned write request which isn't queued to plug list. Otherwise, can you observe zoned write req reorder from ublksrv side? > > As you probably are aware, writes to zoned storage must > happen at the write pointer, so when order is lost there is a problem. > But I would assume that this behavior is also undesirable in other > circumstances. > > I am not sure what is the best approach to change this behavor. Maybe > having a separate queue for the requests and then only scheduling work > if a worker is not already processing the queue? > > If you like, I can try to implement a fix? Yeah, I think zoned write requests could be forwarded to ublk server out of order. Is it possible for re-order the writes in ublksrv side? I guess it is be doable: 1) zoned write request is sent to ublk_queue_rq() in order always 2) when ublksrv zone target/backend code gets zoned write request in each zone, it can wait until the next expected write comes, then handle the write and advance write pointer, then repeat the whole process. 3) because of 1), the next expected zoned write req is guaranteed to come without much delay, and the per-zone queue won't be long. Thanks, Ming ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-17 2:18 ` Ming Lei @ 2022-11-17 8:05 ` Andreas Hindborg 2022-11-17 8:52 ` Ming Lei 0 siblings, 1 reply; 27+ messages in thread From: Andreas Hindborg @ 2022-11-17 8:05 UTC (permalink / raw) To: Ming Lei; +Cc: Andreas Hindborg, Jens Axboe, linux-block Ming Lei <ming.lei@redhat.com> writes: > Hi Andreas, > > On Wed, Nov 16, 2022 at 04:00:15PM +0100, Andreas Hindborg wrote: >> >> Hi Ming, >> >> I have a question regarding ublk. For context, I am working on adding >> zoned storage support to ublk, and you have been very kind to help me on >> Github [1]. >> >> I have a problem with ordering of requests after they are issued to the >> ublk driver. Basically ublk will reverse the ordering of requests that are >> batched. The behavior can be observed with the following fio workload: >> >> > fio --name=test --ioengine=io_uring --rw=read --bs=4m --direct=1 >> > --size=4m --filename=/dev/ublkb0 >> >> For a loopback ublk target I get the following from blktrace: >> >> > 259,2 0 3469 286.337681303 724 D R 0 + 1024 [fio] >> > 259,2 0 3470 286.337691313 724 D R 1024 + 1024 [fio] >> > 259,2 0 3471 286.337694423 724 D R 2048 + 1024 [fio] >> > 259,2 0 3472 286.337696583 724 D R 3072 + 1024 [fio] >> > 259,2 0 3473 286.337698433 724 D R 4096 + 1024 [fio] >> > 259,2 0 3474 286.337700213 724 D R 5120 + 1024 [fio] >> > 259,2 0 3475 286.337702723 724 D R 6144 + 1024 [fio] >> > 259,2 0 3476 286.337704323 724 D R 7168 + 1024 [fio] >> > 259,1 0 1794 286.337794934 390 D R 6144 + 2048 [ublk] >> > 259,1 0 1795 286.337805504 390 D R 4096 + 2048 [ublk] >> > 259,1 0 1796 286.337816274 390 D R 2048 + 2048 [ublk] >> > 259,1 0 1797 286.337821744 390 D R 0 + 2048 [ublk] >> >> And enabling debug prints in ublk shows that the reversal happens when >> ublk defers work to the io_uring context: >> >> > kernel: ublk_queue_rq: qid 0, tag 180, sect 0 >> > kernel: ublk_queue_rq: qid 0, tag 181, sect 1024 >> > kernel: ublk_queue_rq: qid 0, tag 182, sect 2048 >> > kernel: ublk_queue_rq: qid 0, tag 183, sect 3072 >> > kernel: ublk_queue_rq: qid 0, tag 184, sect 4096 >> > kernel: ublk_queue_rq: qid 0, tag 185, sect 5120 >> > kernel: ublk_queue_rq: qid 0, tag 186, sect 6144 >> > kernel: ublk_queue_rq: qid 0, tag 187, sect 7168 >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 187 io_flags 1 addr 7f245d359000, sect 7168 >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 186 io_flags 1 addr 7f245fcfd000, sect 6144 >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 185 io_flags 1 addr 7f245fd7f000, sect 5120 >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 184 io_flags 1 addr 7f245fe01000, sect 4096 >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 183 io_flags 1 addr 7f245fe83000, sect 3072 >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 182 io_flags 1 addr 7f245ff05000, sect 2048 >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 181 io_flags 1 addr 7f245ff87000, sect 1024 >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 180 io_flags 1 addr 7f2460009000, sect 0 >> >> The problem seems to be that the method used to defer work to the >> io_uring context, task_work_add(), is using a stack to queue the >> callbacks. > > Is the observation done on zoned ublk or plain ublk-loop? I collected this trace on my fork, but since the behavior comes from task_work.c using a stack to queue work, it would be present on upstream ublk as well. For completeness, I have verified that this is the case. > If it is plain ublk-loop, the reverse order can be started from > blk_add_rq_to_plug(), but it won't happen for zoned write request > which isn't queued to plug list. I forgot to mention that I set the deadline scheduler for both ublkb0 and the loopback target. No reordering should happen in mq with the deadline scheduler, as far as I understand. > > Otherwise, can you observe zoned write req reorder from ublksrv side? > Yes, the reverse order is observable in ublk server. Reordering happens in ublk kernel driver. >> >> As you probably are aware, writes to zoned storage must >> happen at the write pointer, so when order is lost there is a problem. >> But I would assume that this behavior is also undesirable in other >> circumstances. >> >> I am not sure what is the best approach to change this behavor. Maybe >> having a separate queue for the requests and then only scheduling work >> if a worker is not already processing the queue? >> >> If you like, I can try to implement a fix? > > Yeah, I think zoned write requests could be forwarded to ublk server out of order. In reverse order for requests that are issued without a context switch, as far as I can tell. > > Is it possible for re-order the writes in ublksrv side? I guess it is > be doable: > > 1) zoned write request is sent to ublk_queue_rq() in order always > > 2) when ublksrv zone target/backend code gets zoned write request in each > zone, it can wait until the next expected write comes, then handle the > write and advance write pointer, then repeat the whole process. > > 3) because of 1), the next expected zoned write req is guaranteed to > come without much delay, and the per-zone queue won't be long. I think it is not feasible to have per zone data structures. Instead, I considered adding a sequence number to each request, and then queue up requests if there is a gap in the numbering. But really, the issue should be resolved by changing the way ublk_queue_rq() is sending work to uring context with task_add_work(). Fixing in ublksrv is a bit of a hack and will have performance penalty. Also, it can not be good for any storage device to have sequential requests delivered in reverse order? Fixing this would benefit all targets. Best regards, Andreas ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-17 8:05 ` Andreas Hindborg @ 2022-11-17 8:52 ` Ming Lei 2022-11-17 9:07 ` Andreas Hindborg 0 siblings, 1 reply; 27+ messages in thread From: Ming Lei @ 2022-11-17 8:52 UTC (permalink / raw) To: Andreas Hindborg; +Cc: Jens Axboe, linux-block, ming.lei On Thu, Nov 17, 2022 at 09:05:48AM +0100, Andreas Hindborg wrote: > > Ming Lei <ming.lei@redhat.com> writes: > > > Hi Andreas, > > > > On Wed, Nov 16, 2022 at 04:00:15PM +0100, Andreas Hindborg wrote: > >> > >> Hi Ming, > >> > >> I have a question regarding ublk. For context, I am working on adding > >> zoned storage support to ublk, and you have been very kind to help me on > >> Github [1]. > >> > >> I have a problem with ordering of requests after they are issued to the > >> ublk driver. Basically ublk will reverse the ordering of requests that are > >> batched. The behavior can be observed with the following fio workload: > >> > >> > fio --name=test --ioengine=io_uring --rw=read --bs=4m --direct=1 > >> > --size=4m --filename=/dev/ublkb0 > >> > >> For a loopback ublk target I get the following from blktrace: > >> > >> > 259,2 0 3469 286.337681303 724 D R 0 + 1024 [fio] > >> > 259,2 0 3470 286.337691313 724 D R 1024 + 1024 [fio] > >> > 259,2 0 3471 286.337694423 724 D R 2048 + 1024 [fio] > >> > 259,2 0 3472 286.337696583 724 D R 3072 + 1024 [fio] > >> > 259,2 0 3473 286.337698433 724 D R 4096 + 1024 [fio] > >> > 259,2 0 3474 286.337700213 724 D R 5120 + 1024 [fio] > >> > 259,2 0 3475 286.337702723 724 D R 6144 + 1024 [fio] > >> > 259,2 0 3476 286.337704323 724 D R 7168 + 1024 [fio] > >> > 259,1 0 1794 286.337794934 390 D R 6144 + 2048 [ublk] > >> > 259,1 0 1795 286.337805504 390 D R 4096 + 2048 [ublk] > >> > 259,1 0 1796 286.337816274 390 D R 2048 + 2048 [ublk] > >> > 259,1 0 1797 286.337821744 390 D R 0 + 2048 [ublk] > >> > >> And enabling debug prints in ublk shows that the reversal happens when > >> ublk defers work to the io_uring context: > >> > >> > kernel: ublk_queue_rq: qid 0, tag 180, sect 0 > >> > kernel: ublk_queue_rq: qid 0, tag 181, sect 1024 > >> > kernel: ublk_queue_rq: qid 0, tag 182, sect 2048 > >> > kernel: ublk_queue_rq: qid 0, tag 183, sect 3072 > >> > kernel: ublk_queue_rq: qid 0, tag 184, sect 4096 > >> > kernel: ublk_queue_rq: qid 0, tag 185, sect 5120 > >> > kernel: ublk_queue_rq: qid 0, tag 186, sect 6144 > >> > kernel: ublk_queue_rq: qid 0, tag 187, sect 7168 > >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 187 io_flags 1 addr 7f245d359000, sect 7168 > >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 186 io_flags 1 addr 7f245fcfd000, sect 6144 > >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 185 io_flags 1 addr 7f245fd7f000, sect 5120 > >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 184 io_flags 1 addr 7f245fe01000, sect 4096 > >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 183 io_flags 1 addr 7f245fe83000, sect 3072 > >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 182 io_flags 1 addr 7f245ff05000, sect 2048 > >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 181 io_flags 1 addr 7f245ff87000, sect 1024 > >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 180 io_flags 1 addr 7f2460009000, sect 0 > >> > >> The problem seems to be that the method used to defer work to the > >> io_uring context, task_work_add(), is using a stack to queue the > >> callbacks. > > > > Is the observation done on zoned ublk or plain ublk-loop? > > I collected this trace on my fork, but since the behavior comes from > task_work.c using a stack to queue work, it would be present on upstream > ublk as well. For completeness, I have verified that this is the case. > > > If it is plain ublk-loop, the reverse order can be started from > > blk_add_rq_to_plug(), but it won't happen for zoned write request > > which isn't queued to plug list. > > I forgot to mention that I set the deadline scheduler for both ublkb0 > and the loopback target. No reordering should happen in mq with the > deadline scheduler, as far as I understand. I meant you need to setup one zoned ublk-loop for observing write request order, block layer never guarantees request order for other devices. > > > > > Otherwise, can you observe zoned write req reorder from ublksrv side? > > > > Yes, the reverse order is observable in ublk server. Reordering happens > in ublk kernel driver. > > >> > >> As you probably are aware, writes to zoned storage must > >> happen at the write pointer, so when order is lost there is a problem. > >> But I would assume that this behavior is also undesirable in other > >> circumstances. > >> > >> I am not sure what is the best approach to change this behavor. Maybe > >> having a separate queue for the requests and then only scheduling work > >> if a worker is not already processing the queue? > >> > >> If you like, I can try to implement a fix? > > > > Yeah, I think zoned write requests could be forwarded to ublk server out of order. > > In reverse order for requests that are issued without a context switch, > as far as I can tell. > > > > > Is it possible for re-order the writes in ublksrv side? I guess it is > > be doable: > > > > 1) zoned write request is sent to ublk_queue_rq() in order always > > > > 2) when ublksrv zone target/backend code gets zoned write request in each > > zone, it can wait until the next expected write comes, then handle the > > write and advance write pointer, then repeat the whole process. > > > > 3) because of 1), the next expected zoned write req is guaranteed to > > come without much delay, and the per-zone queue won't be long. > > I think it is not feasible to have per zone data structures. Instead, I If one mapping data structure is used for ordering per-zone write request, it could be pretty easy, such as C++'s map or hash table, and it won't take much memory given the max in-flight IOs are very limited and the zone mapping entry can be reused among different zone, and quite easy to implement. Also most of times, the per-zone ordering can be completed in current batch(requests completed from single io_uring_enter()), then the extra cost could be close to zero, you can simply run the per-zone ordering in ->handle_io_background() callback, when all requests could come for each zone most of times. > considered adding a sequence number to each request, and then queue up > requests if there is a gap in the numbering. IMO, that won't be doable, given you don't know when the sequence will be over. > > But really, the issue should be resolved by changing the way > ublk_queue_rq() is sending work to uring context with task_add_work(). Not sure it can be solved easily given llist is implemented in this way. > Fixing in ublksrv is a bit of a hack and will have performance penalty. As I mentioned, ordering zoned write request in each zone just takes a little memory(mapping, or hashing) and the max size of the mapping table is queue depth, and basically zero cpu cost, not see extra performance penalty is involved. > > Also, it can not be good for any storage device to have sequential > requests delivered in reverse order? Fixing this would benefit all targets. Only zoned target has strict ordering requirement which does take cost, block layer never guarantees request order. As I mentioned, blk_add_rq_to_plug() can re-order requests in reverse order too. Thanks, Ming ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-17 8:52 ` Ming Lei @ 2022-11-17 9:07 ` Andreas Hindborg 2022-11-17 11:47 ` Ming Lei 2022-11-17 13:00 ` Damien Le Moal 0 siblings, 2 replies; 27+ messages in thread From: Andreas Hindborg @ 2022-11-17 9:07 UTC (permalink / raw) To: Ming Lei; +Cc: Andreas Hindborg, Jens Axboe, linux-block Ming Lei <ming.lei@redhat.com> writes: > On Thu, Nov 17, 2022 at 09:05:48AM +0100, Andreas Hindborg wrote: >> >> Ming Lei <ming.lei@redhat.com> writes: >> >> > Hi Andreas, >> > >> > On Wed, Nov 16, 2022 at 04:00:15PM +0100, Andreas Hindborg wrote: >> >> >> >> Hi Ming, >> >> >> >> I have a question regarding ublk. For context, I am working on adding >> >> zoned storage support to ublk, and you have been very kind to help me on >> >> Github [1]. >> >> >> >> I have a problem with ordering of requests after they are issued to the >> >> ublk driver. Basically ublk will reverse the ordering of requests that are >> >> batched. The behavior can be observed with the following fio workload: >> >> >> >> > fio --name=test --ioengine=io_uring --rw=read --bs=4m --direct=1 >> >> > --size=4m --filename=/dev/ublkb0 >> >> >> >> For a loopback ublk target I get the following from blktrace: >> >> >> >> > 259,2 0 3469 286.337681303 724 D R 0 + 1024 [fio] >> >> > 259,2 0 3470 286.337691313 724 D R 1024 + 1024 [fio] >> >> > 259,2 0 3471 286.337694423 724 D R 2048 + 1024 [fio] >> >> > 259,2 0 3472 286.337696583 724 D R 3072 + 1024 [fio] >> >> > 259,2 0 3473 286.337698433 724 D R 4096 + 1024 [fio] >> >> > 259,2 0 3474 286.337700213 724 D R 5120 + 1024 [fio] >> >> > 259,2 0 3475 286.337702723 724 D R 6144 + 1024 [fio] >> >> > 259,2 0 3476 286.337704323 724 D R 7168 + 1024 [fio] >> >> > 259,1 0 1794 286.337794934 390 D R 6144 + 2048 [ublk] >> >> > 259,1 0 1795 286.337805504 390 D R 4096 + 2048 [ublk] >> >> > 259,1 0 1796 286.337816274 390 D R 2048 + 2048 [ublk] >> >> > 259,1 0 1797 286.337821744 390 D R 0 + 2048 [ublk] >> >> >> >> And enabling debug prints in ublk shows that the reversal happens when >> >> ublk defers work to the io_uring context: >> >> >> >> > kernel: ublk_queue_rq: qid 0, tag 180, sect 0 >> >> > kernel: ublk_queue_rq: qid 0, tag 181, sect 1024 >> >> > kernel: ublk_queue_rq: qid 0, tag 182, sect 2048 >> >> > kernel: ublk_queue_rq: qid 0, tag 183, sect 3072 >> >> > kernel: ublk_queue_rq: qid 0, tag 184, sect 4096 >> >> > kernel: ublk_queue_rq: qid 0, tag 185, sect 5120 >> >> > kernel: ublk_queue_rq: qid 0, tag 186, sect 6144 >> >> > kernel: ublk_queue_rq: qid 0, tag 187, sect 7168 >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 187 io_flags 1 addr 7f245d359000, sect 7168 >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 186 io_flags 1 addr 7f245fcfd000, sect 6144 >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 185 io_flags 1 addr 7f245fd7f000, sect 5120 >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 184 io_flags 1 addr 7f245fe01000, sect 4096 >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 183 io_flags 1 addr 7f245fe83000, sect 3072 >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 182 io_flags 1 addr 7f245ff05000, sect 2048 >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 181 io_flags 1 addr 7f245ff87000, sect 1024 >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 180 io_flags 1 addr 7f2460009000, sect 0 >> >> >> >> The problem seems to be that the method used to defer work to the >> >> io_uring context, task_work_add(), is using a stack to queue the >> >> callbacks. >> > >> > Is the observation done on zoned ublk or plain ublk-loop? >> >> I collected this trace on my fork, but since the behavior comes from >> task_work.c using a stack to queue work, it would be present on upstream >> ublk as well. For completeness, I have verified that this is the case. >> >> > If it is plain ublk-loop, the reverse order can be started from >> > blk_add_rq_to_plug(), but it won't happen for zoned write request >> > which isn't queued to plug list. >> >> I forgot to mention that I set the deadline scheduler for both ublkb0 >> and the loopback target. No reordering should happen in mq with the >> deadline scheduler, as far as I understand. > > I meant you need to setup one zoned ublk-loop for observing write request > order, block layer never guarantees request order for other devices. > >> >> > >> > Otherwise, can you observe zoned write req reorder from ublksrv side? >> > >> >> Yes, the reverse order is observable in ublk server. Reordering happens >> in ublk kernel driver. >> >> >> >> >> As you probably are aware, writes to zoned storage must >> >> happen at the write pointer, so when order is lost there is a problem. >> >> But I would assume that this behavior is also undesirable in other >> >> circumstances. >> >> >> >> I am not sure what is the best approach to change this behavor. Maybe >> >> having a separate queue for the requests and then only scheduling work >> >> if a worker is not already processing the queue? >> >> >> >> If you like, I can try to implement a fix? >> > >> > Yeah, I think zoned write requests could be forwarded to ublk server out of order. >> >> In reverse order for requests that are issued without a context switch, >> as far as I can tell. >> >> > >> > Is it possible for re-order the writes in ublksrv side? I guess it is >> > be doable: >> > >> > 1) zoned write request is sent to ublk_queue_rq() in order always >> > >> > 2) when ublksrv zone target/backend code gets zoned write request in each >> > zone, it can wait until the next expected write comes, then handle the >> > write and advance write pointer, then repeat the whole process. >> > >> > 3) because of 1), the next expected zoned write req is guaranteed to >> > come without much delay, and the per-zone queue won't be long. >> >> I think it is not feasible to have per zone data structures. Instead, I > > If one mapping data structure is used for ordering per-zone write > request, it could be pretty easy, such as C++'s map or hash table, and it > won't take much memory given the max in-flight IOs are very limited and > the zone mapping entry can be reused among different zone, and quite easy > to implement. > > Also most of times, the per-zone ordering can be completed in current > batch(requests completed from single io_uring_enter()), then the extra > cost could be close to zero, you can simply run the per-zone ordering in > ->handle_io_background() callback, when all requests could come for each > zone most of times. > >> considered adding a sequence number to each request, and then queue up >> requests if there is a gap in the numbering. > > IMO, that won't be doable, given you don't know when the sequence will be over. We would not need to know when the sequence is over. If in ubdsrv we see request number 1,2,4, we could hold 4 until 3 shows up. When 3 shows up we go ahead and submit all requests from 3 and up, until there is another gap. If ublk_drv is setting the sequence numbers, cancelled/aborted requests would not be an issue. I think this would be less overhead than having per zone data structure. But I still think it is an unnecessary hack. We could just fix the driver. > >> >> But really, the issue should be resolved by changing the way >> ublk_queue_rq() is sending work to uring context with task_add_work(). > > Not sure it can be solved easily given llist is implemented in this way. If we queue requests on a separate queue, we are fine. I can make a patch suggestion. > >> Fixing in ublksrv is a bit of a hack and will have performance penalty. > > As I mentioned, ordering zoned write request in each zone just takes > a little memory(mapping, or hashing) and the max size of the mapping > table is queue depth, and basically zero cpu cost, not see extra > performance penalty is involved. I could implement all three solutions. 1) fix the dirver, 2) per zone structure and 3) sequence numbers. Then I benchmark them and we will know what works. It's a lot of work though. > >> >> Also, it can not be good for any storage device to have sequential >> requests delivered in reverse order? Fixing this would benefit all targets. > > Only zoned target has strict ordering requirement which does take cost, block > layer never guarantees request order. As I mentioned, blk_add_rq_to_plug() > can re-order requests in reverse order too. When mq-deadline scheduler is set for a queue, requests are not reordered, right? I am no block layer expert, so I cannot argue about the implementation. But I think that both spinning rust and flash would benefit from having sequential requests delivered in order? Would it not hurt performance to reverse order for sequential requests all the time? I have a hard time understanding why the block layer would do this by default. One thing is to offer no guarantees, but to _always_ reverse the ordering of sequential requests seem a little counter productive to me. Best regards, Andreas ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-17 9:07 ` Andreas Hindborg @ 2022-11-17 11:47 ` Ming Lei 2022-11-17 11:59 ` Andreas Hindborg 2022-11-17 13:00 ` Damien Le Moal 1 sibling, 1 reply; 27+ messages in thread From: Ming Lei @ 2022-11-17 11:47 UTC (permalink / raw) To: Andreas Hindborg; +Cc: Jens Axboe, linux-block, ming.lei On Thu, Nov 17, 2022 at 10:07:14AM +0100, Andreas Hindborg wrote: > > Ming Lei <ming.lei@redhat.com> writes: > > > On Thu, Nov 17, 2022 at 09:05:48AM +0100, Andreas Hindborg wrote: > >> > >> Ming Lei <ming.lei@redhat.com> writes: > >> > >> > Hi Andreas, > >> > > >> > On Wed, Nov 16, 2022 at 04:00:15PM +0100, Andreas Hindborg wrote: > >> >> > >> >> Hi Ming, > >> >> > >> >> I have a question regarding ublk. For context, I am working on adding > >> >> zoned storage support to ublk, and you have been very kind to help me on > >> >> Github [1]. > >> >> > >> >> I have a problem with ordering of requests after they are issued to the > >> >> ublk driver. Basically ublk will reverse the ordering of requests that are > >> >> batched. The behavior can be observed with the following fio workload: > >> >> > >> >> > fio --name=test --ioengine=io_uring --rw=read --bs=4m --direct=1 > >> >> > --size=4m --filename=/dev/ublkb0 > >> >> > >> >> For a loopback ublk target I get the following from blktrace: > >> >> > >> >> > 259,2 0 3469 286.337681303 724 D R 0 + 1024 [fio] > >> >> > 259,2 0 3470 286.337691313 724 D R 1024 + 1024 [fio] > >> >> > 259,2 0 3471 286.337694423 724 D R 2048 + 1024 [fio] > >> >> > 259,2 0 3472 286.337696583 724 D R 3072 + 1024 [fio] > >> >> > 259,2 0 3473 286.337698433 724 D R 4096 + 1024 [fio] > >> >> > 259,2 0 3474 286.337700213 724 D R 5120 + 1024 [fio] > >> >> > 259,2 0 3475 286.337702723 724 D R 6144 + 1024 [fio] > >> >> > 259,2 0 3476 286.337704323 724 D R 7168 + 1024 [fio] > >> >> > 259,1 0 1794 286.337794934 390 D R 6144 + 2048 [ublk] > >> >> > 259,1 0 1795 286.337805504 390 D R 4096 + 2048 [ublk] > >> >> > 259,1 0 1796 286.337816274 390 D R 2048 + 2048 [ublk] > >> >> > 259,1 0 1797 286.337821744 390 D R 0 + 2048 [ublk] > >> >> > >> >> And enabling debug prints in ublk shows that the reversal happens when > >> >> ublk defers work to the io_uring context: > >> >> > >> >> > kernel: ublk_queue_rq: qid 0, tag 180, sect 0 > >> >> > kernel: ublk_queue_rq: qid 0, tag 181, sect 1024 > >> >> > kernel: ublk_queue_rq: qid 0, tag 182, sect 2048 > >> >> > kernel: ublk_queue_rq: qid 0, tag 183, sect 3072 > >> >> > kernel: ublk_queue_rq: qid 0, tag 184, sect 4096 > >> >> > kernel: ublk_queue_rq: qid 0, tag 185, sect 5120 > >> >> > kernel: ublk_queue_rq: qid 0, tag 186, sect 6144 > >> >> > kernel: ublk_queue_rq: qid 0, tag 187, sect 7168 > >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 187 io_flags 1 addr 7f245d359000, sect 7168 > >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 186 io_flags 1 addr 7f245fcfd000, sect 6144 > >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 185 io_flags 1 addr 7f245fd7f000, sect 5120 > >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 184 io_flags 1 addr 7f245fe01000, sect 4096 > >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 183 io_flags 1 addr 7f245fe83000, sect 3072 > >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 182 io_flags 1 addr 7f245ff05000, sect 2048 > >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 181 io_flags 1 addr 7f245ff87000, sect 1024 > >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 180 io_flags 1 addr 7f2460009000, sect 0 > >> >> > >> >> The problem seems to be that the method used to defer work to the > >> >> io_uring context, task_work_add(), is using a stack to queue the > >> >> callbacks. > >> > > >> > Is the observation done on zoned ublk or plain ublk-loop? > >> > >> I collected this trace on my fork, but since the behavior comes from > >> task_work.c using a stack to queue work, it would be present on upstream > >> ublk as well. For completeness, I have verified that this is the case. > >> > >> > If it is plain ublk-loop, the reverse order can be started from > >> > blk_add_rq_to_plug(), but it won't happen for zoned write request > >> > which isn't queued to plug list. > >> > >> I forgot to mention that I set the deadline scheduler for both ublkb0 > >> and the loopback target. No reordering should happen in mq with the > >> deadline scheduler, as far as I understand. > > > > I meant you need to setup one zoned ublk-loop for observing write request > > order, block layer never guarantees request order for other devices. > > > >> > >> > > >> > Otherwise, can you observe zoned write req reorder from ublksrv side? > >> > > >> > >> Yes, the reverse order is observable in ublk server. Reordering happens > >> in ublk kernel driver. > >> > >> >> > >> >> As you probably are aware, writes to zoned storage must > >> >> happen at the write pointer, so when order is lost there is a problem. > >> >> But I would assume that this behavior is also undesirable in other > >> >> circumstances. > >> >> > >> >> I am not sure what is the best approach to change this behavor. Maybe > >> >> having a separate queue for the requests and then only scheduling work > >> >> if a worker is not already processing the queue? > >> >> > >> >> If you like, I can try to implement a fix? > >> > > >> > Yeah, I think zoned write requests could be forwarded to ublk server out of order. > >> > >> In reverse order for requests that are issued without a context switch, > >> as far as I can tell. > >> > >> > > >> > Is it possible for re-order the writes in ublksrv side? I guess it is > >> > be doable: > >> > > >> > 1) zoned write request is sent to ublk_queue_rq() in order always > >> > > >> > 2) when ublksrv zone target/backend code gets zoned write request in each > >> > zone, it can wait until the next expected write comes, then handle the > >> > write and advance write pointer, then repeat the whole process. > >> > > >> > 3) because of 1), the next expected zoned write req is guaranteed to > >> > come without much delay, and the per-zone queue won't be long. > >> > >> I think it is not feasible to have per zone data structures. Instead, I > > > > If one mapping data structure is used for ordering per-zone write > > request, it could be pretty easy, such as C++'s map or hash table, and it > > won't take much memory given the max in-flight IOs are very limited and > > the zone mapping entry can be reused among different zone, and quite easy > > to implement. > > > > Also most of times, the per-zone ordering can be completed in current > > batch(requests completed from single io_uring_enter()), then the extra > > cost could be close to zero, you can simply run the per-zone ordering in > > ->handle_io_background() callback, when all requests could come for each > > zone most of times. > > > >> considered adding a sequence number to each request, and then queue up > >> requests if there is a gap in the numbering. > > > > IMO, that won't be doable, given you don't know when the sequence will be over. > > We would not need to know when the sequence is over. If in ubdsrv we see > request number 1,2,4, we could hold 4 until 3 shows up. When 3 shows up > we go ahead and submit all requests from 3 and up, until there is > another gap. If ublk_drv is setting the sequence numbers, > cancelled/aborted requests would not be an issue. > > I think this would be less overhead than having per zone data structure. You can only assign it to zoned write request, but you still have to check the sequence inside each zone, right? Then why not just check LBAs in each zone simply? > > But I still think it is an unnecessary hack. We could just fix the driver. But not sure if it is easy to make io_uring_cmd_complete_in_task() or task_work_add to complete request in order efficiently. > > > > >> > >> But really, the issue should be resolved by changing the way > >> ublk_queue_rq() is sending work to uring context with task_add_work(). > > > > Not sure it can be solved easily given llist is implemented in this way. > > If we queue requests on a separate queue, we are fine. I can make a > patch suggestion. The problem is that io_uring_cmd_complete_in_task() or task_work_add() may re-order the request, can you explain why the issue can be solved by separate queue? > > > > >> Fixing in ublksrv is a bit of a hack and will have performance penalty. > > > > As I mentioned, ordering zoned write request in each zone just takes > > a little memory(mapping, or hashing) and the max size of the mapping > > table is queue depth, and basically zero cpu cost, not see extra > > performance penalty is involved. > > I could implement all three solutions. 1) fix the dirver, 2) per zone > structure and 3) sequence numbers. Then I benchmark them and we will > know what works. It's a lot of work though. Let's prove if the theory for each solution is correct first. > > > > >> > >> Also, it can not be good for any storage device to have sequential > >> requests delivered in reverse order? Fixing this would benefit all targets. > > > > Only zoned target has strict ordering requirement which does take cost, block > > layer never guarantees request order. As I mentioned, blk_add_rq_to_plug() > > can re-order requests in reverse order too. > > When mq-deadline scheduler is set for a queue, requests are not > reordered, right? In case of batching submission, mq-deadline will order requests by LBA in this whole batch. > > I am no block layer expert, so I cannot argue about the implementation. > But I think that both spinning rust and flash would benefit from having > sequential requests delivered in order? Would it not hurt performance to > reverse order for sequential requests all the time? I have a hard time > understanding why the block layer would do this by default. > > One thing is to offer no guarantees, but to _always_ reverse the > ordering of sequential requests seem a little counter productive to me. If io_uring_cmd_complete_in_task()/task_work_add() can complete io commands to ublksrv in order without extra cost, that is better. But I don't think it is a big deal for HDD. because when these requests are re-issued from ublksrv to block layer, deadline or bfq will order them too since all are submitted via io_uring in batch. Thanks, Ming ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-17 11:47 ` Ming Lei @ 2022-11-17 11:59 ` Andreas Hindborg 2022-11-17 13:11 ` Damien Le Moal 2022-11-18 4:12 ` Ming Lei 0 siblings, 2 replies; 27+ messages in thread From: Andreas Hindborg @ 2022-11-17 11:59 UTC (permalink / raw) To: Ming Lei; +Cc: Andreas Hindborg, Jens Axboe, linux-block Ming Lei <ming.lei@redhat.com> writes: > On Thu, Nov 17, 2022 at 10:07:14AM +0100, Andreas Hindborg wrote: >> >> Ming Lei <ming.lei@redhat.com> writes: >> >> > On Thu, Nov 17, 2022 at 09:05:48AM +0100, Andreas Hindborg wrote: >> >> >> >> Ming Lei <ming.lei@redhat.com> writes: >> >> >> >> > Hi Andreas, >> >> > >> >> > On Wed, Nov 16, 2022 at 04:00:15PM +0100, Andreas Hindborg wrote: >> >> >> >> >> >> Hi Ming, >> >> >> >> >> >> I have a question regarding ublk. For context, I am working on adding >> >> >> zoned storage support to ublk, and you have been very kind to help me on >> >> >> Github [1]. >> >> >> >> >> >> I have a problem with ordering of requests after they are issued to the >> >> >> ublk driver. Basically ublk will reverse the ordering of requests that are >> >> >> batched. The behavior can be observed with the following fio workload: >> >> >> >> >> >> > fio --name=test --ioengine=io_uring --rw=read --bs=4m --direct=1 >> >> >> > --size=4m --filename=/dev/ublkb0 >> >> >> >> >> >> For a loopback ublk target I get the following from blktrace: >> >> >> >> >> >> > 259,2 0 3469 286.337681303 724 D R 0 + 1024 [fio] >> >> >> > 259,2 0 3470 286.337691313 724 D R 1024 + 1024 [fio] >> >> >> > 259,2 0 3471 286.337694423 724 D R 2048 + 1024 [fio] >> >> >> > 259,2 0 3472 286.337696583 724 D R 3072 + 1024 [fio] >> >> >> > 259,2 0 3473 286.337698433 724 D R 4096 + 1024 [fio] >> >> >> > 259,2 0 3474 286.337700213 724 D R 5120 + 1024 [fio] >> >> >> > 259,2 0 3475 286.337702723 724 D R 6144 + 1024 [fio] >> >> >> > 259,2 0 3476 286.337704323 724 D R 7168 + 1024 [fio] >> >> >> > 259,1 0 1794 286.337794934 390 D R 6144 + 2048 [ublk] >> >> >> > 259,1 0 1795 286.337805504 390 D R 4096 + 2048 [ublk] >> >> >> > 259,1 0 1796 286.337816274 390 D R 2048 + 2048 [ublk] >> >> >> > 259,1 0 1797 286.337821744 390 D R 0 + 2048 [ublk] >> >> >> >> >> >> And enabling debug prints in ublk shows that the reversal happens when >> >> >> ublk defers work to the io_uring context: >> >> >> >> >> >> > kernel: ublk_queue_rq: qid 0, tag 180, sect 0 >> >> >> > kernel: ublk_queue_rq: qid 0, tag 181, sect 1024 >> >> >> > kernel: ublk_queue_rq: qid 0, tag 182, sect 2048 >> >> >> > kernel: ublk_queue_rq: qid 0, tag 183, sect 3072 >> >> >> > kernel: ublk_queue_rq: qid 0, tag 184, sect 4096 >> >> >> > kernel: ublk_queue_rq: qid 0, tag 185, sect 5120 >> >> >> > kernel: ublk_queue_rq: qid 0, tag 186, sect 6144 >> >> >> > kernel: ublk_queue_rq: qid 0, tag 187, sect 7168 >> >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 187 io_flags 1 addr 7f245d359000, sect 7168 >> >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 186 io_flags 1 addr 7f245fcfd000, sect 6144 >> >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 185 io_flags 1 addr 7f245fd7f000, sect 5120 >> >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 184 io_flags 1 addr 7f245fe01000, sect 4096 >> >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 183 io_flags 1 addr 7f245fe83000, sect 3072 >> >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 182 io_flags 1 addr 7f245ff05000, sect 2048 >> >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 181 io_flags 1 addr 7f245ff87000, sect 1024 >> >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 180 io_flags 1 addr 7f2460009000, sect 0 >> >> >> >> >> >> The problem seems to be that the method used to defer work to the >> >> >> io_uring context, task_work_add(), is using a stack to queue the >> >> >> callbacks. >> >> > >> >> > Is the observation done on zoned ublk or plain ublk-loop? >> >> >> >> I collected this trace on my fork, but since the behavior comes from >> >> task_work.c using a stack to queue work, it would be present on upstream >> >> ublk as well. For completeness, I have verified that this is the case. >> >> >> >> > If it is plain ublk-loop, the reverse order can be started from >> >> > blk_add_rq_to_plug(), but it won't happen for zoned write request >> >> > which isn't queued to plug list. >> >> >> >> I forgot to mention that I set the deadline scheduler for both ublkb0 >> >> and the loopback target. No reordering should happen in mq with the >> >> deadline scheduler, as far as I understand. >> > >> > I meant you need to setup one zoned ublk-loop for observing write request >> > order, block layer never guarantees request order for other devices. >> > >> >> >> >> > >> >> > Otherwise, can you observe zoned write req reorder from ublksrv side? >> >> > >> >> >> >> Yes, the reverse order is observable in ublk server. Reordering happens >> >> in ublk kernel driver. >> >> >> >> >> >> >> >> As you probably are aware, writes to zoned storage must >> >> >> happen at the write pointer, so when order is lost there is a problem. >> >> >> But I would assume that this behavior is also undesirable in other >> >> >> circumstances. >> >> >> >> >> >> I am not sure what is the best approach to change this behavor. Maybe >> >> >> having a separate queue for the requests and then only scheduling work >> >> >> if a worker is not already processing the queue? >> >> >> >> >> >> If you like, I can try to implement a fix? >> >> > >> >> > Yeah, I think zoned write requests could be forwarded to ublk server out of order. >> >> >> >> In reverse order for requests that are issued without a context switch, >> >> as far as I can tell. >> >> >> >> > >> >> > Is it possible for re-order the writes in ublksrv side? I guess it is >> >> > be doable: >> >> > >> >> > 1) zoned write request is sent to ublk_queue_rq() in order always >> >> > >> >> > 2) when ublksrv zone target/backend code gets zoned write request in each >> >> > zone, it can wait until the next expected write comes, then handle the >> >> > write and advance write pointer, then repeat the whole process. >> >> > >> >> > 3) because of 1), the next expected zoned write req is guaranteed to >> >> > come without much delay, and the per-zone queue won't be long. >> >> >> >> I think it is not feasible to have per zone data structures. Instead, I >> > >> > If one mapping data structure is used for ordering per-zone write >> > request, it could be pretty easy, such as C++'s map or hash table, and it >> > won't take much memory given the max in-flight IOs are very limited and >> > the zone mapping entry can be reused among different zone, and quite easy >> > to implement. >> > >> > Also most of times, the per-zone ordering can be completed in current >> > batch(requests completed from single io_uring_enter()), then the extra >> > cost could be close to zero, you can simply run the per-zone ordering in >> > ->handle_io_background() callback, when all requests could come for each >> > zone most of times. >> > >> >> considered adding a sequence number to each request, and then queue up >> >> requests if there is a gap in the numbering. >> > >> > IMO, that won't be doable, given you don't know when the sequence will be over. >> >> We would not need to know when the sequence is over. If in ubdsrv we see >> request number 1,2,4, we could hold 4 until 3 shows up. When 3 shows up >> we go ahead and submit all requests from 3 and up, until there is >> another gap. If ublk_drv is setting the sequence numbers, >> cancelled/aborted requests would not be an issue. >> >> I think this would be less overhead than having per zone data structure. > > You can only assign it to zoned write request, but you still have to check > the sequence inside each zone, right? Then why not just check LBAs in > each zone simply? We would need to know the zone map, which is not otherwise required. Then we would need to track the write pointer for each open zone for each queue, so that we can stall writes that are not issued at the write pointer. This is in effect all zones, because we cannot track when zones are implicitly closed. Then, if different queues are issuing writes to the same zone, we need to sync across queues. Userspace may have synchronization in place to issue writes with multiple threads while still hitting the write pointer. It is not good enough to only impose ordering within a zone if we have requests in flight for that zone. For the first write to a zone when there are no writes in flight to that zone, we can not know if the write is at the write pointer, or if more writes to lower LBA is on the way. Not without tracking the write pointer. With a sequence number, the sequence number can be queue local. It would not guarantee that writes always happen at the write pointer, but it would guarantee that requests are not reordered by ublk_drv, which is all that is required. As long as userspace is issuing at the write pointer (as they are required to for zoned storage), this solution would work, even across multiple queues issuing writes to the same zone. > >> >> But I still think it is an unnecessary hack. We could just fix the driver. > > But not sure if it is easy to make io_uring_cmd_complete_in_task() or > task_work_add to complete request in order efficiently. > >> >> > >> >> >> >> But really, the issue should be resolved by changing the way >> >> ublk_queue_rq() is sending work to uring context with task_add_work(). >> > >> > Not sure it can be solved easily given llist is implemented in this way. >> >> If we queue requests on a separate queue, we are fine. I can make a >> patch suggestion. > > The problem is that io_uring_cmd_complete_in_task() or task_work_add() may > re-order the request, can you explain why the issue can be solved by > separate queue? I would suggest to still use task_work_add() to queue a callback, but instead of carrying the pdu as the callback argument, use it as a notification that one or more work items are queued. Then we add the pdu to a hwctx local FIFO queue. __ublk_rq_task_work() would then check this FIFO queue and submit all the requests on the queue to userspace. Without further optimization __ublk_rq_task_work() would some times be called with an empty queue, but this should not be too much overhead. Maybe we could decide to not call task_work_add() if the hwctx local queue of pdus is not empty. > >> >> > >> >> Fixing in ublksrv is a bit of a hack and will have performance penalty. >> > >> > As I mentioned, ordering zoned write request in each zone just takes >> > a little memory(mapping, or hashing) and the max size of the mapping >> > table is queue depth, and basically zero cpu cost, not see extra >> > performance penalty is involved. >> >> I could implement all three solutions. 1) fix the dirver, 2) per zone >> structure and 3) sequence numbers. Then I benchmark them and we will >> know what works. It's a lot of work though. > > Let's prove if the theory for each solution is correct first. Alright. > >> >> > >> >> >> >> Also, it can not be good for any storage device to have sequential >> >> requests delivered in reverse order? Fixing this would benefit all targets. >> > >> > Only zoned target has strict ordering requirement which does take cost, block >> > layer never guarantees request order. As I mentioned, blk_add_rq_to_plug() >> > can re-order requests in reverse order too. >> >> When mq-deadline scheduler is set for a queue, requests are not >> reordered, right? > > In case of batching submission, mq-deadline will order requests by > LBA in this whole batch. That is not what I am seeing in practice. I tried looking through mq-deadline.c, but I could not find code that would order by LBA. Could you point me to the code that implements this policy? > >> >> I am no block layer expert, so I cannot argue about the implementation. >> But I think that both spinning rust and flash would benefit from having >> sequential requests delivered in order? Would it not hurt performance to >> reverse order for sequential requests all the time? I have a hard time >> understanding why the block layer would do this by default. >> >> One thing is to offer no guarantees, but to _always_ reverse the >> ordering of sequential requests seem a little counter productive to me. > > If io_uring_cmd_complete_in_task()/task_work_add() can complete io > commands to ublksrv in order without extra cost, that is better. I agree :) > > But I don't think it is a big deal for HDD. because when these requests > are re-issued from ublksrv to block layer, deadline or bfq will order > them too since all are submitted via io_uring in batch. As I wrote, that is not what I am seeing in my experiment. Repeating the output of blktrace from the top of this email: 259,2 0 3469 286.337681303 724 D R 0 + 1024 [fio] 259,2 0 3470 286.337691313 724 D R 1024 + 1024 [fio] 259,2 0 3471 286.337694423 724 D R 2048 + 1024 [fio] 259,2 0 3472 286.337696583 724 D R 3072 + 1024 [fio] 259,2 0 3473 286.337698433 724 D R 4096 + 1024 [fio] 259,2 0 3474 286.337700213 724 D R 5120 + 1024 [fio] 259,2 0 3475 286.337702723 724 D R 6144 + 1024 [fio] 259,2 0 3476 286.337704323 724 D R 7168 + 1024 [fio] 259,1 0 1794 286.337794934 390 D R 6144 + 2048 [ublk] 259,1 0 1795 286.337805504 390 D R 4096 + 2048 [ublk] 259,1 0 1796 286.337816274 390 D R 2048 + 2048 [ublk] 259,1 0 1797 286.337821744 390 D R 0 + 2048 [ublk] Here the read of 6144: 259,1 0 1794 286.337794934 390 D R 6144 + 2048 [ublk] is clearly issued before the read of 4096: 259,1 0 1795 286.337805504 390 D R 4096 + 2048 [ublk] It is not because there are no IO in flight for the target device. Here is he trace with completions included: 259,2 10 1 0.000000000 468 D R 0 + 1024 [fio] 259,2 10 2 0.000005680 468 D R 1024 + 1024 [fio] 259,2 10 3 0.000007760 468 D R 2048 + 1024 [fio] 259,2 10 4 0.000009140 468 D R 3072 + 1024 [fio] 259,2 10 5 0.000010420 468 D R 4096 + 1024 [fio] 259,2 10 6 0.000011640 468 D R 5120 + 1024 [fio] 259,2 10 7 0.000013350 468 D R 6144 + 1024 [fio] 259,2 10 8 0.000014350 468 D R 7168 + 1024 [fio] 259,1 10 1 0.000280540 412 D R 6144 + 2048 [ublk] 259,1 10 2 0.000433260 412 D R 4096 + 2048 [ublk] 259,1 10 3 0.000603920 412 D R 2048 + 2048 [ublk] 259,1 10 4 0.000698070 412 D R 0 + 2048 [ublk] 259,1 10 5 0.000839250 0 C R 6144 + 2048 [0] 259,2 10 9 0.000891270 412 C R 7168 + 1024 [0] 259,2 10 10 0.000919780 412 C R 6144 + 1024 [0] 259,1 10 6 0.001258791 0 C R 4096 + 2048 [0] 259,2 10 11 0.001306541 412 C R 5120 + 1024 [0] 259,2 10 12 0.001335291 412 C R 4096 + 1024 [0] 259,1 10 7 0.001469911 0 C R 2048 + 2048 [0] 259,2 10 13 0.001518281 412 C R 3072 + 1024 [0] 259,2 10 14 0.001547041 412 C R 2048 + 1024 [0] 259,1 10 8 0.001600801 0 C R 0 + 2048 [0] 259,2 10 15 0.001641871 412 C R 1024 + 1024 [0] 259,2 10 16 0.001671921 412 C R 0 + 1024 [0] This last trace is vanilla Linux 6.0 with mq-deadline enabled for ublkb0(259,2) and the loopback target nvme0n1(259,1). Maybe I am missing some configuration? Best regards, Andreas ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-17 11:59 ` Andreas Hindborg @ 2022-11-17 13:11 ` Damien Le Moal 2022-11-17 13:31 ` Andreas Hindborg 2022-11-18 4:12 ` Ming Lei 1 sibling, 1 reply; 27+ messages in thread From: Damien Le Moal @ 2022-11-17 13:11 UTC (permalink / raw) To: Andreas Hindborg, Ming Lei; +Cc: Jens Axboe, linux-block On 11/17/22 20:59, Andreas Hindborg wrote: > > Ming Lei <ming.lei@redhat.com> writes: > >> On Thu, Nov 17, 2022 at 10:07:14AM +0100, Andreas Hindborg wrote: >>> >>> Ming Lei <ming.lei@redhat.com> writes: >>> >>>> On Thu, Nov 17, 2022 at 09:05:48AM +0100, Andreas Hindborg wrote: >>>>> >>>>> Ming Lei <ming.lei@redhat.com> writes: >>>>> >>>>>> Hi Andreas, >>>>>> >>>>>> On Wed, Nov 16, 2022 at 04:00:15PM +0100, Andreas Hindborg wrote: >>>>>>> >>>>>>> Hi Ming, >>>>>>> >>>>>>> I have a question regarding ublk. For context, I am working on adding >>>>>>> zoned storage support to ublk, and you have been very kind to help me on >>>>>>> Github [1]. >>>>>>> >>>>>>> I have a problem with ordering of requests after they are issued to the >>>>>>> ublk driver. Basically ublk will reverse the ordering of requests that are >>>>>>> batched. The behavior can be observed with the following fio workload: >>>>>>> >>>>>>>> fio --name=test --ioengine=io_uring --rw=read --bs=4m --direct=1 >>>>>>>> --size=4m --filename=/dev/ublkb0 >>>>>>> >>>>>>> For a loopback ublk target I get the following from blktrace: >>>>>>> >>>>>>>> 259,2 0 3469 286.337681303 724 D R 0 + 1024 [fio] >>>>>>>> 259,2 0 3470 286.337691313 724 D R 1024 + 1024 [fio] >>>>>>>> 259,2 0 3471 286.337694423 724 D R 2048 + 1024 [fio] >>>>>>>> 259,2 0 3472 286.337696583 724 D R 3072 + 1024 [fio] >>>>>>>> 259,2 0 3473 286.337698433 724 D R 4096 + 1024 [fio] >>>>>>>> 259,2 0 3474 286.337700213 724 D R 5120 + 1024 [fio] >>>>>>>> 259,2 0 3475 286.337702723 724 D R 6144 + 1024 [fio] >>>>>>>> 259,2 0 3476 286.337704323 724 D R 7168 + 1024 [fio] >>>>>>>> 259,1 0 1794 286.337794934 390 D R 6144 + 2048 [ublk] >>>>>>>> 259,1 0 1795 286.337805504 390 D R 4096 + 2048 [ublk] >>>>>>>> 259,1 0 1796 286.337816274 390 D R 2048 + 2048 [ublk] >>>>>>>> 259,1 0 1797 286.337821744 390 D R 0 + 2048 [ublk] >>>>>>> >>>>>>> And enabling debug prints in ublk shows that the reversal happens when >>>>>>> ublk defers work to the io_uring context: >>>>>>> >>>>>>>> kernel: ublk_queue_rq: qid 0, tag 180, sect 0 >>>>>>>> kernel: ublk_queue_rq: qid 0, tag 181, sect 1024 >>>>>>>> kernel: ublk_queue_rq: qid 0, tag 182, sect 2048 >>>>>>>> kernel: ublk_queue_rq: qid 0, tag 183, sect 3072 >>>>>>>> kernel: ublk_queue_rq: qid 0, tag 184, sect 4096 >>>>>>>> kernel: ublk_queue_rq: qid 0, tag 185, sect 5120 >>>>>>>> kernel: ublk_queue_rq: qid 0, tag 186, sect 6144 >>>>>>>> kernel: ublk_queue_rq: qid 0, tag 187, sect 7168 >>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 187 io_flags 1 addr 7f245d359000, sect 7168 >>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 186 io_flags 1 addr 7f245fcfd000, sect 6144 >>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 185 io_flags 1 addr 7f245fd7f000, sect 5120 >>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 184 io_flags 1 addr 7f245fe01000, sect 4096 >>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 183 io_flags 1 addr 7f245fe83000, sect 3072 >>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 182 io_flags 1 addr 7f245ff05000, sect 2048 >>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 181 io_flags 1 addr 7f245ff87000, sect 1024 >>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 180 io_flags 1 addr 7f2460009000, sect 0 >>>>>>> >>>>>>> The problem seems to be that the method used to defer work to the >>>>>>> io_uring context, task_work_add(), is using a stack to queue the >>>>>>> callbacks. >>>>>> >>>>>> Is the observation done on zoned ublk or plain ublk-loop? >>>>> >>>>> I collected this trace on my fork, but since the behavior comes from >>>>> task_work.c using a stack to queue work, it would be present on upstream >>>>> ublk as well. For completeness, I have verified that this is the case. >>>>> >>>>>> If it is plain ublk-loop, the reverse order can be started from >>>>>> blk_add_rq_to_plug(), but it won't happen for zoned write request >>>>>> which isn't queued to plug list. >>>>> >>>>> I forgot to mention that I set the deadline scheduler for both ublkb0 >>>>> and the loopback target. No reordering should happen in mq with the >>>>> deadline scheduler, as far as I understand. >>>> >>>> I meant you need to setup one zoned ublk-loop for observing write request >>>> order, block layer never guarantees request order for other devices. >>>> >>>>> >>>>>> >>>>>> Otherwise, can you observe zoned write req reorder from ublksrv side? >>>>>> >>>>> >>>>> Yes, the reverse order is observable in ublk server. Reordering happens >>>>> in ublk kernel driver. >>>>> >>>>>>> >>>>>>> As you probably are aware, writes to zoned storage must >>>>>>> happen at the write pointer, so when order is lost there is a problem. >>>>>>> But I would assume that this behavior is also undesirable in other >>>>>>> circumstances. >>>>>>> >>>>>>> I am not sure what is the best approach to change this behavor. Maybe >>>>>>> having a separate queue for the requests and then only scheduling work >>>>>>> if a worker is not already processing the queue? >>>>>>> >>>>>>> If you like, I can try to implement a fix? >>>>>> >>>>>> Yeah, I think zoned write requests could be forwarded to ublk server out of order. >>>>> >>>>> In reverse order for requests that are issued without a context switch, >>>>> as far as I can tell. >>>>> >>>>>> >>>>>> Is it possible for re-order the writes in ublksrv side? I guess it is >>>>>> be doable: >>>>>> >>>>>> 1) zoned write request is sent to ublk_queue_rq() in order always >>>>>> >>>>>> 2) when ublksrv zone target/backend code gets zoned write request in each >>>>>> zone, it can wait until the next expected write comes, then handle the >>>>>> write and advance write pointer, then repeat the whole process. >>>>>> >>>>>> 3) because of 1), the next expected zoned write req is guaranteed to >>>>>> come without much delay, and the per-zone queue won't be long. >>>>> >>>>> I think it is not feasible to have per zone data structures. Instead, I >>>> >>>> If one mapping data structure is used for ordering per-zone write >>>> request, it could be pretty easy, such as C++'s map or hash table, and it >>>> won't take much memory given the max in-flight IOs are very limited and >>>> the zone mapping entry can be reused among different zone, and quite easy >>>> to implement. >>>> >>>> Also most of times, the per-zone ordering can be completed in current >>>> batch(requests completed from single io_uring_enter()), then the extra >>>> cost could be close to zero, you can simply run the per-zone ordering in >>>> ->handle_io_background() callback, when all requests could come for each >>>> zone most of times. >>>> >>>>> considered adding a sequence number to each request, and then queue up >>>>> requests if there is a gap in the numbering. >>>> >>>> IMO, that won't be doable, given you don't know when the sequence will be over. >>> >>> We would not need to know when the sequence is over. If in ubdsrv we see >>> request number 1,2,4, we could hold 4 until 3 shows up. When 3 shows up >>> we go ahead and submit all requests from 3 and up, until there is >>> another gap. If ublk_drv is setting the sequence numbers, >>> cancelled/aborted requests would not be an issue. >>> >>> I think this would be less overhead than having per zone data structure. >> >> You can only assign it to zoned write request, but you still have to check >> the sequence inside each zone, right? Then why not just check LBAs in >> each zone simply? > > We would need to know the zone map, which is not otherwise required. > Then we would need to track the write pointer for each open zone for > each queue, so that we can stall writes that are not issued at the write > pointer. This is in effect all zones, because we cannot track when zones > are implicitly closed. Then, if different queues are issuing writes to > the same zone, we need to sync across queues. Userspace may have > synchronization in place to issue writes with multiple threads while > still hitting the write pointer. > > It is not good enough to only impose ordering within a zone if we have > requests in flight for that zone. For the first write to a zone when there > are no writes in flight to that zone, we can not know if the write is at > the write pointer, or if more writes to lower LBA is on the way. Not > without tracking the write pointer. > > With a sequence number, the sequence number can be queue local. It would > not guarantee that writes always happen at the write pointer, but it > would guarantee that requests are not reordered by ublk_drv, which is > all that is required. As long as userspace is issuing at the write > pointer (as they are required to for zoned storage), this solution would > work, even across multiple queues issuing writes to the same zone. > >> >>> >>> But I still think it is an unnecessary hack. We could just fix the driver. >> >> But not sure if it is easy to make io_uring_cmd_complete_in_task() or >> task_work_add to complete request in order efficiently. >> >>> >>>> >>>>> >>>>> But really, the issue should be resolved by changing the way >>>>> ublk_queue_rq() is sending work to uring context with task_add_work(). >>>> >>>> Not sure it can be solved easily given llist is implemented in this way. >>> >>> If we queue requests on a separate queue, we are fine. I can make a >>> patch suggestion. >> >> The problem is that io_uring_cmd_complete_in_task() or task_work_add() may >> re-order the request, can you explain why the issue can be solved by >> separate queue? > > I would suggest to still use task_work_add() to queue a callback, but > instead of carrying the pdu as the callback argument, use it as a > notification that one or more work items are queued. Then we add the pdu > to a hwctx local FIFO queue. > > __ublk_rq_task_work() would then check this FIFO queue and submit all > the requests on the queue to userspace. > > Without further optimization __ublk_rq_task_work() would some times be > called with an empty queue, but this should not be too much overhead. > Maybe we could decide to not call task_work_add() if the hwctx local > queue of pdus is not empty. > >> >>> >>>> >>>>> Fixing in ublksrv is a bit of a hack and will have performance penalty. >>>> >>>> As I mentioned, ordering zoned write request in each zone just takes >>>> a little memory(mapping, or hashing) and the max size of the mapping >>>> table is queue depth, and basically zero cpu cost, not see extra >>>> performance penalty is involved. >>> >>> I could implement all three solutions. 1) fix the dirver, 2) per zone >>> structure and 3) sequence numbers. Then I benchmark them and we will >>> know what works. It's a lot of work though. >> >> Let's prove if the theory for each solution is correct first. > > Alright. > >> >>> >>>> >>>>> >>>>> Also, it can not be good for any storage device to have sequential >>>>> requests delivered in reverse order? Fixing this would benefit all targets. >>>> >>>> Only zoned target has strict ordering requirement which does take cost, block >>>> layer never guarantees request order. As I mentioned, blk_add_rq_to_plug() >>>> can re-order requests in reverse order too. >>> >>> When mq-deadline scheduler is set for a queue, requests are not >>> reordered, right? >> >> In case of batching submission, mq-deadline will order requests by >> LBA in this whole batch. > > That is not what I am seeing in practice. I tried looking through > mq-deadline.c, but I could not find code that would order by LBA. Could > you point me to the code that implements this policy? see the use of the rb tree, deadline_add_rq_rb() and friends. The rb tree maintains the requests in LBA order. The fifo list, maintains an arrival order for starvation control. > >> >>> >>> I am no block layer expert, so I cannot argue about the implementation. >>> But I think that both spinning rust and flash would benefit from having >>> sequential requests delivered in order? Would it not hurt performance to >>> reverse order for sequential requests all the time? I have a hard time >>> understanding why the block layer would do this by default. >>> >>> One thing is to offer no guarantees, but to _always_ reverse the >>> ordering of sequential requests seem a little counter productive to me. >> >> If io_uring_cmd_complete_in_task()/task_work_add() can complete io >> commands to ublksrv in order without extra cost, that is better. > > I agree :) > >> >> But I don't think it is a big deal for HDD. because when these requests >> are re-issued from ublksrv to block layer, deadline or bfq will order >> them too since all are submitted via io_uring in batch. > > As I wrote, that is not what I am seeing in my experiment. Repeating the > output of blktrace from the top of this email: > > 259,2 0 3469 286.337681303 724 D R 0 + 1024 [fio] > 259,2 0 3470 286.337691313 724 D R 1024 + 1024 [fio] > 259,2 0 3471 286.337694423 724 D R 2048 + 1024 [fio] > 259,2 0 3472 286.337696583 724 D R 3072 + 1024 [fio] > 259,2 0 3473 286.337698433 724 D R 4096 + 1024 [fio] > 259,2 0 3474 286.337700213 724 D R 5120 + 1024 [fio] > 259,2 0 3475 286.337702723 724 D R 6144 + 1024 [fio] > 259,2 0 3476 286.337704323 724 D R 7168 + 1024 [fio] > 259,1 0 1794 286.337794934 390 D R 6144 + 2048 [ublk] > 259,1 0 1795 286.337805504 390 D R 4096 + 2048 [ublk] > 259,1 0 1796 286.337816274 390 D R 2048 + 2048 [ublk] > 259,1 0 1797 286.337821744 390 D R 0 + 2048 [ublk] > > Here the read of 6144: > 259,1 0 1794 286.337794934 390 D R 6144 + 2048 [ublk] > > is clearly issued before the read of 4096: > 259,1 0 1795 286.337805504 390 D R 4096 + 2048 [ublk] > > It is not because there are no IO in flight for the target device. Here > is he trace with completions included: > > 259,2 10 1 0.000000000 468 D R 0 + 1024 [fio] > 259,2 10 2 0.000005680 468 D R 1024 + 1024 [fio] > 259,2 10 3 0.000007760 468 D R 2048 + 1024 [fio] > 259,2 10 4 0.000009140 468 D R 3072 + 1024 [fio] > 259,2 10 5 0.000010420 468 D R 4096 + 1024 [fio] > 259,2 10 6 0.000011640 468 D R 5120 + 1024 [fio] > 259,2 10 7 0.000013350 468 D R 6144 + 1024 [fio] > 259,2 10 8 0.000014350 468 D R 7168 + 1024 [fio] > 259,1 10 1 0.000280540 412 D R 6144 + 2048 [ublk] > 259,1 10 2 0.000433260 412 D R 4096 + 2048 [ublk] > 259,1 10 3 0.000603920 412 D R 2048 + 2048 [ublk] > 259,1 10 4 0.000698070 412 D R 0 + 2048 [ublk] > 259,1 10 5 0.000839250 0 C R 6144 + 2048 [0] > 259,2 10 9 0.000891270 412 C R 7168 + 1024 [0] > 259,2 10 10 0.000919780 412 C R 6144 + 1024 [0] > 259,1 10 6 0.001258791 0 C R 4096 + 2048 [0] > 259,2 10 11 0.001306541 412 C R 5120 + 1024 [0] > 259,2 10 12 0.001335291 412 C R 4096 + 1024 [0] > 259,1 10 7 0.001469911 0 C R 2048 + 2048 [0] > 259,2 10 13 0.001518281 412 C R 3072 + 1024 [0] > 259,2 10 14 0.001547041 412 C R 2048 + 1024 [0] > 259,1 10 8 0.001600801 0 C R 0 + 2048 [0] > 259,2 10 15 0.001641871 412 C R 1024 + 1024 [0] > 259,2 10 16 0.001671921 412 C R 0 + 1024 [0] > > This last trace is vanilla Linux 6.0 with mq-deadline enabled for ublkb0(259,2) > and the loopback target nvme0n1(259,1). queue at head that should be queue at tail ? if nvme0n1 is a multi queue device and does not have a scheduler, there may be a lot of "issue directly" that can really destroy the order of requests. > > Maybe I am missing some configuration? > > Best regards, > Andreas -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-17 13:11 ` Damien Le Moal @ 2022-11-17 13:31 ` Andreas Hindborg 2022-11-18 1:51 ` Damien Le Moal 0 siblings, 1 reply; 27+ messages in thread From: Andreas Hindborg @ 2022-11-17 13:31 UTC (permalink / raw) To: Damien Le Moal; +Cc: Andreas Hindborg, Ming Lei, Jens Axboe, linux-block Damien Le Moal <damien.lemoal@opensource.wdc.com> writes: > On 11/17/22 20:59, Andreas Hindborg wrote: >> >> Ming Lei <ming.lei@redhat.com> writes: >> >>> On Thu, Nov 17, 2022 at 10:07:14AM +0100, Andreas Hindborg wrote: >>>> >>>> Ming Lei <ming.lei@redhat.com> writes: >>>> >>>>> On Thu, Nov 17, 2022 at 09:05:48AM +0100, Andreas Hindborg wrote: >>>>>> >>>>>> Ming Lei <ming.lei@redhat.com> writes: >>>>>> >>>>>>> Hi Andreas, >>>>>>> >>>>>>> On Wed, Nov 16, 2022 at 04:00:15PM +0100, Andreas Hindborg wrote: >>>>>>>> >>>>>>>> Hi Ming, >>>>>>>> >>>>>>>> I have a question regarding ublk. For context, I am working on adding >>>>>>>> zoned storage support to ublk, and you have been very kind to help me on >>>>>>>> Github [1]. >>>>>>>> >>>>>>>> I have a problem with ordering of requests after they are issued to the >>>>>>>> ublk driver. Basically ublk will reverse the ordering of requests that are >>>>>>>> batched. The behavior can be observed with the following fio workload: >>>>>>>> >>>>>>>>> fio --name=test --ioengine=io_uring --rw=read --bs=4m --direct=1 >>>>>>>>> --size=4m --filename=/dev/ublkb0 >>>>>>>> >>>>>>>> For a loopback ublk target I get the following from blktrace: >>>>>>>> >>>>>>>>> 259,2 0 3469 286.337681303 724 D R 0 + 1024 [fio] >>>>>>>>> 259,2 0 3470 286.337691313 724 D R 1024 + 1024 [fio] >>>>>>>>> 259,2 0 3471 286.337694423 724 D R 2048 + 1024 [fio] >>>>>>>>> 259,2 0 3472 286.337696583 724 D R 3072 + 1024 [fio] >>>>>>>>> 259,2 0 3473 286.337698433 724 D R 4096 + 1024 [fio] >>>>>>>>> 259,2 0 3474 286.337700213 724 D R 5120 + 1024 [fio] >>>>>>>>> 259,2 0 3475 286.337702723 724 D R 6144 + 1024 [fio] >>>>>>>>> 259,2 0 3476 286.337704323 724 D R 7168 + 1024 [fio] >>>>>>>>> 259,1 0 1794 286.337794934 390 D R 6144 + 2048 [ublk] >>>>>>>>> 259,1 0 1795 286.337805504 390 D R 4096 + 2048 [ublk] >>>>>>>>> 259,1 0 1796 286.337816274 390 D R 2048 + 2048 [ublk] >>>>>>>>> 259,1 0 1797 286.337821744 390 D R 0 + 2048 [ublk] >>>>>>>> >>>>>>>> And enabling debug prints in ublk shows that the reversal happens when >>>>>>>> ublk defers work to the io_uring context: >>>>>>>> >>>>>>>>> kernel: ublk_queue_rq: qid 0, tag 180, sect 0 >>>>>>>>> kernel: ublk_queue_rq: qid 0, tag 181, sect 1024 >>>>>>>>> kernel: ublk_queue_rq: qid 0, tag 182, sect 2048 >>>>>>>>> kernel: ublk_queue_rq: qid 0, tag 183, sect 3072 >>>>>>>>> kernel: ublk_queue_rq: qid 0, tag 184, sect 4096 >>>>>>>>> kernel: ublk_queue_rq: qid 0, tag 185, sect 5120 >>>>>>>>> kernel: ublk_queue_rq: qid 0, tag 186, sect 6144 >>>>>>>>> kernel: ublk_queue_rq: qid 0, tag 187, sect 7168 >>>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 187 io_flags 1 addr 7f245d359000, sect 7168 >>>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 186 io_flags 1 addr 7f245fcfd000, sect 6144 >>>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 185 io_flags 1 addr 7f245fd7f000, sect 5120 >>>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 184 io_flags 1 addr 7f245fe01000, sect 4096 >>>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 183 io_flags 1 addr 7f245fe83000, sect 3072 >>>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 182 io_flags 1 addr 7f245ff05000, sect 2048 >>>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 181 io_flags 1 addr 7f245ff87000, sect 1024 >>>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 180 io_flags 1 addr 7f2460009000, sect 0 >>>>>>>> >>>>>>>> The problem seems to be that the method used to defer work to the >>>>>>>> io_uring context, task_work_add(), is using a stack to queue the >>>>>>>> callbacks. >>>>>>> >>>>>>> Is the observation done on zoned ublk or plain ublk-loop? >>>>>> >>>>>> I collected this trace on my fork, but since the behavior comes from >>>>>> task_work.c using a stack to queue work, it would be present on upstream >>>>>> ublk as well. For completeness, I have verified that this is the case. >>>>>> >>>>>>> If it is plain ublk-loop, the reverse order can be started from >>>>>>> blk_add_rq_to_plug(), but it won't happen for zoned write request >>>>>>> which isn't queued to plug list. >>>>>> >>>>>> I forgot to mention that I set the deadline scheduler for both ublkb0 >>>>>> and the loopback target. No reordering should happen in mq with the >>>>>> deadline scheduler, as far as I understand. >>>>> >>>>> I meant you need to setup one zoned ublk-loop for observing write request >>>>> order, block layer never guarantees request order for other devices. >>>>> >>>>>> >>>>>>> >>>>>>> Otherwise, can you observe zoned write req reorder from ublksrv side? >>>>>>> >>>>>> >>>>>> Yes, the reverse order is observable in ublk server. Reordering happens >>>>>> in ublk kernel driver. >>>>>> >>>>>>>> >>>>>>>> As you probably are aware, writes to zoned storage must >>>>>>>> happen at the write pointer, so when order is lost there is a problem. >>>>>>>> But I would assume that this behavior is also undesirable in other >>>>>>>> circumstances. >>>>>>>> >>>>>>>> I am not sure what is the best approach to change this behavor. Maybe >>>>>>>> having a separate queue for the requests and then only scheduling work >>>>>>>> if a worker is not already processing the queue? >>>>>>>> >>>>>>>> If you like, I can try to implement a fix? >>>>>>> >>>>>>> Yeah, I think zoned write requests could be forwarded to ublk server out of order. >>>>>> >>>>>> In reverse order for requests that are issued without a context switch, >>>>>> as far as I can tell. >>>>>> >>>>>>> >>>>>>> Is it possible for re-order the writes in ublksrv side? I guess it is >>>>>>> be doable: >>>>>>> >>>>>>> 1) zoned write request is sent to ublk_queue_rq() in order always >>>>>>> >>>>>>> 2) when ublksrv zone target/backend code gets zoned write request in each >>>>>>> zone, it can wait until the next expected write comes, then handle the >>>>>>> write and advance write pointer, then repeat the whole process. >>>>>>> >>>>>>> 3) because of 1), the next expected zoned write req is guaranteed to >>>>>>> come without much delay, and the per-zone queue won't be long. >>>>>> >>>>>> I think it is not feasible to have per zone data structures. Instead, I >>>>> >>>>> If one mapping data structure is used for ordering per-zone write >>>>> request, it could be pretty easy, such as C++'s map or hash table, and it >>>>> won't take much memory given the max in-flight IOs are very limited and >>>>> the zone mapping entry can be reused among different zone, and quite easy >>>>> to implement. >>>>> >>>>> Also most of times, the per-zone ordering can be completed in current >>>>> batch(requests completed from single io_uring_enter()), then the extra >>>>> cost could be close to zero, you can simply run the per-zone ordering in >>>>> ->handle_io_background() callback, when all requests could come for each >>>>> zone most of times. >>>>> >>>>>> considered adding a sequence number to each request, and then queue up >>>>>> requests if there is a gap in the numbering. >>>>> >>>>> IMO, that won't be doable, given you don't know when the sequence will be over. >>>> >>>> We would not need to know when the sequence is over. If in ubdsrv we see >>>> request number 1,2,4, we could hold 4 until 3 shows up. When 3 shows up >>>> we go ahead and submit all requests from 3 and up, until there is >>>> another gap. If ublk_drv is setting the sequence numbers, >>>> cancelled/aborted requests would not be an issue. >>>> >>>> I think this would be less overhead than having per zone data structure. >>> >>> You can only assign it to zoned write request, but you still have to check >>> the sequence inside each zone, right? Then why not just check LBAs in >>> each zone simply? >> >> We would need to know the zone map, which is not otherwise required. >> Then we would need to track the write pointer for each open zone for >> each queue, so that we can stall writes that are not issued at the write >> pointer. This is in effect all zones, because we cannot track when zones >> are implicitly closed. Then, if different queues are issuing writes to >> the same zone, we need to sync across queues. Userspace may have >> synchronization in place to issue writes with multiple threads while >> still hitting the write pointer. >> >> It is not good enough to only impose ordering within a zone if we have >> requests in flight for that zone. For the first write to a zone when there >> are no writes in flight to that zone, we can not know if the write is at >> the write pointer, or if more writes to lower LBA is on the way. Not >> without tracking the write pointer. >> >> With a sequence number, the sequence number can be queue local. It would >> not guarantee that writes always happen at the write pointer, but it >> would guarantee that requests are not reordered by ublk_drv, which is >> all that is required. As long as userspace is issuing at the write >> pointer (as they are required to for zoned storage), this solution would >> work, even across multiple queues issuing writes to the same zone. >> >>> >>>> >>>> But I still think it is an unnecessary hack. We could just fix the driver. >>> >>> But not sure if it is easy to make io_uring_cmd_complete_in_task() or >>> task_work_add to complete request in order efficiently. >>> >>>> >>>>> >>>>>> >>>>>> But really, the issue should be resolved by changing the way >>>>>> ublk_queue_rq() is sending work to uring context with task_add_work(). >>>>> >>>>> Not sure it can be solved easily given llist is implemented in this way. >>>> >>>> If we queue requests on a separate queue, we are fine. I can make a >>>> patch suggestion. >>> >>> The problem is that io_uring_cmd_complete_in_task() or task_work_add() may >>> re-order the request, can you explain why the issue can be solved by >>> separate queue? >> >> I would suggest to still use task_work_add() to queue a callback, but >> instead of carrying the pdu as the callback argument, use it as a >> notification that one or more work items are queued. Then we add the pdu >> to a hwctx local FIFO queue. >> >> __ublk_rq_task_work() would then check this FIFO queue and submit all >> the requests on the queue to userspace. >> >> Without further optimization __ublk_rq_task_work() would some times be >> called with an empty queue, but this should not be too much overhead. >> Maybe we could decide to not call task_work_add() if the hwctx local >> queue of pdus is not empty. >> >>> >>>> >>>>> >>>>>> Fixing in ublksrv is a bit of a hack and will have performance penalty. >>>>> >>>>> As I mentioned, ordering zoned write request in each zone just takes >>>>> a little memory(mapping, or hashing) and the max size of the mapping >>>>> table is queue depth, and basically zero cpu cost, not see extra >>>>> performance penalty is involved. >>>> >>>> I could implement all three solutions. 1) fix the dirver, 2) per zone >>>> structure and 3) sequence numbers. Then I benchmark them and we will >>>> know what works. It's a lot of work though. >>> >>> Let's prove if the theory for each solution is correct first. >> >> Alright. >> >>> >>>> >>>>> >>>>>> >>>>>> Also, it can not be good for any storage device to have sequential >>>>>> requests delivered in reverse order? Fixing this would benefit all targets. >>>>> >>>>> Only zoned target has strict ordering requirement which does take cost, block >>>>> layer never guarantees request order. As I mentioned, blk_add_rq_to_plug() >>>>> can re-order requests in reverse order too. >>>> >>>> When mq-deadline scheduler is set for a queue, requests are not >>>> reordered, right? >>> >>> In case of batching submission, mq-deadline will order requests by >>> LBA in this whole batch. >> >> That is not what I am seeing in practice. I tried looking through >> mq-deadline.c, but I could not find code that would order by LBA. Could >> you point me to the code that implements this policy? > > see the use of the rb tree, deadline_add_rq_rb() and friends. The rb tree > maintains the requests in LBA order. The fifo list, maintains an arrival > order for starvation control. Thanks! > >> >>> >>>> >>>> I am no block layer expert, so I cannot argue about the implementation. >>>> But I think that both spinning rust and flash would benefit from having >>>> sequential requests delivered in order? Would it not hurt performance to >>>> reverse order for sequential requests all the time? I have a hard time >>>> understanding why the block layer would do this by default. >>>> >>>> One thing is to offer no guarantees, but to _always_ reverse the >>>> ordering of sequential requests seem a little counter productive to me. >>> >>> If io_uring_cmd_complete_in_task()/task_work_add() can complete io >>> commands to ublksrv in order without extra cost, that is better. >> >> I agree :) >> >>> >>> But I don't think it is a big deal for HDD. because when these requests >>> are re-issued from ublksrv to block layer, deadline or bfq will order >>> them too since all are submitted via io_uring in batch. >> >> As I wrote, that is not what I am seeing in my experiment. Repeating the >> output of blktrace from the top of this email: >> >> 259,2 0 3469 286.337681303 724 D R 0 + 1024 [fio] >> 259,2 0 3470 286.337691313 724 D R 1024 + 1024 [fio] >> 259,2 0 3471 286.337694423 724 D R 2048 + 1024 [fio] >> 259,2 0 3472 286.337696583 724 D R 3072 + 1024 [fio] >> 259,2 0 3473 286.337698433 724 D R 4096 + 1024 [fio] >> 259,2 0 3474 286.337700213 724 D R 5120 + 1024 [fio] >> 259,2 0 3475 286.337702723 724 D R 6144 + 1024 [fio] >> 259,2 0 3476 286.337704323 724 D R 7168 + 1024 [fio] >> 259,1 0 1794 286.337794934 390 D R 6144 + 2048 [ublk] >> 259,1 0 1795 286.337805504 390 D R 4096 + 2048 [ublk] >> 259,1 0 1796 286.337816274 390 D R 2048 + 2048 [ublk] >> 259,1 0 1797 286.337821744 390 D R 0 + 2048 [ublk] >> >> Here the read of 6144: >> 259,1 0 1794 286.337794934 390 D R 6144 + 2048 [ublk] >> >> is clearly issued before the read of 4096: >> 259,1 0 1795 286.337805504 390 D R 4096 + 2048 [ublk] >> >> It is not because there are no IO in flight for the target device. Here >> is he trace with completions included: >> >> 259,2 10 1 0.000000000 468 D R 0 + 1024 [fio] >> 259,2 10 2 0.000005680 468 D R 1024 + 1024 [fio] >> 259,2 10 3 0.000007760 468 D R 2048 + 1024 [fio] >> 259,2 10 4 0.000009140 468 D R 3072 + 1024 [fio] >> 259,2 10 5 0.000010420 468 D R 4096 + 1024 [fio] >> 259,2 10 6 0.000011640 468 D R 5120 + 1024 [fio] >> 259,2 10 7 0.000013350 468 D R 6144 + 1024 [fio] >> 259,2 10 8 0.000014350 468 D R 7168 + 1024 [fio] >> 259,1 10 1 0.000280540 412 D R 6144 + 2048 [ublk] >> 259,1 10 2 0.000433260 412 D R 4096 + 2048 [ublk] >> 259,1 10 3 0.000603920 412 D R 2048 + 2048 [ublk] >> 259,1 10 4 0.000698070 412 D R 0 + 2048 [ublk] >> 259,1 10 5 0.000839250 0 C R 6144 + 2048 [0] >> 259,2 10 9 0.000891270 412 C R 7168 + 1024 [0] >> 259,2 10 10 0.000919780 412 C R 6144 + 1024 [0] >> 259,1 10 6 0.001258791 0 C R 4096 + 2048 [0] >> 259,2 10 11 0.001306541 412 C R 5120 + 1024 [0] >> 259,2 10 12 0.001335291 412 C R 4096 + 1024 [0] >> 259,1 10 7 0.001469911 0 C R 2048 + 2048 [0] >> 259,2 10 13 0.001518281 412 C R 3072 + 1024 [0] >> 259,2 10 14 0.001547041 412 C R 2048 + 1024 [0] >> 259,1 10 8 0.001600801 0 C R 0 + 2048 [0] >> 259,2 10 15 0.001641871 412 C R 1024 + 1024 [0] >> 259,2 10 16 0.001671921 412 C R 0 + 1024 [0] >> >> This last trace is vanilla Linux 6.0 with mq-deadline enabled for ublkb0(259,2) >> and the loopback target nvme0n1(259,1). > > queue at head that should be queue at tail ? if nvme0n1 is a multi queue > device and does not have a scheduler, there may be a lot of "issue > directly" that can really destroy the order of requests. It is a regular nvme drive handled by the nvme driver. So multi queue with mq-deadline set. Unless I am reading the trace wrong there are 3 reads issued to 259,1 sectors 6144, 4096 and 2048 in that order. This is the order userspace is issuing the reads, so the trace matches what I would expect. But now you tell me that the mq-deadline scheduler should reorder the requests based on LBA? But maybe the ordering by LBA is only for writes? I'll rerun the test with writes. Best regards, Andreas ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-17 13:31 ` Andreas Hindborg @ 2022-11-18 1:51 ` Damien Le Moal 2022-11-18 9:29 ` Andreas Hindborg 0 siblings, 1 reply; 27+ messages in thread From: Damien Le Moal @ 2022-11-18 1:51 UTC (permalink / raw) To: Andreas Hindborg; +Cc: Ming Lei, Jens Axboe, linux-block On 11/17/22 22:31, Andreas Hindborg wrote: > > Damien Le Moal <damien.lemoal@opensource.wdc.com> writes: > >> On 11/17/22 20:59, Andreas Hindborg wrote: >>> >>> Ming Lei <ming.lei@redhat.com> writes: >>> >>>> On Thu, Nov 17, 2022 at 10:07:14AM +0100, Andreas Hindborg wrote: >>>>> >>>>> Ming Lei <ming.lei@redhat.com> writes: >>>>> >>>>>> On Thu, Nov 17, 2022 at 09:05:48AM +0100, Andreas Hindborg wrote: >>>>>>> >>>>>>> Ming Lei <ming.lei@redhat.com> writes: >>>>>>> >>>>>>>> Hi Andreas, >>>>>>>> >>>>>>>> On Wed, Nov 16, 2022 at 04:00:15PM +0100, Andreas Hindborg wrote: >>>>>>>>> >>>>>>>>> Hi Ming, >>>>>>>>> >>>>>>>>> I have a question regarding ublk. For context, I am working on adding >>>>>>>>> zoned storage support to ublk, and you have been very kind to help me on >>>>>>>>> Github [1]. >>>>>>>>> >>>>>>>>> I have a problem with ordering of requests after they are issued to the >>>>>>>>> ublk driver. Basically ublk will reverse the ordering of requests that are >>>>>>>>> batched. The behavior can be observed with the following fio workload: >>>>>>>>> >>>>>>>>>> fio --name=test --ioengine=io_uring --rw=read --bs=4m --direct=1 >>>>>>>>>> --size=4m --filename=/dev/ublkb0 >>>>>>>>> >>>>>>>>> For a loopback ublk target I get the following from blktrace: >>>>>>>>> >>>>>>>>>> 259,2 0 3469 286.337681303 724 D R 0 + 1024 [fio] >>>>>>>>>> 259,2 0 3470 286.337691313 724 D R 1024 + 1024 [fio] >>>>>>>>>> 259,2 0 3471 286.337694423 724 D R 2048 + 1024 [fio] >>>>>>>>>> 259,2 0 3472 286.337696583 724 D R 3072 + 1024 [fio] >>>>>>>>>> 259,2 0 3473 286.337698433 724 D R 4096 + 1024 [fio] >>>>>>>>>> 259,2 0 3474 286.337700213 724 D R 5120 + 1024 [fio] >>>>>>>>>> 259,2 0 3475 286.337702723 724 D R 6144 + 1024 [fio] >>>>>>>>>> 259,2 0 3476 286.337704323 724 D R 7168 + 1024 [fio] >>>>>>>>>> 259,1 0 1794 286.337794934 390 D R 6144 + 2048 [ublk] >>>>>>>>>> 259,1 0 1795 286.337805504 390 D R 4096 + 2048 [ublk] >>>>>>>>>> 259,1 0 1796 286.337816274 390 D R 2048 + 2048 [ublk] >>>>>>>>>> 259,1 0 1797 286.337821744 390 D R 0 + 2048 [ublk] >>>>>>>>> >>>>>>>>> And enabling debug prints in ublk shows that the reversal happens when >>>>>>>>> ublk defers work to the io_uring context: >>>>>>>>> >>>>>>>>>> kernel: ublk_queue_rq: qid 0, tag 180, sect 0 >>>>>>>>>> kernel: ublk_queue_rq: qid 0, tag 181, sect 1024 >>>>>>>>>> kernel: ublk_queue_rq: qid 0, tag 182, sect 2048 >>>>>>>>>> kernel: ublk_queue_rq: qid 0, tag 183, sect 3072 >>>>>>>>>> kernel: ublk_queue_rq: qid 0, tag 184, sect 4096 >>>>>>>>>> kernel: ublk_queue_rq: qid 0, tag 185, sect 5120 >>>>>>>>>> kernel: ublk_queue_rq: qid 0, tag 186, sect 6144 >>>>>>>>>> kernel: ublk_queue_rq: qid 0, tag 187, sect 7168 >>>>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 187 io_flags 1 addr 7f245d359000, sect 7168 >>>>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 186 io_flags 1 addr 7f245fcfd000, sect 6144 >>>>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 185 io_flags 1 addr 7f245fd7f000, sect 5120 >>>>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 184 io_flags 1 addr 7f245fe01000, sect 4096 >>>>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 183 io_flags 1 addr 7f245fe83000, sect 3072 >>>>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 182 io_flags 1 addr 7f245ff05000, sect 2048 >>>>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 181 io_flags 1 addr 7f245ff87000, sect 1024 >>>>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 180 io_flags 1 addr 7f2460009000, sect 0 >>>>>>>>> >>>>>>>>> The problem seems to be that the method used to defer work to the >>>>>>>>> io_uring context, task_work_add(), is using a stack to queue the >>>>>>>>> callbacks. >>>>>>>> >>>>>>>> Is the observation done on zoned ublk or plain ublk-loop? >>>>>>> >>>>>>> I collected this trace on my fork, but since the behavior comes from >>>>>>> task_work.c using a stack to queue work, it would be present on upstream >>>>>>> ublk as well. For completeness, I have verified that this is the case. >>>>>>> >>>>>>>> If it is plain ublk-loop, the reverse order can be started from >>>>>>>> blk_add_rq_to_plug(), but it won't happen for zoned write request >>>>>>>> which isn't queued to plug list. >>>>>>> >>>>>>> I forgot to mention that I set the deadline scheduler for both ublkb0 >>>>>>> and the loopback target. No reordering should happen in mq with the >>>>>>> deadline scheduler, as far as I understand. >>>>>> >>>>>> I meant you need to setup one zoned ublk-loop for observing write request >>>>>> order, block layer never guarantees request order for other devices. >>>>>> >>>>>>> >>>>>>>> >>>>>>>> Otherwise, can you observe zoned write req reorder from ublksrv side? >>>>>>>> >>>>>>> >>>>>>> Yes, the reverse order is observable in ublk server. Reordering happens >>>>>>> in ublk kernel driver. >>>>>>> >>>>>>>>> >>>>>>>>> As you probably are aware, writes to zoned storage must >>>>>>>>> happen at the write pointer, so when order is lost there is a problem. >>>>>>>>> But I would assume that this behavior is also undesirable in other >>>>>>>>> circumstances. >>>>>>>>> >>>>>>>>> I am not sure what is the best approach to change this behavor. Maybe >>>>>>>>> having a separate queue for the requests and then only scheduling work >>>>>>>>> if a worker is not already processing the queue? >>>>>>>>> >>>>>>>>> If you like, I can try to implement a fix? >>>>>>>> >>>>>>>> Yeah, I think zoned write requests could be forwarded to ublk server out of order. >>>>>>> >>>>>>> In reverse order for requests that are issued without a context switch, >>>>>>> as far as I can tell. >>>>>>> >>>>>>>> >>>>>>>> Is it possible for re-order the writes in ublksrv side? I guess it is >>>>>>>> be doable: >>>>>>>> >>>>>>>> 1) zoned write request is sent to ublk_queue_rq() in order always >>>>>>>> >>>>>>>> 2) when ublksrv zone target/backend code gets zoned write request in each >>>>>>>> zone, it can wait until the next expected write comes, then handle the >>>>>>>> write and advance write pointer, then repeat the whole process. >>>>>>>> >>>>>>>> 3) because of 1), the next expected zoned write req is guaranteed to >>>>>>>> come without much delay, and the per-zone queue won't be long. >>>>>>> >>>>>>> I think it is not feasible to have per zone data structures. Instead, I >>>>>> >>>>>> If one mapping data structure is used for ordering per-zone write >>>>>> request, it could be pretty easy, such as C++'s map or hash table, and it >>>>>> won't take much memory given the max in-flight IOs are very limited and >>>>>> the zone mapping entry can be reused among different zone, and quite easy >>>>>> to implement. >>>>>> >>>>>> Also most of times, the per-zone ordering can be completed in current >>>>>> batch(requests completed from single io_uring_enter()), then the extra >>>>>> cost could be close to zero, you can simply run the per-zone ordering in >>>>>> ->handle_io_background() callback, when all requests could come for each >>>>>> zone most of times. >>>>>> >>>>>>> considered adding a sequence number to each request, and then queue up >>>>>>> requests if there is a gap in the numbering. >>>>>> >>>>>> IMO, that won't be doable, given you don't know when the sequence will be over. >>>>> >>>>> We would not need to know when the sequence is over. If in ubdsrv we see >>>>> request number 1,2,4, we could hold 4 until 3 shows up. When 3 shows up >>>>> we go ahead and submit all requests from 3 and up, until there is >>>>> another gap. If ublk_drv is setting the sequence numbers, >>>>> cancelled/aborted requests would not be an issue. >>>>> >>>>> I think this would be less overhead than having per zone data structure. >>>> >>>> You can only assign it to zoned write request, but you still have to check >>>> the sequence inside each zone, right? Then why not just check LBAs in >>>> each zone simply? >>> >>> We would need to know the zone map, which is not otherwise required. >>> Then we would need to track the write pointer for each open zone for >>> each queue, so that we can stall writes that are not issued at the write >>> pointer. This is in effect all zones, because we cannot track when zones >>> are implicitly closed. Then, if different queues are issuing writes to >>> the same zone, we need to sync across queues. Userspace may have >>> synchronization in place to issue writes with multiple threads while >>> still hitting the write pointer. >>> >>> It is not good enough to only impose ordering within a zone if we have >>> requests in flight for that zone. For the first write to a zone when there >>> are no writes in flight to that zone, we can not know if the write is at >>> the write pointer, or if more writes to lower LBA is on the way. Not >>> without tracking the write pointer. >>> >>> With a sequence number, the sequence number can be queue local. It would >>> not guarantee that writes always happen at the write pointer, but it >>> would guarantee that requests are not reordered by ublk_drv, which is >>> all that is required. As long as userspace is issuing at the write >>> pointer (as they are required to for zoned storage), this solution would >>> work, even across multiple queues issuing writes to the same zone. >>> >>>> >>>>> >>>>> But I still think it is an unnecessary hack. We could just fix the driver. >>>> >>>> But not sure if it is easy to make io_uring_cmd_complete_in_task() or >>>> task_work_add to complete request in order efficiently. >>>> >>>>> >>>>>> >>>>>>> >>>>>>> But really, the issue should be resolved by changing the way >>>>>>> ublk_queue_rq() is sending work to uring context with task_add_work(). >>>>>> >>>>>> Not sure it can be solved easily given llist is implemented in this way. >>>>> >>>>> If we queue requests on a separate queue, we are fine. I can make a >>>>> patch suggestion. >>>> >>>> The problem is that io_uring_cmd_complete_in_task() or task_work_add() may >>>> re-order the request, can you explain why the issue can be solved by >>>> separate queue? >>> >>> I would suggest to still use task_work_add() to queue a callback, but >>> instead of carrying the pdu as the callback argument, use it as a >>> notification that one or more work items are queued. Then we add the pdu >>> to a hwctx local FIFO queue. >>> >>> __ublk_rq_task_work() would then check this FIFO queue and submit all >>> the requests on the queue to userspace. >>> >>> Without further optimization __ublk_rq_task_work() would some times be >>> called with an empty queue, but this should not be too much overhead. >>> Maybe we could decide to not call task_work_add() if the hwctx local >>> queue of pdus is not empty. >>> >>>> >>>>> >>>>>> >>>>>>> Fixing in ublksrv is a bit of a hack and will have performance penalty. >>>>>> >>>>>> As I mentioned, ordering zoned write request in each zone just takes >>>>>> a little memory(mapping, or hashing) and the max size of the mapping >>>>>> table is queue depth, and basically zero cpu cost, not see extra >>>>>> performance penalty is involved. >>>>> >>>>> I could implement all three solutions. 1) fix the dirver, 2) per zone >>>>> structure and 3) sequence numbers. Then I benchmark them and we will >>>>> know what works. It's a lot of work though. >>>> >>>> Let's prove if the theory for each solution is correct first. >>> >>> Alright. >>> >>>> >>>>> >>>>>> >>>>>>> >>>>>>> Also, it can not be good for any storage device to have sequential >>>>>>> requests delivered in reverse order? Fixing this would benefit all targets. >>>>>> >>>>>> Only zoned target has strict ordering requirement which does take cost, block >>>>>> layer never guarantees request order. As I mentioned, blk_add_rq_to_plug() >>>>>> can re-order requests in reverse order too. >>>>> >>>>> When mq-deadline scheduler is set for a queue, requests are not >>>>> reordered, right? >>>> >>>> In case of batching submission, mq-deadline will order requests by >>>> LBA in this whole batch. >>> >>> That is not what I am seeing in practice. I tried looking through >>> mq-deadline.c, but I could not find code that would order by LBA. Could >>> you point me to the code that implements this policy? >> >> see the use of the rb tree, deadline_add_rq_rb() and friends. The rb tree >> maintains the requests in LBA order. The fifo list, maintains an arrival >> order for starvation control. > > Thanks! > >> >>> >>>> >>>>> >>>>> I am no block layer expert, so I cannot argue about the implementation. >>>>> But I think that both spinning rust and flash would benefit from having >>>>> sequential requests delivered in order? Would it not hurt performance to >>>>> reverse order for sequential requests all the time? I have a hard time >>>>> understanding why the block layer would do this by default. >>>>> >>>>> One thing is to offer no guarantees, but to _always_ reverse the >>>>> ordering of sequential requests seem a little counter productive to me. >>>> >>>> If io_uring_cmd_complete_in_task()/task_work_add() can complete io >>>> commands to ublksrv in order without extra cost, that is better. >>> >>> I agree :) >>> >>>> >>>> But I don't think it is a big deal for HDD. because when these requests >>>> are re-issued from ublksrv to block layer, deadline or bfq will order >>>> them too since all are submitted via io_uring in batch. >>> >>> As I wrote, that is not what I am seeing in my experiment. Repeating the >>> output of blktrace from the top of this email: >>> >>> 259,2 0 3469 286.337681303 724 D R 0 + 1024 [fio] >>> 259,2 0 3470 286.337691313 724 D R 1024 + 1024 [fio] >>> 259,2 0 3471 286.337694423 724 D R 2048 + 1024 [fio] >>> 259,2 0 3472 286.337696583 724 D R 3072 + 1024 [fio] >>> 259,2 0 3473 286.337698433 724 D R 4096 + 1024 [fio] >>> 259,2 0 3474 286.337700213 724 D R 5120 + 1024 [fio] >>> 259,2 0 3475 286.337702723 724 D R 6144 + 1024 [fio] >>> 259,2 0 3476 286.337704323 724 D R 7168 + 1024 [fio] >>> 259,1 0 1794 286.337794934 390 D R 6144 + 2048 [ublk] >>> 259,1 0 1795 286.337805504 390 D R 4096 + 2048 [ublk] >>> 259,1 0 1796 286.337816274 390 D R 2048 + 2048 [ublk] >>> 259,1 0 1797 286.337821744 390 D R 0 + 2048 [ublk] >>> >>> Here the read of 6144: >>> 259,1 0 1794 286.337794934 390 D R 6144 + 2048 [ublk] >>> >>> is clearly issued before the read of 4096: >>> 259,1 0 1795 286.337805504 390 D R 4096 + 2048 [ublk] >>> >>> It is not because there are no IO in flight for the target device. Here >>> is he trace with completions included: >>> >>> 259,2 10 1 0.000000000 468 D R 0 + 1024 [fio] >>> 259,2 10 2 0.000005680 468 D R 1024 + 1024 [fio] >>> 259,2 10 3 0.000007760 468 D R 2048 + 1024 [fio] >>> 259,2 10 4 0.000009140 468 D R 3072 + 1024 [fio] >>> 259,2 10 5 0.000010420 468 D R 4096 + 1024 [fio] >>> 259,2 10 6 0.000011640 468 D R 5120 + 1024 [fio] >>> 259,2 10 7 0.000013350 468 D R 6144 + 1024 [fio] >>> 259,2 10 8 0.000014350 468 D R 7168 + 1024 [fio] >>> 259,1 10 1 0.000280540 412 D R 6144 + 2048 [ublk] >>> 259,1 10 2 0.000433260 412 D R 4096 + 2048 [ublk] >>> 259,1 10 3 0.000603920 412 D R 2048 + 2048 [ublk] >>> 259,1 10 4 0.000698070 412 D R 0 + 2048 [ublk] >>> 259,1 10 5 0.000839250 0 C R 6144 + 2048 [0] >>> 259,2 10 9 0.000891270 412 C R 7168 + 1024 [0] >>> 259,2 10 10 0.000919780 412 C R 6144 + 1024 [0] >>> 259,1 10 6 0.001258791 0 C R 4096 + 2048 [0] >>> 259,2 10 11 0.001306541 412 C R 5120 + 1024 [0] >>> 259,2 10 12 0.001335291 412 C R 4096 + 1024 [0] >>> 259,1 10 7 0.001469911 0 C R 2048 + 2048 [0] >>> 259,2 10 13 0.001518281 412 C R 3072 + 1024 [0] >>> 259,2 10 14 0.001547041 412 C R 2048 + 1024 [0] >>> 259,1 10 8 0.001600801 0 C R 0 + 2048 [0] >>> 259,2 10 15 0.001641871 412 C R 1024 + 1024 [0] >>> 259,2 10 16 0.001671921 412 C R 0 + 1024 [0] >>> >>> This last trace is vanilla Linux 6.0 with mq-deadline enabled for ublkb0(259,2) >>> and the loopback target nvme0n1(259,1). >> >> queue at head that should be queue at tail ? if nvme0n1 is a multi queue >> device and does not have a scheduler, there may be a lot of "issue >> directly" that can really destroy the order of requests. > > It is a regular nvme drive handled by the nvme driver. So multi queue > with mq-deadline set. > > Unless I am reading the trace wrong there are 3 reads issued to 259,1 > sectors 6144, 4096 and 2048 in that order. This is the order userspace > is issuing the reads, so the trace matches what I would expect. But now > you tell me that the mq-deadline scheduler should reorder the requests > based on LBA? But maybe the ordering by LBA is only for writes? I'll > rerun the test with writes. Nope. Reordering is for reads too. But reordering happens only if the commands are issued in batch or if the drive is busy (all tags used) and commands accumulate in the scheduler. If the user issues these reads one at a time, they will go through the scheduler without reordering. You can see that with blktrace: If you see a G-Q-I-D sequence for each read, then it is not a batch and they are going through as is. If you see G-Q-I-G-Q-I-G-Q-I-D-D-D, then that was a batch and the "D" should be ordered in increasing LBA, even if the I (insert) where out of order. > > Best regards, > Andreas > -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-18 1:51 ` Damien Le Moal @ 2022-11-18 9:29 ` Andreas Hindborg 0 siblings, 0 replies; 27+ messages in thread From: Andreas Hindborg @ 2022-11-18 9:29 UTC (permalink / raw) To: Damien Le Moal; +Cc: Andreas Hindborg, Ming Lei, Jens Axboe, linux-block Damien Le Moal <damien.lemoal@opensource.wdc.com> writes: [...] >> Unless I am reading the trace wrong there are 3 reads issued to 259,1 >> sectors 6144, 4096 and 2048 in that order. This is the order userspace >> is issuing the reads, so the trace matches what I would expect. But now >> you tell me that the mq-deadline scheduler should reorder the requests >> based on LBA? But maybe the ordering by LBA is only for writes? I'll >> rerun the test with writes. > > Nope. Reordering is for reads too. But reordering happens only if the > commands are issued in batch or if the drive is busy (all tags used) and > commands accumulate in the scheduler. > > If the user issues these reads one at a time, they will go through the > scheduler without reordering. You can see that with blktrace: > If you see a G-Q-I-D sequence for each read, then it is not a batch and > they are going through as is. If you see G-Q-I-G-Q-I-G-Q-I-D-D-D, then > that was a batch and the "D" should be ordered in increasing LBA, even if > the I (insert) where out of order. I see. Looking at the full trace it does show that only part of the series of the reversed requests from ublksrv are batched up. They are not submitted in one go from ublksrv. Thanks, Andreas ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-17 11:59 ` Andreas Hindborg 2022-11-17 13:11 ` Damien Le Moal @ 2022-11-18 4:12 ` Ming Lei 2022-11-18 4:35 ` Damien Le Moal 1 sibling, 1 reply; 27+ messages in thread From: Ming Lei @ 2022-11-18 4:12 UTC (permalink / raw) To: Andreas Hindborg; +Cc: Jens Axboe, linux-block, ming.lei, Damien Le Moal On Thu, Nov 17, 2022 at 12:59:48PM +0100, Andreas Hindborg wrote: > > Ming Lei <ming.lei@redhat.com> writes: > > > On Thu, Nov 17, 2022 at 10:07:14AM +0100, Andreas Hindborg wrote: > >> > >> Ming Lei <ming.lei@redhat.com> writes: > >> > >> > On Thu, Nov 17, 2022 at 09:05:48AM +0100, Andreas Hindborg wrote: > >> >> > >> >> Ming Lei <ming.lei@redhat.com> writes: > >> >> > >> >> > Hi Andreas, > >> >> > > >> >> > On Wed, Nov 16, 2022 at 04:00:15PM +0100, Andreas Hindborg wrote: > >> >> >> > >> >> >> Hi Ming, > >> >> >> > >> >> >> I have a question regarding ublk. For context, I am working on adding > >> >> >> zoned storage support to ublk, and you have been very kind to help me on > >> >> >> Github [1]. > >> >> >> > >> >> >> I have a problem with ordering of requests after they are issued to the > >> >> >> ublk driver. Basically ublk will reverse the ordering of requests that are > >> >> >> batched. The behavior can be observed with the following fio workload: > >> >> >> > >> >> >> > fio --name=test --ioengine=io_uring --rw=read --bs=4m --direct=1 > >> >> >> > --size=4m --filename=/dev/ublkb0 > >> >> >> > >> >> >> For a loopback ublk target I get the following from blktrace: > >> >> >> > >> >> >> > 259,2 0 3469 286.337681303 724 D R 0 + 1024 [fio] > >> >> >> > 259,2 0 3470 286.337691313 724 D R 1024 + 1024 [fio] > >> >> >> > 259,2 0 3471 286.337694423 724 D R 2048 + 1024 [fio] > >> >> >> > 259,2 0 3472 286.337696583 724 D R 3072 + 1024 [fio] > >> >> >> > 259,2 0 3473 286.337698433 724 D R 4096 + 1024 [fio] > >> >> >> > 259,2 0 3474 286.337700213 724 D R 5120 + 1024 [fio] > >> >> >> > 259,2 0 3475 286.337702723 724 D R 6144 + 1024 [fio] > >> >> >> > 259,2 0 3476 286.337704323 724 D R 7168 + 1024 [fio] > >> >> >> > 259,1 0 1794 286.337794934 390 D R 6144 + 2048 [ublk] > >> >> >> > 259,1 0 1795 286.337805504 390 D R 4096 + 2048 [ublk] > >> >> >> > 259,1 0 1796 286.337816274 390 D R 2048 + 2048 [ublk] > >> >> >> > 259,1 0 1797 286.337821744 390 D R 0 + 2048 [ublk] > >> >> >> > >> >> >> And enabling debug prints in ublk shows that the reversal happens when > >> >> >> ublk defers work to the io_uring context: > >> >> >> > >> >> >> > kernel: ublk_queue_rq: qid 0, tag 180, sect 0 > >> >> >> > kernel: ublk_queue_rq: qid 0, tag 181, sect 1024 > >> >> >> > kernel: ublk_queue_rq: qid 0, tag 182, sect 2048 > >> >> >> > kernel: ublk_queue_rq: qid 0, tag 183, sect 3072 > >> >> >> > kernel: ublk_queue_rq: qid 0, tag 184, sect 4096 > >> >> >> > kernel: ublk_queue_rq: qid 0, tag 185, sect 5120 > >> >> >> > kernel: ublk_queue_rq: qid 0, tag 186, sect 6144 > >> >> >> > kernel: ublk_queue_rq: qid 0, tag 187, sect 7168 > >> >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 187 io_flags 1 addr 7f245d359000, sect 7168 > >> >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 186 io_flags 1 addr 7f245fcfd000, sect 6144 > >> >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 185 io_flags 1 addr 7f245fd7f000, sect 5120 > >> >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 184 io_flags 1 addr 7f245fe01000, sect 4096 > >> >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 183 io_flags 1 addr 7f245fe83000, sect 3072 > >> >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 182 io_flags 1 addr 7f245ff05000, sect 2048 > >> >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 181 io_flags 1 addr 7f245ff87000, sect 1024 > >> >> >> > kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 180 io_flags 1 addr 7f2460009000, sect 0 > >> >> >> > >> >> >> The problem seems to be that the method used to defer work to the > >> >> >> io_uring context, task_work_add(), is using a stack to queue the > >> >> >> callbacks. > >> >> > > >> >> > Is the observation done on zoned ublk or plain ublk-loop? > >> >> > >> >> I collected this trace on my fork, but since the behavior comes from > >> >> task_work.c using a stack to queue work, it would be present on upstream > >> >> ublk as well. For completeness, I have verified that this is the case. > >> >> > >> >> > If it is plain ublk-loop, the reverse order can be started from > >> >> > blk_add_rq_to_plug(), but it won't happen for zoned write request > >> >> > which isn't queued to plug list. > >> >> > >> >> I forgot to mention that I set the deadline scheduler for both ublkb0 > >> >> and the loopback target. No reordering should happen in mq with the > >> >> deadline scheduler, as far as I understand. > >> > > >> > I meant you need to setup one zoned ublk-loop for observing write request > >> > order, block layer never guarantees request order for other devices. > >> > > >> >> > >> >> > > >> >> > Otherwise, can you observe zoned write req reorder from ublksrv side? > >> >> > > >> >> > >> >> Yes, the reverse order is observable in ublk server. Reordering happens > >> >> in ublk kernel driver. > >> >> > >> >> >> > >> >> >> As you probably are aware, writes to zoned storage must > >> >> >> happen at the write pointer, so when order is lost there is a problem. > >> >> >> But I would assume that this behavior is also undesirable in other > >> >> >> circumstances. > >> >> >> > >> >> >> I am not sure what is the best approach to change this behavor. Maybe > >> >> >> having a separate queue for the requests and then only scheduling work > >> >> >> if a worker is not already processing the queue? > >> >> >> > >> >> >> If you like, I can try to implement a fix? > >> >> > > >> >> > Yeah, I think zoned write requests could be forwarded to ublk server out of order. > >> >> > >> >> In reverse order for requests that are issued without a context switch, > >> >> as far as I can tell. > >> >> > >> >> > > >> >> > Is it possible for re-order the writes in ublksrv side? I guess it is > >> >> > be doable: > >> >> > > >> >> > 1) zoned write request is sent to ublk_queue_rq() in order always > >> >> > > >> >> > 2) when ublksrv zone target/backend code gets zoned write request in each > >> >> > zone, it can wait until the next expected write comes, then handle the > >> >> > write and advance write pointer, then repeat the whole process. > >> >> > > >> >> > 3) because of 1), the next expected zoned write req is guaranteed to > >> >> > come without much delay, and the per-zone queue won't be long. > >> >> > >> >> I think it is not feasible to have per zone data structures. Instead, I > >> > > >> > If one mapping data structure is used for ordering per-zone write > >> > request, it could be pretty easy, such as C++'s map or hash table, and it > >> > won't take much memory given the max in-flight IOs are very limited and > >> > the zone mapping entry can be reused among different zone, and quite easy > >> > to implement. > >> > > >> > Also most of times, the per-zone ordering can be completed in current > >> > batch(requests completed from single io_uring_enter()), then the extra > >> > cost could be close to zero, you can simply run the per-zone ordering in > >> > ->handle_io_background() callback, when all requests could come for each > >> > zone most of times. > >> > > >> >> considered adding a sequence number to each request, and then queue up > >> >> requests if there is a gap in the numbering. > >> > > >> > IMO, that won't be doable, given you don't know when the sequence will be over. > >> > >> We would not need to know when the sequence is over. If in ubdsrv we see > >> request number 1,2,4, we could hold 4 until 3 shows up. When 3 shows up > >> we go ahead and submit all requests from 3 and up, until there is > >> another gap. If ublk_drv is setting the sequence numbers, > >> cancelled/aborted requests would not be an issue. > >> > >> I think this would be less overhead than having per zone data structure. > > > > You can only assign it to zoned write request, but you still have to check > > the sequence inside each zone, right? Then why not just check LBAs in > > each zone simply? > > We would need to know the zone map, which is not otherwise required. > Then we would need to track the write pointer for each open zone for > each queue, so that we can stall writes that are not issued at the write > pointer. This is in effect all zones, because we cannot track when zones > are implicitly closed. Then, if different queues are issuing writes to Can you explain "implicitly closed" state a bit? From https://zonedstorage.io/docs/introduction/zoned-storage, only the following words are mentioned about closed state: ```Conversely, implicitly or explicitly opened zoned can be transitioned to the closed state using the CLOSE ZONE command.``` zone info can be cached in the mapping(hash table)(zone sector is the key, and zone info is the value), which can be implemented as one LRU style. If any zone info isn't hit in the mapping table, ioctl(BLKREPORTZONE) can be called for obtaining the zone info. > the same zone, we need to sync across queues. Userspace may have > synchronization in place to issue writes with multiple threads while > still hitting the write pointer. You can trust mq-dealine, which guaranteed that write IO is sent to ->queue_rq() in order, no matter MQ or SQ. Yes, it could be issue from multiple queues for ublksrv, which doesn't sync among multiple queues. But per-zone re-order still can solve the issue, just need one lock for each zone to cover the MQ re-order. > > It is not good enough to only impose ordering within a zone if we have > requests in flight for that zone. For the first write to a zone when there > are no writes in flight to that zone, we can not know if the write is at > the write pointer, or if more writes to lower LBA is on the way. Not > without tracking the write pointer. I did mean using write pointer for re-ordering IOs in each zone, which is supposed to be available for software by ->report_zones(), and software can cache the info in one mapping table(zone info cache). > > With a sequence number, the sequence number can be queue local. It would > not guarantee that writes always happen at the write pointer, but it > would guarantee that requests are not reordered by ublk_drv, which is > all that is required. As long as userspace is issuing at the write > pointer (as they are required to for zoned storage), this solution would > work, even across multiple queues issuing writes to the same zone. > > > > >> > >> But I still think it is an unnecessary hack. We could just fix the driver. > > > > But not sure if it is easy to make io_uring_cmd_complete_in_task() or > > task_work_add to complete request in order efficiently. > > > >> > >> > > >> >> > >> >> But really, the issue should be resolved by changing the way > >> >> ublk_queue_rq() is sending work to uring context with task_add_work(). > >> > > >> > Not sure it can be solved easily given llist is implemented in this way. > >> > >> If we queue requests on a separate queue, we are fine. I can make a > >> patch suggestion. > > > > The problem is that io_uring_cmd_complete_in_task() or task_work_add() may > > re-order the request, can you explain why the issue can be solved by > > separate queue? > > I would suggest to still use task_work_add() to queue a callback, but > instead of carrying the pdu as the callback argument, use it as a > notification that one or more work items are queued. Then we add the pdu > to a hwctx local FIFO queue. > > __ublk_rq_task_work() would then check this FIFO queue and submit all > the requests on the queue to userspace. The thing is that you have to get one request to use its ucmd or task_work for notifying ubq daemon task to complete io commands: But which request can you use? - if the request isn't dequeued in ublk io submission context(either the last one of batching is observed or ->commit_rqs() is called), the chosen request could be handled already or being handled in ubq daemon task context, race or uaf - if the request is dequeued in ublk io submission context, reverse could be caused because another request can be used for the notification, that is the current situation, but we can order the batch before dequeuing the whole batch. Also it only orders IO for single queue, zoned write IO from multiple queues can't be ordered in this way. Just wondering how NVMe/zoned handles MQ zoned write. I guess it is still done inside FW/hardware? > > Without further optimization __ublk_rq_task_work() would some times be > called with an empty queue, but this should not be too much overhead. > Maybe we could decide to not call task_work_add() if the hwctx local > queue of pdus is not empty. > > > > >> > >> > > >> >> Fixing in ublksrv is a bit of a hack and will have performance penalty. > >> > > >> > As I mentioned, ordering zoned write request in each zone just takes > >> > a little memory(mapping, or hashing) and the max size of the mapping > >> > table is queue depth, and basically zero cpu cost, not see extra > >> > performance penalty is involved. > >> > >> I could implement all three solutions. 1) fix the dirver, 2) per zone > >> structure and 3) sequence numbers. Then I benchmark them and we will > >> know what works. It's a lot of work though. > > > > Let's prove if the theory for each solution is correct first. > > Alright. > > > > >> > >> > > >> >> > >> >> Also, it can not be good for any storage device to have sequential > >> >> requests delivered in reverse order? Fixing this would benefit all targets. > >> > > >> > Only zoned target has strict ordering requirement which does take cost, block > >> > layer never guarantees request order. As I mentioned, blk_add_rq_to_plug() > >> > can re-order requests in reverse order too. > >> > >> When mq-deadline scheduler is set for a queue, requests are not > >> reordered, right? > > > > In case of batching submission, mq-deadline will order requests by > > LBA in this whole batch. > > That is not what I am seeing in practice. I tried looking through > mq-deadline.c, but I could not find code that would order by LBA. Could > you point me to the code that implements this policy? Damien has provided the code, and it is done when inserting request(s) into scheduler queue. > > > > >> > >> I am no block layer expert, so I cannot argue about the implementation. > >> But I think that both spinning rust and flash would benefit from having > >> sequential requests delivered in order? Would it not hurt performance to > >> reverse order for sequential requests all the time? I have a hard time > >> understanding why the block layer would do this by default. > >> > >> One thing is to offer no guarantees, but to _always_ reverse the > >> ordering of sequential requests seem a little counter productive to me. > > > > If io_uring_cmd_complete_in_task()/task_work_add() can complete io > > commands to ublksrv in order without extra cost, that is better. > > I agree :) > > > > > But I don't think it is a big deal for HDD. because when these requests > > are re-issued from ublksrv to block layer, deadline or bfq will order > > them too since all are submitted via io_uring in batch. > > As I wrote, that is not what I am seeing in my experiment. Repeating the > output of blktrace from the top of this email: > > 259,2 0 3469 286.337681303 724 D R 0 + 1024 [fio] > 259,2 0 3470 286.337691313 724 D R 1024 + 1024 [fio] > 259,2 0 3471 286.337694423 724 D R 2048 + 1024 [fio] > 259,2 0 3472 286.337696583 724 D R 3072 + 1024 [fio] > 259,2 0 3473 286.337698433 724 D R 4096 + 1024 [fio] > 259,2 0 3474 286.337700213 724 D R 5120 + 1024 [fio] > 259,2 0 3475 286.337702723 724 D R 6144 + 1024 [fio] > 259,2 0 3476 286.337704323 724 D R 7168 + 1024 [fio] > 259,1 0 1794 286.337794934 390 D R 6144 + 2048 [ublk] > 259,1 0 1795 286.337805504 390 D R 4096 + 2048 [ublk] > 259,1 0 1796 286.337816274 390 D R 2048 + 2048 [ublk] > 259,1 0 1797 286.337821744 390 D R 0 + 2048 [ublk] > > Here the read of 6144: > 259,1 0 1794 286.337794934 390 D R 6144 + 2048 [ublk] > > is clearly issued before the read of 4096: > 259,1 0 1795 286.337805504 390 D R 4096 + 2048 [ublk] > > It is not because there are no IO in flight for the target device. Here > is he trace with completions included: > > 259,2 10 1 0.000000000 468 D R 0 + 1024 [fio] > 259,2 10 2 0.000005680 468 D R 1024 + 1024 [fio] > 259,2 10 3 0.000007760 468 D R 2048 + 1024 [fio] > 259,2 10 4 0.000009140 468 D R 3072 + 1024 [fio] > 259,2 10 5 0.000010420 468 D R 4096 + 1024 [fio] > 259,2 10 6 0.000011640 468 D R 5120 + 1024 [fio] > 259,2 10 7 0.000013350 468 D R 6144 + 1024 [fio] > 259,2 10 8 0.000014350 468 D R 7168 + 1024 [fio] > 259,1 10 1 0.000280540 412 D R 6144 + 2048 [ublk] > 259,1 10 2 0.000433260 412 D R 4096 + 2048 [ublk] > 259,1 10 3 0.000603920 412 D R 2048 + 2048 [ublk] > 259,1 10 4 0.000698070 412 D R 0 + 2048 [ublk] > 259,1 10 5 0.000839250 0 C R 6144 + 2048 [0] > 259,2 10 9 0.000891270 412 C R 7168 + 1024 [0] > 259,2 10 10 0.000919780 412 C R 6144 + 1024 [0] > 259,1 10 6 0.001258791 0 C R 4096 + 2048 [0] > 259,2 10 11 0.001306541 412 C R 5120 + 1024 [0] > 259,2 10 12 0.001335291 412 C R 4096 + 1024 [0] > 259,1 10 7 0.001469911 0 C R 2048 + 2048 [0] > 259,2 10 13 0.001518281 412 C R 3072 + 1024 [0] > 259,2 10 14 0.001547041 412 C R 2048 + 1024 [0] > 259,1 10 8 0.001600801 0 C R 0 + 2048 [0] > 259,2 10 15 0.001641871 412 C R 1024 + 1024 [0] > 259,2 10 16 0.001671921 412 C R 0 + 1024 [0] > > This last trace is vanilla Linux 6.0 with mq-deadline enabled for ublkb0(259,2) > and the loopback target nvme0n1(259,1). > > Maybe I am missing some configuration? I meant if backing IOs are re-submitted in batch, mq-deadline should re-order them. In the above trace, I guess they may not return from same io_uring_enter(), then not re-submitted in single batch. Thanks, Ming ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-18 4:12 ` Ming Lei @ 2022-11-18 4:35 ` Damien Le Moal 2022-11-18 6:07 ` Ming Lei 0 siblings, 1 reply; 27+ messages in thread From: Damien Le Moal @ 2022-11-18 4:35 UTC (permalink / raw) To: Ming Lei, Andreas Hindborg; +Cc: Jens Axboe, linux-block On 11/18/22 13:12, Ming Lei wrote: [...] >>> You can only assign it to zoned write request, but you still have to check >>> the sequence inside each zone, right? Then why not just check LBAs in >>> each zone simply? >> >> We would need to know the zone map, which is not otherwise required. >> Then we would need to track the write pointer for each open zone for >> each queue, so that we can stall writes that are not issued at the write >> pointer. This is in effect all zones, because we cannot track when zones >> are implicitly closed. Then, if different queues are issuing writes to > > Can you explain "implicitly closed" state a bit? > > From https://zonedstorage.io/docs/introduction/zoned-storage, only the > following words are mentioned about closed state: > > ```Conversely, implicitly or explicitly opened zoned can be transitioned to the > closed state using the CLOSE ZONE command.``` When a write is issued to an empty or closed zone, the drive will automatically transition the zone into the implicit open state. This is called implicit open because the host did not (explicitly) issue an open zone command. When there are too many implicitly open zones, the drive may choose to close one of the implicitly opened zone to implicitly open the zone that is a target for a write command. Simple in a nutshell. This is done so that the drive can work with a limited set of resources needed to handle open zones, that is, zones that are being written. There are some more nasty details to all this with limits on the number of open zones and active zones that a zoned drive may have. > > zone info can be cached in the mapping(hash table)(zone sector is the key, and zone > info is the value), which can be implemented as one LRU style. If any zone > info isn't hit in the mapping table, ioctl(BLKREPORTZONE) can be called for > obtaining the zone info. > >> the same zone, we need to sync across queues. Userspace may have >> synchronization in place to issue writes with multiple threads while >> still hitting the write pointer. > > You can trust mq-dealine, which guaranteed that write IO is sent to ->queue_rq() > in order, no matter MQ or SQ. > > Yes, it could be issue from multiple queues for ublksrv, which doesn't sync > among multiple queues. > > But per-zone re-order still can solve the issue, just need one lock > for each zone to cover the MQ re-order. That lock is already there and using it, mq-deadline will never dispatch more than one write per zone at any time. This is to avoid write reordering. So multi queue or not, for any zone, there is no possibility of having writes reordered. -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-18 4:35 ` Damien Le Moal @ 2022-11-18 6:07 ` Ming Lei 2022-11-18 9:41 ` Andreas Hindborg 0 siblings, 1 reply; 27+ messages in thread From: Ming Lei @ 2022-11-18 6:07 UTC (permalink / raw) To: Damien Le Moal; +Cc: Andreas Hindborg, Jens Axboe, linux-block, ming.lei On Fri, Nov 18, 2022 at 01:35:29PM +0900, Damien Le Moal wrote: > On 11/18/22 13:12, Ming Lei wrote: > [...] > >>> You can only assign it to zoned write request, but you still have to check > >>> the sequence inside each zone, right? Then why not just check LBAs in > >>> each zone simply? > >> > >> We would need to know the zone map, which is not otherwise required. > >> Then we would need to track the write pointer for each open zone for > >> each queue, so that we can stall writes that are not issued at the write > >> pointer. This is in effect all zones, because we cannot track when zones > >> are implicitly closed. Then, if different queues are issuing writes to > > > > Can you explain "implicitly closed" state a bit? > > > > From https://zonedstorage.io/docs/introduction/zoned-storage, only the > > following words are mentioned about closed state: > > > > ```Conversely, implicitly or explicitly opened zoned can be transitioned to the > > closed state using the CLOSE ZONE command.``` > > When a write is issued to an empty or closed zone, the drive will > automatically transition the zone into the implicit open state. This is > called implicit open because the host did not (explicitly) issue an open > zone command. > > When there are too many implicitly open zones, the drive may choose to > close one of the implicitly opened zone to implicitly open the zone that > is a target for a write command. > > Simple in a nutshell. This is done so that the drive can work with a > limited set of resources needed to handle open zones, that is, zones that > are being written. There are some more nasty details to all this with > limits on the number of open zones and active zones that a zoned drive may > have. OK, thanks for the clarification about implicitly closed, but I understand this close can't change the zone's write pointer. > > > > > zone info can be cached in the mapping(hash table)(zone sector is the key, and zone > > info is the value), which can be implemented as one LRU style. If any zone > > info isn't hit in the mapping table, ioctl(BLKREPORTZONE) can be called for > > obtaining the zone info. > > > >> the same zone, we need to sync across queues. Userspace may have > >> synchronization in place to issue writes with multiple threads while > >> still hitting the write pointer. > > > > You can trust mq-dealine, which guaranteed that write IO is sent to ->queue_rq() > > in order, no matter MQ or SQ. > > > > Yes, it could be issue from multiple queues for ublksrv, which doesn't sync > > among multiple queues. > > > > But per-zone re-order still can solve the issue, just need one lock > > for each zone to cover the MQ re-order. > > That lock is already there and using it, mq-deadline will never dispatch > more than one write per zone at any time. This is to avoid write > reordering. So multi queue or not, for any zone, there is no possibility > of having writes reordered. oops, I miss the single queue depth point per zone, so ublk won't break zoned write at all, and I agree order of batch IOs is one problem, but not hard to solve. Thanks, Ming ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-18 6:07 ` Ming Lei @ 2022-11-18 9:41 ` Andreas Hindborg 2022-11-18 11:28 ` Ming Lei 0 siblings, 1 reply; 27+ messages in thread From: Andreas Hindborg @ 2022-11-18 9:41 UTC (permalink / raw) To: Ming Lei; +Cc: Damien Le Moal, Andreas Hindborg, Jens Axboe, linux-block Ming Lei <ming.lei@redhat.com> writes: > CAUTION: This email originated from outside of Western Digital. Do not click on > links or open attachments unless you recognize the sender and know that the > content is safe. > > > On Fri, Nov 18, 2022 at 01:35:29PM +0900, Damien Le Moal wrote: >> On 11/18/22 13:12, Ming Lei wrote: >> [...] >> >>> You can only assign it to zoned write request, but you still have to check >> >>> the sequence inside each zone, right? Then why not just check LBAs in >> >>> each zone simply? >> >> >> >> We would need to know the zone map, which is not otherwise required. >> >> Then we would need to track the write pointer for each open zone for >> >> each queue, so that we can stall writes that are not issued at the write >> >> pointer. This is in effect all zones, because we cannot track when zones >> >> are implicitly closed. Then, if different queues are issuing writes to >> > >> > Can you explain "implicitly closed" state a bit? >> > >> > From https://zonedstorage.io/docs/introduction/zoned-storage, only the >> > following words are mentioned about closed state: >> > >> > ```Conversely, implicitly or explicitly opened zoned can be transitioned to the >> > closed state using the CLOSE ZONE command.``` >> >> When a write is issued to an empty or closed zone, the drive will >> automatically transition the zone into the implicit open state. This is >> called implicit open because the host did not (explicitly) issue an open >> zone command. >> >> When there are too many implicitly open zones, the drive may choose to >> close one of the implicitly opened zone to implicitly open the zone that >> is a target for a write command. >> >> Simple in a nutshell. This is done so that the drive can work with a >> limited set of resources needed to handle open zones, that is, zones that >> are being written. There are some more nasty details to all this with >> limits on the number of open zones and active zones that a zoned drive may >> have. > > OK, thanks for the clarification about implicitly closed, but I > understand this close can't change the zone's write pointer. You are right, it does not matter if the zone is implicitly closed, I was mistaken. But we still have to track the write pointer of every zone in open or active state, otherwise we cannot know if a write that arrive to a zone with no outstanding IO is actually at the write pointer, or whether we need to hold it. > >> >> > >> > zone info can be cached in the mapping(hash table)(zone sector is the key, and zone >> > info is the value), which can be implemented as one LRU style. If any zone >> > info isn't hit in the mapping table, ioctl(BLKREPORTZONE) can be called for >> > obtaining the zone info. >> > >> >> the same zone, we need to sync across queues. Userspace may have >> >> synchronization in place to issue writes with multiple threads while >> >> still hitting the write pointer. >> > >> > You can trust mq-dealine, which guaranteed that write IO is sent to ->queue_rq() >> > in order, no matter MQ or SQ. >> > >> > Yes, it could be issue from multiple queues for ublksrv, which doesn't sync >> > among multiple queues. >> > >> > But per-zone re-order still can solve the issue, just need one lock >> > for each zone to cover the MQ re-order. >> >> That lock is already there and using it, mq-deadline will never dispatch >> more than one write per zone at any time. This is to avoid write >> reordering. So multi queue or not, for any zone, there is no possibility >> of having writes reordered. > > oops, I miss the single queue depth point per zone, so ublk won't break > zoned write at all, and I agree order of batch IOs is one problem, but > not hard to solve. The current implementation _does_ break zoned write because it reverses batched writes. But if it is an easy fix, that is cool :) Best regards, Andreas ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-18 9:41 ` Andreas Hindborg @ 2022-11-18 11:28 ` Ming Lei 2022-11-18 11:49 ` Andreas Hindborg 0 siblings, 1 reply; 27+ messages in thread From: Ming Lei @ 2022-11-18 11:28 UTC (permalink / raw) To: Andreas Hindborg; +Cc: Damien Le Moal, Jens Axboe, linux-block, ming.lei On Fri, Nov 18, 2022 at 10:41:31AM +0100, Andreas Hindborg wrote: > > Ming Lei <ming.lei@redhat.com> writes: > > > CAUTION: This email originated from outside of Western Digital. Do not click on > > links or open attachments unless you recognize the sender and know that the > > content is safe. > > > > > > On Fri, Nov 18, 2022 at 01:35:29PM +0900, Damien Le Moal wrote: > >> On 11/18/22 13:12, Ming Lei wrote: > >> [...] > >> >>> You can only assign it to zoned write request, but you still have to check > >> >>> the sequence inside each zone, right? Then why not just check LBAs in > >> >>> each zone simply? > >> >> > >> >> We would need to know the zone map, which is not otherwise required. > >> >> Then we would need to track the write pointer for each open zone for > >> >> each queue, so that we can stall writes that are not issued at the write > >> >> pointer. This is in effect all zones, because we cannot track when zones > >> >> are implicitly closed. Then, if different queues are issuing writes to > >> > > >> > Can you explain "implicitly closed" state a bit? > >> > > >> > From https://zonedstorage.io/docs/introduction/zoned-storage, only the > >> > following words are mentioned about closed state: > >> > > >> > ```Conversely, implicitly or explicitly opened zoned can be transitioned to the > >> > closed state using the CLOSE ZONE command.``` > >> > >> When a write is issued to an empty or closed zone, the drive will > >> automatically transition the zone into the implicit open state. This is > >> called implicit open because the host did not (explicitly) issue an open > >> zone command. > >> > >> When there are too many implicitly open zones, the drive may choose to > >> close one of the implicitly opened zone to implicitly open the zone that > >> is a target for a write command. > >> > >> Simple in a nutshell. This is done so that the drive can work with a > >> limited set of resources needed to handle open zones, that is, zones that > >> are being written. There are some more nasty details to all this with > >> limits on the number of open zones and active zones that a zoned drive may > >> have. > > > > OK, thanks for the clarification about implicitly closed, but I > > understand this close can't change the zone's write pointer. > > You are right, it does not matter if the zone is implicitly closed, I > was mistaken. But we still have to track the write pointer of every zone > in open or active state, otherwise we cannot know if a write that arrive > to a zone with no outstanding IO is actually at the write pointer, or > whether we need to hold it. > > > > >> > >> > > >> > zone info can be cached in the mapping(hash table)(zone sector is the key, and zone > >> > info is the value), which can be implemented as one LRU style. If any zone > >> > info isn't hit in the mapping table, ioctl(BLKREPORTZONE) can be called for > >> > obtaining the zone info. > >> > > >> >> the same zone, we need to sync across queues. Userspace may have > >> >> synchronization in place to issue writes with multiple threads while > >> >> still hitting the write pointer. > >> > > >> > You can trust mq-dealine, which guaranteed that write IO is sent to ->queue_rq() > >> > in order, no matter MQ or SQ. > >> > > >> > Yes, it could be issue from multiple queues for ublksrv, which doesn't sync > >> > among multiple queues. > >> > > >> > But per-zone re-order still can solve the issue, just need one lock > >> > for each zone to cover the MQ re-order. > >> > >> That lock is already there and using it, mq-deadline will never dispatch > >> more than one write per zone at any time. This is to avoid write > >> reordering. So multi queue or not, for any zone, there is no possibility > >> of having writes reordered. > > > > oops, I miss the single queue depth point per zone, so ublk won't break > > zoned write at all, and I agree order of batch IOs is one problem, but > > not hard to solve. > > The current implementation _does_ break zoned write because it reverses > batched writes. But if it is an easy fix, that is cool :) Please look at Damien's comment: >> That lock is already there and using it, mq-deadline will never dispatch >> more than one write per zone at any time. This is to avoid write >> reordering. So multi queue or not, for any zone, there is no possibility >> of having writes reordered. For zoned write, mq-deadline is used to limit at most one inflight write for each zone. So can you explain a bit how the current implementation breaks zoned write? Thanks, Ming ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-18 11:28 ` Ming Lei @ 2022-11-18 11:49 ` Andreas Hindborg 2022-11-18 12:46 ` Andreas Hindborg 2022-11-18 12:47 ` Ming Lei 0 siblings, 2 replies; 27+ messages in thread From: Andreas Hindborg @ 2022-11-18 11:49 UTC (permalink / raw) To: Ming Lei; +Cc: Andreas Hindborg, Damien Le Moal, Jens Axboe, linux-block Ming Lei <ming.lei@redhat.com> writes: > CAUTION: This email originated from outside of Western Digital. Do not click on > links or open attachments unless you recognize the sender and know that the > content is safe. > > > On Fri, Nov 18, 2022 at 10:41:31AM +0100, Andreas Hindborg wrote: >> >> Ming Lei <ming.lei@redhat.com> writes: >> >> > CAUTION: This email originated from outside of Western Digital. Do not click on >> > links or open attachments unless you recognize the sender and know that the >> > content is safe. >> > >> > >> > On Fri, Nov 18, 2022 at 01:35:29PM +0900, Damien Le Moal wrote: >> >> On 11/18/22 13:12, Ming Lei wrote: >> >> [...] >> >> >>> You can only assign it to zoned write request, but you still have to check >> >> >>> the sequence inside each zone, right? Then why not just check LBAs in >> >> >>> each zone simply? >> >> >> >> >> >> We would need to know the zone map, which is not otherwise required. >> >> >> Then we would need to track the write pointer for each open zone for >> >> >> each queue, so that we can stall writes that are not issued at the write >> >> >> pointer. This is in effect all zones, because we cannot track when zones >> >> >> are implicitly closed. Then, if different queues are issuing writes to >> >> > >> >> > Can you explain "implicitly closed" state a bit? >> >> > >> >> > From https://zonedstorage.io/docs/introduction/zoned-storage, only the >> >> > following words are mentioned about closed state: >> >> > >> >> > ```Conversely, implicitly or explicitly opened zoned can be transitioned to the >> >> > closed state using the CLOSE ZONE command.``` >> >> >> >> When a write is issued to an empty or closed zone, the drive will >> >> automatically transition the zone into the implicit open state. This is >> >> called implicit open because the host did not (explicitly) issue an open >> >> zone command. >> >> >> >> When there are too many implicitly open zones, the drive may choose to >> >> close one of the implicitly opened zone to implicitly open the zone that >> >> is a target for a write command. >> >> >> >> Simple in a nutshell. This is done so that the drive can work with a >> >> limited set of resources needed to handle open zones, that is, zones that >> >> are being written. There are some more nasty details to all this with >> >> limits on the number of open zones and active zones that a zoned drive may >> >> have. >> > >> > OK, thanks for the clarification about implicitly closed, but I >> > understand this close can't change the zone's write pointer. >> >> You are right, it does not matter if the zone is implicitly closed, I >> was mistaken. But we still have to track the write pointer of every zone >> in open or active state, otherwise we cannot know if a write that arrive >> to a zone with no outstanding IO is actually at the write pointer, or >> whether we need to hold it. >> >> > >> >> >> >> > >> >> > zone info can be cached in the mapping(hash table)(zone sector is the key, and zone >> >> > info is the value), which can be implemented as one LRU style. If any zone >> >> > info isn't hit in the mapping table, ioctl(BLKREPORTZONE) can be called for >> >> > obtaining the zone info. >> >> > >> >> >> the same zone, we need to sync across queues. Userspace may have >> >> >> synchronization in place to issue writes with multiple threads while >> >> >> still hitting the write pointer. >> >> > >> >> > You can trust mq-dealine, which guaranteed that write IO is sent to ->queue_rq() >> >> > in order, no matter MQ or SQ. >> >> > >> >> > Yes, it could be issue from multiple queues for ublksrv, which doesn't sync >> >> > among multiple queues. >> >> > >> >> > But per-zone re-order still can solve the issue, just need one lock >> >> > for each zone to cover the MQ re-order. >> >> >> >> That lock is already there and using it, mq-deadline will never dispatch >> >> more than one write per zone at any time. This is to avoid write >> >> reordering. So multi queue or not, for any zone, there is no possibility >> >> of having writes reordered. >> > >> > oops, I miss the single queue depth point per zone, so ublk won't break >> > zoned write at all, and I agree order of batch IOs is one problem, but >> > not hard to solve. >> >> The current implementation _does_ break zoned write because it reverses >> batched writes. But if it is an easy fix, that is cool :) > > Please look at Damien's comment: > >>> That lock is already there and using it, mq-deadline will never dispatch >>> more than one write per zone at any time. This is to avoid write >>> reordering. So multi queue or not, for any zone, there is no possibility >>> of having writes reordered. > > For zoned write, mq-deadline is used to limit at most one inflight write > for each zone. > > So can you explain a bit how the current implementation breaks zoned > write? Like Damien wrote in another email, mq-deadline will only impose ordering for requests submitted in batch. The flow we have is the following: - Userspace sends requests to ublk gendisk - Requests go through block layer and is _not_ reordered when using mq-deadline. They may be split. - Requests hit ublk_drv and ublk_drv will reverse order of _all_ batched up requests (including split requests). - ublk_drv sends request to ublksrv in _reverse_ order. - ublksrv sends requests _not_ batched up to target device. - Requests that enter mq-deadline at the same time are reordered in LBA order, that is all good. - Requests that enter the kernel in different batches are not reordered in LBA order and end up missing the write pointer. This is bad. So, ublk_drv is not functional for zoned storage as is. Either we have to fix up the ordering in userspace in ublksrv, and that _will_ have a performance impact. Or we fix the bug in ublk_drv that causes batched requests to be _reversed_. Thanks, Andreas ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-18 11:49 ` Andreas Hindborg @ 2022-11-18 12:46 ` Andreas Hindborg 2022-11-18 12:47 ` Ming Lei 1 sibling, 0 replies; 27+ messages in thread From: Andreas Hindborg @ 2022-11-18 12:46 UTC (permalink / raw) To: Andreas Hindborg; +Cc: Ming Lei, Damien Le Moal, Jens Axboe, linux-block Andreas Hindborg <andreas.hindborg@wdc.com> writes: [...] >>> > >>> > oops, I miss the single queue depth point per zone, so ublk won't break >>> > zoned write at all, and I agree order of batch IOs is one problem, but >>> > not hard to solve. >>> >>> The current implementation _does_ break zoned write because it reverses >>> batched writes. But if it is an easy fix, that is cool :) >> >> Please look at Damien's comment: >> >>>> That lock is already there and using it, mq-deadline will never dispatch >>>> more than one write per zone at any time. This is to avoid write >>>> reordering. So multi queue or not, for any zone, there is no possibility >>>> of having writes reordered. >> >> For zoned write, mq-deadline is used to limit at most one inflight write >> for each zone. >> >> So can you explain a bit how the current implementation breaks zoned >> write? > > Like Damien wrote in another email, mq-deadline will only impose > ordering for requests submitted in batch. The flow we have is the > following: > > - Userspace sends requests to ublk gendisk > - Requests go through block layer and is _not_ reordered when using > mq-deadline. They may be split. > - Requests hit ublk_drv and ublk_drv will reverse order of _all_ > batched up requests (including split requests). > - ublk_drv sends request to ublksrv in _reverse_ order. > - ublksrv sends requests _not_ batched up to target device. > - Requests that enter mq-deadline at the same time are reordered in LBA > order, that is all good. > - Requests that enter the kernel in different batches are not reordered > in LBA order and end up missing the write pointer. This is bad. > > So, ublk_drv is not functional for zoned storage as is. Either we have > to fix up the ordering in userspace in ublksrv, and that _will_ have a > performance impact. Or we fix the bug in ublk_drv that causes batched > requests to be _reversed_. Here is a suggestion for a fix. It needs work, but it illustrates the idea. From 48f54a2a83daf19dda3c928e6518ce4a3e443fcd Mon Sep 17 00:00:00 2001 From: Andreas Hindborg <andreas.hindborg@wdc.com> Date: Fri, 18 Nov 2022 13:44:45 +0100 Subject: [PATCH] wip: Do not reorder requests in ublk --- drivers/block/ublk_drv.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index 6a4a94b4cdf4..4fb5ccd01202 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -9,6 +9,7 @@ * * (part of code stolen from loop.c) */ +#include <linux/llist.h> #include <linux/module.h> #include <linux/moduleparam.h> #include <linux/sched.h> @@ -55,6 +56,7 @@ #define UBLK_PARAM_TYPE_ALL (UBLK_PARAM_TYPE_BASIC | UBLK_PARAM_TYPE_DISCARD) struct ublk_rq_data { + struct llist_node llnode; struct callback_head work; }; @@ -121,6 +123,7 @@ struct ublk_queue { unsigned int max_io_sz; bool abort_work_pending; unsigned short nr_io_ready; /* how many ios setup */ + struct llist_head pdu_queue; struct ublk_device *dev; struct ublk_io ios[0]; }; @@ -724,8 +727,15 @@ static void ublk_rq_task_work_fn(struct callback_head *work) struct ublk_rq_data *data = container_of(work, struct ublk_rq_data, work); struct request *req = blk_mq_rq_from_pdu(data); + struct ublk_queue *ubq = req->mq_hctx->driver_data; - __ublk_rq_task_work(req); + /* Some times this list is empty, but that is OK */ + struct llist_node *head = llist_del_all(&ubq->pdu_queue); + head = llist_reverse_order(head); + llist_for_each_entry(data, head, llnode) { + req = blk_mq_rq_from_pdu(data); + __ublk_rq_task_work(req); + } } static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx, @@ -753,6 +763,7 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx, enum task_work_notify_mode notify_mode = bd->last ? TWA_SIGNAL_NO_IPI : TWA_NONE; + llist_add(&data->llnode, &ubq->pdu_queue); if (task_work_add(ubq->ubq_daemon, &data->work, notify_mode)) goto fail; } else { @@ -1170,6 +1181,9 @@ static int ublk_init_queue(struct ublk_device *ub, int q_id) ubq->io_cmd_buf = ptr; ubq->dev = ub; + + init_llist_head(&ubq->pdu_queue); + return 0; } -- 2.38.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-18 11:49 ` Andreas Hindborg 2022-11-18 12:46 ` Andreas Hindborg @ 2022-11-18 12:47 ` Ming Lei 2022-11-19 0:24 ` Damien Le Moal 1 sibling, 1 reply; 27+ messages in thread From: Ming Lei @ 2022-11-18 12:47 UTC (permalink / raw) To: Andreas Hindborg; +Cc: Damien Le Moal, Jens Axboe, linux-block, ming.lei On Fri, Nov 18, 2022 at 12:49:15PM +0100, Andreas Hindborg wrote: > > Ming Lei <ming.lei@redhat.com> writes: > > > CAUTION: This email originated from outside of Western Digital. Do not click on > > links or open attachments unless you recognize the sender and know that the > > content is safe. > > > > > > On Fri, Nov 18, 2022 at 10:41:31AM +0100, Andreas Hindborg wrote: > >> > >> Ming Lei <ming.lei@redhat.com> writes: > >> > >> > CAUTION: This email originated from outside of Western Digital. Do not click on > >> > links or open attachments unless you recognize the sender and know that the > >> > content is safe. > >> > > >> > > >> > On Fri, Nov 18, 2022 at 01:35:29PM +0900, Damien Le Moal wrote: > >> >> On 11/18/22 13:12, Ming Lei wrote: > >> >> [...] > >> >> >>> You can only assign it to zoned write request, but you still have to check > >> >> >>> the sequence inside each zone, right? Then why not just check LBAs in > >> >> >>> each zone simply? > >> >> >> > >> >> >> We would need to know the zone map, which is not otherwise required. > >> >> >> Then we would need to track the write pointer for each open zone for > >> >> >> each queue, so that we can stall writes that are not issued at the write > >> >> >> pointer. This is in effect all zones, because we cannot track when zones > >> >> >> are implicitly closed. Then, if different queues are issuing writes to > >> >> > > >> >> > Can you explain "implicitly closed" state a bit? > >> >> > > >> >> > From https://zonedstorage.io/docs/introduction/zoned-storage, only the > >> >> > following words are mentioned about closed state: > >> >> > > >> >> > ```Conversely, implicitly or explicitly opened zoned can be transitioned to the > >> >> > closed state using the CLOSE ZONE command.``` > >> >> > >> >> When a write is issued to an empty or closed zone, the drive will > >> >> automatically transition the zone into the implicit open state. This is > >> >> called implicit open because the host did not (explicitly) issue an open > >> >> zone command. > >> >> > >> >> When there are too many implicitly open zones, the drive may choose to > >> >> close one of the implicitly opened zone to implicitly open the zone that > >> >> is a target for a write command. > >> >> > >> >> Simple in a nutshell. This is done so that the drive can work with a > >> >> limited set of resources needed to handle open zones, that is, zones that > >> >> are being written. There are some more nasty details to all this with > >> >> limits on the number of open zones and active zones that a zoned drive may > >> >> have. > >> > > >> > OK, thanks for the clarification about implicitly closed, but I > >> > understand this close can't change the zone's write pointer. > >> > >> You are right, it does not matter if the zone is implicitly closed, I > >> was mistaken. But we still have to track the write pointer of every zone > >> in open or active state, otherwise we cannot know if a write that arrive > >> to a zone with no outstanding IO is actually at the write pointer, or > >> whether we need to hold it. > >> > >> > > >> >> > >> >> > > >> >> > zone info can be cached in the mapping(hash table)(zone sector is the key, and zone > >> >> > info is the value), which can be implemented as one LRU style. If any zone > >> >> > info isn't hit in the mapping table, ioctl(BLKREPORTZONE) can be called for > >> >> > obtaining the zone info. > >> >> > > >> >> >> the same zone, we need to sync across queues. Userspace may have > >> >> >> synchronization in place to issue writes with multiple threads while > >> >> >> still hitting the write pointer. > >> >> > > >> >> > You can trust mq-dealine, which guaranteed that write IO is sent to ->queue_rq() > >> >> > in order, no matter MQ or SQ. > >> >> > > >> >> > Yes, it could be issue from multiple queues for ublksrv, which doesn't sync > >> >> > among multiple queues. > >> >> > > >> >> > But per-zone re-order still can solve the issue, just need one lock > >> >> > for each zone to cover the MQ re-order. > >> >> > >> >> That lock is already there and using it, mq-deadline will never dispatch > >> >> more than one write per zone at any time. This is to avoid write > >> >> reordering. So multi queue or not, for any zone, there is no possibility > >> >> of having writes reordered. > >> > > >> > oops, I miss the single queue depth point per zone, so ublk won't break > >> > zoned write at all, and I agree order of batch IOs is one problem, but > >> > not hard to solve. > >> > >> The current implementation _does_ break zoned write because it reverses > >> batched writes. But if it is an easy fix, that is cool :) > > > > Please look at Damien's comment: > > > >>> That lock is already there and using it, mq-deadline will never dispatch > >>> more than one write per zone at any time. This is to avoid write > >>> reordering. So multi queue or not, for any zone, there is no possibility > >>> of having writes reordered. > > > > For zoned write, mq-deadline is used to limit at most one inflight write > > for each zone. > > > > So can you explain a bit how the current implementation breaks zoned > > write? > > Like Damien wrote in another email, mq-deadline will only impose > ordering for requests submitted in batch. The flow we have is the > following: > > - Userspace sends requests to ublk gendisk > - Requests go through block layer and is _not_ reordered when using > mq-deadline. They may be split. > - Requests hit ublk_drv and ublk_drv will reverse order of _all_ > batched up requests (including split requests). For ublk-zone, ublk driver needs to be exposed as zoned device by calling disk_set_zoned() finally, which definitely isn't supported now, so mq-deadline at most sends one write IO for each zone after ublk-zone is supported, see blk_req_can_dispatch_to_zone(). > - ublk_drv sends request to ublksrv in _reverse_ order. > - ublksrv sends requests _not_ batched up to target device. > - Requests that enter mq-deadline at the same time are reordered in LBA > order, that is all good. > - Requests that enter the kernel in different batches are not reordered > in LBA order and end up missing the write pointer. This is bad. Again, please read Damien's comment: >> That lock is already there and using it, mq-deadline will never dispatch >> more than one write per zone at any time. Anytime, there is at most one write IO for each zone, how can the single write IO be re-order? Thanks, Ming ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-18 12:47 ` Ming Lei @ 2022-11-19 0:24 ` Damien Le Moal 2022-11-19 7:36 ` Andreas Hindborg ` (2 more replies) 0 siblings, 3 replies; 27+ messages in thread From: Damien Le Moal @ 2022-11-19 0:24 UTC (permalink / raw) To: Ming Lei, Andreas Hindborg; +Cc: Jens Axboe, linux-block On 11/18/22 21:47, Ming Lei wrote: > On Fri, Nov 18, 2022 at 12:49:15PM +0100, Andreas Hindborg wrote: >> >> Ming Lei <ming.lei@redhat.com> writes: >> >>> CAUTION: This email originated from outside of Western Digital. Do not click on >>> links or open attachments unless you recognize the sender and know that the >>> content is safe. >>> >>> >>> On Fri, Nov 18, 2022 at 10:41:31AM +0100, Andreas Hindborg wrote: >>>> >>>> Ming Lei <ming.lei@redhat.com> writes: >>>> >>>>> CAUTION: This email originated from outside of Western Digital. Do not click on >>>>> links or open attachments unless you recognize the sender and know that the >>>>> content is safe. >>>>> >>>>> >>>>> On Fri, Nov 18, 2022 at 01:35:29PM +0900, Damien Le Moal wrote: >>>>>> On 11/18/22 13:12, Ming Lei wrote: >>>>>> [...] >>>>>>>>> You can only assign it to zoned write request, but you still have to check >>>>>>>>> the sequence inside each zone, right? Then why not just check LBAs in >>>>>>>>> each zone simply? >>>>>>>> >>>>>>>> We would need to know the zone map, which is not otherwise required. >>>>>>>> Then we would need to track the write pointer for each open zone for >>>>>>>> each queue, so that we can stall writes that are not issued at the write >>>>>>>> pointer. This is in effect all zones, because we cannot track when zones >>>>>>>> are implicitly closed. Then, if different queues are issuing writes to >>>>>>> >>>>>>> Can you explain "implicitly closed" state a bit? >>>>>>> >>>>>>> From https://zonedstorage.io/docs/introduction/zoned-storage, only the >>>>>>> following words are mentioned about closed state: >>>>>>> >>>>>>> ```Conversely, implicitly or explicitly opened zoned can be transitioned to the >>>>>>> closed state using the CLOSE ZONE command.``` >>>>>> >>>>>> When a write is issued to an empty or closed zone, the drive will >>>>>> automatically transition the zone into the implicit open state. This is >>>>>> called implicit open because the host did not (explicitly) issue an open >>>>>> zone command. >>>>>> >>>>>> When there are too many implicitly open zones, the drive may choose to >>>>>> close one of the implicitly opened zone to implicitly open the zone that >>>>>> is a target for a write command. >>>>>> >>>>>> Simple in a nutshell. This is done so that the drive can work with a >>>>>> limited set of resources needed to handle open zones, that is, zones that >>>>>> are being written. There are some more nasty details to all this with >>>>>> limits on the number of open zones and active zones that a zoned drive may >>>>>> have. >>>>> >>>>> OK, thanks for the clarification about implicitly closed, but I >>>>> understand this close can't change the zone's write pointer. >>>> >>>> You are right, it does not matter if the zone is implicitly closed, I >>>> was mistaken. But we still have to track the write pointer of every zone >>>> in open or active state, otherwise we cannot know if a write that arrive >>>> to a zone with no outstanding IO is actually at the write pointer, or >>>> whether we need to hold it. >>>> >>>>> >>>>>> >>>>>>> >>>>>>> zone info can be cached in the mapping(hash table)(zone sector is the key, and zone >>>>>>> info is the value), which can be implemented as one LRU style. If any zone >>>>>>> info isn't hit in the mapping table, ioctl(BLKREPORTZONE) can be called for >>>>>>> obtaining the zone info. >>>>>>> >>>>>>>> the same zone, we need to sync across queues. Userspace may have >>>>>>>> synchronization in place to issue writes with multiple threads while >>>>>>>> still hitting the write pointer. >>>>>>> >>>>>>> You can trust mq-dealine, which guaranteed that write IO is sent to ->queue_rq() >>>>>>> in order, no matter MQ or SQ. >>>>>>> >>>>>>> Yes, it could be issue from multiple queues for ublksrv, which doesn't sync >>>>>>> among multiple queues. >>>>>>> >>>>>>> But per-zone re-order still can solve the issue, just need one lock >>>>>>> for each zone to cover the MQ re-order. >>>>>> >>>>>> That lock is already there and using it, mq-deadline will never dispatch >>>>>> more than one write per zone at any time. This is to avoid write >>>>>> reordering. So multi queue or not, for any zone, there is no possibility >>>>>> of having writes reordered. >>>>> >>>>> oops, I miss the single queue depth point per zone, so ublk won't break >>>>> zoned write at all, and I agree order of batch IOs is one problem, but >>>>> not hard to solve. >>>> >>>> The current implementation _does_ break zoned write because it reverses >>>> batched writes. But if it is an easy fix, that is cool :) >>> >>> Please look at Damien's comment: >>> >>>>> That lock is already there and using it, mq-deadline will never dispatch >>>>> more than one write per zone at any time. This is to avoid write >>>>> reordering. So multi queue or not, for any zone, there is no possibility >>>>> of having writes reordered. >>> >>> For zoned write, mq-deadline is used to limit at most one inflight write >>> for each zone. >>> >>> So can you explain a bit how the current implementation breaks zoned >>> write? >> >> Like Damien wrote in another email, mq-deadline will only impose >> ordering for requests submitted in batch. The flow we have is the >> following: >> >> - Userspace sends requests to ublk gendisk >> - Requests go through block layer and is _not_ reordered when using >> mq-deadline. They may be split. >> - Requests hit ublk_drv and ublk_drv will reverse order of _all_ >> batched up requests (including split requests). > > For ublk-zone, ublk driver needs to be exposed as zoned device by > calling disk_set_zoned() finally, which definitely isn't supported now, > so mq-deadline at most sends one write IO for each zone after ublk-zone > is supported, see blk_req_can_dispatch_to_zone(). > >> - ublk_drv sends request to ublksrv in _reverse_ order. >> - ublksrv sends requests _not_ batched up to target device. >> - Requests that enter mq-deadline at the same time are reordered in LBA >> order, that is all good. >> - Requests that enter the kernel in different batches are not reordered >> in LBA order and end up missing the write pointer. This is bad. > > Again, please read Damien's comment: > >>> That lock is already there and using it, mq-deadline will never dispatch >>> more than one write per zone at any time. > > Anytime, there is at most one write IO for each zone, how can the single > write IO be re-order? If the user issues writes one at a time out of order (not aligned to the write pointer), mq-deadline will not help at all. The zone write locking will still limit write dispatching to one per zone, but the writes will fail. mq-deadline will reorder write commands in the correct lba order only if: - the commands are inserted as a batch (more than on request passed to ->insert_requests) - commands are inserted individually when the target zone is locked (a write is already being executed) This has been the semantic from the start: the block layer has no guarantees about the correct ordering of writes to zoned drive. What is guaranteed is that (1) if the user issues writes in order AND (2) mq-deadline is used, then writes will be dispatched in the same order to the device. I have not looked at the details of ublk, but from the thread, I think (1) is not done and (2) is missing-ish as the ublk device is not marked as zoned. > > > Thanks, > Ming > -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-19 0:24 ` Damien Le Moal @ 2022-11-19 7:36 ` Andreas Hindborg 2022-11-21 10:15 ` Andreas Hindborg 2022-11-20 14:37 ` Ming Lei 2022-11-21 8:03 ` Christoph Hellwig 2 siblings, 1 reply; 27+ messages in thread From: Andreas Hindborg @ 2022-11-19 7:36 UTC (permalink / raw) To: Damien Le Moal; +Cc: Ming Lei, Andreas Hindborg, Jens Axboe, linux-block Damien Le Moal <damien.lemoal@opensource.wdc.com> writes: > On 11/18/22 21:47, Ming Lei wrote: >> On Fri, Nov 18, 2022 at 12:49:15PM +0100, Andreas Hindborg wrote: >>> >>> Ming Lei <ming.lei@redhat.com> writes: >>> >>>> CAUTION: This email originated from outside of Western Digital. Do not click on >>>> links or open attachments unless you recognize the sender and know that the >>>> content is safe. >>>> >>>> >>>> On Fri, Nov 18, 2022 at 10:41:31AM +0100, Andreas Hindborg wrote: >>>>> >>>>> Ming Lei <ming.lei@redhat.com> writes: >>>>> >>>>>> CAUTION: This email originated from outside of Western Digital. Do not click on >>>>>> links or open attachments unless you recognize the sender and know that the >>>>>> content is safe. >>>>>> >>>>>> >>>>>> On Fri, Nov 18, 2022 at 01:35:29PM +0900, Damien Le Moal wrote: >>>>>>> On 11/18/22 13:12, Ming Lei wrote: >>>>>>> [...] >>>>>>>>>> You can only assign it to zoned write request, but you still have to check >>>>>>>>>> the sequence inside each zone, right? Then why not just check LBAs in >>>>>>>>>> each zone simply? >>>>>>>>> >>>>>>>>> We would need to know the zone map, which is not otherwise required. >>>>>>>>> Then we would need to track the write pointer for each open zone for >>>>>>>>> each queue, so that we can stall writes that are not issued at the write >>>>>>>>> pointer. This is in effect all zones, because we cannot track when zones >>>>>>>>> are implicitly closed. Then, if different queues are issuing writes to >>>>>>>> >>>>>>>> Can you explain "implicitly closed" state a bit? >>>>>>>> >>>>>>>> From https://zonedstorage.io/docs/introduction/zoned-storage, only the >>>>>>>> following words are mentioned about closed state: >>>>>>>> >>>>>>>> ```Conversely, implicitly or explicitly opened zoned can be transitioned to the >>>>>>>> closed state using the CLOSE ZONE command.``` >>>>>>> >>>>>>> When a write is issued to an empty or closed zone, the drive will >>>>>>> automatically transition the zone into the implicit open state. This is >>>>>>> called implicit open because the host did not (explicitly) issue an open >>>>>>> zone command. >>>>>>> >>>>>>> When there are too many implicitly open zones, the drive may choose to >>>>>>> close one of the implicitly opened zone to implicitly open the zone that >>>>>>> is a target for a write command. >>>>>>> >>>>>>> Simple in a nutshell. This is done so that the drive can work with a >>>>>>> limited set of resources needed to handle open zones, that is, zones that >>>>>>> are being written. There are some more nasty details to all this with >>>>>>> limits on the number of open zones and active zones that a zoned drive may >>>>>>> have. >>>>>> >>>>>> OK, thanks for the clarification about implicitly closed, but I >>>>>> understand this close can't change the zone's write pointer. >>>>> >>>>> You are right, it does not matter if the zone is implicitly closed, I >>>>> was mistaken. But we still have to track the write pointer of every zone >>>>> in open or active state, otherwise we cannot know if a write that arrive >>>>> to a zone with no outstanding IO is actually at the write pointer, or >>>>> whether we need to hold it. >>>>> >>>>>> >>>>>>> >>>>>>>> >>>>>>>> zone info can be cached in the mapping(hash table)(zone sector is the key, and zone >>>>>>>> info is the value), which can be implemented as one LRU style. If any zone >>>>>>>> info isn't hit in the mapping table, ioctl(BLKREPORTZONE) can be called for >>>>>>>> obtaining the zone info. >>>>>>>> >>>>>>>>> the same zone, we need to sync across queues. Userspace may have >>>>>>>>> synchronization in place to issue writes with multiple threads while >>>>>>>>> still hitting the write pointer. >>>>>>>> >>>>>>>> You can trust mq-dealine, which guaranteed that write IO is sent to ->queue_rq() >>>>>>>> in order, no matter MQ or SQ. >>>>>>>> >>>>>>>> Yes, it could be issue from multiple queues for ublksrv, which doesn't sync >>>>>>>> among multiple queues. >>>>>>>> >>>>>>>> But per-zone re-order still can solve the issue, just need one lock >>>>>>>> for each zone to cover the MQ re-order. >>>>>>> >>>>>>> That lock is already there and using it, mq-deadline will never dispatch >>>>>>> more than one write per zone at any time. This is to avoid write >>>>>>> reordering. So multi queue or not, for any zone, there is no possibility >>>>>>> of having writes reordered. >>>>>> >>>>>> oops, I miss the single queue depth point per zone, so ublk won't break >>>>>> zoned write at all, and I agree order of batch IOs is one problem, but >>>>>> not hard to solve. >>>>> >>>>> The current implementation _does_ break zoned write because it reverses >>>>> batched writes. But if it is an easy fix, that is cool :) >>>> >>>> Please look at Damien's comment: >>>> >>>>>> That lock is already there and using it, mq-deadline will never dispatch >>>>>> more than one write per zone at any time. This is to avoid write >>>>>> reordering. So multi queue or not, for any zone, there is no possibility >>>>>> of having writes reordered. >>>> >>>> For zoned write, mq-deadline is used to limit at most one inflight write >>>> for each zone. >>>> >>>> So can you explain a bit how the current implementation breaks zoned >>>> write? >>> >>> Like Damien wrote in another email, mq-deadline will only impose >>> ordering for requests submitted in batch. The flow we have is the >>> following: >>> >>> - Userspace sends requests to ublk gendisk >>> - Requests go through block layer and is _not_ reordered when using >>> mq-deadline. They may be split. >>> - Requests hit ublk_drv and ublk_drv will reverse order of _all_ >>> batched up requests (including split requests). >> >> For ublk-zone, ublk driver needs to be exposed as zoned device by >> calling disk_set_zoned() finally, which definitely isn't supported now, >> so mq-deadline at most sends one write IO for each zone after ublk-zone >> is supported, see blk_req_can_dispatch_to_zone(). >> >>> - ublk_drv sends request to ublksrv in _reverse_ order. >>> - ublksrv sends requests _not_ batched up to target device. >>> - Requests that enter mq-deadline at the same time are reordered in LBA >>> order, that is all good. >>> - Requests that enter the kernel in different batches are not reordered >>> in LBA order and end up missing the write pointer. This is bad. >> >> Again, please read Damien's comment: >> >>>> That lock is already there and using it, mq-deadline will never dispatch >>>> more than one write per zone at any time. >> >> Anytime, there is at most one write IO for each zone, how can the single >> write IO be re-order? > > If the user issues writes one at a time out of order (not aligned to the > write pointer), mq-deadline will not help at all. The zone write locking > will still limit write dispatching to one per zone, but the writes will fail. > > mq-deadline will reorder write commands in the correct lba order only if: > - the commands are inserted as a batch (more than on request passed to > ->insert_requests) > - commands are inserted individually when the target zone is locked (a > write is already being executed) > > This has been the semantic from the start: the block layer has no > guarantees about the correct ordering of writes to zoned drive. What is > guaranteed is that (1) if the user issues writes in order AND (2) > mq-deadline is used, then writes will be dispatched in the same order to > the device. > > I have not looked at the details of ublk, but from the thread, I think (1) > is not done and (2) is missing-ish as the ublk device is not marked as zoned. I have a patch in the works for adding zoned storage support to ublk. It sets up the ublk device as a zoned device. It is very much work in progress, but it lives here [1] for now. I am pretty sure that I saw large writes to zoned ublk device being split and issued to the device (same zone) with multiple outstanding requests at the same time. I'll verify on Monday and provide a test case if that is the case. Might be I configured the ublk device wrong? I set it up as host managed zoned and set up zone size, max active, max open. Best regards, Andreas [1] https://github.com/metaspace/linux/tree/ublk-zoned ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-19 7:36 ` Andreas Hindborg @ 2022-11-21 10:15 ` Andreas Hindborg 0 siblings, 0 replies; 27+ messages in thread From: Andreas Hindborg @ 2022-11-21 10:15 UTC (permalink / raw) To: Andreas Hindborg; +Cc: Damien Le Moal, Ming Lei, Jens Axboe, linux-block Andreas Hindborg <andreas.hindborg@wdc.com> writes: > Damien Le Moal <damien.lemoal@opensource.wdc.com> writes: > >> On 11/18/22 21:47, Ming Lei wrote: >>> Anytime, there is at most one write IO for each zone, how can the single >>> write IO be re-order? >> >> If the user issues writes one at a time out of order (not aligned to the >> write pointer), mq-deadline will not help at all. The zone write locking >> will still limit write dispatching to one per zone, but the writes will fail. >> >> mq-deadline will reorder write commands in the correct lba order only if: >> - the commands are inserted as a batch (more than on request passed to >> ->insert_requests) >> - commands are inserted individually when the target zone is locked (a >> write is already being executed) >> >> This has been the semantic from the start: the block layer has no >> guarantees about the correct ordering of writes to zoned drive. What is >> guaranteed is that (1) if the user issues writes in order AND (2) >> mq-deadline is used, then writes will be dispatched in the same order to >> the device. >> >> I have not looked at the details of ublk, but from the thread, I think (1) >> is not done and (2) is missing-ish as the ublk device is not marked as zoned. > > I have a patch in the works for adding zoned storage support to ublk. It > sets up the ublk device as a zoned device. It is very much work in > progress, but it lives here [1] for now. > > I am pretty sure that I saw large writes to zoned ublk device being > split and issued to the device (same zone) with multiple outstanding > requests at the same time. I'll verify on Monday and provide a test case > if that is the case. Might be I configured the ublk device wrong? I set > it up as host managed zoned and set up zone size, max active, max open. Turns out I was not setting up the zone bitmaps. I was missing a call to blk_revalidate_disk_zones() and that took out the zone lock in the deadline scheduler. With that in place, everything works as it should. Thanks for the help everyone! Best regards, Andreas ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-19 0:24 ` Damien Le Moal 2022-11-19 7:36 ` Andreas Hindborg @ 2022-11-20 14:37 ` Ming Lei 2022-11-21 1:25 ` Damien Le Moal 2022-11-21 8:03 ` Christoph Hellwig 2 siblings, 1 reply; 27+ messages in thread From: Ming Lei @ 2022-11-20 14:37 UTC (permalink / raw) To: Damien Le Moal; +Cc: Andreas Hindborg, Jens Axboe, linux-block, ming.lei On Sat, Nov 19, 2022 at 09:24:01AM +0900, Damien Le Moal wrote: > On 11/18/22 21:47, Ming Lei wrote: > > On Fri, Nov 18, 2022 at 12:49:15PM +0100, Andreas Hindborg wrote: > >> > >> Ming Lei <ming.lei@redhat.com> writes: > >> > >>> CAUTION: This email originated from outside of Western Digital. Do not click on > >>> links or open attachments unless you recognize the sender and know that the > >>> content is safe. > >>> > >>> > >>> On Fri, Nov 18, 2022 at 10:41:31AM +0100, Andreas Hindborg wrote: > >>>> > >>>> Ming Lei <ming.lei@redhat.com> writes: > >>>> > >>>>> CAUTION: This email originated from outside of Western Digital. Do not click on > >>>>> links or open attachments unless you recognize the sender and know that the > >>>>> content is safe. > >>>>> > >>>>> > >>>>> On Fri, Nov 18, 2022 at 01:35:29PM +0900, Damien Le Moal wrote: > >>>>>> On 11/18/22 13:12, Ming Lei wrote: > >>>>>> [...] > >>>>>>>>> You can only assign it to zoned write request, but you still have to check > >>>>>>>>> the sequence inside each zone, right? Then why not just check LBAs in > >>>>>>>>> each zone simply? > >>>>>>>> > >>>>>>>> We would need to know the zone map, which is not otherwise required. > >>>>>>>> Then we would need to track the write pointer for each open zone for > >>>>>>>> each queue, so that we can stall writes that are not issued at the write > >>>>>>>> pointer. This is in effect all zones, because we cannot track when zones > >>>>>>>> are implicitly closed. Then, if different queues are issuing writes to > >>>>>>> > >>>>>>> Can you explain "implicitly closed" state a bit? > >>>>>>> > >>>>>>> From https://zonedstorage.io/docs/introduction/zoned-storage, only the > >>>>>>> following words are mentioned about closed state: > >>>>>>> > >>>>>>> ```Conversely, implicitly or explicitly opened zoned can be transitioned to the > >>>>>>> closed state using the CLOSE ZONE command.``` > >>>>>> > >>>>>> When a write is issued to an empty or closed zone, the drive will > >>>>>> automatically transition the zone into the implicit open state. This is > >>>>>> called implicit open because the host did not (explicitly) issue an open > >>>>>> zone command. > >>>>>> > >>>>>> When there are too many implicitly open zones, the drive may choose to > >>>>>> close one of the implicitly opened zone to implicitly open the zone that > >>>>>> is a target for a write command. > >>>>>> > >>>>>> Simple in a nutshell. This is done so that the drive can work with a > >>>>>> limited set of resources needed to handle open zones, that is, zones that > >>>>>> are being written. There are some more nasty details to all this with > >>>>>> limits on the number of open zones and active zones that a zoned drive may > >>>>>> have. > >>>>> > >>>>> OK, thanks for the clarification about implicitly closed, but I > >>>>> understand this close can't change the zone's write pointer. > >>>> > >>>> You are right, it does not matter if the zone is implicitly closed, I > >>>> was mistaken. But we still have to track the write pointer of every zone > >>>> in open or active state, otherwise we cannot know if a write that arrive > >>>> to a zone with no outstanding IO is actually at the write pointer, or > >>>> whether we need to hold it. > >>>> > >>>>> > >>>>>> > >>>>>>> > >>>>>>> zone info can be cached in the mapping(hash table)(zone sector is the key, and zone > >>>>>>> info is the value), which can be implemented as one LRU style. If any zone > >>>>>>> info isn't hit in the mapping table, ioctl(BLKREPORTZONE) can be called for > >>>>>>> obtaining the zone info. > >>>>>>> > >>>>>>>> the same zone, we need to sync across queues. Userspace may have > >>>>>>>> synchronization in place to issue writes with multiple threads while > >>>>>>>> still hitting the write pointer. > >>>>>>> > >>>>>>> You can trust mq-dealine, which guaranteed that write IO is sent to ->queue_rq() > >>>>>>> in order, no matter MQ or SQ. > >>>>>>> > >>>>>>> Yes, it could be issue from multiple queues for ublksrv, which doesn't sync > >>>>>>> among multiple queues. > >>>>>>> > >>>>>>> But per-zone re-order still can solve the issue, just need one lock > >>>>>>> for each zone to cover the MQ re-order. > >>>>>> > >>>>>> That lock is already there and using it, mq-deadline will never dispatch > >>>>>> more than one write per zone at any time. This is to avoid write > >>>>>> reordering. So multi queue or not, for any zone, there is no possibility > >>>>>> of having writes reordered. > >>>>> > >>>>> oops, I miss the single queue depth point per zone, so ublk won't break > >>>>> zoned write at all, and I agree order of batch IOs is one problem, but > >>>>> not hard to solve. > >>>> > >>>> The current implementation _does_ break zoned write because it reverses > >>>> batched writes. But if it is an easy fix, that is cool :) > >>> > >>> Please look at Damien's comment: > >>> > >>>>> That lock is already there and using it, mq-deadline will never dispatch > >>>>> more than one write per zone at any time. This is to avoid write > >>>>> reordering. So multi queue or not, for any zone, there is no possibility > >>>>> of having writes reordered. > >>> > >>> For zoned write, mq-deadline is used to limit at most one inflight write > >>> for each zone. > >>> > >>> So can you explain a bit how the current implementation breaks zoned > >>> write? > >> > >> Like Damien wrote in another email, mq-deadline will only impose > >> ordering for requests submitted in batch. The flow we have is the > >> following: > >> > >> - Userspace sends requests to ublk gendisk > >> - Requests go through block layer and is _not_ reordered when using > >> mq-deadline. They may be split. > >> - Requests hit ublk_drv and ublk_drv will reverse order of _all_ > >> batched up requests (including split requests). > > > > For ublk-zone, ublk driver needs to be exposed as zoned device by > > calling disk_set_zoned() finally, which definitely isn't supported now, > > so mq-deadline at most sends one write IO for each zone after ublk-zone > > is supported, see blk_req_can_dispatch_to_zone(). > > > >> - ublk_drv sends request to ublksrv in _reverse_ order. > >> - ublksrv sends requests _not_ batched up to target device. > >> - Requests that enter mq-deadline at the same time are reordered in LBA > >> order, that is all good. > >> - Requests that enter the kernel in different batches are not reordered > >> in LBA order and end up missing the write pointer. This is bad. > > > > Again, please read Damien's comment: > > > >>> That lock is already there and using it, mq-deadline will never dispatch > >>> more than one write per zone at any time. > > > > Anytime, there is at most one write IO for each zone, how can the single > > write IO be re-order? > > If the user issues writes one at a time out of order (not aligned to the > write pointer), mq-deadline will not help at all. The zone write locking > will still limit write dispatching to one per zone, but the writes will fail. > > mq-deadline will reorder write commands in the correct lba order only if: > - the commands are inserted as a batch (more than on request passed to > ->insert_requests) > - commands are inserted individually when the target zone is locked (a > write is already being executed) > > This has been the semantic from the start: the block layer has no > guarantees about the correct ordering of writes to zoned drive. What is > guaranteed is that (1) if the user issues writes in order AND (2) > mq-deadline is used, then writes will be dispatched in the same order to > the device. > > I have not looked at the details of ublk, but from the thread, I think (1) > is not done and (2) is missing-ish as the ublk device is not marked as zoned. (2) is supposed to be done for support ublk-zone, at least in the Andreas's PR, which just implements one ublk-loop over one real backing zoned disk. mq-deadline dispatches one write IO for each zone on /dev/ublkbN( supposed to be exposed as zoned, not done yet), and this write IO will be forwarded to ublksrv, so ublksrv can only see one such write IO for each zone, which will be re-issued to the backing real zoned disk, so there can't be the write io reorder issue, cause the write batch just has one write IO for each zone. [1] https://github.com/ming1/ubdsrv/pull/28/files Thanks, Ming ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-20 14:37 ` Ming Lei @ 2022-11-21 1:25 ` Damien Le Moal 0 siblings, 0 replies; 27+ messages in thread From: Damien Le Moal @ 2022-11-21 1:25 UTC (permalink / raw) To: Ming Lei; +Cc: Andreas Hindborg, Jens Axboe, linux-block On 11/20/22 23:37, Ming Lei wrote: > On Sat, Nov 19, 2022 at 09:24:01AM +0900, Damien Le Moal wrote: >> On 11/18/22 21:47, Ming Lei wrote: >>> On Fri, Nov 18, 2022 at 12:49:15PM +0100, Andreas Hindborg wrote: >>>> >>>> Ming Lei <ming.lei@redhat.com> writes: >>>> >>>>> CAUTION: This email originated from outside of Western Digital. Do not click on >>>>> links or open attachments unless you recognize the sender and know that the >>>>> content is safe. >>>>> >>>>> >>>>> On Fri, Nov 18, 2022 at 10:41:31AM +0100, Andreas Hindborg wrote: >>>>>> >>>>>> Ming Lei <ming.lei@redhat.com> writes: >>>>>> >>>>>>> CAUTION: This email originated from outside of Western Digital. Do not click on >>>>>>> links or open attachments unless you recognize the sender and know that the >>>>>>> content is safe. >>>>>>> >>>>>>> >>>>>>> On Fri, Nov 18, 2022 at 01:35:29PM +0900, Damien Le Moal wrote: >>>>>>>> On 11/18/22 13:12, Ming Lei wrote: >>>>>>>> [...] >>>>>>>>>>> You can only assign it to zoned write request, but you still have to check >>>>>>>>>>> the sequence inside each zone, right? Then why not just check LBAs in >>>>>>>>>>> each zone simply? >>>>>>>>>> >>>>>>>>>> We would need to know the zone map, which is not otherwise required. >>>>>>>>>> Then we would need to track the write pointer for each open zone for >>>>>>>>>> each queue, so that we can stall writes that are not issued at the write >>>>>>>>>> pointer. This is in effect all zones, because we cannot track when zones >>>>>>>>>> are implicitly closed. Then, if different queues are issuing writes to >>>>>>>>> >>>>>>>>> Can you explain "implicitly closed" state a bit? >>>>>>>>> >>>>>>>>> From https://zonedstorage.io/docs/introduction/zoned-storage, only the >>>>>>>>> following words are mentioned about closed state: >>>>>>>>> >>>>>>>>> ```Conversely, implicitly or explicitly opened zoned can be transitioned to the >>>>>>>>> closed state using the CLOSE ZONE command.``` >>>>>>>> >>>>>>>> When a write is issued to an empty or closed zone, the drive will >>>>>>>> automatically transition the zone into the implicit open state. This is >>>>>>>> called implicit open because the host did not (explicitly) issue an open >>>>>>>> zone command. >>>>>>>> >>>>>>>> When there are too many implicitly open zones, the drive may choose to >>>>>>>> close one of the implicitly opened zone to implicitly open the zone that >>>>>>>> is a target for a write command. >>>>>>>> >>>>>>>> Simple in a nutshell. This is done so that the drive can work with a >>>>>>>> limited set of resources needed to handle open zones, that is, zones that >>>>>>>> are being written. There are some more nasty details to all this with >>>>>>>> limits on the number of open zones and active zones that a zoned drive may >>>>>>>> have. >>>>>>> >>>>>>> OK, thanks for the clarification about implicitly closed, but I >>>>>>> understand this close can't change the zone's write pointer. >>>>>> >>>>>> You are right, it does not matter if the zone is implicitly closed, I >>>>>> was mistaken. But we still have to track the write pointer of every zone >>>>>> in open or active state, otherwise we cannot know if a write that arrive >>>>>> to a zone with no outstanding IO is actually at the write pointer, or >>>>>> whether we need to hold it. >>>>>> >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> zone info can be cached in the mapping(hash table)(zone sector is the key, and zone >>>>>>>>> info is the value), which can be implemented as one LRU style. If any zone >>>>>>>>> info isn't hit in the mapping table, ioctl(BLKREPORTZONE) can be called for >>>>>>>>> obtaining the zone info. >>>>>>>>> >>>>>>>>>> the same zone, we need to sync across queues. Userspace may have >>>>>>>>>> synchronization in place to issue writes with multiple threads while >>>>>>>>>> still hitting the write pointer. >>>>>>>>> >>>>>>>>> You can trust mq-dealine, which guaranteed that write IO is sent to ->queue_rq() >>>>>>>>> in order, no matter MQ or SQ. >>>>>>>>> >>>>>>>>> Yes, it could be issue from multiple queues for ublksrv, which doesn't sync >>>>>>>>> among multiple queues. >>>>>>>>> >>>>>>>>> But per-zone re-order still can solve the issue, just need one lock >>>>>>>>> for each zone to cover the MQ re-order. >>>>>>>> >>>>>>>> That lock is already there and using it, mq-deadline will never dispatch >>>>>>>> more than one write per zone at any time. This is to avoid write >>>>>>>> reordering. So multi queue or not, for any zone, there is no possibility >>>>>>>> of having writes reordered. >>>>>>> >>>>>>> oops, I miss the single queue depth point per zone, so ublk won't break >>>>>>> zoned write at all, and I agree order of batch IOs is one problem, but >>>>>>> not hard to solve. >>>>>> >>>>>> The current implementation _does_ break zoned write because it reverses >>>>>> batched writes. But if it is an easy fix, that is cool :) >>>>> >>>>> Please look at Damien's comment: >>>>> >>>>>>> That lock is already there and using it, mq-deadline will never dispatch >>>>>>> more than one write per zone at any time. This is to avoid write >>>>>>> reordering. So multi queue or not, for any zone, there is no possibility >>>>>>> of having writes reordered. >>>>> >>>>> For zoned write, mq-deadline is used to limit at most one inflight write >>>>> for each zone. >>>>> >>>>> So can you explain a bit how the current implementation breaks zoned >>>>> write? >>>> >>>> Like Damien wrote in another email, mq-deadline will only impose >>>> ordering for requests submitted in batch. The flow we have is the >>>> following: >>>> >>>> - Userspace sends requests to ublk gendisk >>>> - Requests go through block layer and is _not_ reordered when using >>>> mq-deadline. They may be split. >>>> - Requests hit ublk_drv and ublk_drv will reverse order of _all_ >>>> batched up requests (including split requests). >>> >>> For ublk-zone, ublk driver needs to be exposed as zoned device by >>> calling disk_set_zoned() finally, which definitely isn't supported now, >>> so mq-deadline at most sends one write IO for each zone after ublk-zone >>> is supported, see blk_req_can_dispatch_to_zone(). >>> >>>> - ublk_drv sends request to ublksrv in _reverse_ order. >>>> - ublksrv sends requests _not_ batched up to target device. >>>> - Requests that enter mq-deadline at the same time are reordered in LBA >>>> order, that is all good. >>>> - Requests that enter the kernel in different batches are not reordered >>>> in LBA order and end up missing the write pointer. This is bad. >>> >>> Again, please read Damien's comment: >>> >>>>> That lock is already there and using it, mq-deadline will never dispatch >>>>> more than one write per zone at any time. >>> >>> Anytime, there is at most one write IO for each zone, how can the single >>> write IO be re-order? >> >> If the user issues writes one at a time out of order (not aligned to the >> write pointer), mq-deadline will not help at all. The zone write locking >> will still limit write dispatching to one per zone, but the writes will fail. >> >> mq-deadline will reorder write commands in the correct lba order only if: >> - the commands are inserted as a batch (more than on request passed to >> ->insert_requests) >> - commands are inserted individually when the target zone is locked (a >> write is already being executed) >> >> This has been the semantic from the start: the block layer has no >> guarantees about the correct ordering of writes to zoned drive. What is >> guaranteed is that (1) if the user issues writes in order AND (2) >> mq-deadline is used, then writes will be dispatched in the same order to >> the device. >> >> I have not looked at the details of ublk, but from the thread, I think (1) >> is not done and (2) is missing-ish as the ublk device is not marked as zoned. > > (2) is supposed to be done for support ublk-zone, at least in the > Andreas's PR, which just implements one ublk-loop over one real backing > zoned disk. > > mq-deadline dispatches one write IO for each zone on /dev/ublkbN( > supposed to be exposed as zoned, not done yet), and this write IO will > be forwarded to ublksrv, so ublksrv can only see one such write IO for > each zone, which will be re-issued to the backing real zoned disk, so > there can't be the write io reorder issue, cause the write batch just > has one write IO for each zone. Good point ! > > > [1] https://github.com/ming1/ubdsrv/pull/28/files > > Thanks, > Ming > -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-19 0:24 ` Damien Le Moal 2022-11-19 7:36 ` Andreas Hindborg 2022-11-20 14:37 ` Ming Lei @ 2022-11-21 8:03 ` Christoph Hellwig 2022-11-21 8:13 ` Ming Lei 2 siblings, 1 reply; 27+ messages in thread From: Christoph Hellwig @ 2022-11-21 8:03 UTC (permalink / raw) To: Damien Le Moal; +Cc: Ming Lei, Andreas Hindborg, Jens Axboe, linux-block I've been wading through this thread a bit, and while I can't follow everything due some pretty bad reply trimming I'm really really confused by this whole discussion. Every single storage system, be that hard drives, SSDs, file systems or what else much preferres sequential write. Any code that turns perfectly sequential writes into reverse sequential writes is a massive performance bug, zoned or not. So totally independent of zone device support in ublk this needs to be fixed ASAP. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-21 8:03 ` Christoph Hellwig @ 2022-11-21 8:13 ` Ming Lei 0 siblings, 0 replies; 27+ messages in thread From: Ming Lei @ 2022-11-21 8:13 UTC (permalink / raw) To: Christoph Hellwig Cc: Damien Le Moal, Andreas Hindborg, Jens Axboe, linux-block, ming.lei On Mon, Nov 21, 2022 at 12:03:02AM -0800, Christoph Hellwig wrote: > I've been wading through this thread a bit, and while I can't follow > everything due some pretty bad reply trimming I'm really really > confused by this whole discussion. > > Every single storage system, be that hard drives, SSDs, file systems > or what else much preferres sequential write. Any code that turns > perfectly sequential writes into reverse sequential writes is a massive > performance bug, zoned or not. > > So totally independent of zone device support in ublk this needs to be > fixed ASAP. I will send one fix soon. BTW, blk-mq has such similar issue, which is reported for while, but not addressed yet. https://lore.kernel.org/linux-nvme/CAHex0coMNvFa1TPzK+5mZHsiie4d1Jd0Z8ejcZk1Vi1_4F7eRg@mail.gmail.com/T/#mbcb9b52ff61aed6fdd1b6630c6e62e55a4ed898f Thanks, Ming ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Reordering of ublk IO requests 2022-11-17 9:07 ` Andreas Hindborg 2022-11-17 11:47 ` Ming Lei @ 2022-11-17 13:00 ` Damien Le Moal 1 sibling, 0 replies; 27+ messages in thread From: Damien Le Moal @ 2022-11-17 13:00 UTC (permalink / raw) To: Andreas Hindborg, Ming Lei; +Cc: Jens Axboe, linux-block On 11/17/22 18:07, Andreas Hindborg wrote: > > Ming Lei <ming.lei@redhat.com> writes: > >> On Thu, Nov 17, 2022 at 09:05:48AM +0100, Andreas Hindborg wrote: >>> >>> Ming Lei <ming.lei@redhat.com> writes: >>> >>>> Hi Andreas, >>>> >>>> On Wed, Nov 16, 2022 at 04:00:15PM +0100, Andreas Hindborg wrote: >>>>> >>>>> Hi Ming, >>>>> >>>>> I have a question regarding ublk. For context, I am working on adding >>>>> zoned storage support to ublk, and you have been very kind to help me on >>>>> Github [1]. >>>>> >>>>> I have a problem with ordering of requests after they are issued to the >>>>> ublk driver. Basically ublk will reverse the ordering of requests that are >>>>> batched. The behavior can be observed with the following fio workload: >>>>> >>>>>> fio --name=test --ioengine=io_uring --rw=read --bs=4m --direct=1 >>>>>> --size=4m --filename=/dev/ublkb0 >>>>> >>>>> For a loopback ublk target I get the following from blktrace: >>>>> >>>>>> 259,2 0 3469 286.337681303 724 D R 0 + 1024 [fio] >>>>>> 259,2 0 3470 286.337691313 724 D R 1024 + 1024 [fio] >>>>>> 259,2 0 3471 286.337694423 724 D R 2048 + 1024 [fio] >>>>>> 259,2 0 3472 286.337696583 724 D R 3072 + 1024 [fio] >>>>>> 259,2 0 3473 286.337698433 724 D R 4096 + 1024 [fio] >>>>>> 259,2 0 3474 286.337700213 724 D R 5120 + 1024 [fio] >>>>>> 259,2 0 3475 286.337702723 724 D R 6144 + 1024 [fio] >>>>>> 259,2 0 3476 286.337704323 724 D R 7168 + 1024 [fio] >>>>>> 259,1 0 1794 286.337794934 390 D R 6144 + 2048 [ublk] >>>>>> 259,1 0 1795 286.337805504 390 D R 4096 + 2048 [ublk] >>>>>> 259,1 0 1796 286.337816274 390 D R 2048 + 2048 [ublk] >>>>>> 259,1 0 1797 286.337821744 390 D R 0 + 2048 [ublk] >>>>> >>>>> And enabling debug prints in ublk shows that the reversal happens when >>>>> ublk defers work to the io_uring context: >>>>> >>>>>> kernel: ublk_queue_rq: qid 0, tag 180, sect 0 >>>>>> kernel: ublk_queue_rq: qid 0, tag 181, sect 1024 >>>>>> kernel: ublk_queue_rq: qid 0, tag 182, sect 2048 >>>>>> kernel: ublk_queue_rq: qid 0, tag 183, sect 3072 >>>>>> kernel: ublk_queue_rq: qid 0, tag 184, sect 4096 >>>>>> kernel: ublk_queue_rq: qid 0, tag 185, sect 5120 >>>>>> kernel: ublk_queue_rq: qid 0, tag 186, sect 6144 >>>>>> kernel: ublk_queue_rq: qid 0, tag 187, sect 7168 >>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 187 io_flags 1 addr 7f245d359000, sect 7168 >>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 186 io_flags 1 addr 7f245fcfd000, sect 6144 >>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 185 io_flags 1 addr 7f245fd7f000, sect 5120 >>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 184 io_flags 1 addr 7f245fe01000, sect 4096 >>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 183 io_flags 1 addr 7f245fe83000, sect 3072 >>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 182 io_flags 1 addr 7f245ff05000, sect 2048 >>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 181 io_flags 1 addr 7f245ff87000, sect 1024 >>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 180 io_flags 1 addr 7f2460009000, sect 0 >>>>> >>>>> The problem seems to be that the method used to defer work to the >>>>> io_uring context, task_work_add(), is using a stack to queue the >>>>> callbacks. >>>> >>>> Is the observation done on zoned ublk or plain ublk-loop? >>> >>> I collected this trace on my fork, but since the behavior comes from >>> task_work.c using a stack to queue work, it would be present on upstream >>> ublk as well. For completeness, I have verified that this is the case. >>> >>>> If it is plain ublk-loop, the reverse order can be started from >>>> blk_add_rq_to_plug(), but it won't happen for zoned write request >>>> which isn't queued to plug list. >>> >>> I forgot to mention that I set the deadline scheduler for both ublkb0 >>> and the loopback target. No reordering should happen in mq with the >>> deadline scheduler, as far as I understand. >> >> I meant you need to setup one zoned ublk-loop for observing write request >> order, block layer never guarantees request order for other devices. >> >>> >>>> >>>> Otherwise, can you observe zoned write req reorder from ublksrv side? >>>> >>> >>> Yes, the reverse order is observable in ublk server. Reordering happens >>> in ublk kernel driver. >>> >>>>> >>>>> As you probably are aware, writes to zoned storage must >>>>> happen at the write pointer, so when order is lost there is a problem. >>>>> But I would assume that this behavior is also undesirable in other >>>>> circumstances. >>>>> >>>>> I am not sure what is the best approach to change this behavor. Maybe >>>>> having a separate queue for the requests and then only scheduling work >>>>> if a worker is not already processing the queue? >>>>> >>>>> If you like, I can try to implement a fix? >>>> >>>> Yeah, I think zoned write requests could be forwarded to ublk server out of order. >>> >>> In reverse order for requests that are issued without a context switch, >>> as far as I can tell. >>> >>>> >>>> Is it possible for re-order the writes in ublksrv side? I guess it is >>>> be doable: >>>> >>>> 1) zoned write request is sent to ublk_queue_rq() in order always >>>> >>>> 2) when ublksrv zone target/backend code gets zoned write request in each >>>> zone, it can wait until the next expected write comes, then handle the >>>> write and advance write pointer, then repeat the whole process. >>>> >>>> 3) because of 1), the next expected zoned write req is guaranteed to >>>> come without much delay, and the per-zone queue won't be long. >>> >>> I think it is not feasible to have per zone data structures. Instead, I >> >> If one mapping data structure is used for ordering per-zone write >> request, it could be pretty easy, such as C++'s map or hash table, and it >> won't take much memory given the max in-flight IOs are very limited and >> the zone mapping entry can be reused among different zone, and quite easy >> to implement. >> >> Also most of times, the per-zone ordering can be completed in current >> batch(requests completed from single io_uring_enter()), then the extra >> cost could be close to zero, you can simply run the per-zone ordering in >> ->handle_io_background() callback, when all requests could come for each >> zone most of times. >> >>> considered adding a sequence number to each request, and then queue up >>> requests if there is a gap in the numbering. >> >> IMO, that won't be doable, given you don't know when the sequence will be over. > > We would not need to know when the sequence is over. If in ubdsrv we see > request number 1,2,4, we could hold 4 until 3 shows up. When 3 shows up > we go ahead and submit all requests from 3 and up, until there is > another gap. If ublk_drv is setting the sequence numbers, > cancelled/aborted requests would not be an issue. Do not go there. That will never work reliably. We tried that already at the scheduler level to try to simplify file system support for zoned drives. Even waiting 10s of seconds would sometime not be enough to see all the gaps filled. > I think this would be less overhead than having per zone data structure. > > But I still think it is an unnecessary hack. We could just fix the driver. > >> >>> >>> But really, the issue should be resolved by changing the way >>> ublk_queue_rq() is sending work to uring context with task_add_work(). >> >> Not sure it can be solved easily given llist is implemented in this way. > > If we queue requests on a separate queue, we are fine. I can make a > patch suggestion. > >> >>> Fixing in ublksrv is a bit of a hack and will have performance penalty. >> >> As I mentioned, ordering zoned write request in each zone just takes >> a little memory(mapping, or hashing) and the max size of the mapping >> table is queue depth, and basically zero cpu cost, not see extra >> performance penalty is involved. > > I could implement all three solutions. 1) fix the dirver, 2) per zone > structure and 3) sequence numbers. Then I benchmark them and we will > know what works. It's a lot of work though. scrap (3). It will not work. >>> Also, it can not be good for any storage device to have sequential >>> requests delivered in reverse order? Fixing this would benefit all targets. >> >> Only zoned target has strict ordering requirement which does take cost, block >> layer never guarantees request order. As I mentioned, blk_add_rq_to_plug() >> can re-order requests in reverse order too. > > When mq-deadline scheduler is set for a queue, requests are not > reordered, right? They are not, they are ordered in LBA and arrival time order. But that's it. If the user issues the requests out of order and the current QD is lower than the max, the requests will immediately go out of mq-deadline to dispatch out of order too. What mq-deadline does for zones is guarantee that writes will be sent out one at a time, starting with the first request issued by the user. If that one is out of order, then no luck. That is the user's fault. Subsequent user request will be blocked until the one previously issued completes. So these requests may be reordered (in LBA order), thus hiding potential bug from the user side issuing requests out of order. mq-deadline does not even attempt to be intelligent about reordering and filling gaps in write sequences. As mentioned above, any attempt to do that can always fail with some corner case workload. This is not reliable. > I am no block layer expert, so I cannot argue about the implementation. > But I think that both spinning rust and flash would benefit from having > sequential requests delivered in order? Would it not hurt performance to > reverse order for sequential requests all the time? I have a hard time > understanding why the block layer would do this by default. It is not in purpose. The block layer tries to process things in order. But because of the locking model and potential requeuing of request from the device driver, reordering can happen. Even some HW are happy to silently screw up request ordering (e.g. any AHCI adapter will). > One thing is to offer no guarantees, but to _always_ reverse the > ordering of sequential requests seem a little counter productive to me. That I agree. There are no guarantees and never will be any. But processing requests in the reverse order of their issuing order is a bug. -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2022-11-21 10:17 UTC | newest] Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-11-16 15:00 Reordering of ublk IO requests Andreas Hindborg 2022-11-17 2:18 ` Ming Lei 2022-11-17 8:05 ` Andreas Hindborg 2022-11-17 8:52 ` Ming Lei 2022-11-17 9:07 ` Andreas Hindborg 2022-11-17 11:47 ` Ming Lei 2022-11-17 11:59 ` Andreas Hindborg 2022-11-17 13:11 ` Damien Le Moal 2022-11-17 13:31 ` Andreas Hindborg 2022-11-18 1:51 ` Damien Le Moal 2022-11-18 9:29 ` Andreas Hindborg 2022-11-18 4:12 ` Ming Lei 2022-11-18 4:35 ` Damien Le Moal 2022-11-18 6:07 ` Ming Lei 2022-11-18 9:41 ` Andreas Hindborg 2022-11-18 11:28 ` Ming Lei 2022-11-18 11:49 ` Andreas Hindborg 2022-11-18 12:46 ` Andreas Hindborg 2022-11-18 12:47 ` Ming Lei 2022-11-19 0:24 ` Damien Le Moal 2022-11-19 7:36 ` Andreas Hindborg 2022-11-21 10:15 ` Andreas Hindborg 2022-11-20 14:37 ` Ming Lei 2022-11-21 1:25 ` Damien Le Moal 2022-11-21 8:03 ` Christoph Hellwig 2022-11-21 8:13 ` Ming Lei 2022-11-17 13:00 ` Damien Le Moal
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.