linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* block: only use __set_current_state() for polled IO
@ 2019-01-02 17:55 Jens Axboe
  2019-01-02 18:07 ` Linus Torvalds
  0 siblings, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2019-01-02 17:55 UTC (permalink / raw)
  To: linux-block; +Cc: Christoph Hellwig, Linus Torvalds

A previous commit deemed it safe to use __set_current_state() for IRQ
driven O_DIRECT, but that's isn't necessarily the case. Be safer and
only apply that optimization to polled IO, where we know the the task is
going to find the completions itself.

Fixes: 849a370016a5 ("block: avoid ordered task state change for polled IO")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

---

diff --git a/fs/block_dev.c b/fs/block_dev.c
index e1886cc7048f..b5fba2922504 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -193,6 +193,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 	struct bio_vec inline_vecs[DIO_INLINE_BIO_VECS], *vecs, *bvec;
 	loff_t pos = iocb->ki_pos;
 	bool should_dirty = false;
+	bool is_poll;
 	struct bio bio;
 	ssize_t ret;
 	blk_qc_t qc;
@@ -232,18 +233,21 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 		bio.bi_opf = dio_bio_write_op(iocb);
 		task_io_account_write(ret);
 	}
-	if (iocb->ki_flags & IOCB_HIPRI)
+	is_poll = (iocb->ki_flags & IOCB_HIPRI) != 0;
+	if (is_poll)
 		bio.bi_opf |= REQ_HIPRI;
 
 	qc = submit_bio(&bio);
 	for (;;) {
-		__set_current_state(TASK_UNINTERRUPTIBLE);
+		if (is_poll)
+			__set_current_state(TASK_UNINTERRUPTIBLE);
+		else
+			set_current_state(TASK_UNINTERRUPTIBLE);
 
 		if (!READ_ONCE(bio.bi_private))
 			break;
 
-		if (!(iocb->ki_flags & IOCB_HIPRI) ||
-		    !blk_poll(bdev_get_queue(bdev), qc, true))
+		if (!is_poll || !blk_poll(bdev_get_queue(bdev), qc, true))
 			io_schedule();
 	}
 	__set_current_state(TASK_RUNNING);
@@ -426,13 +430,15 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 		return -EIOCBQUEUED;
 
 	for (;;) {
-		__set_current_state(TASK_UNINTERRUPTIBLE);
+		if (is_poll)
+			__set_current_state(TASK_UNINTERRUPTIBLE);
+		else
+			set_current_state(TASK_UNINTERRUPTIBLE);
 
 		if (!READ_ONCE(dio->waiter))
 			break;
 
-		if (!(iocb->ki_flags & IOCB_HIPRI) ||
-		    !blk_poll(bdev_get_queue(bdev), qc, true))
+		if (!is_poll || !blk_poll(bdev_get_queue(bdev), qc, true))
 			io_schedule();
 	}
 	__set_current_state(TASK_RUNNING);
diff --git a/fs/iomap.c b/fs/iomap.c
index 9a5bf1e8925b..30c5b7d9aca9 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1790,6 +1790,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	size_t count = iov_iter_count(iter);
 	loff_t pos = iocb->ki_pos, start = pos;
 	loff_t end = iocb->ki_pos + count - 1, ret = 0;
+	bool is_poll = (iocb->ki_flags & IOCB_HIPRI) != 0;
 	unsigned int flags = IOMAP_DIRECT;
 	struct blk_plug plug;
 	struct iomap_dio *dio;
@@ -1908,13 +1909,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 			return -EIOCBQUEUED;
 
 		for (;;) {
-			__set_current_state(TASK_UNINTERRUPTIBLE);
+			if (is_poll)
+				__set_current_state(TASK_UNINTERRUPTIBLE);
+			else
+				set_current_state(TASK_UNINTERRUPTIBLE);
 
 			if (!READ_ONCE(dio->submit.waiter))
 				break;
 
-			if (!(iocb->ki_flags & IOCB_HIPRI) ||
-			    !dio->submit.last_queue ||
+			if (!is_poll || !dio->submit.last_queue ||
 			    !blk_poll(dio->submit.last_queue,
 					 dio->submit.cookie, true))
 				io_schedule();

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: block: only use __set_current_state() for polled IO
  2019-01-02 17:55 block: only use __set_current_state() for polled IO Jens Axboe
@ 2019-01-02 18:07 ` Linus Torvalds
  2019-01-02 18:10   ` Jens Axboe
  2019-01-02 18:11   ` Linus Torvalds
  0 siblings, 2 replies; 7+ messages in thread
From: Linus Torvalds @ 2019-01-02 18:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig

On Wed, Jan 2, 2019 at 9:55 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> A previous commit deemed it safe to use __set_current_state() for IRQ
> driven O_DIRECT, but that's isn't necessarily the case. Be safer and
> only apply that optimization to polled IO, where we know the the task is
> going to find the completions itself.

No.

We're just reverting 849a370016a5.

There is absolutely no point in trying to avoid a memory barrier that
is maybe 15 CPU cycles.

Stop this craziness. The optimization is garbage. If you want to save
15 cycles, get rid of code, don't add new code in an area where the
block maintainers have already shown that they can't get it right.

               Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: block: only use __set_current_state() for polled IO
  2019-01-02 18:07 ` Linus Torvalds
@ 2019-01-02 18:10   ` Jens Axboe
  2019-01-02 18:11   ` Linus Torvalds
  1 sibling, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2019-01-02 18:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-block, Christoph Hellwig

On 1/2/19 11:07 AM, Linus Torvalds wrote:
> On Wed, Jan 2, 2019 at 9:55 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> A previous commit deemed it safe to use __set_current_state() for IRQ
>> driven O_DIRECT, but that's isn't necessarily the case. Be safer and
>> only apply that optimization to polled IO, where we know the the task is
>> going to find the completions itself.
> 
> No.
> 
> We're just reverting 849a370016a5.
> 
> There is absolutely no point in trying to avoid a memory barrier that
> is maybe 15 CPU cycles.
> 
> Stop this craziness. The optimization is garbage. If you want to save
> 15 cycles, get rid of code, don't add new code in an area where the
> block maintainers have already shown that they can't get it right.

Fine, we'll just revert the damn thing then.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: block: only use __set_current_state() for polled IO
  2019-01-02 18:07 ` Linus Torvalds
  2019-01-02 18:10   ` Jens Axboe
@ 2019-01-02 18:11   ` Linus Torvalds
  2019-01-02 18:14     ` Jens Axboe
  1 sibling, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2019-01-02 18:11 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig

On Wed, Jan 2, 2019 at 10:07 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Stop this craziness. The optimization is garbage. If you want to save
> 15 cycles, get rid of code, don't add new code in an area where the
> block maintainers have already shown that they can't get it right.

Side note: that argument is further strengthened by a very simple
observation: if the code isn't going to sleep, then the whole state
setting is entirely and utterly pointless.

So code like this:

+                       if (is_poll)
+                               __set_current_state(TASK_UNINTERRUPTIBLE);

is *FUNDAMENTAL* garbage.

Seriously, stop playing stupid games. You're doing things wrong.

I'm doing to revert that buggy commit, and I will not pull any crazy
code that tries to redo this kind of completely broken code.

                 Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: block: only use __set_current_state() for polled IO
  2019-01-02 18:11   ` Linus Torvalds
@ 2019-01-02 18:14     ` Jens Axboe
  2019-01-02 18:55       ` Linus Torvalds
  0 siblings, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2019-01-02 18:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-block, Christoph Hellwig

On 1/2/19 11:11 AM, Linus Torvalds wrote:
> On Wed, Jan 2, 2019 at 10:07 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Stop this craziness. The optimization is garbage. If you want to save
>> 15 cycles, get rid of code, don't add new code in an area where the
>> block maintainers have already shown that they can't get it right.
> 
> Side note: that argument is further strengthened by a very simple
> observation: if the code isn't going to sleep, then the whole state
> setting is entirely and utterly pointless.
> 
> So code like this:
> 
> +                       if (is_poll)
> +                               __set_current_state(TASK_UNINTERRUPTIBLE);
> 
> is *FUNDAMENTAL* garbage.
> 
> Seriously, stop playing stupid games. You're doing things wrong.
> 
> I'm doing to revert that buggy commit, and I will not pull any crazy
> code that tries to redo this kind of completely broken code.

Note that the blk-mq hunk is fine, since that's ONLY used for polling.
We should get rid of the state setting there completely though, which is
something we can do now that polling is strictly a non-IRQ type of IO.

The swap part may fall back to non-polled, so better safer than sorry on
that one.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: block: only use __set_current_state() for polled IO
  2019-01-02 18:14     ` Jens Axboe
@ 2019-01-02 18:55       ` Linus Torvalds
  2019-01-02 18:59         ` Jens Axboe
  0 siblings, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2019-01-02 18:55 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 927 bytes --]

On Wed, Jan 2, 2019 at 10:14 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> Note that the blk-mq hunk is fine, since that's ONLY used for polling.

Actually, it's fine for another reason too: using
"__set_current_state()" is always safe when the state you're setting
things to is TASK_RUNNING. Even if there's a concurrent wakeup, that's
only going to set things running anyway, so there's no race.

But yes, if something is truly just polling, then it shouldn't touch
the task state at all.

That said, some of the "polling" code looks mis-named. For example,
the blk_poll() call in mm/page_io.c looks like it's polling, but it
can actually sleep (ie blk_mq_poll_hybrid() gets called even when
"spin" is true).

But the sleeping case looks like it handles the task state thing on
its own, so all those polling codepaths should probably be inspected.

In the meantime, the attached is what I committed.

                     Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 3148 bytes --]

commit 1ac5cd4978794bd060d448acc0305e9fc996ba92
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Wed Jan 2 10:46:03 2019 -0800

    block: don't use un-ordered __set_current_state(TASK_UNINTERRUPTIBLE)
    
    This mostly reverts commit 849a370016a5 ("block: avoid ordered task
    state change for polled IO").  It was wrongly claiming that the ordering
    wasn't necessary.  The memory barrier _is_ necessary.
    
    If something is truly polling and not going to sleep, it's the whole
    state setting that is unnecessary, not the memory barrier.  Whenever you
    set your state to a sleeping state, you absolutely need the memory
    barrier.
    
    Note that sometimes the memory barrier can be elsewhere.  For example,
    the ordering might be provided by an external lock, or by setting the
    process state to sleeping before adding yourself to the wait queue list
    that is used for waking up (where the wait queue lock itself will
    guarantee that any wakeup will correctly see the sleeping state).
    
    But none of those cases were true here.
    
    NOTE! Some of the polling paths may indeed be able to drop the state
    setting entirely, at which point the memory barrier also goes away.
    
    (Also note that this doesn't revert the TASK_RUNNING cases: there is no
    race between a wakeup and setting the process state to TASK_RUNNING,
    since the end result doesn't depend on ordering).
    
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 450be88cffef..c546cdce77e6 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -237,11 +237,9 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 
 	qc = submit_bio(&bio);
 	for (;;) {
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-
+		set_current_state(TASK_UNINTERRUPTIBLE);
 		if (!READ_ONCE(bio.bi_private))
 			break;
-
 		if (!(iocb->ki_flags & IOCB_HIPRI) ||
 		    !blk_poll(bdev_get_queue(bdev), qc, true))
 			io_schedule();
@@ -426,8 +424,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 		return -EIOCBQUEUED;
 
 	for (;;) {
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-
+		set_current_state(TASK_UNINTERRUPTIBLE);
 		if (!READ_ONCE(dio->waiter))
 			break;
 
diff --git a/fs/iomap.c b/fs/iomap.c
index 3a0cd557b4cf..a3088fae567b 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1921,8 +1921,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 			return -EIOCBQUEUED;
 
 		for (;;) {
-			__set_current_state(TASK_UNINTERRUPTIBLE);
-
+			set_current_state(TASK_UNINTERRUPTIBLE);
 			if (!READ_ONCE(dio->submit.waiter))
 				break;
 
diff --git a/mm/page_io.c b/mm/page_io.c
index 3475733b1926..d975fa3f02aa 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -405,8 +405,7 @@ int swap_readpage(struct page *page, bool synchronous)
 	bio_get(bio);
 	qc = submit_bio(bio);
 	while (synchronous) {
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-
+		set_current_state(TASK_UNINTERRUPTIBLE);
 		if (!READ_ONCE(bio->bi_private))
 			break;
 

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: block: only use __set_current_state() for polled IO
  2019-01-02 18:55       ` Linus Torvalds
@ 2019-01-02 18:59         ` Jens Axboe
  0 siblings, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2019-01-02 18:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-block, Christoph Hellwig

On 1/2/19 11:55 AM, Linus Torvalds wrote:
> On Wed, Jan 2, 2019 at 10:14 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Note that the blk-mq hunk is fine, since that's ONLY used for polling.
> 
> Actually, it's fine for another reason too: using
> "__set_current_state()" is always safe when the state you're setting
> things to is TASK_RUNNING. Even if there's a concurrent wakeup, that's
> only going to set things running anyway, so there's no race.

Right, that is of course correct, guess I glanced too quickly at that
one for the revert.

> But yes, if something is truly just polling, then it shouldn't touch
> the task state at all.

Exactly, and I agree this was going down the wrong path. I'll work
on getting rid of the state setting for polled IO, that makes a lot
more sense than trying to fiddle with __set_current_state() when
we aren't going to be sleeping at all.

> That said, some of the "polling" code looks mis-named. For example,
> the blk_poll() call in mm/page_io.c looks like it's polling, but it
> can actually sleep (ie blk_mq_poll_hybrid() gets called even when
> "spin" is true).

That's another case that needs a bit of cleanup, the hybrid polling
only really makes sense for sync polling. FWIW, the 'spin == true'
doesn't control if we're going to be sleeping or not, it's the
caller saying "keep going until you find a completion we care about".

> But the sleeping case looks like it handles the task state thing on
> its own, so all those polling codepaths should probably be inspected.
> 
> In the meantime, the attached is what I committed.

Looks good to me, matches my manual revert.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-01-02 18:59 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-02 17:55 block: only use __set_current_state() for polled IO Jens Axboe
2019-01-02 18:07 ` Linus Torvalds
2019-01-02 18:10   ` Jens Axboe
2019-01-02 18:11   ` Linus Torvalds
2019-01-02 18:14     ` Jens Axboe
2019-01-02 18:55       ` Linus Torvalds
2019-01-02 18:59         ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).