All of lore.kernel.org
 help / color / mirror / Atom feed
* BUG: Hung task timeouts in for-4.10/dio
@ 2016-11-08 18:16 Logan Gunthorpe
  2016-11-08 18:21 ` Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: Logan Gunthorpe @ 2016-11-08 18:16 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig; +Cc: linux-block, Stephen Bates

[-- Attachment #1: Type: text/plain, Size: 856 bytes --]

Hi guys,

We were looking at testing the new IO polling improvements and we built
a kernel from the 'for-4.10/dio' (64ead7d) branch in linux-block.
However this branch seems to cause hung tasks when booted. Most
noticeably, dhclient seems to always hang as it tries to read from it's
leases file, and that means networking does not work on the computers we
tested on. Other tasks seemed to hang occasionally and randomly.

We tested on two machines with radically different hardware but both
running Debian Jessie.  (One is a dual-socket server system with the
root FS on an HDD and the other is an off the shelf commodity
motherboard with root on an SSD.)

We performed a bisect to find the culprit commit to be:

[b685d3d65ac791406e0dfd8779cc9b3707fea5a3] block: treat REQ_FUA and
REQ_PREFLUSH as synchronous

I've attached a bisect log.

Thanks,

Logan

[-- Attachment #2: block-bisect.log --]
[-- Type: text/x-log, Size: 1346 bytes --]

git bisect start
# good: [1001354ca34179f3db924eb66672442a173147dc] Linux 4.9-rc1
git bisect good 1001354ca34179f3db924eb66672442a173147dc
# bad: [64ead7d24b34ed40e85577e9be8bd203835e57c4] blk-mq: make the polling code adaptive
git bisect bad 64ead7d24b34ed40e85577e9be8bd203835e57c4
# bad: [70fd76140a6cb63262bd47b68d57b42e889c10ee] block,fs: use REQ_* flags directly
git bisect bad 70fd76140a6cb63262bd47b68d57b42e889c10ee
# good: [2552e3f878c2b43b41d7728a328821d8220c28da] blk-mq: get rid of confusing blk_map_ctx structure
git bisect good 2552e3f878c2b43b41d7728a328821d8220c28da
# good: [aa39ebd404423e62f74cfd3e27e9ffe7e38b2a25] cfq-iosched: use op_is_sync instead of opencoding it
git bisect good aa39ebd404423e62f74cfd3e27e9ffe7e38b2a25
# good: [67f055c798c72c49ee0c844eae0cd6e9c83b1b16] btrfs: use op_is_sync to check for synchronous requests
git bisect good 67f055c798c72c49ee0c844eae0cd6e9c83b1b16
# bad: [b685d3d65ac791406e0dfd8779cc9b3707fea5a3] block: treat REQ_FUA and REQ_PREFLUSH as synchronous
git bisect bad b685d3d65ac791406e0dfd8779cc9b3707fea5a3
# good: [6f6b29171a192e84b666c816e49d2175afbbb09f] block: don't use REQ_SYNC in the READ_SYNC definition
git bisect good 6f6b29171a192e84b666c816e49d2175afbbb09f
# first bad commit: [b685d3d65ac791406e0dfd8779cc9b3707fea5a3] block: treat REQ_FUA and REQ_PREFLUSH as synchronous

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: BUG: Hung task timeouts in for-4.10/dio
  2016-11-08 18:16 BUG: Hung task timeouts in for-4.10/dio Logan Gunthorpe
@ 2016-11-08 18:21 ` Jens Axboe
  2016-11-08 18:59   ` Logan Gunthorpe
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2016-11-08 18:21 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig; +Cc: linux-block, Stephen Bates

On 11/08/2016 11:16 AM, Logan Gunthorpe wrote:
> Hi guys,
>
> We were looking at testing the new IO polling improvements and we built
> a kernel from the 'for-4.10/dio' (64ead7d) branch in linux-block.
> However this branch seems to cause hung tasks when booted. Most
> noticeably, dhclient seems to always hang as it tries to read from it's
> leases file, and that means networking does not work on the computers we
> tested on. Other tasks seemed to hang occasionally and randomly.
>
> We tested on two machines with radically different hardware but both
> running Debian Jessie.  (One is a dual-socket server system with the
> root FS on an HDD and the other is an off the shelf commodity
> motherboard with root on an SSD.)
>
> We performed a bisect to find the culprit commit to be:
>
> [b685d3d65ac791406e0dfd8779cc9b3707fea5a3] block: treat REQ_FUA and
> REQ_PREFLUSH as synchronous
>
> I've attached a bisect log.

I don't think that's right. The version you ran has a bug in the stats
code. Please update to the current for-4.10/dio branch (82a78cd682bf)
and I think you'll have more luck.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: BUG: Hung task timeouts in for-4.10/dio
  2016-11-08 18:21 ` Jens Axboe
@ 2016-11-08 18:59   ` Logan Gunthorpe
  2016-11-08 19:01     ` Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: Logan Gunthorpe @ 2016-11-08 18:59 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig; +Cc: linux-block, Stephen Bates

Hi Jens,

Thanks for the quick reply. I just built 82a78cd and I'm seeing the same
problem as reported.

Logan

On 08/11/16 11:21 AM, Jens Axboe wrote:
> On 11/08/2016 11:16 AM, Logan Gunthorpe wrote:
>> Hi guys,
>>
>> We were looking at testing the new IO polling improvements and we built
>> a kernel from the 'for-4.10/dio' (64ead7d) branch in linux-block.
>> However this branch seems to cause hung tasks when booted. Most
>> noticeably, dhclient seems to always hang as it tries to read from it's
>> leases file, and that means networking does not work on the computers we
>> tested on. Other tasks seemed to hang occasionally and randomly.
>>
>> We tested on two machines with radically different hardware but both
>> running Debian Jessie.  (One is a dual-socket server system with the
>> root FS on an HDD and the other is an off the shelf commodity
>> motherboard with root on an SSD.)
>>
>> We performed a bisect to find the culprit commit to be:
>>
>> [b685d3d65ac791406e0dfd8779cc9b3707fea5a3] block: treat REQ_FUA and
>> REQ_PREFLUSH as synchronous
>>
>> I've attached a bisect log.
> 
> I don't think that's right. The version you ran has a bug in the stats
> code. Please update to the current for-4.10/dio branch (82a78cd682bf)
> and I think you'll have more luck.
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: BUG: Hung task timeouts in for-4.10/dio
  2016-11-08 18:59   ` Logan Gunthorpe
@ 2016-11-08 19:01     ` Jens Axboe
  2016-11-08 19:03       ` Logan Gunthorpe
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2016-11-08 19:01 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig; +Cc: linux-block, Stephen Bates

On 11/08/2016 11:59 AM, Logan Gunthorpe wrote:
> Hi Jens,
>
> Thanks for the quick reply. I just built 82a78cd and I'm seeing the same
> problem as reported.

Hmm, very odd. Does it work if you run that branch and revert the commit 
you bisected as troublesome, b685d3d65ac7?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: BUG: Hung task timeouts in for-4.10/dio
  2016-11-08 19:01     ` Jens Axboe
@ 2016-11-08 19:03       ` Logan Gunthorpe
  2016-11-08 19:19         ` Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: Logan Gunthorpe @ 2016-11-08 19:03 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig; +Cc: linux-block, Stephen Bates

Hey,

I haven't check 82a78cd, but when I tried reverting the commit in
yesterdays version there were conflicts, as a subsequent patch removed
the defines that the specific patch operated on.

Logan


On 08/11/16 12:01 PM, Jens Axboe wrote:
> On 11/08/2016 11:59 AM, Logan Gunthorpe wrote:
>> Hi Jens,
>>
>> Thanks for the quick reply. I just built 82a78cd and I'm seeing the same
>> problem as reported.
> 
> Hmm, very odd. Does it work if you run that branch and revert the commit
> you bisected as troublesome, b685d3d65ac7?
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: BUG: Hung task timeouts in for-4.10/dio
  2016-11-08 19:03       ` Logan Gunthorpe
@ 2016-11-08 19:19         ` Jens Axboe
  2016-11-08 19:47           ` Logan Gunthorpe
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2016-11-08 19:19 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig; +Cc: linux-block, Stephen Bates

On 11/08/2016 12:03 PM, Logan Gunthorpe wrote:
> Hey,
>
> I haven't check 82a78cd, but when I tried reverting the commit in
> yesterdays version there were conflicts, as a subsequent patch removed
> the defines that the specific patch operated on.

Can you try and boot for-4.10/block instead?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: BUG: Hung task timeouts in for-4.10/dio
  2016-11-08 19:19         ` Jens Axboe
@ 2016-11-08 19:47           ` Logan Gunthorpe
  2016-11-08 19:50             ` Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: Logan Gunthorpe @ 2016-11-08 19:47 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig; +Cc: linux-block, Stephen Bates

Hey,

On 08/11/16 12:19 PM, Jens Axboe wrote:
> Can you try and boot for-4.10/block instead?

Yup. I'm seeing the same issue with that branch too. (b57d74a)

Thanks,

Logan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: BUG: Hung task timeouts in for-4.10/dio
  2016-11-08 19:47           ` Logan Gunthorpe
@ 2016-11-08 19:50             ` Jens Axboe
       [not found]               ` <84693d82-b8dc-42b3-03da-055def7499a0@deltatee.com>
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2016-11-08 19:50 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig; +Cc: linux-block, Stephen Bates

On 11/08/2016 12:47 PM, Logan Gunthorpe wrote:
> Hey,
>
> On 08/11/16 12:19 PM, Jens Axboe wrote:
>> Can you try and boot for-4.10/block instead?
>
> Yup. I'm seeing the same issue with that branch too. (b57d74a)

OK, let's get some more info on this setup then, so we can get to the
bottom of that. Can you send a dmesg from a working boot? What file
systems are you using? Etc.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: BUG: Hung task timeouts in for-4.10/dio
       [not found]               ` <84693d82-b8dc-42b3-03da-055def7499a0@deltatee.com>
@ 2016-11-08 20:21                 ` Jens Axboe
  2016-11-08 21:08                   ` Mike Snitzer
  2016-11-09  0:50                   ` Damien Le Moal
  0 siblings, 2 replies; 23+ messages in thread
From: Jens Axboe @ 2016-11-08 20:21 UTC (permalink / raw)
  To: Logan Gunthorpe, Christoph Hellwig; +Cc: linux-block, Mike Snitzer

On 11/08/2016 12:55 PM, Logan Gunthorpe wrote:
> Hey,
>
> I've attached the output of dmesg from a working boot and the output of
> mount.
>
> Pretty much all the file systems are ext4. We have some experimental
> nvme devices in this system which I did try removing to eliminate that
> possibility.
>
> Let me know if you need anything else.

You're using dm, that might be related. Mike, have you tried booting
for-4.10/block and checking if dm works fine?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: BUG: Hung task timeouts in for-4.10/dio
  2016-11-08 20:21                 ` Jens Axboe
@ 2016-11-08 21:08                   ` Mike Snitzer
  2016-11-09  0:50                   ` Damien Le Moal
  1 sibling, 0 replies; 23+ messages in thread
From: Mike Snitzer @ 2016-11-08 21:08 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Logan Gunthorpe, Christoph Hellwig, linux-block

On Tue, Nov 08 2016 at  3:21pm -0500,
Jens Axboe <axboe@fb.com> wrote:

> On 11/08/2016 12:55 PM, Logan Gunthorpe wrote:
> >Hey,
> >
> >I've attached the output of dmesg from a working boot and the output of
> >mount.
> >
> >Pretty much all the file systems are ext4. We have some experimental
> >nvme devices in this system which I did try removing to eliminate that
> >possibility.
> >
> >Let me know if you need anything else.

I'm not seeing Logan's above message (with dmesg) on linux-block
archives.  So I cannot yet appreciate which DM targets are in play.

> You're using dm, that might be related. Mike, have you tried booting
> for-4.10/block and checking if dm works fine?

I haven't tried for-4.10/block yet.  But I should be able to try it this
week.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: BUG: Hung task timeouts in for-4.10/dio
  2016-11-08 20:21                 ` Jens Axboe
  2016-11-08 21:08                   ` Mike Snitzer
@ 2016-11-09  0:50                   ` Damien Le Moal
  2016-11-09  1:09                     ` Christoph Hellwig
  1 sibling, 1 reply; 23+ messages in thread
From: Damien Le Moal @ 2016-11-09  0:50 UTC (permalink / raw)
  To: Jens Axboe, Logan Gunthorpe, Christoph Hellwig; +Cc: linux-block, Mike Snitzer


Jens,

On 11/9/16 05:21, Jens Axboe wrote:
> On 11/08/2016 12:55 PM, Logan Gunthorpe wrote:
>> Hey,
>>
>> I've attached the output of dmesg from a working boot and the output of
>> mount.
>>
>> Pretty much all the file systems are ext4. We have some experimental
>> nvme devices in this system which I did try removing to eliminate that
>> possibility.
>>
>> Let me know if you need anything else.
> 
> You're using dm, that might be related. Mike, have you tried booting
> for-4.10/block and checking if dm works fine?

Using yesterday's tree, I experienced similar problems with
for-4.10/block without using dm (using ext4 on top of SSDs): random
tasks hung, starting from boot, with the machine eventually completely
freezing.

I did not dig into the problem a lot. I just looked at task stack traces
(echo t > /proc/sysrq-trigger) and noticed that hung tasks are waiting
for requests. Ex:

[   55.356418] plymouthd       D ffffffff81671758     0   353      1
0x00000000
[   55.356419]  ffff8807fbf1ec00 0000000000000000 ffff8807fba6d500
ffff8807fba3b600
[   55.356420]  ffff88081fb97900 ffff8807f04079a8 ffffffff81671758
000000000000158f
[   55.356421]  0000000000000000 ffff8807f3373800 ffff8807fba3b600
ffff88081fb97900
[   55.356421] Call Trace:
[   55.356421]  [<ffffffff81671758>] ? __schedule+0x178/0x650
[   55.356422]  [<ffffffff81671c70>] schedule+0x40/0x90
[   55.356423]  [<ffffffff816749d1>] schedule_timeout+0x2b1/0x3e0
[   55.356424]  [<ffffffff8115419d>] ? mempool_alloc_slab+0x1d/0x30
[   55.356425]  [<ffffffff810e0971>] ? ktime_get+0x41/0xb0
[   55.356426]  [<ffffffff81671574>] io_schedule_timeout+0xa4/0x110
[   55.356427]  [<ffffffff8130ee2b>] get_request+0x3fb/0x7d0
[   55.356428]  [<ffffffff8120fd83>] ? __find_get_block+0xf3/0x180
[   55.356429]  [<ffffffff810be260>] ? wait_woken+0x90/0x90
[   55.356431]  [<ffffffff813117cb>] blk_queue_bio+0xfb/0x3c0
[   55.356432]  [<ffffffff8130fb90>] generic_make_request+0xd0/0x180
[   55.356433]  [<ffffffff8130fcac>] submit_bio+0x6c/0x130
[   55.356436]  [<ffffffff81270f08>] ext4_io_submit+0x38/0x50
[   55.356437]  [<ffffffff8126c241>] ext4_writepages+0x561/0xdb0
[   55.356439]  [<ffffffff811601e1>] do_writepages+0x21/0x30
[   55.356440]  [<ffffffff811520aa>] __filemap_fdatawrite_range+0xaa/0xf0
[   55.356440]  [<ffffffff811524df>] ? __generic_file_write_iter+0x14f/0x1d0
[   55.356441]  [<ffffffff8115213c>] filemap_flush+0x1c/0x20
[   55.356442]  [<ffffffff812698bc>] ext4_alloc_da_blocks+0x2c/0x80
[   55.356443]  [<ffffffff81262268>] ext4_release_file+0x78/0xc0
[   55.356446]  [<ffffffff811db2a9>] __fput+0xb9/0x200
[   55.356447]  [<ffffffff811db42e>] ____fput+0xe/0x10
[   55.356449]  [<ffffffff81097bf5>] task_work_run+0x85/0xb0
[   55.356450]  [<ffffffff810016a7>] exit_to_usermode_loop+0x97/0xa0
[   55.356451]  [<ffffffff810019e3>] syscall_return_slowpath+0x53/0x60
[   55.356452]  [<ffffffff8167605f>] entry_SYSCALL_64_fastpath+0x92/0x94

I needed the ZBC code so I detached the head back to 5f2808f and
everything then worked fine. I will try to bisect.

Best regards.

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Research Group,
Western Digital Corporation
Damien.LeMoal@wdc.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.wdc.com, www.hgst.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: BUG: Hung task timeouts in for-4.10/dio
  2016-11-09  0:50                   ` Damien Le Moal
@ 2016-11-09  1:09                     ` Christoph Hellwig
  2016-11-09  1:25                       ` Damien Le Moal
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2016-11-09  1:09 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Jens Axboe, Logan Gunthorpe, Christoph Hellwig, linux-block,
	Mike Snitzer

Ok, sounds like I'm really the one to blame.  I'll see if I can
find a reproducer.  Damien, or you using device mapper on that
system?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: BUG: Hung task timeouts in for-4.10/dio
  2016-11-09  1:09                     ` Christoph Hellwig
@ 2016-11-09  1:25                       ` Damien Le Moal
  2016-11-09  1:28                         ` Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: Damien Le Moal @ 2016-11-09  1:25 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Logan Gunthorpe, linux-block, Mike Snitzer


On 11/9/16 10:09, Christoph Hellwig wrote:
> Ok, sounds like I'm really the one to blame.  I'll see if I can
> find a reproducer.  Damien, or you using device mapper on that
> system?

No LVM/md/dm used on boot. Mount is direct to the block device (SSDs
with ext4). The devices are simple SSDs, so no polling involved.

The hangs suspiciously look like they are either background write or
flush. So I was wondering if it is indeed related to FLUSH/FUA as Logan
suggested or the background write stuff, rather than the direct-IO
optimization & polling.

Will try again/bisect to see if I can get more info.

Cheers.

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Research Group,
Western Digital Corporation
Damien.LeMoal@wdc.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.wdc.com, www.hgst.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: BUG: Hung task timeouts in for-4.10/dio
  2016-11-09  1:25                       ` Damien Le Moal
@ 2016-11-09  1:28                         ` Jens Axboe
  2016-11-09  2:02                           ` Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2016-11-09  1:28 UTC (permalink / raw)
  To: Damien Le Moal, Christoph Hellwig
  Cc: Logan Gunthorpe, linux-block, Mike Snitzer

On 11/08/2016 06:25 PM, Damien Le Moal wrote:
>
> On 11/9/16 10:09, Christoph Hellwig wrote:
>> Ok, sounds like I'm really the one to blame.  I'll see if I can
>> find a reproducer.  Damien, or you using device mapper on that
>> system?
>
> No LVM/md/dm used on boot. Mount is direct to the block device (SSDs
> with ext4). The devices are simple SSDs, so no polling involved.
>
> The hangs suspiciously look like they are either background write or
> flush. So I was wondering if it is indeed related to FLUSH/FUA as Logan
> suggested or the background write stuff, rather than the direct-IO
> optimization & polling.

The background write stuff is not in either of those branches, plus the
backtrace would have looked different. Yours is showing us waiting for a
request. I don't think it's the direct-io or polling code, it looks like
a generic issue.

> Will try again/bisect to see if I can get more info.

Maybe try and revert the one that Logan pointed his finger at, if that
is doable.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: BUG: Hung task timeouts in for-4.10/dio
  2016-11-09  1:28                         ` Jens Axboe
@ 2016-11-09  2:02                           ` Jens Axboe
  2016-11-09  2:05                             ` Christoph Hellwig
  2016-11-09  2:17                             ` Damien Le Moal
  0 siblings, 2 replies; 23+ messages in thread
From: Jens Axboe @ 2016-11-09  2:02 UTC (permalink / raw)
  To: Damien Le Moal, Christoph Hellwig
  Cc: Logan Gunthorpe, linux-block, Mike Snitzer

On 11/08/2016 06:28 PM, Jens Axboe wrote:
> On 11/08/2016 06:25 PM, Damien Le Moal wrote:
>>
>> On 11/9/16 10:09, Christoph Hellwig wrote:
>>> Ok, sounds like I'm really the one to blame.  I'll see if I can
>>> find a reproducer.  Damien, or you using device mapper on that
>>> system?
>>
>> No LVM/md/dm used on boot. Mount is direct to the block device (SSDs
>> with ext4). The devices are simple SSDs, so no polling involved.
>>
>> The hangs suspiciously look like they are either background write or
>> flush. So I was wondering if it is indeed related to FLUSH/FUA as Logan
>> suggested or the background write stuff, rather than the direct-IO
>> optimization & polling.
>
> The background write stuff is not in either of those branches, plus the
> backtrace would have looked different. Yours is showing us waiting for a
> request. I don't think it's the direct-io or polling code, it looks like
> a generic issue.
>
>> Will try again/bisect to see if I can get more info.
>
> Maybe try and revert the one that Logan pointed his finger at, if that
> is doable.

It smells like an accounting error. One thing that I don't like with the
current scheme is the implicit knowledge that certain flags imply sync
as well. If we clear any of those flags, then we screw up accounting at
the end.

Does this make a difference?


diff --git a/block/blk-flush.c b/block/blk-flush.c
index c486b7aa62ee..d70983e28115 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -395,6 +395,8 @@ void blk_insert_flush(struct request *rq)
  	if (!(fflags & (1UL << QUEUE_FLAG_FUA)))
  		rq->cmd_flags &= ~REQ_FUA;

+	rq->cmd_flags |= REQ_SYNC;
+
  	/*
  	 * An empty flush handed down from a stacking driver may
  	 * translate into nothing if the underlying device does not

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: BUG: Hung task timeouts in for-4.10/dio
  2016-11-09  2:02                           ` Jens Axboe
@ 2016-11-09  2:05                             ` Christoph Hellwig
  2016-11-09  2:11                               ` Jens Axboe
  2016-11-09  2:17                             ` Damien Le Moal
  1 sibling, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2016-11-09  2:05 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Damien Le Moal, Christoph Hellwig, Logan Gunthorpe, linux-block,
	Mike Snitzer

On Tue, Nov 08, 2016 at 07:02:52PM -0700, Jens Axboe wrote:
> It smells like an accounting error. One thing that I don't like with the
> current scheme is the implicit knowledge that certain flags imply sync
> as well. If we clear any of those flags, then we screw up accounting at
> the end.
>
> Does this make a difference?

That looks interesting.  In the meantime I reproduced a similar
hang, but only half-way through an xfstests run with a non-mq device.
I'll see how far I can narrow it down and will give your patch a try as
well.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: BUG: Hung task timeouts in for-4.10/dio
  2016-11-09  2:05                             ` Christoph Hellwig
@ 2016-11-09  2:11                               ` Jens Axboe
  0 siblings, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2016-11-09  2:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Damien Le Moal, Logan Gunthorpe, linux-block, Mike Snitzer

On 11/08/2016 07:05 PM, Christoph Hellwig wrote:
> On Tue, Nov 08, 2016 at 07:02:52PM -0700, Jens Axboe wrote:
>> It smells like an accounting error. One thing that I don't like with the
>> current scheme is the implicit knowledge that certain flags imply sync
>> as well. If we clear any of those flags, then we screw up accounting at
>> the end.
>>
>> Does this make a difference?
>
> That looks interesting.  In the meantime I reproduced a similar
> hang, but only half-way through an xfstests run with a non-mq device.
> I'll see how far I can narrow it down and will give your patch a try as
> well.

It'd only trigger on non-mq, and the symptoms (and the bisect) point at
this being an accounting issue. Damien/Logan, would be great you could
try the debug patch I sent.

My non-mq drive on my test box has write through caching, which probably
explains why I haven't seen the issue.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: BUG: Hung task timeouts in for-4.10/dio
  2016-11-09  2:02                           ` Jens Axboe
  2016-11-09  2:05                             ` Christoph Hellwig
@ 2016-11-09  2:17                             ` Damien Le Moal
  2016-11-09  2:38                               ` Jens Axboe
  1 sibling, 1 reply; 23+ messages in thread
From: Damien Le Moal @ 2016-11-09  2:17 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig; +Cc: Logan Gunthorpe, linux-block, Mike Snitzer


Jens,

On 11/9/16 11:02, Jens Axboe wrote:
> It smells like an accounting error. One thing that I don't like with the
> current scheme is the implicit knowledge that certain flags imply sync
> as well. If we clear any of those flags, then we screw up accounting at
> the end.
> 
> Does this make a difference?
> 
> 
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index c486b7aa62ee..d70983e28115 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -395,6 +395,8 @@ void blk_insert_flush(struct request *rq)
>   	if (!(fflags & (1UL << QUEUE_FLAG_FUA)))
>   		rq->cmd_flags &= ~REQ_FUA;
> 
> +	rq->cmd_flags |= REQ_SYNC;
> +
>   	/*
>   	 * An empty flush handed down from a stacking driver may
>   	 * translate into nothing if the underlying device does not

That worked. Still seeing hangs/failures at boot with the latest
for-4.10/block (networkmanager failing, no login shell, etc) but with
the above patch, I get a clean boot and login with network working.
Once booted, simple tests on ext4 and xfs do not show any problem, but
that was only very light testing.

Best regards.

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Research Group,
Western Digital Corporation
Damien.LeMoal@wdc.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.wdc.com, www.hgst.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: BUG: Hung task timeouts in for-4.10/dio
  2016-11-09  2:17                             ` Damien Le Moal
@ 2016-11-09  2:38                               ` Jens Axboe
  2016-11-09  2:40                                 ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2016-11-09  2:38 UTC (permalink / raw)
  To: Damien Le Moal, Christoph Hellwig
  Cc: Logan Gunthorpe, linux-block, Mike Snitzer

On 11/08/2016 07:17 PM, Damien Le Moal wrote:
>
> Jens,
>
> On 11/9/16 11:02, Jens Axboe wrote:
>> It smells like an accounting error. One thing that I don't like with the
>> current scheme is the implicit knowledge that certain flags imply sync
>> as well. If we clear any of those flags, then we screw up accounting at
>> the end.
>>
>> Does this make a difference?
>>
>>
>> diff --git a/block/blk-flush.c b/block/blk-flush.c
>> index c486b7aa62ee..d70983e28115 100644
>> --- a/block/blk-flush.c
>> +++ b/block/blk-flush.c
>> @@ -395,6 +395,8 @@ void blk_insert_flush(struct request *rq)
>>   	if (!(fflags & (1UL << QUEUE_FLAG_FUA)))
>>   		rq->cmd_flags &= ~REQ_FUA;
>>
>> +	rq->cmd_flags |= REQ_SYNC;
>> +
>>   	/*
>>   	 * An empty flush handed down from a stacking driver may
>>   	 * translate into nothing if the underlying device does not
>
> That worked. Still seeing hangs/failures at boot with the latest
> for-4.10/block (networkmanager failing, no login shell, etc) but with
> the above patch, I get a clean boot and login with network working.
> Once booted, simple tests on ext4 and xfs do not show any problem, but
> that was only very light testing.

Great, thanks for testing! So that validates the theory. I'll get
something committed with a comment. It's not the prettiest patch, but
it'll do for now.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: BUG: Hung task timeouts in for-4.10/dio
  2016-11-09  2:38                               ` Jens Axboe
@ 2016-11-09  2:40                                 ` Christoph Hellwig
  2016-11-09  2:45                                   ` Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2016-11-09  2:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Damien Le Moal, Christoph Hellwig, Logan Gunthorpe, linux-block,
	Mike Snitzer

On Tue, Nov 08, 2016 at 07:38:29PM -0700, Jens Axboe wrote:
> Great, thanks for testing! So that validates the theory. I'll get
> something committed with a comment. It's not the prettiest patch, but
> it'll do for now.

Should we revert the change instead (logically, not with an actual
git revert as that would break) and require and add REQ_SYNC to
all users instead?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: BUG: Hung task timeouts in for-4.10/dio
  2016-11-09  2:40                                 ` Christoph Hellwig
@ 2016-11-09  2:45                                   ` Jens Axboe
  2016-11-09  2:55                                     ` Damien Le Moal
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2016-11-09  2:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Damien Le Moal, Logan Gunthorpe, linux-block, Mike Snitzer

On 11/08/2016 07:40 PM, Christoph Hellwig wrote:
> On Tue, Nov 08, 2016 at 07:38:29PM -0700, Jens Axboe wrote:
>> Great, thanks for testing! So that validates the theory. I'll get
>> something committed with a comment. It's not the prettiest patch, but
>> it'll do for now.
>
> Should we revert the change instead (logically, not with an actual
> git revert as that would break) and require and add REQ_SYNC to
> all users instead?

I just committed the work-around. But yes, let's have a logical revert
and require that REQ_SYNC be set for REQ_FUA|REQ_PREFLUSH to avoid it
being this fragile.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: BUG: Hung task timeouts in for-4.10/dio
  2016-11-09  2:45                                   ` Jens Axboe
@ 2016-11-09  2:55                                     ` Damien Le Moal
  2016-11-09 17:03                                       ` Logan Gunthorpe
  0 siblings, 1 reply; 23+ messages in thread
From: Damien Le Moal @ 2016-11-09  2:55 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig; +Cc: Logan Gunthorpe, linux-block, Mike Snitzer


Jens,

On 11/9/16 11:45, Jens Axboe wrote:
> I just committed the work-around. But yes, let's have a logical revert
> and require that REQ_SYNC be set for REQ_FUA|REQ_PREFLUSH to avoid it
> being this fragile.

Great ! Thank you.

-- =

Damien Le Moal, Ph.D.
Sr. Manager, System Software Research Group,
Western Digital Corporation
Damien.LeMoal@wdc.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.wdc.com, www.hgst.com
Western Digital Corporation (and its subsidiaries) E-mail Confidentiality N=
otice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or l=
egally privileged information of WDC and/or its affiliates, and are intende=
d solely for the use of the individual or entity to which they are addresse=
d. If you are not the intended recipient, any disclosure, copying, distribu=
tion or any action taken or omitted to be taken in reliance on it, is prohi=
bited. If you have received this e-mail in error, please notify the sender =
immediately and delete the e-mail in its entirety from your system.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: BUG: Hung task timeouts in for-4.10/dio
  2016-11-09  2:55                                     ` Damien Le Moal
@ 2016-11-09 17:03                                       ` Logan Gunthorpe
  0 siblings, 0 replies; 23+ messages in thread
From: Logan Gunthorpe @ 2016-11-09 17:03 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe, Christoph Hellwig
  Cc: linux-block, Mike Snitzer, Stephen Bates

Hey,

I just tested with the latest for-4.10/block branch and it looks like it
fixed our problem.

Thanks!

Logan

On 08/11/16 07:55 PM, Damien Le Moal wrote:
> 
> Jens,
> 
> On 11/9/16 11:45, Jens Axboe wrote:
>> I just committed the work-around. But yes, let's have a logical revert
>> and require that REQ_SYNC be set for REQ_FUA|REQ_PREFLUSH to avoid it
>> being this fragile.
> 
> Great ! Thank you.
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2016-11-09 17:03 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-08 18:16 BUG: Hung task timeouts in for-4.10/dio Logan Gunthorpe
2016-11-08 18:21 ` Jens Axboe
2016-11-08 18:59   ` Logan Gunthorpe
2016-11-08 19:01     ` Jens Axboe
2016-11-08 19:03       ` Logan Gunthorpe
2016-11-08 19:19         ` Jens Axboe
2016-11-08 19:47           ` Logan Gunthorpe
2016-11-08 19:50             ` Jens Axboe
     [not found]               ` <84693d82-b8dc-42b3-03da-055def7499a0@deltatee.com>
2016-11-08 20:21                 ` Jens Axboe
2016-11-08 21:08                   ` Mike Snitzer
2016-11-09  0:50                   ` Damien Le Moal
2016-11-09  1:09                     ` Christoph Hellwig
2016-11-09  1:25                       ` Damien Le Moal
2016-11-09  1:28                         ` Jens Axboe
2016-11-09  2:02                           ` Jens Axboe
2016-11-09  2:05                             ` Christoph Hellwig
2016-11-09  2:11                               ` Jens Axboe
2016-11-09  2:17                             ` Damien Le Moal
2016-11-09  2:38                               ` Jens Axboe
2016-11-09  2:40                                 ` Christoph Hellwig
2016-11-09  2:45                                   ` Jens Axboe
2016-11-09  2:55                                     ` Damien Le Moal
2016-11-09 17:03                                       ` Logan Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.