All of lore.kernel.org
 help / color / mirror / Atom feed
* disk-io lockup in 4.14.13 kernel
@ 2018-02-22 10:58 Jaco Kroon
  2018-02-22 16:46 ` Bart Van Assche
  0 siblings, 1 reply; 17+ messages in thread
From: Jaco Kroon @ 2018-02-22 10:58 UTC (permalink / raw)
  To: linux-block; +Cc: Pieter Kruger

Hi,

We've been seeing sporadic IO lockups on recent kernels.

Currently installed on the server is 4.14.13.

Previously we ran 4.0.9, due to various problems from 4.1 onward with
respect to RAID problems, and netfilter etc ... with 4.13 being the
first usable kernel again that doesn't require additional patching.  I'm
not sure if the situation currently is RAID related, or actual disk, I
doubt it's filesystem (ext4 in this case).

As far as I can tell the issue has been coming along since the
introduction of 4.1.

We've got LVM in use, with dm-5 representing vg=lvm, lv=home, mounted on
/home (quite a big one).  Two PVs backing that, each a mdadm based RAID6.

The clear symptom that we see every time is this snippet from iostat -dmx 1:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sds               0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdq               0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdo               0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdr               0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdl               0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdp               0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdn               0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdj               0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sda               0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf               0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdk               0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdh               0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdm               0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdi               0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdg               0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
md127             0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
md125             0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
md126             0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
md124             0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-5              0.00     0.00    0.00    0.00     0.00     0.00    
0.00     1.00    0.00    0.00    0.00   0.00 100.00
dm-6              0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-7              0.00     0.00    0.00    0.00     0.00     0.00    
0.00     0.00    0.00    0.00    0.00   0.00   0.00

Normally I see that the sd* devices carries approximately the same load
as the md* devices as the dm-* devices (in terms of number of requests,
not by utilization %).  So if we have outstanding requests on dm-* we
will see oustanding requests on md* on sd*.  This is not the case above,
which leads me to believe that dm-5 in this case received a request, but
for some or another reason never responded (ie, kernel believes there is
an outstanding request).

The file-system on /home is non-responsive at this point for any
non-cached data (meaning that some folders an ls on will work, whereas
others the ls process will go into uninterruptible wait and stay there -
far majority of them).

/proc/mdstat is looking very happy based on my experience.  No failed
drives or the like.

Capture of everything I thought to capture available at
http://downloads.uls.co.za/lockup/, specifically:

dmesg output
dmsetup -v info
dmsetup -v table
/proc/mstat
px axf
uptime
iostat -dmx 1 60

System is currently in "broken" state, and I can leave it there for
approximately the next four hours to gather additional information, at
which point I'll have to trigger a hard reset (system fails to shut down
cleanly).

Kind Regards,
Jaco

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: disk-io lockup in 4.14.13 kernel
  2018-02-22 10:58 disk-io lockup in 4.14.13 kernel Jaco Kroon
@ 2018-02-22 16:46 ` Bart Van Assche
  2018-02-23  9:58   ` Jaco Kroon
       [not found]   ` <257ceeb7-f466-d13d-8818-829759eda587@uls.co.za>
  0 siblings, 2 replies; 17+ messages in thread
From: Bart Van Assche @ 2018-02-22 16:46 UTC (permalink / raw)
  To: Jaco Kroon, linux-block; +Cc: Pieter Kruger

On 02/22/18 02:58, Jaco Kroon wrote:
> We've been seeing sporadic IO lockups on recent kernels.

Are you using the legacy I/O stack or blk-mq? If you are not yet using 
blk-mq, can you switch to blk-mq + scsi-mq + dm-mq? If the lockup is 
reproducible with blk-mq, can you share the output of the following command:

(cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} \;)

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: disk-io lockup in 4.14.13 kernel
  2018-02-22 16:46 ` Bart Van Assche
@ 2018-02-23  9:58   ` Jaco Kroon
  2018-02-23 16:03     ` Bart Van Assche
       [not found]   ` <257ceeb7-f466-d13d-8818-829759eda587@uls.co.za>
  1 sibling, 1 reply; 17+ messages in thread
From: Jaco Kroon @ 2018-02-23  9:58 UTC (permalink / raw)
  To: Bart Van Assche, linux-block; +Cc: Pieter Kruger

Hi Bart,

Thank you for your response.

On 22/02/2018 18:46, Bart Van Assche wrote:
> On 02/22/18 02:58, Jaco Kroon wrote:
>> We've been seeing sporadic IO lockups on recent kernels.
>
> Are you using the legacy I/O stack or blk-mq? If you are not yet using
> blk-mq, can you switch to blk-mq + scsi-mq + dm-mq? If the lockup is
> reproducible with blk-mq, can you share the output of the following
> command:
crowsnest ~ # zgrep MQ /proc/config.gz

CONFIG_SCSI_MQ_DEFAULT=y
# CONFIG_DM_MQ_DEFAULT is not set

... oi, so that's a very valid question.

blk-mq is thus off by default, I've now enabled it on the "live" system
with "echo 1 > /sys/module/dm_mod/parameters/use_blk_mq".

I've also modified the kernel config to set CONFIG_DM_MQ_DEFAULT (I know
I can just set this on cmdline too).

The only immediately visible effect is that I seem to be more
consistently get >300MB/s (and more frequently >400MB) off the array in
terms of read speed, where normally I'd expect 200MB consistent with
spikes just over 300MB and very infrequently over 400MB.  This is a very
simple spot check with iotop over approximately a minute.

I am seeing I/O errors in dmesg from time to time, this to me hints that
potentially it may be something related to some error path that's
causing problems.

Just so we're clear, we're seeing this happen approximately once a
month, so if switching on dm_mod.use_blk_mq solves it then I won't
really know beyond a shadow of a doubt, and the only way of "knowing" is
if we can get to an uptime of three months or so ...

>
> (cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} \;)
I don't have a /sys/kernel/debug folder - I've enabled CONFIG_DEBUG_FS
and BLK_DEBUG_FS, will reboot at the first opportunity.  As a general
rule - is there additional overhead to having debugfs enabled?  Any
other risks that I should be aware of?  In essence, are there any
disadvantages to just enabling DEBUG_FS as a general rule?  I did note
that a few extra DEBUG options pop up for other modules ... so my gut is
towards leaving this disabled as a general rule and enabling when needed.

>
> Thanks,
Thank you!

Kind Regards,
Jaco

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: disk-io lockup in 4.14.13 kernel
  2018-02-23  9:58   ` Jaco Kroon
@ 2018-02-23 16:03     ` Bart Van Assche
  0 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2018-02-23 16:03 UTC (permalink / raw)
  To: linux-block, jaco; +Cc: pieterk

T24gRnJpLCAyMDE4LTAyLTIzIGF0IDExOjU4ICswMjAwLCBKYWNvIEtyb29uIHdyb3RlOg0KPiBP
biAyMi8wMi8yMDE4IDE4OjQ2LCBCYXJ0IFZhbiBBc3NjaGUgd3JvdGU6DQo+ID4gKGNkIC9zeXMv
a2VybmVsL2RlYnVnL2Jsb2NrICYmIGZpbmQgLiAtdHlwZSBmIC1leGVjIGdyZXAgLWFIIC4ge30g
XDspDQo+IA0KPiBJIGRvbid0IGhhdmUgYSAvc3lzL2tlcm5lbC9kZWJ1ZyBmb2xkZXIgLSBJJ3Zl
IGVuYWJsZWQgQ09ORklHX0RFQlVHX0ZTDQo+IGFuZCBCTEtfREVCVUdfRlMsIHdpbGwgcmVib290
IGF0IHRoZSBmaXJzdCBvcHBvcnR1bml0eS4gIEFzIGEgZ2VuZXJhbA0KPiBydWxlIC0gaXMgdGhl
cmUgYWRkaXRpb25hbCBvdmVyaGVhZCB0byBoYXZpbmcgZGVidWdmcyBlbmFibGVkPyAgQW55DQo+
IG90aGVyIHJpc2tzIHRoYXQgSSBzaG91bGQgYmUgYXdhcmUgb2Y/ICBJbiBlc3NlbmNlLCBhcmUg
dGhlcmUgYW55DQo+IGRpc2FkdmFudGFnZXMgdG8ganVzdCBlbmFibGluZyBERUJVR19GUyBhcyBh
IGdlbmVyYWwgcnVsZT8gIEkgZGlkIG5vdGUNCj4gdGhhdCBhIGZldyBleHRyYSBERUJVRyBvcHRp
b25zIHBvcCB1cCBmb3Igb3RoZXIgbW9kdWxlcyAuLi4gc28gbXkgZ3V0IGlzDQo+IHRvd2FyZHMg
bGVhdmluZyB0aGlzIGRpc2FibGVkIGFzIGEgZ2VuZXJhbCBydWxlIGFuZCBlbmFibGluZyB3aGVu
IG5lZWRlZC4NCg0KSGVsbG8gSmFjbywNCg0KVGhlIG9ubHkgZGlzYWR2YW50YWdlcyBvZiBlbmFi
bGluZyBkZWJ1Z2ZzIHRoYXQgSSBrbm93IG9mIGFyZToNCi0gVGhlIGFkZGl0aW9uYWwgbWVtb3J5
IHJlcXVpcmVkIGJ5IGRlYnVnZnMgKHByb2JhYmx5IG5vdCB0aGF0IG11Y2gpLg0KLSBBIHNlY3Vy
aXR5IHJpc2sgaWYgbm90IGFsbCB1c2VycyB3aG8gaGF2ZSBhbiBhY2NvdW50IG9uIHRoZSBzeXN0
ZW0gYXJlIGZ1bGx5DQogIHRydXN0ZWQuIFNlZSBlLmcuIGh0dHBzOi8vYnVncy5kZWJpYW4ub3Jn
L2NnaS1iaW4vYnVncmVwb3J0LmNnaT9idWc9NjgxNDE4Lg0KDQpFbmFibGluZyBkZWJ1Z2ZzIGRv
ZXNuJ3QgY2F1c2UgYW55IHJ1bnRpbWUgb3ZlcmhlYWQgaW4gdGhlIGhvdCBwYXRoIG9mIHRoZQ0K
YmxvY2sgbGF5ZXIgaWYgbm8gc29mdHdhcmUgYWNjZXNzZXMgdGhlIGRlYnVnZnMgYXR0cmlidXRl
cy4NCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: disk-io lockup in 4.14.13 kernel
       [not found]   ` <257ceeb7-f466-d13d-8818-829759eda587@uls.co.za>
@ 2018-03-11  3:08     ` Bart Van Assche
  2018-03-11  4:33       ` Jaco Kroon
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2018-03-11  3:08 UTC (permalink / raw)
  To: linux-block, jaco; +Cc: pieterk

T24gU2F0LCAyMDE4LTAzLTEwIGF0IDIyOjU2ICswMjAwLCBKYWNvIEtyb29uIHdyb3RlOg0KPiBP
biAyMi8wMi8yMDE4IDE4OjQ2LCBCYXJ0IFZhbiBBc3NjaGUgd3JvdGU6DQo+ID4gT24gMDIvMjIv
MTggMDI6NTgsIEphY28gS3Jvb24gd3JvdGU6DQo+ID4gPiBXZSd2ZSBiZWVuIHNlZWluZyBzcG9y
YWRpYyBJTyBsb2NrdXBzIG9uIHJlY2VudCBrZXJuZWxzLg0KPiA+IA0KPiA+IEFyZSB5b3UgdXNp
bmcgdGhlIGxlZ2FjeSBJL08gc3RhY2sgb3IgYmxrLW1xPyBJZiB5b3UgYXJlIG5vdCB5ZXQgdXNp
bmcNCj4gPiBibGstbXEsIGNhbiB5b3Ugc3dpdGNoIHRvIGJsay1tcSArIHNjc2ktbXEgKyBkbS1t
cT8gSWYgdGhlIGxvY2t1cCBpcw0KPiA+IHJlcHJvZHVjaWJsZSB3aXRoIGJsay1tcSwgY2FuIHlv
dSBzaGFyZSB0aGUgb3V0cHV0IG9mIHRoZSBmb2xsb3dpbmcNCj4gPiBjb21tYW5kOg0KPiA+IA0K
PiA+IChjZCAvc3lzL2tlcm5lbC9kZWJ1Zy9ibG9jayAmJiBmaW5kIC4gLXR5cGUgZiAtZXhlYyBn
cmVwIC1hSCAuIHt9IFw7KQ0KPiANCj4gTG9va3MgbGlrZSB0aGUgbG9ja3VwcyBhcmUgZmFyIG1v
cmUgZnJlcXVlbnQgd2l0aCBldmVyeXRoaW5nIG9uIG1xLiANCj4gSnVzdCB0byBjb25maXJtOg0K
PiANCj4gQ09ORklHX1NDU0lfTVFfREVGQVVMVD15DQo+IENPTkZJR19ETV9NUV9ERUZBVUxUPXkN
Cj4gDQo+IA0KPiBQbGVhc2UgZmluZCBhdHRhY2hlZCB0aGUgb3V0cHV0IGZyb20gdGhlIHJlcXVl
c3RlZC4NCj4gDQo+IGh0dHA6Ly9kb3dubG9hZHMudWxzLmNvLnphL2xvY2t1cC9sb2NrdXAtMjAx
ODAzMTAtMjIzMDM2LyBjb250YWlucw0KPiBhZGRpdGlvbmFsIHN0dWZmLCBzdXJyb3VuZGluZyB0
aGF0Lg0KDQpUaGFua3MsIHRoYXQgaGVscHMuIEluIGJsb2NrX2RlYnVnLnR4dCBJIHNlZSB0aGF0
IG9ubHkgZm9yIC9kZXYvc2RtIGENCnJlcXVlc3QgZ290IHN0dWNrOg0KDQokIGdyZXAgJ2J1c3k9
W14wXScgYmxvY2tfZGVidWcudHh0ICANCi4vc2RtL2hjdHgwL3RhZ3M6YnVzeT05DQoNCkJ1dCBJ
IGNhbid0IHNlZSBpbiB0aGUgb3V0cHV0IHRoYXQgaGFzIGJlZW4gc2hhcmVkIHdoaWNoIEkvTyBz
Y2hlZHVsZXIgaGFzDQpiZWVuIGNvbmZpZ3VyZWQgbm9yIHdoaWNoIFNDU0kgTExEIGlzIGludm9s
dmVkLiBDYW4geW91IHBsZWFzZSBhbHNvIHNoYXJlDQp0aGF0IGluZm9ybWF0aW9uLCBlLmcuIGJ5
IHByb3ZpZGluZyB0aGUgb3V0cHV0IG9mIHRoZSBmb2xsb3dpbmcgY29tbWFuZHM6DQoNCmNhdCAv
c3lzL2Jsb2NrL3NkbS9xdWV1ZS9zY2hlZHVsZXINCmZpbmQgL3N5cyAtbmFtZSBzZG0gIyBwcm92
aWRlcyB0aGUgUENJIElEDQpsc3BjaQ0KDQpUaGFua3MsDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: disk-io lockup in 4.14.13 kernel
  2018-03-11  3:08     ` Bart Van Assche
@ 2018-03-11  4:33       ` Jaco Kroon
  2018-03-11  5:00         ` Bart Van Assche
  0 siblings, 1 reply; 17+ messages in thread
From: Jaco Kroon @ 2018-03-11  4:33 UTC (permalink / raw)
  To: Bart Van Assche, linux-block; +Cc: pieterk

Hi Bart,

On 11/03/2018 05:08, Bart Van Assche wrote:
> On Sat, 2018-03-10 at 22:56 +0200, Jaco Kroon wrote:
>> On 22/02/2018 18:46, Bart Van Assche wrote:
>>> On 02/22/18 02:58, Jaco Kroon wrote:
>>>> We've been seeing sporadic IO lockups on recent kernels.
>>> Are you using the legacy I/O stack or blk-mq? If you are not yet using
>>> blk-mq, can you switch to blk-mq + scsi-mq + dm-mq? If the lockup is
>>> reproducible with blk-mq, can you share the output of the following
>>> command:
>>>
>>> (cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} \;)
>> Looks like the lockups are far more frequent with everything on mq. 
>> Just to confirm:
>>
>> CONFIG_SCSI_MQ_DEFAULT=y
>> CONFIG_DM_MQ_DEFAULT=y
>>
>>
>> Please find attached the output from the requested.
>>
>> http://downloads.uls.co.za/lockup/lockup-20180310-223036/ contains
>> additional stuff, surrounding that.
> Thanks, that helps. In block_debug.txt I see that only for /dev/sdm a
> request got stuck:
>
> $ grep 'busy=[^0]' block_debug.txt  
> ./sdm/hctx0/tags:busy=9
>
> But I can't see in the output that has been shared which I/O scheduler has
> been configured nor which SCSI LLD is involved. Can you please also share
> that information, e.g. by providing the output of the following commands:
Had to reboot, but I trust the information should still be valid.  Could
be scheduler related now that you mention it.
> cat /sys/block/sdm/queue/scheduler
crowsnest ~ # cat /sys/block/sdm/queue/scheduler
[mq-deadline] kyber bfq none

Used to get better performance (on average) from deadline that others,
but I don't see elevator and other options that look familiar any more. 
Having said that I just cross referenced with a few other systems and
they all too use deadline and I'm not seeing similar behavior there. 
Most notably the large difference I see with those systems is that they
use raid1 and not raid6.  Could just be co-incidence.
> find /sys -name sdm # provides the PCI ID
crowsnest ~ # find /sys -name sdm
/sys/kernel/debug/block/sdm
/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host0/port-0:0/expander-0:0/port-0:0:0/expander-0:1/port-0:1:0/end_device-0:1:0/target0:0:13/0:0:13:0/block/sdm
/sys/class/block/sdm
/sys/block/sdm

> lspci
crowsnest ~ # lspci
00:00.0 Host bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
DMI2 (rev 02)
00:01.0 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI
Express Root Port 1 (rev 02)
00:02.0 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI
Express Root Port 2 (rev 02)
00:02.2 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI
Express Root Port 2 (rev 02)
00:03.0 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI
Express Root Port 3 (rev 02)
00:03.2 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI
Express Root Port 3 (rev 02)
00:03.3 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI
Express Root Port 3 (rev 02)
00:04.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DMA Channel 0 (rev 02)
00:04.1 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DMA Channel 1 (rev 02)
00:04.2 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DMA Channel 2 (rev 02)
00:04.3 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DMA Channel 3 (rev 02)
00:04.4 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DMA Channel 4 (rev 02)
00:04.5 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DMA Channel 5 (rev 02)
00:04.6 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DMA Channel 6 (rev 02)
00:04.7 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DMA Channel 7 (rev 02)
00:05.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Address Map, VTd_Misc, System Management (rev 02)
00:05.1 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Hot Plug (rev 02)
00:05.2 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 RAS, Control Status and Global Errors (rev 02)
00:05.4 PIC: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 I/O APIC
(rev 02)
00:11.0 Unassigned class [ff00]: Intel Corporation C610/X99 series
chipset SPSR (rev 05)
00:11.4 SATA controller: Intel Corporation C610/X99 series chipset sSATA
Controller [AHCI mode] (rev 05)
00:14.0 USB controller: Intel Corporation C610/X99 series chipset USB
xHCI Host Controller (rev 05)
00:16.0 Communication controller: Intel Corporation C610/X99 series
chipset MEI Controller #1 (rev 05)
00:16.1 Communication controller: Intel Corporation C610/X99 series
chipset MEI Controller #2 (rev 05)
00:1a.0 USB controller: Intel Corporation C610/X99 series chipset USB
Enhanced Host Controller #2 (rev 05)
00:1c.0 PCI bridge: Intel Corporation C610/X99 series chipset PCI
Express Root Port #1 (rev d5)
00:1c.6 PCI bridge: Intel Corporation C610/X99 series chipset PCI
Express Root Port #7 (rev d5)
00:1d.0 USB controller: Intel Corporation C610/X99 series chipset USB
Enhanced Host Controller #1 (rev 05)
00:1f.0 ISA bridge: Intel Corporation C610/X99 series chipset LPC
Controller (rev 05)
00:1f.2 SATA controller: Intel Corporation C610/X99 series chipset
6-Port SATA Controller [AHCI mode] (rev 05)
00:1f.3 SMBus: Intel Corporation C610/X99 series chipset SMBus
Controller (rev 05)
00:1f.6 Signal processing controller: Intel Corporation C610/X99 series
chipset Thermal Subsystem (rev 05)
01:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic
SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)
06:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network
Connection (rev 01)
06:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network
Connection (rev 01)
06:00.2 Ethernet controller: Intel Corporation I350 Gigabit Network
Connection (rev 01)
06:00.3 Ethernet controller: Intel Corporation I350 Gigabit Network
Connection (rev 01)
08:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge
(rev 03)
09:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED
Graphics Family (rev 30)
ff:0b.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 R3 QPI Link 0 & 1 Monitoring (rev 02)
ff:0b.1 Performance counters: Intel Corporation Xeon E7 v3/Xeon E5
v3/Core i7 R3 QPI Link 0 & 1 Monitoring (rev 02)
ff:0b.2 Performance counters: Intel Corporation Xeon E7 v3/Xeon E5
v3/Core i7 R3 QPI Link 0 & 1 Monitoring (rev 02)
ff:0c.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Unicast Registers (rev 02)
ff:0c.1 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Unicast Registers (rev 02)
ff:0c.2 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Unicast Registers (rev 02)
ff:0c.3 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Unicast Registers (rev 02)
ff:0c.4 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Unicast Registers (rev 02)
ff:0c.5 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Unicast Registers (rev 02)
ff:0f.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Buffered Ring Agent (rev 02)
ff:0f.1 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Buffered Ring Agent (rev 02)
ff:0f.4 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 System Address Decoder & Broadcast Registers (rev 02)
ff:0f.5 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 System Address Decoder & Broadcast Registers (rev 02)
ff:0f.6 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 System Address Decoder & Broadcast Registers (rev 02)
ff:10.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 PCIe Ring Interface (rev 02)
ff:10.1 Performance counters: Intel Corporation Xeon E7 v3/Xeon E5
v3/Core i7 PCIe Ring Interface (rev 02)
ff:10.5 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Scratchpad & Semaphore Registers (rev 02)
ff:10.6 Performance counters: Intel Corporation Xeon E7 v3/Xeon E5
v3/Core i7 Scratchpad & Semaphore Registers (rev 02)
ff:10.7 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Scratchpad & Semaphore Registers (rev 02)
ff:12.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Home Agent 0 (rev 02)
ff:12.1 Performance counters: Intel Corporation Xeon E7 v3/Xeon E5
v3/Core i7 Home Agent 0 (rev 02)
ff:13.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Integrated Memory Controller 0 Target Address, Thermal & RAS
Registers (rev 02)
ff:13.1 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Integrated Memory Controller 0 Target Address, Thermal & RAS
Registers (rev 02)
ff:13.3 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Integrated Memory Controller 0 Channel Target Address Decoder (rev 02)
ff:13.4 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Integrated Memory Controller 0 Channel Target Address Decoder (rev 02)
ff:13.5 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Integrated Memory Controller 0 Channel Target Address Decoder (rev 02)
ff:13.6 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DDRIO Channel 0/1 Broadcast (rev 02)
ff:13.7 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DDRIO Global Broadcast (rev 02)
ff:14.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Integrated Memory Controller 0 Channel 0 Thermal Control (rev 02)
ff:14.1 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Integrated Memory Controller 0 Channel 1 Thermal Control (rev 02)
ff:14.2 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Integrated Memory Controller 0 Channel 0 ERROR Registers (rev 02)
ff:14.3 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Integrated Memory Controller 0 Channel 1 ERROR Registers (rev 02)
ff:14.6 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DDRIO (VMSE) 0 & 1 (rev 02)
ff:14.7 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DDRIO (VMSE) 0 & 1 (rev 02)
ff:15.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Integrated Memory Controller 0 Channel 2 Thermal Control (rev 02)
ff:15.1 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Integrated Memory Controller 0 Channel 3 Thermal Control (rev 02)
ff:15.2 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Integrated Memory Controller 0 Channel 2 ERROR Registers (rev 02)
ff:15.3 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Integrated Memory Controller 0 Channel 3 ERROR Registers (rev 02)
ff:16.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Integrated Memory Controller 1 Target Address, Thermal & RAS
Registers (rev 02)
ff:16.6 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DDRIO Channel 2/3 Broadcast (rev 02)
ff:16.7 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DDRIO Global Broadcast (rev 02)
ff:17.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Integrated Memory Controller 1 Channel 0 Thermal Control (rev 02)
ff:17.4 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DDRIO (VMSE) 2 & 3 (rev 02)
ff:17.5 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DDRIO (VMSE) 2 & 3 (rev 02)
ff:17.6 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DDRIO (VMSE) 2 & 3 (rev 02)
ff:17.7 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 DDRIO (VMSE) 2 & 3 (rev 02)
ff:1e.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Power Control Unit (rev 02)
ff:1e.1 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Power Control Unit (rev 02)
ff:1e.2 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Power Control Unit (rev 02)
ff:1e.3 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Power Control Unit (rev 02)
ff:1e.4 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 Power Control Unit (rev 02)
ff:1f.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 VCU (rev 02)
ff:1f.2 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core
i7 VCU (rev 02)

And also hdparm -I /dev/sdm for what it's worth.


/dev/sdm:

ATA device, with non-removable media
        Model Number:       ST4000NM0033-9ZM170                    
        Serial Number:      Z1Z9B7HL
        Firmware Revision:  SN04   
        Transport:          Serial, SATA Rev 3.0
Standards:
        Supported: 9 8 7 6 5
        Likely used: 9
Configuration:
        Logical         max     current
        cylinders       16383   16383
        heads           16      16
        sectors/track   63      63
        --
        CHS current addressable sectors:    16514064
        LBA    user addressable sectors:   268435455
        LBA48  user addressable sectors:  7814037168
        Logical  Sector size:                   512 bytes
        Physical Sector size:                   512 bytes
        Logical Sector-0 offset:                  0 bytes
        device size with M = 1024*1024:     3815447 MBytes
        device size with M = 1000*1000:     4000787 MBytes (4000 GB)
        cache/buffer size  = unknown
        Form Factor: 3.5 inch
        Nominal Media Rotation Rate: 7200
Capabilities:
        LBA, IORDY(can be disabled)
        Queue depth: 32
        Standby timer values: spec'd by Standard, no device specific minimum
        R/W multiple sector transfer: Max = 16  Current = ?
        Recommended acoustic management value: 254, current value: 0
        DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
             Cycle time: min=120ns recommended=120ns
        PIO: pio0 pio1 pio2 pio3 pio4
             Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
        Enabled Supported:
           *    SMART feature set
                Security Mode feature set
           *    Power Management feature set
           *    Write cache
           *    Look-ahead
           *    Host Protected Area feature set
           *    WRITE_BUFFER command
           *    READ_BUFFER command
           *    DOWNLOAD_MICROCODE
                SET_MAX security extension
           *    48-bit Address feature set
           *    Mandatory FLUSH_CACHE
           *    FLUSH_CACHE_EXT
           *    SMART error logging
           *    SMART self-test
           *    General Purpose Logging feature set
           *    WRITE_{DMA|MULTIPLE}_FUA_EXT
           *    64-bit World wide name
           *    IDLE_IMMEDIATE with UNLOAD
                Write-Read-Verify feature set
           *    WRITE_UNCORRECTABLE_EXT command
           *    {READ,WRITE}_DMA_EXT_GPL commands
           *    Segmented DOWNLOAD_MICROCODE
                unknown 119[6]
           *    unknown 119[7]
           *    Gen1 signaling speed (1.5Gb/s)
           *    Gen2 signaling speed (3.0Gb/s)
           *    Gen3 signaling speed (6.0Gb/s)
           *    Native Command Queueing (NCQ)
           *    Phy event counters
           *    Idle-Unload when NCQ is active
           *    READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
           *    DMA Setup Auto-Activate optimization
                Device-initiated interface power management
           *    Software settings preservation
                unknown 78[7]
           *    SMART Command Transport (SCT) feature set
           *    SCT Write Same (AC2)
           *    SCT Error Recovery Control (AC3)
           *    SCT Features Control (AC4)
           *    SCT Data Tables (AC5)
                unknown 206[7]
                unknown 206[12] (vendor specific)
                unknown 206[14] (vendor specific)
Security:
        Master password revision code = 65534
                supported
        not     enabled
        not     locked
        not     frozen
        not     expired: security count
                supported: enhanced erase
        464min for SECURITY ERASE UNIT. 464min for ENHANCED SECURITY
ERASE UNIT.
Logical Unit WWN Device Identifier: 5000c50086dc88c3
        NAA             : 5
        IEEE OUI        : 000c50
        Unique ID       : 086dc88c3
Checksum: correct

Please let me know if there is anything else I can help with.

Thank you.

Kind Regards,
Jaco

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: disk-io lockup in 4.14.13 kernel
  2018-03-11  4:33       ` Jaco Kroon
@ 2018-03-11  5:00         ` Bart Van Assche
  2018-03-13  9:30           ` Jaco Kroon
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2018-03-11  5:00 UTC (permalink / raw)
  To: linux-block, jaco; +Cc: pieterk

T24gU3VuLCAyMDE4LTAzLTExIGF0IDA2OjMzICswMjAwLCBKYWNvIEtyb29uIHdyb3RlOg0KPiBj
cm93c25lc3QgfiAjIGZpbmQgL3N5cyAtbmFtZSBzZG0NCj4gL3N5cy9rZXJuZWwvZGVidWcvYmxv
Y2svc2RtDQo+IC9zeXMvZGV2aWNlcy9wY2kwMDAwOjAwLzAwMDA6MDA6MDEuMC8wMDAwOjAxOjAw
LjAvaG9zdDAvcG9ydC0wOjAvZXhwYW5kZXItMDowL3BvcnQtMDowOjAvZXhwYW5kZXItMDoxL3Bv
cnQtMDoxOjAvZW5kX2RldmljZS0wOjE6MC90YXJnZXQwOjA6MTMvMDowOjEzOjAvYmxvY2svc2Rt
DQo+IC9zeXMvY2xhc3MvYmxvY2svc2RtDQo+IC9zeXMvYmxvY2svc2RtDQo+IA0KPiA+IGxzcGNp
DQo+IA0KPiBjcm93c25lc3QgfiAjIGxzcGNpDQo+IFsgLi4uIF0NCj4gMDE6MDAuMCBTZXJpYWwg
QXR0YWNoZWQgU0NTSSBjb250cm9sbGVyOiBMU0kgTG9naWMgLyBTeW1iaW9zIExvZ2ljDQo+IFNB
UzMwMDggUENJLUV4cHJlc3MgRnVzaW9uLU1QVCBTQVMtMyAocmV2IDAyKQ0KPiBbIC4uLiBdDQoN
CkhpIEphY28sDQoNClJlY2VudGx5IGEgYnVnIGZpeCBmb3IgdGhlIG1xLWRlYWRsaW5lIHNjaGVk
dWxlciB3YXMgcG9zdGVkIGJ1dCBJIGRvbid0DQp0aGluayB0aGF0IHRoYXQgcGF0Y2ggd2lsbCBj
aGFuZ2UgdGhlIGJlaGF2aW9yIG9uIHlvdXIgc2V0dXAgc2luY2UgeW91IGFyZQ0Kbm90IHVzaW5n
IFpCQyBkaXNrcy4gU2VlIGFsc28gIm1xLWRlYWRsaW5lOiBNYWtlIHN1cmUgdG8gYWx3YXlzIHVu
bG9jaw0Kem9uZXMiIChodHRwczovL21hcmMuaW5mby8/bD1saW51eC1ibG9jayZtPTE1MTk4Mzkz
MzcxNDQ5MikuDQoNCkRpZCBJIHNlZSBjb3JyZWN0bHkgdGhhdCAvZGV2L3NkbSBpcyBiZWhpbmQg
YSBNUFQgU0FTIGNvbnRyb2xsZXI/IFlvdSBtYXkNCndhbnQgdG8gY29udGFjdCB0aGUgYXV0aG9y
cyBvZiB0aGlzIGRyaXZlciBhbmQgQ2MgdGhlIGxpbnV4LXNjc2kgbWFpbGluZw0KbGlzdC4gU29y
cnkgYnV0IEknbSBub3QgZmFtaWxpYXIgd2l0aCB0aGUgbXB0M3NhcyBkcml2ZXIgbXlzZWxmLg0K
DQpCYXJ0Lg==

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: disk-io lockup in 4.14.13 kernel
  2018-03-11  5:00         ` Bart Van Assche
@ 2018-03-13  9:30           ` Jaco Kroon
  2018-03-13 14:10             ` Bart Van Assche
  0 siblings, 1 reply; 17+ messages in thread
From: Jaco Kroon @ 2018-03-13  9:30 UTC (permalink / raw)
  To: Bart Van Assche, linux-block; +Cc: pieterk

Hi Bart,

On 11/03/2018 07:00, Bart Van Assche wrote:
> On Sun, 2018-03-11 at 06:33 +0200, Jaco Kroon wrote:
>> crowsnest ~ # find /sys -name sdm
>> /sys/kernel/debug/block/sdm
>> /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host0/port-0:0/expander-0:0/port-0:0:0/expander-0:1/port-0:1:0/end_device-0:1:0/target0:0:13/0:0:13:0/block/sdm
>> /sys/class/block/sdm
>> /sys/block/sdm
>>
>>> lspci
>> crowsnest ~ # lspci
>> [ ... ]
>> 01:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic
>> SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)
>> [ ... ]
> Hi Jaco,
>
> Recently a bug fix for the mq-deadline scheduler was posted but I don't
> think that that patch will change the behavior on your setup since you are
> not using ZBC disks. See also "mq-deadline: Make sure to always unlock
> zones" (https://marc.info/?l=linux-block&m=151983933714492).
>From that link:

In case of a failed write request (all retries failed) and when using
libata, the SCSI error handler calls scsi_finish_command(). In the
case of blk-mq this means that scsi_mq_done() does not get called,
that blk_mq_complete_request() does not get called and also that the
mq-deadline .completed_request() method is not called. This results in
the target zone of the failed write request being left in a locked
state, preventing that any new write requests are issued to the same
zone.

Why do you say that this won't make a difference? To me it sounds like
it could very well relate? You're talking about "ZBC" disks. I'm going
to assume that the ZBC is Zoned Block ??? and reading up on it I get
really confused.

Either way, the source version onto hich the patch applies is not
4.14.13 code (the patch references lines 756 and the source in 4.14.13
only has 679 lines of code.  I also can't find any kind of locking that
I can imagine that can cause a problem unless there is problems inside
__dd_dispatch_request, blk_mq_sched_try_merge or dd_insert_request (none
of which contains any loops that I can see at a quick glance, at least
down to elv_merge, from there it gets more complicated).

>
> Did I see correctly that /dev/sdm is behind a MPT SAS controller? You may
> want to contact the authors of this driver and Cc the linux-scsi mailing
> list. Sorry but I'm not familiar with the mpt3sas driver myself.
You did see correctly, all drives are behind the MPT SAS.  If there is
in fact a problem with the driver (or the controller itself for that
matter) it would explain things.  It would also explain why we don't see
this problem on other hosts.

I'll contact them as well.

Kind Regards,
Jaco

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: disk-io lockup in 4.14.13 kernel
  2018-03-13  9:30           ` Jaco Kroon
@ 2018-03-13 14:10             ` Bart Van Assche
  2018-03-13 14:59               ` Jaco Kroon
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2018-03-13 14:10 UTC (permalink / raw)
  To: linux-block, jaco; +Cc: pieterk

T24gVHVlLCAyMDE4LTAzLTEzIGF0IDExOjMwICswMjAwLCBKYWNvIEtyb29uIHdyb3RlOg0KPiBP
biAxMS8wMy8yMDE4IDA3OjAwLCBCYXJ0IFZhbiBBc3NjaGUgd3JvdGU6DQo+ID4gRGlkIEkgc2Vl
IGNvcnJlY3RseSB0aGF0IC9kZXYvc2RtIGlzIGJlaGluZCBhIE1QVCBTQVMgY29udHJvbGxlcj8g
WW91IG1heQ0KPiA+IHdhbnQgdG8gY29udGFjdCB0aGUgYXV0aG9ycyBvZiB0aGlzIGRyaXZlciBh
bmQgQ2MgdGhlIGxpbnV4LXNjc2kgbWFpbGluZw0KPiA+IGxpc3QuIFNvcnJ5IGJ1dCBJJ20gbm90
IGZhbWlsaWFyIHdpdGggdGhlIG1wdDNzYXMgZHJpdmVyIG15c2VsZi4NCj4gDQo+IFlvdSBkaWQg
c2VlIGNvcnJlY3RseSwgYWxsIGRyaXZlcyBhcmUgYmVoaW5kIHRoZSBNUFQgU0FTLiAgSWYgdGhl
cmUgaXMNCj4gaW4gZmFjdCBhIHByb2JsZW0gd2l0aCB0aGUgZHJpdmVyIChvciB0aGUgY29udHJv
bGxlciBpdHNlbGYgZm9yIHRoYXQNCj4gbWF0dGVyKSBpdCB3b3VsZCBleHBsYWluIHRoaW5ncy4g
IEl0IHdvdWxkIGFsc28gZXhwbGFpbiB3aHkgd2UgZG9uJ3Qgc2VlDQo+IHRoaXMgcHJvYmxlbSBv
biBvdGhlciBob3N0cy4NCg0KWW91IG1heSB3YW50IHRvIGhhdmUgYSBsb29rIGF0IHRoZSBmb2xs
b3dpbmcgcmVwb3J0LCBzb21ldGhpbmcgSSByYW4gaW50bw0KbXlzZWxmIHllc3RlcmRheTogaHR0
cHM6Ly9tYXJjLmluZm8vP2w9bGludXgtc2NzaSZtPTE1MjA4NzkwNDAyNDE0Ni4NCg0KQmFydC4N
Cg0KDQoNCg==

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: disk-io lockup in 4.14.13 kernel
  2018-03-13 14:10             ` Bart Van Assche
@ 2018-03-13 14:59               ` Jaco Kroon
  2018-03-13 15:06                 ` Bart Van Assche
  0 siblings, 1 reply; 17+ messages in thread
From: Jaco Kroon @ 2018-03-13 14:59 UTC (permalink / raw)
  To: Bart Van Assche, linux-block; +Cc: pieterk

Hi Bart,


On 13/03/2018 16:10, Bart Van Assche wrote:
> On Tue, 2018-03-13 at 11:30 +0200, Jaco Kroon wrote:
>> On 11/03/2018 07:00, Bart Van Assche wrote:
>>> Did I see correctly that /dev/sdm is behind a MPT SAS controller? You may
>>> want to contact the authors of this driver and Cc the linux-scsi mailing
>>> list. Sorry but I'm not familiar with the mpt3sas driver myself.
>> You did see correctly, all drives are behind the MPT SAS.  If there is
>> in fact a problem with the driver (or the controller itself for that
>> matter) it would explain things.  It would also explain why we don't see
>> this problem on other hosts.
> You may want to have a look at the following report, something I ran into
> myself yesterday: https://marc.info/?l=linux-scsi&m=152087904024146.
I quickly checked my dmesg logs and I'm not seeing that particular
message, could be that newer kernels only started warning about it?

I'm not seeing any references to the
scsih_get_enclosure_logicalid_chassis_slot() function in the code here,
so there is obviously a newer driver in newer kernels.  The scsih_get_*
calls I do see is in mpt3sas_scsih.c and I'm not seeing (direct) calls
to _config_request at all.  At least it seems there is activity on the
driver, which is always a good sign that someone is attending to problems.

I'll queue an upgrade to 4.16 so long for when it's released in a couple
of days/weeks.

Kind Regards,
Jaco

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: disk-io lockup in 4.14.13 kernel
  2018-03-13 14:59               ` Jaco Kroon
@ 2018-03-13 15:06                 ` Bart Van Assche
       [not found]                   ` <b9f2421d-6350-1136-86ea-bdb70a59b6d9@uls.co.za>
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2018-03-13 15:06 UTC (permalink / raw)
  To: linux-block, jaco; +Cc: pieterk

T24gVHVlLCAyMDE4LTAzLTEzIGF0IDE2OjU5ICswMjAwLCBKYWNvIEtyb29uIHdyb3RlOg0KPiBJ
IHF1aWNrbHkgY2hlY2tlZCBteSBkbWVzZyBsb2dzIGFuZCBJJ20gbm90IHNlZWluZyB0aGF0IHBh
cnRpY3VsYXINCj4gbWVzc2FnZSwgY291bGQgYmUgdGhhdCBuZXdlciBrZXJuZWxzIG9ubHkgc3Rh
cnRlZCB3YXJuaW5nIGFib3V0IGl0Pw0KDQpIZWxsbyBKYWNvLA0KDQpUaGF0IG1lc3NhZ2Ugb25s
eSBhcHBlYXJzIGlmICBDT05GSUdfREVCVUdfQVRPTUlDX1NMRUVQIChzbGVlcCBpbnNpZGUgYXRv
bWljKQ0KaXMgZW5hYmxlZCBpbiB0aGUga2VybmVsIGNvbmZpZy4gVGhlIGtlcm5lbCBjb25maWd1
cmF0aW9uIG9wdGlvbnMgSSBlbmFibGUgdG8NCnRlc3Qga2VybmVsIGNvZGUgY2FuIGJlIGZvdW5k
IGJlbG93LiBQbGVhc2Ugbm90ZSB0aGF0IHNvbWUgb2YgdGhlc2Ugb3B0aW9ucw0Kc2xvdyBkb3du
IHRoZSBrZXJuZWwgc2lnbmlmaWNhbnRseSwgc28gdGhlc2Ugb3B0aW9ucyBzaG91bGQgcHJvYmFi
bHkgbm90IGJlDQplbmFibGVkIG9uIGEgcHJvZHVjdGlvbiBzeXN0ZW0uDQoNCkNPTkZJR19CTEtf
REVCVUdfRlMNCkNPTkZJR19ERUJVR19BVE9NSUNfU0xFRVANCkNPTkZJR19ERUJVR19CT09UX1BB
UkFNUw0KQ09ORklHX0RFQlVHX0JVR1ZFUkJPU0UNCkNPTkZJR19ERUJVR19GUw0KQ09ORklHX0RF
QlVHX0lORk8NCkNPTkZJR19ERUJVR19JTkZPX0RXQVJGNA0KQ09ORklHX0RFQlVHX0lORk9fUkVE
VUNFRA0KQ09ORklHX0RFQlVHX0tFUk5FTA0KQ09ORklHX0RFQlVHX0tNRU1MRUFLDQpDT05GSUdf
REVCVUdfTElTVA0KQ09ORklHX0RFQlVHX0xPQ0tfQUxMT0MNCkNPTkZJR19ERUJVR19NVVRFWEVT
DQpDT05GSUdfREVCVUdfT0JKRUNUUw0KQ09ORklHX0RFQlVHX09CSkVDVFNfUkNVX0hFQUQNCkNP
TkZJR19ERUJVR19QQUdFQUxMT0MNCkNPTkZJR19ERUJVR19QRVJfQ1BVX01BUFMNCkNPTkZJR19E
RUJVR19QSV9MSVNUDQpDT05GSUdfREVCVUdfUFJFRU1QVA0KQ09ORklHX0RFQlVHX1BSRUVNUFRf
Vk9MVU5UQVJZDQpDT05GSUdfREVCVUdfU0cNCkNPTkZJR19ERUJVR19TUElOTE9DSw0KQ09ORklH
X0RFQlVHX1NUQUNLT1ZFUkZMT1cNCkNPTkZJR19ERUJVR19TVEFDS19VU0FHRQ0KQ09ORklHX0RF
VEVDVF9IVU5HX1RBU0sNCkNPTkZJR19EWU5BTUlDX0RFQlVHDQpDT05GSUdfSEFSRExPQ0tVUF9E
RVRFQ1RPUg0KQ09ORklHX0tBU0FODQpDT05GSUdfTUFHSUNfU1lTUlENCkNPTkZJR19QUkVFTVBU
DQpDT05GSUdfUFJPVkVfTE9DS0lORw0KQ09ORklHX1BST1ZFX1JDVQ0KQ09ORklHX1NDSEVEX0RF
QlVHDQpDT05GSUdfU0xVQg0KQ09ORklHX1NMVUJfREVCVUdfT04NCkNPTkZJR19XUV9XQVRDSERP
Rw0KDQpCYXJ0Lg0KDQoNCg0K

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: disk-io lockup in 4.14.13 kernel
       [not found]                   ` <b9f2421d-6350-1136-86ea-bdb70a59b6d9@uls.co.za>
@ 2018-03-13 17:24                     ` Bart Van Assche
  2018-03-13 17:44                       ` Jaco Kroon
  2018-03-24 21:38                       ` Jaco Kroon
  0 siblings, 2 replies; 17+ messages in thread
From: Bart Van Assche @ 2018-03-13 17:24 UTC (permalink / raw)
  To: linux-block, jaco; +Cc: pieterk

T24gVHVlLCAyMDE4LTAzLTEzIGF0IDE5OjE2ICswMjAwLCBKYWNvIEtyb29uIHdyb3RlOg0KPiBU
aGUgc2VydmVyIGluIHF1ZXN0aW9uIGlzIHRoZSBkZXN0aW5hdGlvbiBvZiAgbnVtZXJvdXMgcnN5
bmMvc3NoIGNhc2VzDQo+ICh1c2VkIHByaW1hcmlseSBmb3IgYmFja3VwcykgYW5kIGlzIG5vdCBp
bnRlbmRlZCBhcyBhIHJlYWwtdGltZSBzeXN0ZW0uDQo+IEknbSBoYXBweSB0byBlbmFibGUgdGhl
IG9wdGlvbnMgYmVsb3cgdGhhdCB5b3Ugd291bGQgaW5kaWNhdGUgd291bGQgYmUNCj4gaGVscGZ1
bCBpbiBwaW5wb2ludGluZyB0aGUgcHJvYmxlbSAoYXNzdW1pbmcgd2UncmUgbm90IGxvb2tpbmcg
YXQgYSA4eA0KPiBtb3JlIENQVSByZXF1aXJlZCB0eXBlIG9mIGRlZ3JhZGluZyBhcyBJJ3ZlIHJl
Y2VudGx5IHNlZW4gd2l0aCBhc3Rlcmlzaw0KPiBsb2NrIGRlYnVnZ2luZyBlbmFibGVkKS4gSSd2
ZSBtYXJrZWQgaW4gYm9sZCBiZWxvdyB3aGF0IEkgYXNzdW1lIHdvdWxkDQo+IGJlIGhlbHBmdWwu
ICBJZiB5b3UgZG9uJ3QgbWluZCBjb25maXJtaW5nIGZvciBtZSBJJ2xsIGVuYWJsZSBhbmQNCj4g
c2NoZWR1bGUgYSByZWJvb3QuDQoNCkhlbGxvIEphY28sDQoNCk15IHJlY29tbWVuZGF0aW9uIGlz
IHRvIHdhaXQgdW50aWwgdGhlIG1wdDNzYXMgbWFpbnRhaW5lcnMgcG9zdCBhIGZpeA0KZm9yIHdo
YXQgSSByZXBvcnRlZCB5ZXN0ZXJkYXkgb24gdGhlIGxpbnV4LXNjc2kgbWFpbGluZyBsaXN0LiBF
bmFibGluZw0KQ09ORklHX0RFQlVHX0FUT01JQ19TTEVFUCBoYXMgbmFtZWx5IGEgdmVyeSBhbm5v
eWluZyBjb25zZXF1ZW5jZSBmb3IgdGhlDQptcHQzc2FzIGRyaXZlcjogdGhlIGZpcnN0IHByb2Nl
c3MgdGhhdCBoaXRzIHRoZSAic2xlZXAgaW4gYXRvbWljIGNvbnRleHQiDQpidWcgZ2V0cyBraWxs
ZWQuIEkgZG9uJ3QgdGhpbmsgdGhhdCB5b3Ugd2FudCB0aGlzIGtpbmQgb2YgYmVoYXZpb3Igb24g
YQ0KcHJvZHVjdGlvbiBzZXR1cC4NCg0KQmFydC4NCg0KDQoNCg0K

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: disk-io lockup in 4.14.13 kernel
  2018-03-13 17:24                     ` Bart Van Assche
@ 2018-03-13 17:44                       ` Jaco Kroon
  2018-03-24 21:38                       ` Jaco Kroon
  1 sibling, 0 replies; 17+ messages in thread
From: Jaco Kroon @ 2018-03-13 17:44 UTC (permalink / raw)
  To: Bart Van Assche, linux-block; +Cc: pieterk

Hi Bart,

On 13/03/2018 19:24, Bart Van Assche wrote:
> On Tue, 2018-03-13 at 19:16 +0200, Jaco Kroon wrote:
>> The server in question is the destination of  numerous rsync/ssh cases
>> (used primarily for backups) and is not intended as a real-time system.
>> I'm happy to enable the options below that you would indicate would be
>> helpful in pinpointing the problem (assuming we're not looking at a 8x
>> more CPU required type of degrading as I've recently seen with asterisk
>> lock debugging enabled). I've marked in bold below what I assume would
>> be helpful.  If you don't mind confirming for me I'll enable and
>> schedule a reboot.
> Hello Jaco,
>
> My recommendation is to wait until the mpt3sas maintainers post a fix
> for what I reported yesterday on the linux-scsi mailing list. Enabling
> CONFIG_DEBUG_ATOMIC_SLEEP has namely a very annoying consequence for the
> mpt3sas driver: the first process that hits the "sleep in atomic context"
> bug gets killed. I don't think that you want this kind of behavior on a
> production setup.
>
Would you mind adding myself as CC to that thread please?

Kind Regards,
Jaco

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: disk-io lockup in 4.14.13 kernel
  2018-03-13 17:24                     ` Bart Van Assche
  2018-03-13 17:44                       ` Jaco Kroon
@ 2018-03-24 21:38                       ` Jaco Kroon
  2018-03-26 22:56                         ` Bart Van Assche
  1 sibling, 1 reply; 17+ messages in thread
From: Jaco Kroon @ 2018-03-24 21:38 UTC (permalink / raw)
  To: Bart Van Assche, linux-block; +Cc: pieterk

Hi Bart,

Does the following go with your theory:

[452545.945561] sysrq: SysRq : Show backtrace of all active CPUs
[452545.946182] NMI backtrace for cpu 5
[452545.946185] CPU: 5 PID: 31921 Comm: bash Tainted: G          I    
4.14.13-uls #2
[452545.946186] Hardware name: Supermicro
SSG-5048R-E1CR36L/X10SRH-CLN4F, BIOS T20140520103247 05/20/2014
[452545.946187] Call Trace:
[452545.946196]  dump_stack+0x46/0x5a
[452545.946200]  nmi_cpu_backtrace+0xb3/0xc0
[452545.946205]  ? irq_force_complete_move+0xd0/0xd0
[452545.946208]  nmi_trigger_cpumask_backtrace+0x8f/0xc0
[452545.946212]  __handle_sysrq+0xec/0x140
[452545.946216]  write_sysrq_trigger+0x26/0x30
[452545.946219]  proc_reg_write+0x38/0x60
[452545.946222]  __vfs_write+0x1e/0x130
[452545.946225]  vfs_write+0xab/0x190
[452545.946228]  SyS_write+0x3d/0xa0
[452545.946233]  entry_SYSCALL_64_fastpath+0x13/0x6c
[452545.946236] RIP: 0033:0x7f6b85db52d0
[452545.946238] RSP: 002b:00007fff6f9479e8 EFLAGS: 00000246
[452545.946241] Sending NMI from CPU 5 to CPUs 0-4:
[452545.946272] NMI backtrace for cpu 0 skipped: idling at pc
0xffffffff8162b0a0
[452545.946275] NMI backtrace for cpu 3 skipped: idling at pc
0xffffffff8162b0a0
[452545.946279] NMI backtrace for cpu 4 skipped: idling at pc
0xffffffff8162b0a0
[452545.946283] NMI backtrace for cpu 2 skipped: idling at pc
0xffffffff8162b0a0
[452545.946287] NMI backtrace for cpu 1 skipped: idling at pc
0xffffffff8162b0a0

I'm not sure how to link that address back to some function or
something, and had to reboot, so not sure if that can be done still.

Kind Regards,
Jaco

On 13/03/2018 19:24, Bart Van Assche wrote:
> On Tue, 2018-03-13 at 19:16 +0200, Jaco Kroon wrote:
>> The server in question is the destination of  numerous rsync/ssh cases
>> (used primarily for backups) and is not intended as a real-time system.
>> I'm happy to enable the options below that you would indicate would be
>> helpful in pinpointing the problem (assuming we're not looking at a 8x
>> more CPU required type of degrading as I've recently seen with asterisk
>> lock debugging enabled). I've marked in bold below what I assume would
>> be helpful.  If you don't mind confirming for me I'll enable and
>> schedule a reboot.
> Hello Jaco,
>
> My recommendation is to wait until the mpt3sas maintainers post a fix
> for what I reported yesterday on the linux-scsi mailing list. Enabling
> CONFIG_DEBUG_ATOMIC_SLEEP has namely a very annoying consequence for the
> mpt3sas driver: the first process that hits the "sleep in atomic context"
> bug gets killed. I don't think that you want this kind of behavior on a
> production setup.
>
> Bart.
>
>
>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: disk-io lockup in 4.14.13 kernel
  2018-03-24 21:38                       ` Jaco Kroon
@ 2018-03-26 22:56                         ` Bart Van Assche
  2018-03-27  8:59                           ` Jaco Kroon
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2018-03-26 22:56 UTC (permalink / raw)
  To: linux-block, jaco; +Cc: pieterk

T24gU2F0LCAyMDE4LTAzLTI0IGF0IDIzOjM4ICswMjAwLCBKYWNvIEtyb29uIHdyb3RlOg0KPiBE
b2VzIHRoZSBmb2xsb3dpbmcgZ28gd2l0aCB5b3VyIHRoZW9yeToNCj4gDQo+IFs0NTI1NDUuOTQ1
NTYxXSBzeXNycTogU3lzUnEgOiBTaG93IGJhY2t0cmFjZSBvZiBhbGwgYWN0aXZlIENQVXMNCj4g
WzQ1MjU0NS45NDYxODJdIE5NSSBiYWNrdHJhY2UgZm9yIGNwdSA1DQo+IFs0NTI1NDUuOTQ2MTg1
XSBDUFU6IDUgUElEOiAzMTkyMSBDb21tOiBiYXNoIFRhaW50ZWQ6IEcgICAgICAgICAgSSAgICAN
Cj4gNC4xNC4xMy11bHMgIzINCj4gWzQ1MjU0NS45NDYxODZdIEhhcmR3YXJlIG5hbWU6IFN1cGVy
bWljcm8NCj4gU1NHLTUwNDhSLUUxQ1IzNkwvWDEwU1JILUNMTjRGLCBCSU9TIFQyMDE0MDUyMDEw
MzI0NyAwNS8yMC8yMDE0DQo+IFs0NTI1NDUuOTQ2MTg3XSBDYWxsIFRyYWNlOg0KPiBbNDUyNTQ1
Ljk0NjE5Nl0gIGR1bXBfc3RhY2srMHg0Ni8weDVhDQo+IFs0NTI1NDUuOTQ2MjAwXSAgbm1pX2Nw
dV9iYWNrdHJhY2UrMHhiMy8weGMwDQo+IFs0NTI1NDUuOTQ2MjA1XSAgPyBpcnFfZm9yY2VfY29t
cGxldGVfbW92ZSsweGQwLzB4ZDANCj4gWzQ1MjU0NS45NDYyMDhdICBubWlfdHJpZ2dlcl9jcHVt
YXNrX2JhY2t0cmFjZSsweDhmLzB4YzANCj4gWzQ1MjU0NS45NDYyMTJdICBfX2hhbmRsZV9zeXNy
cSsweGVjLzB4MTQwDQo+IFs0NTI1NDUuOTQ2MjE2XSAgd3JpdGVfc3lzcnFfdHJpZ2dlcisweDI2
LzB4MzANCj4gWzQ1MjU0NS45NDYyMTldICBwcm9jX3JlZ193cml0ZSsweDM4LzB4NjANCj4gWzQ1
MjU0NS45NDYyMjJdICBfX3Zmc193cml0ZSsweDFlLzB4MTMwDQo+IFs0NTI1NDUuOTQ2MjI1XSAg
dmZzX3dyaXRlKzB4YWIvMHgxOTANCj4gWzQ1MjU0NS45NDYyMjhdICBTeVNfd3JpdGUrMHgzZC8w
eGEwDQo+IFs0NTI1NDUuOTQ2MjMzXSAgZW50cnlfU1lTQ0FMTF82NF9mYXN0cGF0aCsweDEzLzB4
NmMNCj4gWzQ1MjU0NS45NDYyMzZdIFJJUDogMDAzMzoweDdmNmI4NWRiNTJkMA0KPiBbNDUyNTQ1
Ljk0NjIzOF0gUlNQOiAwMDJiOjAwMDA3ZmZmNmY5NDc5ZTggRUZMQUdTOiAwMDAwMDI0Ng0KPiBb
NDUyNTQ1Ljk0NjI0MV0gU2VuZGluZyBOTUkgZnJvbSBDUFUgNSB0byBDUFVzIDAtNDoNCj4gWzQ1
MjU0NS45NDYyNzJdIE5NSSBiYWNrdHJhY2UgZm9yIGNwdSAwIHNraXBwZWQ6IGlkbGluZyBhdCBw
Yw0KPiAweGZmZmZmZmZmODE2MmIwYTANCj4gWzQ1MjU0NS45NDYyNzVdIE5NSSBiYWNrdHJhY2Ug
Zm9yIGNwdSAzIHNraXBwZWQ6IGlkbGluZyBhdCBwYw0KPiAweGZmZmZmZmZmODE2MmIwYTANCj4g
WzQ1MjU0NS45NDYyNzldIE5NSSBiYWNrdHJhY2UgZm9yIGNwdSA0IHNraXBwZWQ6IGlkbGluZyBh
dCBwYw0KPiAweGZmZmZmZmZmODE2MmIwYTANCj4gWzQ1MjU0NS45NDYyODNdIE5NSSBiYWNrdHJh
Y2UgZm9yIGNwdSAyIHNraXBwZWQ6IGlkbGluZyBhdCBwYw0KPiAweGZmZmZmZmZmODE2MmIwYTAN
Cj4gWzQ1MjU0NS45NDYyODddIE5NSSBiYWNrdHJhY2UgZm9yIGNwdSAxIHNraXBwZWQ6IGlkbGlu
ZyBhdCBwYw0KPiAweGZmZmZmZmZmODE2MmIwYTANCj4gDQo+IEknbSBub3Qgc3VyZSBob3cgdG8g
bGluayB0aGF0IGFkZHJlc3MgYmFjayB0byBzb21lIGZ1bmN0aW9uIG9yDQo+IHNvbWV0aGluZywg
YW5kIGhhZCB0byByZWJvb3QsIHNvIG5vdCBzdXJlIGlmIHRoYXQgY2FuIGJlIGRvbmUgc3RpbGwu
DQoNCkhlbGxvIEphY28sDQoNClRoZSBhYm92ZSBjYWxsIHRyYWNlIG1lYW5zIHRoYXQgU3lzUnEt
bCB3YXMgdHJpZ2dlcmVkLCBlaXRoZXIgdmlhIHRoZSBrZXlib2FyZA0Kb3IgdGhyb3VnaCBwcm9j
ZnMuIEkgZG9uJ3QgdGhpbmsgdGhhdCB0aGVyZSBpcyBhbnkgaW5mb3JtYXRpb24gaW4gdGhlIGFi
b3ZlDQp0aGF0IHJldmVhbHMgdGhlIHJvb3QgY2F1c2Ugb2Ygd2h5IGEgcmVib290IHdhcyBuZWNl
c3NhcnkuDQoNCldoYXQgSSBkbyBteXNlbGYgdG8gaWRlbnRpZnkgdGhlIHJvb3QgY2F1c2Ugb2Yg
d2VpcmQga2VybmVsIGJlaGF2aW9yIGlzIHRvDQpyZWJ1aWxkIHRoZSBrZXJuZWwgd2l0aCBhIGJ1
bmNoIG9mIGRlYnVnZ2luZyBvcHRpb25zIGVuYWJsZWQgYW5kIHRoYXQgSSB0cnkgdG8NCnJlcGVh
dCB0aGUgdHJpZ2dlciB0aGF0IGNhdXNlZCB0aGUgd2VpcmQgYmVoYXZpb3IuIElmIHRoaXMgY2F1
c2VzIHRoZSBrZXJuZWwNCmRlYnVnZ2luZyBjb2RlIHRvIHByb2R1Y2UgYWRkaXRpb25hbCBvdXRw
dXQgdGhhdCBvdXRwdXQgY2FuIGJlIHZlcnkgaGVscGZ1bCBmb3INCmlkZW50aWZ5aW5nIHdoYXQg
aXMgZ29pbmcgb24uIFRoaXMgYXBwcm9hY2ggZG9lcyBub3QgYWx3YXlzIHdvcmsgaG93ZXZlci4N
Cg0KQmFydC4NCg0KDQoNCg==

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: disk-io lockup in 4.14.13 kernel
  2018-03-26 22:56                         ` Bart Van Assche
@ 2018-03-27  8:59                           ` Jaco Kroon
  2018-03-27 15:36                             ` Bart Van Assche
  0 siblings, 1 reply; 17+ messages in thread
From: Jaco Kroon @ 2018-03-27  8:59 UTC (permalink / raw)
  To: Bart Van Assche, linux-block; +Cc: pieterk

Hi Bart,
> The above call trace means that SysRq-l was triggered, either via the keyboard
> or through procfs. I don't think that there is any information in the above
> that reveals the root cause of why a reboot was necessary.
I triggered it hoping to get a stack trace of the process which is
deadlocking finding where the lock is being taken that ends up blocking,
but I then realized that you mentioned sleeping, which may end up not
having a stack trace because there is no process actually running?  (I'm
not that intimately familiar with kernel internals, mostly doing
user-space development, so kernel is an interesting environment for me
only, but not really part of my day-to-day activities.
>
> What I do myself to identify the root cause of weird kernel behavior is to
> rebuild the kernel with a bunch of debugging options enabled and that I try to
> repeat the trigger that caused the weird behavior. If this causes the kernel
> debugging code to produce additional output that output can be very helpful for
> identifying what is going on. This approach does not always work however.
Yea, I find debugging is often an art more than a science.  I saw a
patch for the mpt3sas issue but the email was munged and I wasn't able
to apply the patch from the email - I've emailed w.r.t. that but I've
not received a response.  Have you managed to get anything more sensible
from them?

Kind Regards,
Jaco

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: disk-io lockup in 4.14.13 kernel
  2018-03-27  8:59                           ` Jaco Kroon
@ 2018-03-27 15:36                             ` Bart Van Assche
  0 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2018-03-27 15:36 UTC (permalink / raw)
  To: Jaco Kroon, linux-block; +Cc: pieterk

On 03/27/18 01:59, Jaco Kroon wrote:
> I triggered it hoping to get a stack trace of the process which is
> deadlocking finding where the lock is being taken that ends up blocking,
> but I then realized that you mentioned sleeping, which may end up not
> having a stack trace because there is no process actually running?  (I'm
> not that intimately familiar with kernel internals, mostly doing
> user-space development, so kernel is an interesting environment for me
> only, but not really part of my day-to-day activities.

Usually I proceed as follows to obtain more information about a lockup:
* Trigger SysRq-w (shows tasks that are in uninterruptible wait state).
* If the output that appears does not provide enough information, 
trigger SysRq-t (shows all tasks).
* If it seems like a block layer request got stuck, also analyze the 
output of the following command:

find /sys/kernel/debug/block -type f |
grep -vE \
'/(poll_stat|write_hints|completed|merged|dispatched|run|queued)' | 
xargs grep -a .

> I saw a
> patch for the mpt3sas issue but the email was munged and I wasn't able
> to apply the patch from the email.

That's weird. Saving that patch ("Save as mbox") and applying it with 
"git am" works fine for me. See also 
https://www.mail-archive.com/linux-scsi@vger.kernel.org/msg72175.html.

Because I was too busy during the past two weeks I have not yet had the
time to have a close look at that patch. But I hope to find some time 
soon to have a closer look.

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-03-27 15:36 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-22 10:58 disk-io lockup in 4.14.13 kernel Jaco Kroon
2018-02-22 16:46 ` Bart Van Assche
2018-02-23  9:58   ` Jaco Kroon
2018-02-23 16:03     ` Bart Van Assche
     [not found]   ` <257ceeb7-f466-d13d-8818-829759eda587@uls.co.za>
2018-03-11  3:08     ` Bart Van Assche
2018-03-11  4:33       ` Jaco Kroon
2018-03-11  5:00         ` Bart Van Assche
2018-03-13  9:30           ` Jaco Kroon
2018-03-13 14:10             ` Bart Van Assche
2018-03-13 14:59               ` Jaco Kroon
2018-03-13 15:06                 ` Bart Van Assche
     [not found]                   ` <b9f2421d-6350-1136-86ea-bdb70a59b6d9@uls.co.za>
2018-03-13 17:24                     ` Bart Van Assche
2018-03-13 17:44                       ` Jaco Kroon
2018-03-24 21:38                       ` Jaco Kroon
2018-03-26 22:56                         ` Bart Van Assche
2018-03-27  8:59                           ` Jaco Kroon
2018-03-27 15:36                             ` Bart Van Assche

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.