linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] block: ratelimite pr_err on IO path
@ 2018-04-12  9:21 Jack Wang
  2018-04-12 21:20 ` Martin K. Petersen
  0 siblings, 1 reply; 6+ messages in thread
From: Jack Wang @ 2018-04-12  9:21 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-kernel, elliott, Jack Wang

From: Jack Wang <jinpu.wang@profitbricks.com>

This avoid soft lockup below:
[ 2328.328429] Call Trace:
[ 2328.328433]  vprintk_emit+0x229/0x2e0
[ 2328.328436]  ? t10_pi_type3_verify_ip+0x20/0x20
[ 2328.328437]  printk+0x52/0x6e
[ 2328.328439]  t10_pi_verify+0x9e/0xf0
[ 2328.328441]  bio_integrity_process+0x12e/0x220
[ 2328.328442]  ? t10_pi_type1_verify_crc+0x20/0x20
[ 2328.328443]  bio_integrity_verify_fn+0xde/0x140
[ 2328.328447]  process_one_work+0x13f/0x370
[ 2328.328449]  worker_thread+0x62/0x3d0
[ 2328.328450]  ? rescuer_thread+0x2f0/0x2f0
[ 2328.328452]  kthread+0x116/0x150
[ 2328.328454]  ? __kthread_parkme+0x70/0x70
[ 2328.328457]  ret_from_fork+0x35/0x40

Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
---
v2: keep the message in same line as Robert and coding style suggested

 block/t10-pi.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/block/t10-pi.c b/block/t10-pi.c
index a98db38..6faf8c1 100644
--- a/block/t10-pi.c
+++ b/block/t10-pi.c
@@ -84,10 +84,11 @@ static blk_status_t t10_pi_verify(struct blk_integrity_iter *iter,
 
 			if (be32_to_cpu(pi->ref_tag) !=
 			    lower_32_bits(iter->seed)) {
-				pr_err("%s: ref tag error at location %llu " \
-				       "(rcvd %u)\n", iter->disk_name,
-				       (unsigned long long)
-				       iter->seed, be32_to_cpu(pi->ref_tag));
+				pr_err_ratelimited("%s: ref tag error at location %llu (rcvd %u)\n",
+						   iter->disk_name,
+						   (unsigned long long)
+						   iter->seed,
+						   be32_to_cpu(pi->ref_tag));
 				return BLK_STS_PROTECTION;
 			}
 			break;
@@ -101,10 +102,11 @@ static blk_status_t t10_pi_verify(struct blk_integrity_iter *iter,
 		csum = fn(iter->data_buf, iter->interval);
 
 		if (pi->guard_tag != csum) {
-			pr_err("%s: guard tag error at sector %llu " \
-			       "(rcvd %04x, want %04x)\n", iter->disk_name,
-			       (unsigned long long)iter->seed,
-			       be16_to_cpu(pi->guard_tag), be16_to_cpu(csum));
+			pr_err_ratelimited("%s: guard tag error at sector %llu (rcvd %04x, want %04x)\n",
+					   iter->disk_name,
+					   (unsigned long long)iter->seed,
+					   be16_to_cpu(pi->guard_tag),
+					   be16_to_cpu(csum));
 			return BLK_STS_PROTECTION;
 		}
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] block: ratelimite pr_err on IO path
  2018-04-12  9:21 [PATCH v2] block: ratelimite pr_err on IO path Jack Wang
@ 2018-04-12 21:20 ` Martin K. Petersen
  2018-04-13  8:37   ` Jinpu Wang
  0 siblings, 1 reply; 6+ messages in thread
From: Martin K. Petersen @ 2018-04-12 21:20 UTC (permalink / raw)
  To: Jack Wang; +Cc: axboe, linux-block, linux-kernel, elliott


Jack,

> +				pr_err_ratelimited("%s: ref tag error at location %llu (rcvd %u)\n",

I'm a bit concerned about dropping records of potential data loss.

Also, what are you doing that compels all these to be logged? This
should be a very rare occurrence.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] block: ratelimite pr_err on IO path
  2018-04-12 21:20 ` Martin K. Petersen
@ 2018-04-13  8:37   ` Jinpu Wang
  2018-04-13 16:59     ` Martin K. Petersen
  0 siblings, 1 reply; 6+ messages in thread
From: Jinpu Wang @ 2018-04-13  8:37 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Jens Axboe, linux-block, linux-kernel, Elliott,
	Robert (Persistent Memory)

On Thu, Apr 12, 2018 at 11:20 PM, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>
> Jack,
>
>> +                             pr_err_ratelimited("%s: ref tag error at location %llu (rcvd %u)\n",
>
> I'm a bit concerned about dropping records of potential data loss.
>
> Also, what are you doing that compels all these to be logged? This
> should be a very rare occurrence.
>
> --
> Martin K. Petersen      Oracle Linux Engineering
Hi Martin,

Thanks for asking, we updated mpt3sas driver which enables DIX support
(prot_mask=0x7f), all disks are SATA SSDs, no DIF support.
After reboot, kernel reports the IO errors from all the drives behind
HBA, seems for almost every read IO, which turns the system unusable:
[   13.079375] sda: ref tag error at location 0 (rcvd 143196159)
[   13.079989] sda: ref tag error at location 937702912 (rcvd 143196159)
[   13.080233] sda: ref tag error at location 937703072 (rcvd 143196159)
[   13.080407] sda: ref tag error at location 0 (rcvd 143196159)
[   13.080594] sda: ref tag error at location 8 (rcvd 143196159)
[   13.080996] sda: ref tag error at location 0 (rcvd 143196159)
[   13.089878] sdb: ref tag error at location 0 (rcvd 143196159)
[   13.090275] sdb: ref tag error at location 937702912 (rcvd 277413887)
[   13.090448] sdb: ref tag error at location 937703072 (rcvd 143196159)
[   13.090655] sdb: ref tag error at location 0 (rcvd 143196159)
[   13.090823] sdb: ref tag error at location 8 (rcvd 277413887)
[   13.091218] sdb: ref tag error at location 0 (rcvd 143196159)
[   13.095412] sdc: ref tag error at location 0 (rcvd 143196159)
[   13.095859] sdc: ref tag error at location 937702912 (rcvd 143196159)
[   13.096058] sdc: ref tag error at location 937703072 (rcvd 143196159)
[   13.096228] sdc: ref tag error at location 0 (rcvd 143196159)
[   13.096445] sdc: ref tag error at location 8 (rcvd 143196159)
[   13.096833] sdc: ref tag error at location 0 (rcvd 277413887)
[   13.097187] sds: ref tag error at location 0 (rcvd 277413887)
[   13.097707] sds: ref tag error at location 937702912 (rcvd 143196159)
[   13.097855] sds: ref tag error at location 937703072 (rcvd 277413887)

Kernel version 4.15 and 4.14.28, I scan the commits in upstream,
haven't found any relevant.
in  4.4.112, there's no such errors.
Diable DIX support (prot_mask=0x7) in mpt3sas fixes the problem.

Regards,
-- 
Jack Wang
Linux Kernel Developer            ProfitBricks GmbH

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] block: ratelimite pr_err on IO path
  2018-04-13  8:37   ` Jinpu Wang
@ 2018-04-13 16:59     ` Martin K. Petersen
  2018-04-16  8:16       ` Jinpu Wang
  0 siblings, 1 reply; 6+ messages in thread
From: Martin K. Petersen @ 2018-04-13 16:59 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Martin K. Petersen, Jens Axboe, linux-block, linux-kernel,
	Elliott, Robert (Persistent Memory),
	Sreekanth Reddy, Suganath Prabu Subramani, Chaitra P B,
	linux-scsi


Jinpu,

[CC:ed the mpt3sas maintainers]

The ratelimit patch is just an attempt to treat the symptom, not the
cause.

> Thanks for asking, we updated mpt3sas driver which enables DIX support
> (prot_mask=0x7f), all disks are SATA SSDs, no DIF support.
> After reboot, kernel reports the IO errors from all the drives behind
> HBA, seems for almost every read IO, which turns the system unusable:
> [   13.079375] sda: ref tag error at location 0 (rcvd 143196159)
> [   13.079989] sda: ref tag error at location 937702912 (rcvd 143196159)
> [   13.080233] sda: ref tag error at location 937703072 (rcvd 143196159)
> [   13.080407] sda: ref tag error at location 0 (rcvd 143196159)
> [   13.080594] sda: ref tag error at location 8 (rcvd 143196159)

That sounds like a bug in the mpt3sas driver or firmware. I guess the
HBA could conceivably be operating a SATA device as DIX Type 0 and strip
the PI on the drive side. But that doesn't seem to be a particularly
useful mode of operation.

Jinpu: Which firmware are you running? Also, please send us the output
of:

        sg_readcap -l /dev/sda
        sg_inq -x /dev/sda
        sg_vpd /dev/sda

Broadcom: How is DIX supposed to work for SATA drives behind an mpt3sas
controller?

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] block: ratelimite pr_err on IO path
  2018-04-13 16:59     ` Martin K. Petersen
@ 2018-04-16  8:16       ` Jinpu Wang
  2018-04-16  9:06         ` Sreekanth Reddy
  0 siblings, 1 reply; 6+ messages in thread
From: Jinpu Wang @ 2018-04-16  8:16 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Jens Axboe, linux-block, linux-kernel, Elliott,
	Robert (Persistent Memory),
	Sreekanth Reddy, Suganath Prabu Subramani, Chaitra P B,
	linux-scsi

On Fri, Apr 13, 2018 at 6:59 PM, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>
> Jinpu,
>
> [CC:ed the mpt3sas maintainers]
>
> The ratelimit patch is just an attempt to treat the symptom, not the
> cause.
Agree. If we can fix the root cause, it will be great.
>
>> Thanks for asking, we updated mpt3sas driver which enables DIX support
>> (prot_mask=0x7f), all disks are SATA SSDs, no DIF support.
>> After reboot, kernel reports the IO errors from all the drives behind
>> HBA, seems for almost every read IO, which turns the system unusable:
>> [   13.079375] sda: ref tag error at location 0 (rcvd 143196159)
>> [   13.079989] sda: ref tag error at location 937702912 (rcvd 143196159)
>> [   13.080233] sda: ref tag error at location 937703072 (rcvd 143196159)
>> [   13.080407] sda: ref tag error at location 0 (rcvd 143196159)
>> [   13.080594] sda: ref tag error at location 8 (rcvd 143196159)
>
> That sounds like a bug in the mpt3sas driver or firmware. I guess the
> HBA could conceivably be operating a SATA device as DIX Type 0 and strip
> the PI on the drive side. But that doesn't seem to be a particularly
> useful mode of operation.
>
> Jinpu: Which firmware are you running? Also, please send us the output
> of:
>
>         sg_readcap -l /dev/sda
>         sg_inq -x /dev/sda
>         sg_vpd /dev/sda
>
Disks are INTEL SSDSC2BX48, directly attached to HBA.
LSISAS3008: FWVersion(13.00.00.00), ChipRevision(0x02), BiosVersion(08.11.00.00)
mpt3sas_cm2: Protocol=(Initiator,Target),
Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set
Full,NCQ)

jwang@x:~$ sudo sg_vpd /dev/sdz
Supported VPD pages VPD page:
  Supported VPD pages [sv]
  Unit serial number [sn]
  Device identification [di]
  Mode page policy [mpp]
  ATA information (SAT) [ai]
  Block limits (SBC) [bl]
  Block device characteristics (SBC) [bdc]
  Logical block provisioning (SBC) [lbpv]
jwang@x:~$ sudo sg_inq -x /dev/sdz
VPD INQUIRY: extended INQUIRY data page
    inquiry: field in cdb illegal (page not supported)
jwang@x:~$ sudo sg_readcap -l /dev/sdz
Read Capacity results:
   Protection: prot_en=0, p_type=0, p_i_exponent=0
   Logical block provisioning: lbpme=1, lbprz=1
   Last logical block address=937703087 (0x37e436af), Number of
logical blocks=937703088
   Logical block length=512 bytes
   Logical blocks per physical block exponent=3 [so physical block
length=4096 bytes]
   Lowest aligned logical block address=0
Hence:
   Device size: 480103981056 bytes, 457862.8 MiB, 480.10 GB


> Broadcom: How is DIX supposed to work for SATA drives behind an mpt3sas
> controller?
>
> --
> Martin K. Petersen      Oracle Linux Engineering


Thanks!

-- 
Jack Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] block: ratelimite pr_err on IO path
  2018-04-16  8:16       ` Jinpu Wang
@ 2018-04-16  9:06         ` Sreekanth Reddy
  0 siblings, 0 replies; 6+ messages in thread
From: Sreekanth Reddy @ 2018-04-16  9:06 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: Martin K. Petersen, Jens Axboe, linux-block, linux-kernel,
	Elliott, Robert (Persistent Memory),
	Suganath Prabu Subramani, Chaitra P B, linux-scsi

On Mon, Apr 16, 2018 at 1:46 PM, Jinpu Wang <jinpu.wang@profitbricks.com> wrote:
> On Fri, Apr 13, 2018 at 6:59 PM, Martin K. Petersen
> <martin.petersen@oracle.com> wrote:
>>
>> Jinpu,
>>
>> [CC:ed the mpt3sas maintainers]
>>
>> The ratelimit patch is just an attempt to treat the symptom, not the
>> cause.
> Agree. If we can fix the root cause, it will be great.
>>
>>> Thanks for asking, we updated mpt3sas driver which enables DIX support
>>> (prot_mask=0x7f), all disks are SATA SSDs, no DIF support.
>>> After reboot, kernel reports the IO errors from all the drives behind
>>> HBA, seems for almost every read IO, which turns the system unusable:
>>> [   13.079375] sda: ref tag error at location 0 (rcvd 143196159)
>>> [   13.079989] sda: ref tag error at location 937702912 (rcvd 143196159)
>>> [   13.080233] sda: ref tag error at location 937703072 (rcvd 143196159)
>>> [   13.080407] sda: ref tag error at location 0 (rcvd 143196159)
>>> [   13.080594] sda: ref tag error at location 8 (rcvd 143196159)
>>
>> That sounds like a bug in the mpt3sas driver or firmware. I guess the
>> HBA could conceivably be operating a SATA device as DIX Type 0 and strip
>> the PI on the drive side. But that doesn't seem to be a particularly
>> useful mode of operation.
>>
>> Jinpu: Which firmware are you running? Also, please send us the output
>> of:
>>
>>         sg_readcap -l /dev/sda
>>         sg_inq -x /dev/sda
>>         sg_vpd /dev/sda
>>
> Disks are INTEL SSDSC2BX48, directly attached to HBA.
> LSISAS3008: FWVersion(13.00.00.00), ChipRevision(0x02), BiosVersion(08.11.00.00)
> mpt3sas_cm2: Protocol=(Initiator,Target),
> Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set
> Full,NCQ)
>
> jwang@x:~$ sudo sg_vpd /dev/sdz
> Supported VPD pages VPD page:
>   Supported VPD pages [sv]
>   Unit serial number [sn]
>   Device identification [di]
>   Mode page policy [mpp]
>   ATA information (SAT) [ai]
>   Block limits (SBC) [bl]
>   Block device characteristics (SBC) [bdc]
>   Logical block provisioning (SBC) [lbpv]
> jwang@x:~$ sudo sg_inq -x /dev/sdz
> VPD INQUIRY: extended INQUIRY data page
>     inquiry: field in cdb illegal (page not supported)
> jwang@x:~$ sudo sg_readcap -l /dev/sdz
> Read Capacity results:
>    Protection: prot_en=0, p_type=0, p_i_exponent=0
>    Logical block provisioning: lbpme=1, lbprz=1
>    Last logical block address=937703087 (0x37e436af), Number of
> logical blocks=937703088
>    Logical block length=512 bytes
>    Logical blocks per physical block exponent=3 [so physical block
> length=4096 bytes]
>    Lowest aligned logical block address=0
> Hence:
>    Device size: 480103981056 bytes, 457862.8 MiB, 480.10 GB
>
>
>> Broadcom: How is DIX supposed to work for SATA drives behind an mpt3sas
>> controller?

[Sreekanth] Current Upstream mpt3sas driver doesn't have DIX support
capabilities,
it supports only DIF feature.


Thanks,
Sreekanth
>>
>> --
>> Martin K. Petersen      Oracle Linux Engineering
>
>
> Thanks!
>
> --
> Jack Wang
> Linux Kernel Developer
>
> ProfitBricks GmbH
> Greifswalder Str. 207
> D - 10405 Berlin

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-04-16  9:06 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-12  9:21 [PATCH v2] block: ratelimite pr_err on IO path Jack Wang
2018-04-12 21:20 ` Martin K. Petersen
2018-04-13  8:37   ` Jinpu Wang
2018-04-13 16:59     ` Martin K. Petersen
2018-04-16  8:16       ` Jinpu Wang
2018-04-16  9:06         ` Sreekanth Reddy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).