All of lore.kernel.org
 help / color / mirror / Atom feed
* Strange Uncorrectable Section Count Produced By Fio
@ 2017-06-29 17:11 Forrest, Jon
       [not found] ` <CY4PR10MB1477786895ADFF04303497739DD20@CY4PR10MB1477.namprd10.prod.outlook.com>
  0 siblings, 1 reply; 5+ messages in thread
From: Forrest, Jon @ 2017-06-29 17:11 UTC (permalink / raw)
  To: fio

(Oracle X4-2L running CentOS 7.3 with 512GB of RAM and 3 Oracle F80
800GB PCIe Flash Accelerators)

Running the fio job shown below seems to generate the expected
i/o load, but it also caused the following output on the console:

WARNING: Your hard drive is failing
Device: /dev/sdb [SAT], 37009733189632 Offline uncorrectable sectors
WARNING: Your hard drive is failing
Device: /dev/sdc [SAT], 54275501719552 Offline uncorrectable sectors
WARNING: Your hard drive is failing
Device: /dev/sdd [SAT], 71987946848256 Offline uncorrectable sectors
WARNING: Your hard drive is failing
Device: /dev/sde [SAT], 93179315486720 Offline uncorrectable sectors
WARNING: Your hard drive is failing
Device: /dev/sdf [SAT], 101245264068608 Offline uncorrectable sectors
WARNING: Your hard drive is failing
Device: /dev/sdg [SAT], 113082193936384 Offline uncorrectable sectors
WARNING: Your hard drive is failing
Device: /dev/sdh [SAT], 127135326928896 Offline uncorrectable sectors
WARNING: Your hard drive is failing
Device: /dev/sdi [SAT], 141721035866112 Offline uncorrectable sectors
WARNING: Your hard drive is failing
Device: /dev/sdaj [SAT], 162611756793856 Offline uncorrectable sectors
WARNING: Your hard drive is failing
Device: /dev/sdak [SAT], 178245437751296 Offline uncorrectable sectors
WARNING: Your hard drive is failing
Device: /dev/sdal [SAT], 189824669581312 Offline uncorrectable sectors
WARNING: Your hard drive is failing
Device: /dev/sdam [SAT], 203104708460544 Offline uncorrectable sectors

These are the 12 drives the fio job is accessing. The system seems
to be running fine with no crashes or hangs.

The number of uncorrectable sectors seems bogus. Also, these message 
showed up soon after starting the job started. It seems unlikely that
e.g. 203104708460544 sectors would have been access during this time.
Also, according to an 'iostat' command, the job is still running so the
drives are presumably still not offline. Finally, the 'ddcli' program
that lets me talk to the flash card says:

Bytes Read                            89961186304
Soft Read Error Rate                  3.657696e-03
Wear Range Delta                      0          (%)
Uncorrectable RAISE Errors            0
Current Temperature                   46         (degree C)
Uncorrectable ECC Errors              0
SATA R-Errors (CRC) Error Count       0

which looks normal.

Are the uncorrectable sector reports something I should worry about?
Is this something that 'fio' can tickle?

The 'fio' job is:
[global]
bs=8k
iodepth=128
direct=1
ioengine=libaio
randrepeat=0
group_reporting
time_based
runtime=24h
filesize=6G

[job1]
rw=randread
filename=/dev/sdaj:/dev/sdak:/dev/sdal:/dev/sdam:/dev/sdb:/dev/sdc:/dev/sdd:/dev/sde:/dev/sdf:/dev/sdg:/dev/sdh:/dev/sdi
name=random-read

Cordially,

-- 
Jon Forrest
Dolby Laboratories, Inc.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Strange Uncorrectable Section Count Produced By Fio
       [not found] ` <CY4PR10MB1477786895ADFF04303497739DD20@CY4PR10MB1477.namprd10.prod.outlook.com>
@ 2017-06-29 17:39   ` Forrest, Jon
  2017-06-30  2:45     ` Sitsofe Wheeler
  0 siblings, 1 reply; 5+ messages in thread
From: Forrest, Jon @ 2017-06-29 17:39 UTC (permalink / raw)
  To: fio



On 6/29/2017 10:21 AM, Todd Lawall wrote:
> 
> Correct me if I'm wrong everybody, but those look suspiciously like
> errors coming from the kernel/driver rather than FIO.

I agree but see below.

> John, you could confirm this by looking for these messages in the output of the
> 'dmesg' command.   If they're from the driver, then there's further
> questions for the authors of that code that you'd need to ask.

Much to my surprise, 'dmesg' doesn't include those messages.

Also, I received 12 email messages sent to 'root' from 'root' on that
system with the same error messages, one message per disk.

I hadn't thought these messages came from 'fio' directly, but
I don't know where they come from. Whatever is producing them
seems to be confused because the number of errors are incorrect.

Jon Forrest
Dolby Laboratories, Inc.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Strange Uncorrectable Section Count Produced By Fio
  2017-06-29 17:39   ` Forrest, Jon
@ 2017-06-30  2:45     ` Sitsofe Wheeler
  2017-06-30 16:42       ` Forrest, Jon
  0 siblings, 1 reply; 5+ messages in thread
From: Sitsofe Wheeler @ 2017-06-30  2:45 UTC (permalink / raw)
  To: Forrest, Jon; +Cc: fio

On 29 June 2017 at 18:39, Forrest, Jon <nobozo@gmail.com> wrote:
>
> Running the fio job shown below seems to generate the expected
> i/o load, but it also caused the following output on the console:
>
> [...]
>
> WARNING: Your hard drive is failing
> Device: /dev/sdb [SAT], 37009733189632 Offline uncorrectable sectors
> WARNING: Your hard drive is failing
> Device: /dev/sdc [SAT], 54275501719552 Offline uncorrectable sectors
> WARNING: Your hard drive is failing
> Device: /dev/sdd [SAT], 71987946848256 Offline uncorrectable sectors
>
> On 6/29/2017 10:21 AM, Todd Lawall wrote:
>>
>> Correct me if I'm wrong everybody, but those look suspiciously like
>> errors coming from the kernel/driver rather than FIO.
>
> I agree but see below.
>
>> John, you could confirm this by looking for these messages in the output
>> of the
>> 'dmesg' command.   If they're from the driver, then there's further
>> questions for the authors of that code that you'd need to ask.
>
> Much to my surprise, 'dmesg' doesn't include those messages.
>
> Also, I received 12 email messages sent to 'root' from 'root' on that
> system with the same error messages, one message per disk.
>
> I hadn't thought these messages came from 'fio' directly, but
> I don't know where they come from. Whatever is producing them
> seems to be confused because the number of errors are incorrect.

That output looks like it comes from SMART monitoring of your disks
(i.e. you have a script regularly monitoring for changes in smartctl
output and sending the results to logwatch or emailing you). Further,
it looks like in your case smartctl doesn't know how to correctly
interpret the "Offline uncorrectable sectors" value - perhaps because
you're disks are "new" you need a more recent smartctl?

However, I'd be somewhat worried that the "Offline uncorrectable
sectors" are changing at all. It is one of the few values that has a
strong predictive power of a spinning disk being faulty (see
https://www.backblaze.com/blog/what-smart-stats-indicate-hard-drive-failures/
) but since you're on an SSD I don't know if it's predictive power is
as good.

-- 
Sitsofe | http://sucs.org/~sits/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Strange Uncorrectable Section Count Produced By Fio
  2017-06-30  2:45     ` Sitsofe Wheeler
@ 2017-06-30 16:42       ` Forrest, Jon
  2017-07-01  2:41         ` Elliott, Robert (Persistent Memory)
  0 siblings, 1 reply; 5+ messages in thread
From: Forrest, Jon @ 2017-06-30 16:42 UTC (permalink / raw)
  To: Sitsofe Wheeler; +Cc: fio

On 6/29/2017 7:45 PM, Sitsofe Wheeler wrote:

First of all, thanks for following up on this.

> That output looks like it comes from SMART monitoring of your disks
> (i.e. you have a script regularly monitoring for changes in smartctl
> output and sending the results to logwatch or emailing you). Further,
> it looks like in your case smartctl doesn't know how to correctly
> interpret the "Offline uncorrectable sectors" value - perhaps because
> you're disks are "new" you need a more recent smartctl?

This is a newly installed CentOS 7.3 system so everything is in the
state that RedHat supplies.

> However, I'd be somewhat worried that the "Offline uncorrectable
> sectors" are changing at all. It is one of the few values that has a
> strong predictive power of a spinning disk being faulty (see
> https://www.backblaze.com/blog/what-smart-stats-indicate-hard-drive-failures/
> ) but since you're on an SSD I don't know if it's predictive power is
> as good.

If the uncorrectable error count were real, then I'd agree with
you 100%. However, since those values appear only once during the
early stages of a 24 hour fio job and don't appear again later with
higher numbers of erros, and because the ddcli interface to the flash
card shows no errors at all, I'm inclined to ignore them. The fio job
finished fine.

Jon Forrest
Dolby Labs



^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: Strange Uncorrectable Section Count Produced By Fio
  2017-06-30 16:42       ` Forrest, Jon
@ 2017-07-01  2:41         ` Elliott, Robert (Persistent Memory)
  0 siblings, 0 replies; 5+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2017-07-01  2:41 UTC (permalink / raw)
  To: Forrest, Jon, fio, Sitsofe Wheeler

> > However, I'd be somewhat worried that the "Offline uncorrectable
> > sectors" are changing at all. It is one of the few values that has a
> > strong predictive power of a spinning disk being faulty (see
> > https://www.backblaze.com/blog/what-smart-stats-indicate-hard-drive-
> failures/
> > ) but since you're on an SSD I don't know if it's predictive power is
> > as good.
> 
> If the uncorrectable error count were real, then I'd agree with
> you 100%. However, since those values appear only once during the
> early stages of a 24 hour fio job and don't appear again later with
> higher numbers of erros, and because the ddcli interface to the flash
> card shows no errors at all, I'm inclined to ignore them. The fio job
> finished fine.

and
> > Device: /dev/sdb [SAT], 37009733189632 Offline uncorrectable sectors
> > WARNING: Your hard drive is failing
> > Device: /dev/sdc [SAT], 54275501719552 Offline uncorrectable sectors
> > WARNING: Your hard drive is failing
> > Device: /dev/sdd [SAT], 71987946848256 Offline uncorrectable sectors

In hex, those are:
21A9_0000_0000
315D_0000_0000
4179_0000_0000

so they're clearly not real count values.  Some program might be
parsing fields incorrectly (e.g., off by 4 bytes).


---
Robert Elliott, HPE Persistent Memory



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-07-01  2:41 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-29 17:11 Strange Uncorrectable Section Count Produced By Fio Forrest, Jon
     [not found] ` <CY4PR10MB1477786895ADFF04303497739DD20@CY4PR10MB1477.namprd10.prod.outlook.com>
2017-06-29 17:39   ` Forrest, Jon
2017-06-30  2:45     ` Sitsofe Wheeler
2017-06-30 16:42       ` Forrest, Jon
2017-07-01  2:41         ` Elliott, Robert (Persistent Memory)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.