All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Disk Monitoring
@ 2017-06-28 13:19 Wolfgang Denk
  2017-06-29  9:52 ` Gandalf Corvotempesta
  0 siblings, 1 reply; 23+ messages in thread
From: Wolfgang Denk @ 2017-06-28 13:19 UTC (permalink / raw)
  To: Gandalf Corvotempesta; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2945 bytes --]

Dear Gandalf,

In message <CAJH6TXgvrVckHDmh1oiN9mupLrsS2NP3J44bG1_wE9Nnx4=yHQ@mail.gmail.com> you wrote:
> 
> 1) all raid controllers have proactive monitoring features, like
> patrol read, consistency check and (more or less) some SMART
> integration.
> Any counterpart in mdadm?

As Wol already pointed out, you should use  smaartctl  to monitor
the state of the disk drives, ideally on a regular base.  Changes
(increases) of numbers like "Reallocated Sectors", ""Current Pending
Sectors" or ""Offline Uncorrectable Sectors" are always suspicious.
If they increase just by one, and then stay constant for weeks you
can probably ignore it.  But if you see I/O errors in the system
logs and/or "Reallocated Sectors" increasing every few days then you
should not wait much longer and replace the respective drive.

Attached are two very simple scripts I use for this purpose;
"disk-test" simply runs smartctl on all /dev/sd? devices and parses
the output.  The result is something like this:

$ sudo disk-test
=== /dev/sda : ST1000NM0011 S/N Z1N2RA6E *** ERRORS ***
        Reallocated Sectors:     1
=== /dev/sdb : ST2000NM0033-9ZM175 S/N Z1X1J1K9 OK
=== /dev/sdc : ST2000NM0033-9ZM175 S/N Z1X1JEF6 OK
=== /dev/sdd : ST2000NM0033-9ZM175 S/N Z1X4XSN9 OK
=== /dev/sde : ST2000NM0033-9ZM175 S/N Z1X4X6G8 OK
=== /dev/sdf : ST2000NM0033-9ZM175 S/N Z1X54EA1 OK
=== /dev/sdg : ST2000NM0033-9ZM175 S/N Z1X5443W OK
=== /dev/sdh : ST2000NM0033-9ZM175 S/N Z1X4XAHQ OK
=== /dev/sdi : ST2000NM0033-9ZM175 S/N Z1X4X6NB OK
=== /dev/sdj : TOSHIBA MK1002TSKB S/N 32E3K0K2F OK
=== /dev/sdk : TOSHIBA MK1002TSKB S/N 32F3K0PRF OK
=== /dev/sdl : TOSHIBA MK1002TSKB S/N 32H3K10CF *** ERRORS ***
        Reallocated Sectors:     1
=== /dev/sdm : TOSHIBA MK1002TSKB S/N 32H3K0ZLF OK
=== /dev/sdn : TOSHIBA MK1002TSKB S/N 32H3K104F OK
=== /dev/sdo : TOSHIBA MK1002TSKB S/N 32H1K31DF OK
=== /dev/sdp : TOSHIBA MK1002TSKB S/N 32F3K0PUF OK
=== /dev/sdq : TOSHIBA MK1002TSKB S/N 32E3K0JZF OK

Here I have two drives with 1 reallocated sector each, which I
consider harmeless as it has stayed constant for several months.

The second script "disk-watch" is intended to be run as a cron job
on a regular base (here usually twice per day).  It will send out
email whenever the state changes (don't forget to adjust the MAIL_TO
setting).  You may also want to clean up the entries in /var/log/diskwatch
every now and then (or better add it to your logrotate
configuration).

HTH.


Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
Yes, it's a technical challenge, and  you  have  to  kind  of  admire
people  who go to the lengths of actually implementing it, but at the
same time you wonder about their IQ...
         --  Linus Torvalds in <5phda5$ml6$1@palladium.transmeta.com>


[-- Attachment #2: disk-test --]
[-- Type: text/plain , Size: 1083 bytes --]

#!/bin/sh

DISKS="$(echo /dev/sd?)"

PATH=$PATH:/sbin:/usr/sbin

for i in ${DISKS}
do
	SMARTDATA=$(smartctl -a $i | \
	egrep 'Device Model:|Serial Number:|Reallocated_Sector_Ct|Current_Pending_Sector|Offline_Uncorrectable|failed|Unknown USB' | \
	grep -v ' -  *0$')
	LINES=$(echo "${SMARTDATA}" | wc -l)
	HEAD=$(echo "${SMARTDATA}" | \
	       sed -n -e 's/Device Model: //p' \
		      -e 's!Serial Number:!S/N!p')	
	BODY=$(echo "${SMARTDATA}" | \
	       awk '$2 ~ /Reallocated_Sector_Ct/	{ printf "Reallocated Sectors:   %3d\n", $10 }
		    $2 ~ /Current_Pending_Sector/	{ printf "Current Pending Sect:  %3d\n", $10 }
		    $2 ~ /Offline_Uncorrectable/	{ printf "Offline Uncorrectable: %3d\n", $10 }
		    $0 ~ /failed:.*AMCC/		{ printf "Unsupported AMCC/3ware controller\n" }
		    $0 ~ /SMART command failed/		{ printf "Device does not support SMART\n" }
		    $0 ~ /Unknown USB bridge/		{ printf "Unknown USB bridge\n" }
		'
	     )
	if [ $LINES -eq 2 ]
	then
		echo === $i : ${HEAD} OK
	else
		echo === $i : ${HEAD} "*** ERRORS ***"
		echo "${BODY}" | sed -e 's/^/	/'
	fi
done

[-- Attachment #3: disk-watch --]
[-- Type: text/plain , Size: 683 bytes --]

#!/bin/sh

D_TEST=/usr/local/sbin/disk-test
D_LOGDIR=/var/log/diskwatch
MAIL_TO="root"

[ -x ${D_TEST} ] || { echo "ERROR: cannot execute ${D_TEST}" >&2 ; exit 1 ; }

[ -d ${D_LOGDIR} ] || \
	mkdir -p ${D_LOGDIR} || \
		{ echo "ERROR: cannot create ${D_LOGDIR}" >&2 ; exit 1 ; }

cd ${D_LOGDIR} || { echo "ERROR: cannot cd ${D_LOGDIR}" >&2 ; exit 1 ; }

rm -f previous

[ -L latest ] && mv latest previous

NOW=$(date "+%F-%T")

${D_TEST} >${NOW}

ln -s "${NOW}" latest

DIFF=''

[ -r previous ] && DIFF=$(diff -u previous latest)

[ -z "${DIFF}" ] && exit 0

mailx -s "$(hostname): SMART DISK WARNING" ${MAIL_TO} <<+++
Disk status change:
${DIFF}

Recent results:
$(cat latest)
+++

^ permalink raw reply	[flat|nested] 23+ messages in thread
* Disk Monitoring
@ 2017-06-28 10:25 Gandalf Corvotempesta
  2017-06-28 10:45 ` Johannes Truschnigg
  2017-06-28 12:43 ` Wols Lists
  0 siblings, 2 replies; 23+ messages in thread
From: Gandalf Corvotempesta @ 2017-06-28 10:25 UTC (permalink / raw)
  To: linux-raid

Hi to all
I always used hardwre raid but with my next server I would like to use mdadm.

Some questions:

1) all raid controllers have proactive monitoring features, like
patrol read, consistency check and (more or less) some SMART
integration.
Any counterpart in mdadm?

2) thanks to this features, raid controller are usually able to detect
disk issues before they cause data-loss. what about mdadm ?

How and when do you replace disks ? Based on which params? Do you
always wait for a total failure before replacing the disk?

Is mdadm able to notify some possible bad-things before they happens ?

Many times in the past our raid controllers forced a bad sector
reallocation during proactive tasks like patrol read. This saved me
many times before. I've tried to not replace a disks when this
reallocation was made (it was a test server) and after some weeks the
disk failed totally.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-07-06  3:31 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-28 13:19 Disk Monitoring Wolfgang Denk
2017-06-29  9:52 ` Gandalf Corvotempesta
2017-06-29 10:10   ` Reindl Harald
2017-06-29 10:14     ` Gandalf Corvotempesta
2017-06-29 10:37       ` Reindl Harald
2017-06-29 14:28       ` Wols Lists
2017-06-29 10:14   ` Andreas Klauer
2017-06-29 10:14   ` Mateusz Korniak
2017-06-29 10:16     ` Gandalf Corvotempesta
2017-06-29 14:33       ` Wols Lists
2017-06-30 12:35         ` Gandalf Corvotempesta
2017-06-30 14:35           ` Phil Turmel
2017-06-30 19:56             ` Anthony Youngman
2017-07-01 13:42               ` Drew
2017-07-01 14:12                 ` Gandalf Corvotempesta
2017-07-01 15:36                   ` Drew
2017-06-29 10:20   ` Mateusz Korniak
2017-06-29 10:25     ` Gandalf Corvotempesta
2017-06-29 10:34       ` Reindl Harald
  -- strict thread matches above, loose matches on Subject: below --
2017-06-28 10:25 Gandalf Corvotempesta
2017-06-28 10:45 ` Johannes Truschnigg
2017-07-06  3:31   ` NeilBrown
2017-06-28 12:43 ` Wols Lists

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.