[RFC PATCH 0/1] btrfs-progs: scrub: add start/end position for scrub

* [RFC PATCH 0/1] btrfs-progs: scrub: add start/end position for scrub
@ 2019-12-02  3:44 Zygo Blaxell
  2019-12-02  3:44 ` [PATCH] " Zygo Blaxell
  0 siblings, 1 reply; 3+ messages in thread
From: Zygo Blaxell @ 2019-12-02  3:44 UTC (permalink / raw)
  To: linux-btrfs

This patch has some problems that will be a lot of work to fix, and
before doing any of that I thought I would check to see if anyone else
thinks the idea is sane.

This patch just adds start (-s) and end (-e) position arguments to 'btrfs
scrub start', to enable focusing a scrub on specific areas of a device.
The positions are offsets from the start of the device.

The idea is that if you have a disk with a lot of errors, you do a
loop of:

	- start a scrub at the beginning of the disk
	- get some read/uncorrectable errors in dmesg
	- cancel scrub
	- fix the errors (delete/replace files)
	- restart scrub at just before the offset of the first error
	- repeat from step 2

The last steps use the '-s' option to skip over parts of the disk that
have already been scrubbed.  Each pass starts reading just before the
first detected error in the previous pass to confirm that all references
to the offending data blocks have been removed from the filesystem.

Without these options, the process looks like this:

	- start a scrub at the beginning of the disk
	- get a random sample of read/uncorrectable errors in dmesg
	- wait for scrub to end
	- fix the errors (delete/replace files)
	- repeat from step 1

The current approach need a full scrub to be repeated many times, because
only a small percentage of a large number of errors will be sampled on
each pass due to dmesg ratelimiting.

It is possible to cancel the scrub, edit /var/lib/btrfs/scrub.status.*,
change the "last_physical" field to the desired start position, and then
resume the scrub to achieve a similar effect to this patch, but that's
somewhat ugly.

TODO:

This patch does nothing to correct the "Total bytes to scrub" or
"ETA" fields in various outputs, which are very wrong when the new
-s and -e options are used.  Fixing that will require joining the
device tree with block groups to estimate how many bytes will be
scrubbed.  Alternatively, we could just disable the ETA/TBS fields
in the status output when -s or -e are used.

^ permalink raw reply	[flat|nested] 3+ messages in thread