Scrub resume failure

* Scrub resume failure
@ 2019-06-06 14:26 Graham Cobb
  2019-06-07 23:54 ` Graham Cobb
  0 siblings, 1 reply; 2+ messages in thread
From: Graham Cobb @ 2019-06-06 14:26 UTC (permalink / raw)
  To: linux-btrfs

I have a btrfs filesystem which I want to scrub. This is a multi-TB
filesystem and will take well over 24 hours to scrub.

Unfortunately, the scrub turns out to be quite intrusive into the system
(even when making sure it is very low priority for ionice and nice).
Operations on other disks run excessively slowly, causing timeouts on
important actions like mail delivery (causing bounces).

So, I break it up. I run it for some interval (hours), with the
time-critical services stopped. Then I cancel the scrub and let mail
delivery run for a while. Then I stop mail again and resume the scrub
for another interval, etc.

This works and solves the mail bounce problem.

However, after a few cancel/resume cycles, the scrub terminates. No
errors are reported but one of the resumes will just immediately
terminate claiming the scrub is done. It isn't. Nowhere near.

The disk being scrubbed is in use during all this. It doesn't get a
heavy load but it is my main backup disk and various backups happen,
some of them involving snapshots being created and deleted.

Glancing at the use of the ioctl in the btrfs-progs code, I assume the
resume is using the last_physical from the last run as the start for the
next. Does that break if the filesystem has changed and that is no
longer a used block or something? If so, I think that makes resume useless.

If this is not expected behaviour I will do more work to analyse and
reproduce.

Graham

^ permalink raw reply	[flat|nested] 2+ messages in thread