linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Scrub resume failure
@ 2019-06-06 14:26 Graham Cobb
  2019-06-07 23:54 ` Graham Cobb
  0 siblings, 1 reply; 2+ messages in thread
From: Graham Cobb @ 2019-06-06 14:26 UTC (permalink / raw)
  To: linux-btrfs

I have a btrfs filesystem which I want to scrub. This is a multi-TB
filesystem and will take well over 24 hours to scrub.

Unfortunately, the scrub turns out to be quite intrusive into the system
(even when making sure it is very low priority for ionice and nice).
Operations on other disks run excessively slowly, causing timeouts on
important actions like mail delivery (causing bounces).

So, I break it up. I run it for some interval (hours), with the
time-critical services stopped. Then I cancel the scrub and let mail
delivery run for a while. Then I stop mail again and resume the scrub
for another interval, etc.

This works and solves the mail bounce problem.

However, after a few cancel/resume cycles, the scrub terminates. No
errors are reported but one of the resumes will just immediately
terminate claiming the scrub is done. It isn't. Nowhere near.

The disk being scrubbed is in use during all this. It doesn't get a
heavy load but it is my main backup disk and various backups happen,
some of them involving snapshots being created and deleted.

Glancing at the use of the ioctl in the btrfs-progs code, I assume the
resume is using the last_physical from the last run as the start for the
next. Does that break if the filesystem has changed and that is no
longer a used block or something? If so, I think that makes resume useless.

If this is not expected behaviour I will do more work to analyse and
reproduce.

Graham

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Scrub resume failure
  2019-06-06 14:26 Scrub resume failure Graham Cobb
@ 2019-06-07 23:54 ` Graham Cobb
  0 siblings, 0 replies; 2+ messages in thread
From: Graham Cobb @ 2019-06-07 23:54 UTC (permalink / raw)
  To: linux-btrfs

On 06/06/2019 15:26, Graham Cobb wrote:
> However, after a few cancel/resume cycles, the scrub terminates. No
> errors are reported but one of the resumes will just immediately
> terminate claiming the scrub is done. It isn't. Nowhere near.

I believe I have found the problem. It is a bug in the scrub command.

When a scrub completes or is cancelled, the utility updates the saved
statistics for reporting using btrfs scrub status. These statistics
include the last_physical value returned from the ioctl, which is then
used by the resume code to specify the start for the next run.

Most statistics (such as bytes scrubbed, error counts, etc) are
maintained by adding the values from the current run to the saved
values. However, the last_physical value should not be added: it should
replace the saved value. The current code incorrectly adds it to the
saved value, meaning that large amounts of the filesystem are missed out
on the next run.

I have created a patch, which I will send in a separate message. As I
have not submitted patches to this list before, I will send it as a
PATCH RFC and would welcome comments.

Graham

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2019-06-07 23:54 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-06 14:26 Scrub resume failure Graham Cobb
2019-06-07 23:54 ` Graham Cobb

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).