All of lore.kernel.org
 help / color / mirror / Atom feed
* data rolled back 5 hours after crash, long fsync running times, watchdog evasion on 5.4.11
@ 2020-02-09  0:43 Zygo Blaxell
  2020-02-09  9:00 ` Martin Steigerwald
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Zygo Blaxell @ 2020-02-09  0:43 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Timothy Pearson

[-- Attachment #1: Type: text/plain, Size: 7042 bytes --]

I reproduced a filesystem rollback similar to one reported back in
November by Timothy Pearson:

	https://www.spinics.net/lists/linux-btrfs/msg94318.html

The timeline is something like this:

1.  System gets loaded, lots of writes, a few processes calling fsync().
Fairly normal stuff.

2.  Just enough new data is written continuously to keep a transaction
from finishing, but not enough to block new writes (i.e. the filesystem
is kept in equilibrium between dirty_background_bytes and dirty_bytes).
Typically an application gets stuck in fsync() here, but the rest of
the system is unaffected.  Here is one:

	TIMESTAMP: Fri Feb  7 01:03:21 EST 2020
	==> /proc/31872/task/31872/stack <==
	[<0>] wait_on_page_bit+0x172/0x250
	[<0>] read_extent_buffer_pages+0x2e5/0x3a0
	[<0>] btree_read_extent_buffer_pages+0xa3/0x120
	[<0>] read_tree_block+0x4e/0x70
	[<0>] read_block_for_search.isra.34+0x1aa/0x350
	[<0>] btrfs_search_slot+0x20a/0x940
	[<0>] lookup_extent_data_ref+0x7e/0x210
	[<0>] __btrfs_free_extent.isra.45+0x22f/0xa10
	[<0>] __btrfs_run_delayed_refs+0x2d5/0x12d0
	[<0>] btrfs_run_delayed_refs+0x105/0x1b0
	[<0>] btrfs_commit_transaction+0x52/0xa70
	[<0>] btrfs_sync_file+0x248/0x490
	[<0>] vfs_fsync_range+0x48/0x80
	[<0>] do_fsync+0x3d/0x70
	[<0>] __x64_sys_fdatasync+0x17/0x20
	[<0>] do_syscall_64+0x5f/0x1f0
	[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe

3.  The transid of the subvol roots never increases as long as the balance
of incoming and outgoing data flows in #2 is maintained.  This can go on
for hours or even days on recent kernels (much longer than was possible on
5.0). In this particular case it was 3 hours, in earlier cases I've caught
it delayed by 14 hours or more, and been able to recover by pausing write
workloads long enough for btrfs to keep up.  Here is an excerpt from bees
logs showing this (the filesystem's current transid is the "count" field):

	2020-02-07 00:04:24 10386.10408<7> crawl_transid: Polling 561.882s for next 100 transid RateEstimator { count = 4441796, raw = 7412.98 / 42429.6, ratio = 7412.98 / 42440.2, rate = 0.174669, duration(1) = 5.72512, seconds_for(1) = 1 }

and 3 hours later the filesystem transid hasn't moved:

	2020-02-07 03:06:53 10386.10408<7> crawl_transid: Polling 719.095s for next 100 transid RateEstimator { count = 4441796, raw = 6248.72 / 45912.8, ratio = 6248.72 / 45928.7, rate = 0.136053, duration(1) = 7.35009, seconds_for(1) = 1 }

4.  Most writes continue without incident:  git commits, log files,
bees dedupe, kernel builds, rsync all run normally.  Some things block:
applications that call fsync() or sync() get stuck.  Snapshot creation
blocks.  Maintenance balances, when the maintenance window opens, also
block, and snapshot deletes are then blocked waiting for balance.

In most cases this is as far as we get:  eventually something breaks
the equilibrium from #2, the stalled transaction commit completes,
and everything is normal; however, in this case, we never finish the
transaction.  fsync (the same one!) is still running some hours later:

	TIMESTAMP: Fri Feb  7 03:47:40 EST 2020
	==> /proc/31872/task/31872/stack <==
	[<0>] btrfs_tree_lock+0x120/0x260
	[<0>] btrfs_lock_root_node+0x34/0x50
	[<0>] btrfs_search_slot+0x4d5/0x940
	[<0>] lookup_inline_extent_backref+0x164/0x5a0
	[<0>] __btrfs_free_extent.isra.45+0x1f0/0xa10
	[<0>] __btrfs_run_delayed_refs+0x2d5/0x12d0
	[<0>] btrfs_run_delayed_refs+0x105/0x1b0
	[<0>] btrfs_commit_transaction+0x52/0xa70
	[<0>] btrfs_sync_file+0x248/0x490
	[<0>] vfs_fsync_range+0x48/0x80
	[<0>] do_fsync+0x3d/0x70
	[<0>] __x64_sys_fdatasync+0x17/0x20
	[<0>] do_syscall_64+0x5f/0x1f0
	[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe

We know it's the same fsync call because a supervisor process killed
pid 31872 with SIGKILL at around 01:10.  It's not deadlocked here--the
stack changes continuously the whole time--but it doesn't finish running.

5.  2 hours later, lstat() in the watchdog daemon blocks:  

	TIMESTAMP: Fri Feb  7 05:15:07 EST 2020
	4506 pts/10   DN+    0:00 /bin/bash /root/bin/watchdog-mdtest
	==> /proc/4506/task/4506/stack <==
	[<0>] lookup_slow+0x2c/0x60
	[<0>] walk_component+0x1bf/0x330
	[<0>] path_lookupat.isra.44+0x6d/0x220
	[<0>] filename_lookup.part.59+0xa0/0x170
	[<0>] user_path_at_empty+0x3e/0x50
	[<0>] vfs_statx+0x76/0xe0
	[<0>] __do_sys_newlstat+0x3d/0x70
	[<0>] __x64_sys_newlstat+0x16/0x20
	[<0>] do_syscall_64+0x5f/0x1f0
	[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe

Up to that point, a few processes have been blocked for up to 5 hours,
but this is not unusual on a big filesystem given #1.  Usually processes
that read the filesystem (e.g. calling lstat) are not blocked, unless they
try to access a directory being modified by a process that is blocked.
lstat() being blocked is unusual.

6.  60 seconds later, the system reboots due to watchdog timeout.

Upon reboot, the filesystem reverts to its state at the last completed
transaction 4441796 at #2, which is 5 hours earlier.  Everything seems to
be intact, but there is no trace of any update to the filesystem after
the transaction 4441796.  The last 'fi usage' logged before the crash
and the first 'fi usage' after show 40GB of data and 25GB of metadata
block groups freed in between.

I have only observed this on kernel 5.4, but I've seen the commit blocking
behavior twice in two days.  I have not observed it so far on 5.0 and
earlier (since it was released in March 2019).  I don't have data from
kernels in between due to other test-blocking bugs.

TBH I can't really say 5.0 _doesn't_ do this--while writing this post,
I've seen some transactions on 5.0 running for 5-10 minutes, and I
haven't been checking for this specific behavior in earlier testing;
however, 5.0 crashes a lot, so if there was a behavior like this in 5.0,
I'd expect somebody would have noticed.

On kernel 5.4 we see fewer processes blocked under heavy write loads,
but the processes that do block are blocked longer.  In particular, our
watchdog daemon, which creates and removes a directory every 30 seconds
to detect lockups, didn't fire.  The system ran for 3 hours before the
watchdog detected a problem in this case, and the previous day the system
ran 14 hours without completing a transaction and the watchdog didn't
fire at all.  We'll add an fsync to the watchdog too, as the logs for
systems running 5.4 are full of processes stuck many hours in fsync.

Going forward we will add an alert to our server monitoring to verify that
the superblock's generation number field updates at regular intervals
(at least once an hour) and apply a workaround if not.  I'll also add
to my test workload and start a bisect to see if this is a regression in
recent kernels.

There is a workaround:  detect when the superblock generation field stops
updating, and pause (freeze or SIGSTOP) all writing processes long enough
for btrfs to catch up and complete the current transaction.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: data rolled back 5 hours after crash, long fsync running times, watchdog evasion on 5.4.11
  2020-02-09  0:43 data rolled back 5 hours after crash, long fsync running times, watchdog evasion on 5.4.11 Zygo Blaxell
@ 2020-02-09  9:00 ` Martin Steigerwald
  2020-02-10  4:10   ` Zygo Blaxell
  2020-02-09 17:08 ` Martin Raiber
  2020-02-10  1:49 ` Chris Murphy
  2 siblings, 1 reply; 11+ messages in thread
From: Martin Steigerwald @ 2020-02-09  9:00 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs, Timothy Pearson

Zygo Blaxell - 09.02.20, 01:43:07 CET:
> Up to that point, a few processes have been blocked for up to 5 hours,
> but this is not unusual on a big filesystem given #1.  Usually
> processes that read the filesystem (e.g. calling lstat) are not
> blocked, unless they try to access a directory being modified by a
> process that is blocked. lstat() being blocked is unusual.

This is really funny, cause what you consider not being unusual, I'd 
consider a bug or at least a huge limitation.

But in a sense I never really got that processed can be stuck in 
uninterruptible sleep on Linux or Unix *at all*. Such a situation 
without giving a user at least the ability to end it by saying "I don't 
care about the data that process is to write, let me remove it already" 
for me is a major limitation to what appears to be kind of specific to 
the UNIX architecture or at least the way the Linux virtual memory 
manager is working.

That written I may be completely ignorant of something very important 
here and some may tell me it can't be any other way for this and that 
reason. Currently I still think it can.

And even if uninterruptible sleep can still happen cause it is really 
necessary, five hours is at least about five hours minus probably a minute 
or so too long.

Ciao,
-- 
Martin



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: data rolled back 5 hours after crash, long fsync running times, watchdog evasion on 5.4.11
  2020-02-09  0:43 data rolled back 5 hours after crash, long fsync running times, watchdog evasion on 5.4.11 Zygo Blaxell
  2020-02-09  9:00 ` Martin Steigerwald
@ 2020-02-09 17:08 ` Martin Raiber
  2020-02-09 23:11   ` Timothy Pearson
  2020-02-10  5:20   ` Zygo Blaxell
  2020-02-10  1:49 ` Chris Murphy
  2 siblings, 2 replies; 11+ messages in thread
From: Martin Raiber @ 2020-02-09 17:08 UTC (permalink / raw)
  To: Zygo Blaxell, linux-btrfs; +Cc: Timothy Pearson

On 09.02.2020 01:43 Zygo Blaxell wrote:
> I reproduced a filesystem rollback similar to one reported back in
> November by Timothy Pearson:
>
> 	https://www.spinics.net/lists/linux-btrfs/msg94318.html
>
> The timeline is something like this:
>
> 1.  System gets loaded, lots of writes, a few processes calling fsync().
> Fairly normal stuff.
>
> 2.  Just enough new data is written continuously to keep a transaction
> from finishing, but not enough to block new writes (i.e. the filesystem
> is kept in equilibrium between dirty_background_bytes and dirty_bytes).
> Typically an application gets stuck in fsync() here, but the rest of
> the system is unaffected.  Here is one:
>
> 	TIMESTAMP: Fri Feb  7 01:03:21 EST 2020
> 	==> /proc/31872/task/31872/stack <==
> 	[<0>] wait_on_page_bit+0x172/0x250
> 	[<0>] read_extent_buffer_pages+0x2e5/0x3a0
> 	[<0>] btree_read_extent_buffer_pages+0xa3/0x120
> 	[<0>] read_tree_block+0x4e/0x70
> 	[<0>] read_block_for_search.isra.34+0x1aa/0x350
> 	[<0>] btrfs_search_slot+0x20a/0x940
> 	[<0>] lookup_extent_data_ref+0x7e/0x210
> 	[<0>] __btrfs_free_extent.isra.45+0x22f/0xa10
> 	[<0>] __btrfs_run_delayed_refs+0x2d5/0x12d0
> 	[<0>] btrfs_run_delayed_refs+0x105/0x1b0
> 	[<0>] btrfs_commit_transaction+0x52/0xa70
> 	[<0>] btrfs_sync_file+0x248/0x490
> 	[<0>] vfs_fsync_range+0x48/0x80
> 	[<0>] do_fsync+0x3d/0x70
> 	[<0>] __x64_sys_fdatasync+0x17/0x20
> 	[<0>] do_syscall_64+0x5f/0x1f0
> 	[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> 3.  The transid of the subvol roots never increases as long as the balance
> of incoming and outgoing data flows in #2 is maintained.  This can go on
> for hours or even days on recent kernels (much longer than was possible on
> 5.0). In this particular case it was 3 hours, in earlier cases I've caught
> it delayed by 14 hours or more, and been able to recover by pausing write
> workloads long enough for btrfs to keep up.  Here is an excerpt from bees
> logs showing this (the filesystem's current transid is the "count" field):
>
> 	2020-02-07 00:04:24 10386.10408<7> crawl_transid: Polling 561.882s for next 100 transid RateEstimator { count = 4441796, raw = 7412.98 / 42429.6, ratio = 7412.98 / 42440.2, rate = 0.174669, duration(1) = 5.72512, seconds_for(1) = 1 }
>
> and 3 hours later the filesystem transid hasn't moved:
>
> 	2020-02-07 03:06:53 10386.10408<7> crawl_transid: Polling 719.095s for next 100 transid RateEstimator { count = 4441796, raw = 6248.72 / 45912.8, ratio = 6248.72 / 45928.7, rate = 0.136053, duration(1) = 7.35009, seconds_for(1) = 1 }
>
> 4.  Most writes continue without incident:  git commits, log files,
> bees dedupe, kernel builds, rsync all run normally.  Some things block:
> applications that call fsync() or sync() get stuck.  Snapshot creation
> blocks.  Maintenance balances, when the maintenance window opens, also
> block, and snapshot deletes are then blocked waiting for balance.
>
> In most cases this is as far as we get:  eventually something breaks
> the equilibrium from #2, the stalled transaction commit completes,
> and everything is normal; however, in this case, we never finish the
> transaction.  fsync (the same one!) is still running some hours later:
>
> 	TIMESTAMP: Fri Feb  7 03:47:40 EST 2020
> 	==> /proc/31872/task/31872/stack <==
> 	[<0>] btrfs_tree_lock+0x120/0x260
> 	[<0>] btrfs_lock_root_node+0x34/0x50
> 	[<0>] btrfs_search_slot+0x4d5/0x940
> 	[<0>] lookup_inline_extent_backref+0x164/0x5a0
> 	[<0>] __btrfs_free_extent.isra.45+0x1f0/0xa10
> 	[<0>] __btrfs_run_delayed_refs+0x2d5/0x12d0
> 	[<0>] btrfs_run_delayed_refs+0x105/0x1b0
> 	[<0>] btrfs_commit_transaction+0x52/0xa70
> 	[<0>] btrfs_sync_file+0x248/0x490
> 	[<0>] vfs_fsync_range+0x48/0x80
> 	[<0>] do_fsync+0x3d/0x70
> 	[<0>] __x64_sys_fdatasync+0x17/0x20
> 	[<0>] do_syscall_64+0x5f/0x1f0
> 	[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> We know it's the same fsync call because a supervisor process killed
> pid 31872 with SIGKILL at around 01:10.  It's not deadlocked here--the
> stack changes continuously the whole time--but it doesn't finish running.
>
> 5.  2 hours later, lstat() in the watchdog daemon blocks:  
>
> 	TIMESTAMP: Fri Feb  7 05:15:07 EST 2020
> 	4506 pts/10   DN+    0:00 /bin/bash /root/bin/watchdog-mdtest
> 	==> /proc/4506/task/4506/stack <==
> 	[<0>] lookup_slow+0x2c/0x60
> 	[<0>] walk_component+0x1bf/0x330
> 	[<0>] path_lookupat.isra.44+0x6d/0x220
> 	[<0>] filename_lookup.part.59+0xa0/0x170
> 	[<0>] user_path_at_empty+0x3e/0x50
> 	[<0>] vfs_statx+0x76/0xe0
> 	[<0>] __do_sys_newlstat+0x3d/0x70
> 	[<0>] __x64_sys_newlstat+0x16/0x20
> 	[<0>] do_syscall_64+0x5f/0x1f0
> 	[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> Up to that point, a few processes have been blocked for up to 5 hours,
> but this is not unusual on a big filesystem given #1.  Usually processes
> that read the filesystem (e.g. calling lstat) are not blocked, unless they
> try to access a directory being modified by a process that is blocked.
> lstat() being blocked is unusual.
>
> 6.  60 seconds later, the system reboots due to watchdog timeout.
>
> Upon reboot, the filesystem reverts to its state at the last completed
> transaction 4441796 at #2, which is 5 hours earlier.  Everything seems to
> be intact, but there is no trace of any update to the filesystem after
> the transaction 4441796.  The last 'fi usage' logged before the crash
> and the first 'fi usage' after show 40GB of data and 25GB of metadata
> block groups freed in between.
>
> I have only observed this on kernel 5.4, but I've seen the commit blocking
> behavior twice in two days.  I have not observed it so far on 5.0 and
> earlier (since it was released in March 2019).  I don't have data from
> kernels in between due to other test-blocking bugs.
>
> TBH I can't really say 5.0 _doesn't_ do this--while writing this post,
> I've seen some transactions on 5.0 running for 5-10 minutes, and I
> haven't been checking for this specific behavior in earlier testing;
> however, 5.0 crashes a lot, so if there was a behavior like this in 5.0,
> I'd expect somebody would have noticed.
>
> On kernel 5.4 we see fewer processes blocked under heavy write loads,
> but the processes that do block are blocked longer.  In particular, our
> watchdog daemon, which creates and removes a directory every 30 seconds
> to detect lockups, didn't fire.  The system ran for 3 hours before the
> watchdog detected a problem in this case, and the previous day the system
> ran 14 hours without completing a transaction and the watchdog didn't
> fire at all.  We'll add an fsync to the watchdog too, as the logs for
> systems running 5.4 are full of processes stuck many hours in fsync.
>
> Going forward we will add an alert to our server monitoring to verify that
> the superblock's generation number field updates at regular intervals
> (at least once an hour) and apply a workaround if not.  I'll also add
> to my test workload and start a bisect to see if this is a regression in
> recent kernels.
>
> There is a workaround:  detect when the superblock generation field stops
> updating, and pause (freeze or SIGSTOP) all writing processes long enough
> for btrfs to catch up and complete the current transaction.

I have the same problem. Have a dozen threads with high throughput
(simply writing data to a single file sequentially) combined with a
dozen threads doing metadata-heavy workloads and it'll take a long time
for a sync() to finish. Work-around is the same as well: Throttle the
application if the sync doesn't finish after a (configurable) amount of
time.
I have seen the same problem with ZFS on FreeBSD, though, so it is by no
means btrfs or even Linux specific. My guess is that since most file
systems are constrained/throttled by journal (size), there is no
mechanism for metadata vs. data fairness.
As for most congestion problems the best solution is to increase
capacity (storage IOPS) so everything runs below max capacity most of
the time or to throttle at the source to reach the same goal.

https://www.spinics.net/lists/linux-btrfs/msg72909.html was my report
back then, maybe take a look at the number of btrfs_delayed_ref_head in
slabtop as well?




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: data rolled back 5 hours after crash, long fsync running times, watchdog evasion on 5.4.11
  2020-02-09 17:08 ` Martin Raiber
@ 2020-02-09 23:11   ` Timothy Pearson
  2020-02-10  4:27     ` Zygo Blaxell
  2020-02-10  5:20   ` Zygo Blaxell
  1 sibling, 1 reply; 11+ messages in thread
From: Timothy Pearson @ 2020-02-09 23:11 UTC (permalink / raw)
  To: Martin Raiber; +Cc: Zygo Blaxell, linux-btrfs



----- Original Message -----
> From: "Martin Raiber" <martin@urbackup.org>
> To: "Zygo Blaxell" <ce3g8jdj@umail.furryterror.org>, "linux-btrfs" <linux-btrfs@vger.kernel.org>
> Cc: "Timothy Pearson" <tpearson@raptorengineering.com>
> Sent: Sunday, February 9, 2020 11:08:58 AM
> Subject: Re: data rolled back 5 hours after crash, long fsync running times, watchdog evasion on 5.4.11

> On 09.02.2020 01:43 Zygo Blaxell wrote:
>> I reproduced a filesystem rollback similar to one reported back in
>> November by Timothy Pearson:
>>
>> 	https://www.spinics.net/lists/linux-btrfs/msg94318.html
>>
>> The timeline is something like this:
>>
>> 1.  System gets loaded, lots of writes, a few processes calling fsync().
>> Fairly normal stuff.
>>
>> 2.  Just enough new data is written continuously to keep a transaction
>> from finishing, but not enough to block new writes (i.e. the filesystem
>> is kept in equilibrium between dirty_background_bytes and dirty_bytes).
>> Typically an application gets stuck in fsync() here, but the rest of
>> the system is unaffected.  Here is one:
>>
>> 	TIMESTAMP: Fri Feb  7 01:03:21 EST 2020
>> 	==> /proc/31872/task/31872/stack <==
>> 	[<0>] wait_on_page_bit+0x172/0x250
>> 	[<0>] read_extent_buffer_pages+0x2e5/0x3a0
>> 	[<0>] btree_read_extent_buffer_pages+0xa3/0x120
>> 	[<0>] read_tree_block+0x4e/0x70
>> 	[<0>] read_block_for_search.isra.34+0x1aa/0x350
>> 	[<0>] btrfs_search_slot+0x20a/0x940
>> 	[<0>] lookup_extent_data_ref+0x7e/0x210
>> 	[<0>] __btrfs_free_extent.isra.45+0x22f/0xa10
>> 	[<0>] __btrfs_run_delayed_refs+0x2d5/0x12d0
>> 	[<0>] btrfs_run_delayed_refs+0x105/0x1b0
>> 	[<0>] btrfs_commit_transaction+0x52/0xa70
>> 	[<0>] btrfs_sync_file+0x248/0x490
>> 	[<0>] vfs_fsync_range+0x48/0x80
>> 	[<0>] do_fsync+0x3d/0x70
>> 	[<0>] __x64_sys_fdatasync+0x17/0x20
>> 	[<0>] do_syscall_64+0x5f/0x1f0
>> 	[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>> 3.  The transid of the subvol roots never increases as long as the balance
>> of incoming and outgoing data flows in #2 is maintained.  This can go on
>> for hours or even days on recent kernels (much longer than was possible on
>> 5.0). In this particular case it was 3 hours, in earlier cases I've caught
>> it delayed by 14 hours or more, and been able to recover by pausing write
>> workloads long enough for btrfs to keep up.  Here is an excerpt from bees
>> logs showing this (the filesystem's current transid is the "count" field):
>>
>> 	2020-02-07 00:04:24 10386.10408<7> crawl_transid: Polling 561.882s for next 100
>> 	transid RateEstimator { count = 4441796, raw = 7412.98 / 42429.6, ratio =
>> 	7412.98 / 42440.2, rate = 0.174669, duration(1) = 5.72512, seconds_for(1) = 1 }
>>
>> and 3 hours later the filesystem transid hasn't moved:
>>
>> 	2020-02-07 03:06:53 10386.10408<7> crawl_transid: Polling 719.095s for next 100
>> 	transid RateEstimator { count = 4441796, raw = 6248.72 / 45912.8, ratio =
>> 	6248.72 / 45928.7, rate = 0.136053, duration(1) = 7.35009, seconds_for(1) = 1 }
>>
>> 4.  Most writes continue without incident:  git commits, log files,
>> bees dedupe, kernel builds, rsync all run normally.  Some things block:
>> applications that call fsync() or sync() get stuck.  Snapshot creation
>> blocks.  Maintenance balances, when the maintenance window opens, also
>> block, and snapshot deletes are then blocked waiting for balance.
>>
>> In most cases this is as far as we get:  eventually something breaks
>> the equilibrium from #2, the stalled transaction commit completes,
>> and everything is normal; however, in this case, we never finish the
>> transaction.  fsync (the same one!) is still running some hours later:
>>
>> 	TIMESTAMP: Fri Feb  7 03:47:40 EST 2020
>> 	==> /proc/31872/task/31872/stack <==
>> 	[<0>] btrfs_tree_lock+0x120/0x260
>> 	[<0>] btrfs_lock_root_node+0x34/0x50
>> 	[<0>] btrfs_search_slot+0x4d5/0x940
>> 	[<0>] lookup_inline_extent_backref+0x164/0x5a0
>> 	[<0>] __btrfs_free_extent.isra.45+0x1f0/0xa10
>> 	[<0>] __btrfs_run_delayed_refs+0x2d5/0x12d0
>> 	[<0>] btrfs_run_delayed_refs+0x105/0x1b0
>> 	[<0>] btrfs_commit_transaction+0x52/0xa70
>> 	[<0>] btrfs_sync_file+0x248/0x490
>> 	[<0>] vfs_fsync_range+0x48/0x80
>> 	[<0>] do_fsync+0x3d/0x70
>> 	[<0>] __x64_sys_fdatasync+0x17/0x20
>> 	[<0>] do_syscall_64+0x5f/0x1f0
>> 	[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>> We know it's the same fsync call because a supervisor process killed
>> pid 31872 with SIGKILL at around 01:10.  It's not deadlocked here--the
>> stack changes continuously the whole time--but it doesn't finish running.
>>
>> 5.  2 hours later, lstat() in the watchdog daemon blocks:
>>
>> 	TIMESTAMP: Fri Feb  7 05:15:07 EST 2020
>> 	4506 pts/10   DN+    0:00 /bin/bash /root/bin/watchdog-mdtest
>> 	==> /proc/4506/task/4506/stack <==
>> 	[<0>] lookup_slow+0x2c/0x60
>> 	[<0>] walk_component+0x1bf/0x330
>> 	[<0>] path_lookupat.isra.44+0x6d/0x220
>> 	[<0>] filename_lookup.part.59+0xa0/0x170
>> 	[<0>] user_path_at_empty+0x3e/0x50
>> 	[<0>] vfs_statx+0x76/0xe0
>> 	[<0>] __do_sys_newlstat+0x3d/0x70
>> 	[<0>] __x64_sys_newlstat+0x16/0x20
>> 	[<0>] do_syscall_64+0x5f/0x1f0
>> 	[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>> Up to that point, a few processes have been blocked for up to 5 hours,
>> but this is not unusual on a big filesystem given #1.  Usually processes
>> that read the filesystem (e.g. calling lstat) are not blocked, unless they
>> try to access a directory being modified by a process that is blocked.
>> lstat() being blocked is unusual.
>>
>> 6.  60 seconds later, the system reboots due to watchdog timeout.
>>
>> Upon reboot, the filesystem reverts to its state at the last completed
>> transaction 4441796 at #2, which is 5 hours earlier.  Everything seems to
>> be intact, but there is no trace of any update to the filesystem after
>> the transaction 4441796.  The last 'fi usage' logged before the crash
>> and the first 'fi usage' after show 40GB of data and 25GB of metadata
>> block groups freed in between.
>>
>> I have only observed this on kernel 5.4, but I've seen the commit blocking
>> behavior twice in two days.  I have not observed it so far on 5.0 and
>> earlier (since it was released in March 2019).  I don't have data from
>> kernels in between due to other test-blocking bugs.
>>
>> TBH I can't really say 5.0 _doesn't_ do this--while writing this post,
>> I've seen some transactions on 5.0 running for 5-10 minutes, and I
>> haven't been checking for this specific behavior in earlier testing;
>> however, 5.0 crashes a lot, so if there was a behavior like this in 5.0,
>> I'd expect somebody would have noticed.
>>
>> On kernel 5.4 we see fewer processes blocked under heavy write loads,
>> but the processes that do block are blocked longer.  In particular, our
>> watchdog daemon, which creates and removes a directory every 30 seconds
>> to detect lockups, didn't fire.  The system ran for 3 hours before the
>> watchdog detected a problem in this case, and the previous day the system
>> ran 14 hours without completing a transaction and the watchdog didn't
>> fire at all.  We'll add an fsync to the watchdog too, as the logs for
>> systems running 5.4 are full of processes stuck many hours in fsync.
>>
>> Going forward we will add an alert to our server monitoring to verify that
>> the superblock's generation number field updates at regular intervals
>> (at least once an hour) and apply a workaround if not.  I'll also add
>> to my test workload and start a bisect to see if this is a regression in
>> recent kernels.
>>
>> There is a workaround:  detect when the superblock generation field stops
>> updating, and pause (freeze or SIGSTOP) all writing processes long enough
>> for btrfs to catch up and complete the current transaction.
> 
> I have the same problem. Have a dozen threads with high throughput
> (simply writing data to a single file sequentially) combined with a
> dozen threads doing metadata-heavy workloads and it'll take a long time
> for a sync() to finish. Work-around is the same as well: Throttle the
> application if the sync doesn't finish after a (configurable) amount of
> time.
> I have seen the same problem with ZFS on FreeBSD, though, so it is by no
> means btrfs or even Linux specific. My guess is that since most file
> systems are constrained/throttled by journal (size), there is no
> mechanism for metadata vs. data fairness.
> As for most congestion problems the best solution is to increase
> capacity (storage IOPS) so everything runs below max capacity most of
> the time or to throttle at the source to reach the same goal.
> 
> https://www.spinics.net/lists/linux-btrfs/msg72909.html was my report
> back then, maybe take a look at the number of btrfs_delayed_ref_head in
> slabtop as well?

If I might suggest a stopgap measure, simply firing a warning in the kernel log to the effect of "metadata write bandwidth insufficient, DATA LOSS MAY OCCUR" in this situation would be most helpful.  As it stands right now largest problem isn't so much the reversion effect itself as the silence until the data loss occurs -- the system isn't sending any messages that anything is potentially wrong, so using BTRFS feels like playing Russian Roulette at the moment  Downstream apps don't exactly tend to list their maximum sustained IOPS, so trying to balance array IOPS with application IOPS across a large server cluster is not feasible from a pure calculation standpoint, and is normally done semi-empirically knowing the worst case effect is app stall, not data loss for an indeterminate, relatively large period along with potential data corruption.

This same thought applies to ZFS, assuming it is also silently eating data in the situation mentioned.  Very interesting to know it's a general problem with this class of storage.

Thanks!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: data rolled back 5 hours after crash, long fsync running times, watchdog evasion on 5.4.11
  2020-02-09  0:43 data rolled back 5 hours after crash, long fsync running times, watchdog evasion on 5.4.11 Zygo Blaxell
  2020-02-09  9:00 ` Martin Steigerwald
  2020-02-09 17:08 ` Martin Raiber
@ 2020-02-10  1:49 ` Chris Murphy
  2020-02-10  5:18   ` Zygo Blaxell
  2 siblings, 1 reply; 11+ messages in thread
From: Chris Murphy @ 2020-02-10  1:49 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Btrfs BTRFS, Timothy Pearson

On Sat, Feb 8, 2020 at 5:43 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> Upon reboot, the filesystem reverts to its state at the last completed
> transaction 4441796 at #2, which is 5 hours earlier.  Everything seems to
> be intact, but there is no trace of any update to the filesystem after
> the transaction 4441796.  The last 'fi usage' logged before the crash
> and the first 'fi usage' after show 40GB of data and 25GB of metadata
> block groups freed in between.

Is this behavior affected by flushoncommit mount option? i.e. do you
see a difference using flushoncommit vs noflushoncommit? My suspicion
is the problem doesn't happen with noflushoncommit, but then you get
another consequence that maybe your use case can't tolerate?


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: data rolled back 5 hours after crash, long fsync running times, watchdog evasion on 5.4.11
  2020-02-09  9:00 ` Martin Steigerwald
@ 2020-02-10  4:10   ` Zygo Blaxell
  0 siblings, 0 replies; 11+ messages in thread
From: Zygo Blaxell @ 2020-02-10  4:10 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs, Timothy Pearson

[-- Attachment #1: Type: text/plain, Size: 2916 bytes --]

On Sun, Feb 09, 2020 at 10:00:34AM +0100, Martin Steigerwald wrote:
> Zygo Blaxell - 09.02.20, 01:43:07 CET:
> > Up to that point, a few processes have been blocked for up to 5 hours,
> > but this is not unusual on a big filesystem given #1.  Usually
> > processes that read the filesystem (e.g. calling lstat) are not
> > blocked, unless they try to access a directory being modified by a
> > process that is blocked. lstat() being blocked is unusual.
> 
> This is really funny, cause what you consider not being unusual, I'd 
> consider a bug or at least a huge limitation.
> 
> But in a sense I never really got that processed can be stuck in 
> uninterruptible sleep on Linux or Unix *at all*. Such a situation 
> without giving a user at least the ability to end it by saying "I don't 
> care about the data that process is to write, let me remove it already" 
> for me is a major limitation to what appears to be kind of specific to 
> the UNIX architecture or at least the way the Linux virtual memory 
> manager is working.

> That written I may be completely ignorant of something very important 
> here and some may tell me it can't be any other way for this and that 
> reason. Currently I still think it can.

The process in uninterruptible sleep is waiting for the filesystem code to
finish whatever it's doing so the in-memory and on-disk structures end in
a consistent state.  If whatever it's doing is "waiting for a lock held by
some other thread doing an expensive thing", it can block for a long time.

We can't simply abort the kernel thread here, which is why it's
uninterruptible wait (*).  Generic interruption would need to unwind the
kernel stack all the way back to userspace, reverting all changes made
to the filesystem's internal data structures as we go, without tripping
over the need for some other lock in the process, and without introducing
horrible new regressions.

In theory we can interrupt any kernel thread at any time--that happens
naturally whenever there's a BUG() or power failure, for instance--but
the effect on all the other threads that might be running is pretty
painful.

If you add a level of indirection--e.g. run the btrfs code in a VM and
access it via a network or virtio client--then we can interrupt the
client, but the server ends up having to finish whatever operation the
client requested anyway, so the client just gets to immediately hang
waiting for the server on its next call.

> And even if uninterruptible sleep can still happen cause it is really 
> necessary, five hours is at least about five hours minus probably a minute 
> or so too long.

Yes it would be nice if btrfs could avoid overcommitting itself so badly,
but that's a somewhat older and larger-scoped bug.

> Ciao,
> -- 
> Martin
> 
> 

(*) well we could, if all the filesystem code was written that way.
Patches welcome!

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: data rolled back 5 hours after crash, long fsync running times, watchdog evasion on 5.4.11
  2020-02-09 23:11   ` Timothy Pearson
@ 2020-02-10  4:27     ` Zygo Blaxell
  2020-02-10  5:18       ` Timothy Pearson
  0 siblings, 1 reply; 11+ messages in thread
From: Zygo Blaxell @ 2020-02-10  4:27 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: Martin Raiber, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 11783 bytes --]

On Sun, Feb 09, 2020 at 05:11:00PM -0600, Timothy Pearson wrote:
> 
> 
> ----- Original Message -----
> > From: "Martin Raiber" <martin@urbackup.org>
> > To: "Zygo Blaxell" <ce3g8jdj@umail.furryterror.org>, "linux-btrfs" <linux-btrfs@vger.kernel.org>
> > Cc: "Timothy Pearson" <tpearson@raptorengineering.com>
> > Sent: Sunday, February 9, 2020 11:08:58 AM
> > Subject: Re: data rolled back 5 hours after crash, long fsync running times, watchdog evasion on 5.4.11
> 
> > On 09.02.2020 01:43 Zygo Blaxell wrote:
> >> I reproduced a filesystem rollback similar to one reported back in
> >> November by Timothy Pearson:
> >>
> >> 	https://www.spinics.net/lists/linux-btrfs/msg94318.html
> >>
> >> The timeline is something like this:
> >>
> >> 1.  System gets loaded, lots of writes, a few processes calling fsync().
> >> Fairly normal stuff.
> >>
> >> 2.  Just enough new data is written continuously to keep a transaction
> >> from finishing, but not enough to block new writes (i.e. the filesystem
> >> is kept in equilibrium between dirty_background_bytes and dirty_bytes).
> >> Typically an application gets stuck in fsync() here, but the rest of
> >> the system is unaffected.  Here is one:
> >>
> >> 	TIMESTAMP: Fri Feb  7 01:03:21 EST 2020
> >> 	==> /proc/31872/task/31872/stack <==
> >> 	[<0>] wait_on_page_bit+0x172/0x250
> >> 	[<0>] read_extent_buffer_pages+0x2e5/0x3a0
> >> 	[<0>] btree_read_extent_buffer_pages+0xa3/0x120
> >> 	[<0>] read_tree_block+0x4e/0x70
> >> 	[<0>] read_block_for_search.isra.34+0x1aa/0x350
> >> 	[<0>] btrfs_search_slot+0x20a/0x940
> >> 	[<0>] lookup_extent_data_ref+0x7e/0x210
> >> 	[<0>] __btrfs_free_extent.isra.45+0x22f/0xa10
> >> 	[<0>] __btrfs_run_delayed_refs+0x2d5/0x12d0
> >> 	[<0>] btrfs_run_delayed_refs+0x105/0x1b0
> >> 	[<0>] btrfs_commit_transaction+0x52/0xa70
> >> 	[<0>] btrfs_sync_file+0x248/0x490
> >> 	[<0>] vfs_fsync_range+0x48/0x80
> >> 	[<0>] do_fsync+0x3d/0x70
> >> 	[<0>] __x64_sys_fdatasync+0x17/0x20
> >> 	[<0>] do_syscall_64+0x5f/0x1f0
> >> 	[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >>
> >> 3.  The transid of the subvol roots never increases as long as the balance
> >> of incoming and outgoing data flows in #2 is maintained.  This can go on
> >> for hours or even days on recent kernels (much longer than was possible on
> >> 5.0). In this particular case it was 3 hours, in earlier cases I've caught
> >> it delayed by 14 hours or more, and been able to recover by pausing write
> >> workloads long enough for btrfs to keep up.  Here is an excerpt from bees
> >> logs showing this (the filesystem's current transid is the "count" field):
> >>
> >> 	2020-02-07 00:04:24 10386.10408<7> crawl_transid: Polling 561.882s for next 100
> >> 	transid RateEstimator { count = 4441796, raw = 7412.98 / 42429.6, ratio =
> >> 	7412.98 / 42440.2, rate = 0.174669, duration(1) = 5.72512, seconds_for(1) = 1 }
> >>
> >> and 3 hours later the filesystem transid hasn't moved:
> >>
> >> 	2020-02-07 03:06:53 10386.10408<7> crawl_transid: Polling 719.095s for next 100
> >> 	transid RateEstimator { count = 4441796, raw = 6248.72 / 45912.8, ratio =
> >> 	6248.72 / 45928.7, rate = 0.136053, duration(1) = 7.35009, seconds_for(1) = 1 }
> >>
> >> 4.  Most writes continue without incident:  git commits, log files,
> >> bees dedupe, kernel builds, rsync all run normally.  Some things block:
> >> applications that call fsync() or sync() get stuck.  Snapshot creation
> >> blocks.  Maintenance balances, when the maintenance window opens, also
> >> block, and snapshot deletes are then blocked waiting for balance.
> >>
> >> In most cases this is as far as we get:  eventually something breaks
> >> the equilibrium from #2, the stalled transaction commit completes,
> >> and everything is normal; however, in this case, we never finish the
> >> transaction.  fsync (the same one!) is still running some hours later:
> >>
> >> 	TIMESTAMP: Fri Feb  7 03:47:40 EST 2020
> >> 	==> /proc/31872/task/31872/stack <==
> >> 	[<0>] btrfs_tree_lock+0x120/0x260
> >> 	[<0>] btrfs_lock_root_node+0x34/0x50
> >> 	[<0>] btrfs_search_slot+0x4d5/0x940
> >> 	[<0>] lookup_inline_extent_backref+0x164/0x5a0
> >> 	[<0>] __btrfs_free_extent.isra.45+0x1f0/0xa10
> >> 	[<0>] __btrfs_run_delayed_refs+0x2d5/0x12d0
> >> 	[<0>] btrfs_run_delayed_refs+0x105/0x1b0
> >> 	[<0>] btrfs_commit_transaction+0x52/0xa70
> >> 	[<0>] btrfs_sync_file+0x248/0x490
> >> 	[<0>] vfs_fsync_range+0x48/0x80
> >> 	[<0>] do_fsync+0x3d/0x70
> >> 	[<0>] __x64_sys_fdatasync+0x17/0x20
> >> 	[<0>] do_syscall_64+0x5f/0x1f0
> >> 	[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >>
> >> We know it's the same fsync call because a supervisor process killed
> >> pid 31872 with SIGKILL at around 01:10.  It's not deadlocked here--the
> >> stack changes continuously the whole time--but it doesn't finish running.
> >>
> >> 5.  2 hours later, lstat() in the watchdog daemon blocks:
> >>
> >> 	TIMESTAMP: Fri Feb  7 05:15:07 EST 2020
> >> 	4506 pts/10   DN+    0:00 /bin/bash /root/bin/watchdog-mdtest
> >> 	==> /proc/4506/task/4506/stack <==
> >> 	[<0>] lookup_slow+0x2c/0x60
> >> 	[<0>] walk_component+0x1bf/0x330
> >> 	[<0>] path_lookupat.isra.44+0x6d/0x220
> >> 	[<0>] filename_lookup.part.59+0xa0/0x170
> >> 	[<0>] user_path_at_empty+0x3e/0x50
> >> 	[<0>] vfs_statx+0x76/0xe0
> >> 	[<0>] __do_sys_newlstat+0x3d/0x70
> >> 	[<0>] __x64_sys_newlstat+0x16/0x20
> >> 	[<0>] do_syscall_64+0x5f/0x1f0
> >> 	[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >>
> >> Up to that point, a few processes have been blocked for up to 5 hours,
> >> but this is not unusual on a big filesystem given #1.  Usually processes
> >> that read the filesystem (e.g. calling lstat) are not blocked, unless they
> >> try to access a directory being modified by a process that is blocked.
> >> lstat() being blocked is unusual.
> >>
> >> 6.  60 seconds later, the system reboots due to watchdog timeout.
> >>
> >> Upon reboot, the filesystem reverts to its state at the last completed
> >> transaction 4441796 at #2, which is 5 hours earlier.  Everything seems to
> >> be intact, but there is no trace of any update to the filesystem after
> >> the transaction 4441796.  The last 'fi usage' logged before the crash
> >> and the first 'fi usage' after show 40GB of data and 25GB of metadata
> >> block groups freed in between.
> >>
> >> I have only observed this on kernel 5.4, but I've seen the commit blocking
> >> behavior twice in two days.  I have not observed it so far on 5.0 and
> >> earlier (since it was released in March 2019).  I don't have data from
> >> kernels in between due to other test-blocking bugs.
> >>
> >> TBH I can't really say 5.0 _doesn't_ do this--while writing this post,
> >> I've seen some transactions on 5.0 running for 5-10 minutes, and I
> >> haven't been checking for this specific behavior in earlier testing;
> >> however, 5.0 crashes a lot, so if there was a behavior like this in 5.0,
> >> I'd expect somebody would have noticed.
> >>
> >> On kernel 5.4 we see fewer processes blocked under heavy write loads,
> >> but the processes that do block are blocked longer.  In particular, our
> >> watchdog daemon, which creates and removes a directory every 30 seconds
> >> to detect lockups, didn't fire.  The system ran for 3 hours before the
> >> watchdog detected a problem in this case, and the previous day the system
> >> ran 14 hours without completing a transaction and the watchdog didn't
> >> fire at all.  We'll add an fsync to the watchdog too, as the logs for
> >> systems running 5.4 are full of processes stuck many hours in fsync.
> >>
> >> Going forward we will add an alert to our server monitoring to verify that
> >> the superblock's generation number field updates at regular intervals
> >> (at least once an hour) and apply a workaround if not.  I'll also add
> >> to my test workload and start a bisect to see if this is a regression in
> >> recent kernels.
> >>
> >> There is a workaround:  detect when the superblock generation field stops
> >> updating, and pause (freeze or SIGSTOP) all writing processes long enough
> >> for btrfs to catch up and complete the current transaction.
> > 
> > I have the same problem. Have a dozen threads with high throughput
> > (simply writing data to a single file sequentially) combined with a
> > dozen threads doing metadata-heavy workloads and it'll take a long time
> > for a sync() to finish. Work-around is the same as well: Throttle the
> > application if the sync doesn't finish after a (configurable) amount of
> > time.
> > I have seen the same problem with ZFS on FreeBSD, though, so it is by no
> > means btrfs or even Linux specific. My guess is that since most file
> > systems are constrained/throttled by journal (size), there is no
> > mechanism for metadata vs. data fairness.
> > As for most congestion problems the best solution is to increase
> > capacity (storage IOPS) so everything runs below max capacity most of
> > the time or to throttle at the source to reach the same goal.
> > 
> > https://www.spinics.net/lists/linux-btrfs/msg72909.html was my report
> > back then, maybe take a look at the number of btrfs_delayed_ref_head in
> > slabtop as well?
> 
> If I might suggest a stopgap measure, simply firing a warning in the
> kernel log to the effect of "metadata write bandwidth insufficient,
> DATA LOSS MAY OCCUR" in this situation would be most helpful.  As it

This is fairly easy in userspace:

        #!/bin/sh
        fsDev="$1"
        lastGen="$(btrfs ins dump-super "$fsDev" | grep ^generation)"
        while true; do
                sleep 15m
                thisGen="$(btrfs ins dump-super "$fsDev" | grep ^generation)"
                if [ "$thisGen" = "$lastGen" ]; then
                        echo "BTRFS: DATA LOSS MAY OCCUR on $fsDev: $thisGen" > /dev/kmsg
                fi
                lastGen="$thisGen"
        done

Adjust the "15m" to whatever interval of data loss you can tolerate,
but be warned that setting it to less than 5 minutes will generate a
crapton of noise warnings on a busy server whenever you create a snapshot.

> stands right now largest problem isn't so much the reversion effect
> itself as the silence until the data loss occurs -- the system isn't
> sending any messages that anything is potentially wrong, so using
> BTRFS feels like playing Russian Roulette at the moment  Downstream
> apps don't exactly tend to list their maximum sustained IOPS, so
> trying to balance array IOPS with application IOPS across a large
> server cluster is not feasible from a pure calculation standpoint,
> and is normally done semi-empirically knowing the worst case effect
> is app stall, not data loss for an indeterminate, relatively large
> period along with potential data corruption.
> 
> This same thought applies to ZFS, assuming it is also silently eating
> data in the situation mentioned.  Very interesting to know it's a
> general problem with this class of storage.

I'm pretty sure it's a new bug in btrfs.  I've been monitoring
this server running 5.0 since Friday, and the superblock generation
never gets _too_ out of date, though it does fall some minutes behind.
There are occasional hour-long transactions, but we don't get data loss
with those because we can't write any data during that hour.  Also this
hasn't happened in kernel versions 3.18..4.20 that I've ever seen.

I think btrfs got too successful at unblocking writes during commits,
and we need to start blocking them again (ideally by blocking the writes
that make commits run longer, and maybe some of the reads too).


> Thanks!

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: data rolled back 5 hours after crash, long fsync running times, watchdog evasion on 5.4.11
  2020-02-10  1:49 ` Chris Murphy
@ 2020-02-10  5:18   ` Zygo Blaxell
  2020-02-10  7:52     ` Chris Murphy
  0 siblings, 1 reply; 11+ messages in thread
From: Zygo Blaxell @ 2020-02-10  5:18 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS, Timothy Pearson

[-- Attachment #1: Type: text/plain, Size: 3917 bytes --]

On Sun, Feb 09, 2020 at 06:49:11PM -0700, Chris Murphy wrote:
> On Sat, Feb 8, 2020 at 5:43 PM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > Upon reboot, the filesystem reverts to its state at the last completed
> > transaction 4441796 at #2, which is 5 hours earlier.  Everything seems to
> > be intact, but there is no trace of any update to the filesystem after
> > the transaction 4441796.  The last 'fi usage' logged before the crash
> > and the first 'fi usage' after show 40GB of data and 25GB of metadata
> > block groups freed in between.
> 
> Is this behavior affected by flushoncommit mount option? i.e. do you
> see a difference using flushoncommit vs noflushoncommit? My suspicion
> is the problem doesn't happen with noflushoncommit, but then you get
> another consequence that maybe your use case can't tolerate?

Sigh...the first three things anyone suggests when I talk about btrfs's
ridiculous commit latency are:

	1.  Have you tried sysctl vm.dirty_background_bytes?

	2.  Have you tried turning off flushoncommit?

	3.  Have you tried cgroupsv2 parameter xyz?

as if those are not the first things I'd try, or set up a test farm
to run random sets of parameter combinations (including discard, ssd cache
modes, etc) to see if there was any combination of these parameters that
made btrfs go faster, over any of the last five years.

I know what doesn't work:  Very low values of vm.dirty_background_bytes
can certainly harm throughput, but once it's above 100M or so it makes
no difference.  Some SSDs are terrible with discard, others need it
to avoid crippling performance losses every few months.  Writeback SSD
caches get flooded with data thanks to btrfs's already scary-fast write
path, and end up adding additional latency.  cgroupsv2 measures the wrong
things, so it reports io stall pressure of zero in high-priority cgroups
while all writes are blocked and some low-priority cgroup desparately
needs to be throttled.  If you throttle anything on btrfs at the block
level, you get priority inversion, because it's impossible to predict
which thread will end up hosting its very own btrfs transaction commit,
and nobody gets to write anything while one of those is running (well,
on 5.3+, apparently lots of processes can continue to write, but nothing
they write will be persisted after a crash).

When the kernel hits vm.dirty_bytes, none of the other settings matter:
the performance difference between flushoncommit and noflushoncommit
is the order of the writes during a commit, but the commit is always
dumping all the dirty pages that the kernel can store in RAM on disk.
noflushoncommit allows the kernel to dump the pages in the wrong order,
but has no performance advantages.  noflushoncommit might even make the
latency a little _worse_.

Profiling indicates that btrfs spends most of its time _reading_ the
filesystem during commits.  Roughly half the IO is metadata reads for
extent and csum trees, the other half is writing updated versions of
these, and maybe 1% is writing the data blocks.  While all that's going
on, more and more stuff gets locked, until eventually transactions stop
dead on kernels up to 5.0, or keep going forever on kernel 5.3 and later).
Freezing all reading processes helps the commit finish faster, but it
needs cripping levels of throttling (like 0.1% of raw write bandwidth
of the slowest disk in the array, or even less) before making a dent in
the big latency spikes.

I'm not quite sure what's different on 5.4--there were a lot of changes
and I haven't been doing profiling because I've been focused on fixing
the crashing bugs until recently.  5.4 has apparently moved the latency
to different places--ordinary writing processes no longer block at all,
while processes calling rename or fsync can be blocked for entire days.

> 
> -- 
> Chris Murphy

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: data rolled back 5 hours after crash, long fsync running times, watchdog evasion on 5.4.11
  2020-02-10  4:27     ` Zygo Blaxell
@ 2020-02-10  5:18       ` Timothy Pearson
  0 siblings, 0 replies; 11+ messages in thread
From: Timothy Pearson @ 2020-02-10  5:18 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Martin Raiber, linux-btrfs



----- Original Message -----
> From: "Zygo Blaxell" <ce3g8jdj@umail.furryterror.org>
> To: "Timothy Pearson" <tpearson@raptorengineering.com>
> Cc: "Martin Raiber" <martin@urbackup.org>, "linux-btrfs" <linux-btrfs@vger.kernel.org>
> Sent: Sunday, February 9, 2020 10:27:34 PM
> Subject: Re: data rolled back 5 hours after crash, long fsync running times, watchdog evasion on 5.4.11

> On Sun, Feb 09, 2020 at 05:11:00PM -0600, Timothy Pearson wrote:
>> 
>> 
>> ----- Original Message -----
>> > From: "Martin Raiber" <martin@urbackup.org>
>> > To: "Zygo Blaxell" <ce3g8jdj@umail.furryterror.org>, "linux-btrfs"
>> > <linux-btrfs@vger.kernel.org>
>> > Cc: "Timothy Pearson" <tpearson@raptorengineering.com>
>> > Sent: Sunday, February 9, 2020 11:08:58 AM
>> > Subject: Re: data rolled back 5 hours after crash, long fsync running times,
>> > watchdog evasion on 5.4.11
>> 
>> > On 09.02.2020 01:43 Zygo Blaxell wrote:
>> >> I reproduced a filesystem rollback similar to one reported back in
>> >> November by Timothy Pearson:
>> >>
>> >> 	https://www.spinics.net/lists/linux-btrfs/msg94318.html
>> >>
>> >> The timeline is something like this:
>> >>
>> >> 1.  System gets loaded, lots of writes, a few processes calling fsync().
>> >> Fairly normal stuff.
>> >>
>> >> 2.  Just enough new data is written continuously to keep a transaction
>> >> from finishing, but not enough to block new writes (i.e. the filesystem
>> >> is kept in equilibrium between dirty_background_bytes and dirty_bytes).
>> >> Typically an application gets stuck in fsync() here, but the rest of
>> >> the system is unaffected.  Here is one:
>> >>
>> >> 	TIMESTAMP: Fri Feb  7 01:03:21 EST 2020
>> >> 	==> /proc/31872/task/31872/stack <==
>> >> 	[<0>] wait_on_page_bit+0x172/0x250
>> >> 	[<0>] read_extent_buffer_pages+0x2e5/0x3a0
>> >> 	[<0>] btree_read_extent_buffer_pages+0xa3/0x120
>> >> 	[<0>] read_tree_block+0x4e/0x70
>> >> 	[<0>] read_block_for_search.isra.34+0x1aa/0x350
>> >> 	[<0>] btrfs_search_slot+0x20a/0x940
>> >> 	[<0>] lookup_extent_data_ref+0x7e/0x210
>> >> 	[<0>] __btrfs_free_extent.isra.45+0x22f/0xa10
>> >> 	[<0>] __btrfs_run_delayed_refs+0x2d5/0x12d0
>> >> 	[<0>] btrfs_run_delayed_refs+0x105/0x1b0
>> >> 	[<0>] btrfs_commit_transaction+0x52/0xa70
>> >> 	[<0>] btrfs_sync_file+0x248/0x490
>> >> 	[<0>] vfs_fsync_range+0x48/0x80
>> >> 	[<0>] do_fsync+0x3d/0x70
>> >> 	[<0>] __x64_sys_fdatasync+0x17/0x20
>> >> 	[<0>] do_syscall_64+0x5f/0x1f0
>> >> 	[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> >>
>> >> 3.  The transid of the subvol roots never increases as long as the balance
>> >> of incoming and outgoing data flows in #2 is maintained.  This can go on
>> >> for hours or even days on recent kernels (much longer than was possible on
>> >> 5.0). In this particular case it was 3 hours, in earlier cases I've caught
>> >> it delayed by 14 hours or more, and been able to recover by pausing write
>> >> workloads long enough for btrfs to keep up.  Here is an excerpt from bees
>> >> logs showing this (the filesystem's current transid is the "count" field):
>> >>
>> >> 	2020-02-07 00:04:24 10386.10408<7> crawl_transid: Polling 561.882s for next 100
>> >> 	transid RateEstimator { count = 4441796, raw = 7412.98 / 42429.6, ratio =
>> >> 	7412.98 / 42440.2, rate = 0.174669, duration(1) = 5.72512, seconds_for(1) = 1 }
>> >>
>> >> and 3 hours later the filesystem transid hasn't moved:
>> >>
>> >> 	2020-02-07 03:06:53 10386.10408<7> crawl_transid: Polling 719.095s for next 100
>> >> 	transid RateEstimator { count = 4441796, raw = 6248.72 / 45912.8, ratio =
>> >> 	6248.72 / 45928.7, rate = 0.136053, duration(1) = 7.35009, seconds_for(1) = 1 }
>> >>
>> >> 4.  Most writes continue without incident:  git commits, log files,
>> >> bees dedupe, kernel builds, rsync all run normally.  Some things block:
>> >> applications that call fsync() or sync() get stuck.  Snapshot creation
>> >> blocks.  Maintenance balances, when the maintenance window opens, also
>> >> block, and snapshot deletes are then blocked waiting for balance.
>> >>
>> >> In most cases this is as far as we get:  eventually something breaks
>> >> the equilibrium from #2, the stalled transaction commit completes,
>> >> and everything is normal; however, in this case, we never finish the
>> >> transaction.  fsync (the same one!) is still running some hours later:
>> >>
>> >> 	TIMESTAMP: Fri Feb  7 03:47:40 EST 2020
>> >> 	==> /proc/31872/task/31872/stack <==
>> >> 	[<0>] btrfs_tree_lock+0x120/0x260
>> >> 	[<0>] btrfs_lock_root_node+0x34/0x50
>> >> 	[<0>] btrfs_search_slot+0x4d5/0x940
>> >> 	[<0>] lookup_inline_extent_backref+0x164/0x5a0
>> >> 	[<0>] __btrfs_free_extent.isra.45+0x1f0/0xa10
>> >> 	[<0>] __btrfs_run_delayed_refs+0x2d5/0x12d0
>> >> 	[<0>] btrfs_run_delayed_refs+0x105/0x1b0
>> >> 	[<0>] btrfs_commit_transaction+0x52/0xa70
>> >> 	[<0>] btrfs_sync_file+0x248/0x490
>> >> 	[<0>] vfs_fsync_range+0x48/0x80
>> >> 	[<0>] do_fsync+0x3d/0x70
>> >> 	[<0>] __x64_sys_fdatasync+0x17/0x20
>> >> 	[<0>] do_syscall_64+0x5f/0x1f0
>> >> 	[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> >>
>> >> We know it's the same fsync call because a supervisor process killed
>> >> pid 31872 with SIGKILL at around 01:10.  It's not deadlocked here--the
>> >> stack changes continuously the whole time--but it doesn't finish running.
>> >>
>> >> 5.  2 hours later, lstat() in the watchdog daemon blocks:
>> >>
>> >> 	TIMESTAMP: Fri Feb  7 05:15:07 EST 2020
>> >> 	4506 pts/10   DN+    0:00 /bin/bash /root/bin/watchdog-mdtest
>> >> 	==> /proc/4506/task/4506/stack <==
>> >> 	[<0>] lookup_slow+0x2c/0x60
>> >> 	[<0>] walk_component+0x1bf/0x330
>> >> 	[<0>] path_lookupat.isra.44+0x6d/0x220
>> >> 	[<0>] filename_lookup.part.59+0xa0/0x170
>> >> 	[<0>] user_path_at_empty+0x3e/0x50
>> >> 	[<0>] vfs_statx+0x76/0xe0
>> >> 	[<0>] __do_sys_newlstat+0x3d/0x70
>> >> 	[<0>] __x64_sys_newlstat+0x16/0x20
>> >> 	[<0>] do_syscall_64+0x5f/0x1f0
>> >> 	[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> >>
>> >> Up to that point, a few processes have been blocked for up to 5 hours,
>> >> but this is not unusual on a big filesystem given #1.  Usually processes
>> >> that read the filesystem (e.g. calling lstat) are not blocked, unless they
>> >> try to access a directory being modified by a process that is blocked.
>> >> lstat() being blocked is unusual.
>> >>
>> >> 6.  60 seconds later, the system reboots due to watchdog timeout.
>> >>
>> >> Upon reboot, the filesystem reverts to its state at the last completed
>> >> transaction 4441796 at #2, which is 5 hours earlier.  Everything seems to
>> >> be intact, but there is no trace of any update to the filesystem after
>> >> the transaction 4441796.  The last 'fi usage' logged before the crash
>> >> and the first 'fi usage' after show 40GB of data and 25GB of metadata
>> >> block groups freed in between.
>> >>
>> >> I have only observed this on kernel 5.4, but I've seen the commit blocking
>> >> behavior twice in two days.  I have not observed it so far on 5.0 and
>> >> earlier (since it was released in March 2019).  I don't have data from
>> >> kernels in between due to other test-blocking bugs.
>> >>
>> >> TBH I can't really say 5.0 _doesn't_ do this--while writing this post,
>> >> I've seen some transactions on 5.0 running for 5-10 minutes, and I
>> >> haven't been checking for this specific behavior in earlier testing;
>> >> however, 5.0 crashes a lot, so if there was a behavior like this in 5.0,
>> >> I'd expect somebody would have noticed.
>> >>
>> >> On kernel 5.4 we see fewer processes blocked under heavy write loads,
>> >> but the processes that do block are blocked longer.  In particular, our
>> >> watchdog daemon, which creates and removes a directory every 30 seconds
>> >> to detect lockups, didn't fire.  The system ran for 3 hours before the
>> >> watchdog detected a problem in this case, and the previous day the system
>> >> ran 14 hours without completing a transaction and the watchdog didn't
>> >> fire at all.  We'll add an fsync to the watchdog too, as the logs for
>> >> systems running 5.4 are full of processes stuck many hours in fsync.
>> >>
>> >> Going forward we will add an alert to our server monitoring to verify that
>> >> the superblock's generation number field updates at regular intervals
>> >> (at least once an hour) and apply a workaround if not.  I'll also add
>> >> to my test workload and start a bisect to see if this is a regression in
>> >> recent kernels.
>> >>
>> >> There is a workaround:  detect when the superblock generation field stops
>> >> updating, and pause (freeze or SIGSTOP) all writing processes long enough
>> >> for btrfs to catch up and complete the current transaction.
>> > 
>> > I have the same problem. Have a dozen threads with high throughput
>> > (simply writing data to a single file sequentially) combined with a
>> > dozen threads doing metadata-heavy workloads and it'll take a long time
>> > for a sync() to finish. Work-around is the same as well: Throttle the
>> > application if the sync doesn't finish after a (configurable) amount of
>> > time.
>> > I have seen the same problem with ZFS on FreeBSD, though, so it is by no
>> > means btrfs or even Linux specific. My guess is that since most file
>> > systems are constrained/throttled by journal (size), there is no
>> > mechanism for metadata vs. data fairness.
>> > As for most congestion problems the best solution is to increase
>> > capacity (storage IOPS) so everything runs below max capacity most of
>> > the time or to throttle at the source to reach the same goal.
>> > 
>> > https://www.spinics.net/lists/linux-btrfs/msg72909.html was my report
>> > back then, maybe take a look at the number of btrfs_delayed_ref_head in
>> > slabtop as well?
>> 
>> If I might suggest a stopgap measure, simply firing a warning in the
>> kernel log to the effect of "metadata write bandwidth insufficient,
>> DATA LOSS MAY OCCUR" in this situation would be most helpful.  As it
> 
> This is fairly easy in userspace:
> 
>        #!/bin/sh
>        fsDev="$1"
>        lastGen="$(btrfs ins dump-super "$fsDev" | grep ^generation)"
>        while true; do
>                sleep 15m
>                thisGen="$(btrfs ins dump-super "$fsDev" | grep ^generation)"
>                if [ "$thisGen" = "$lastGen" ]; then
>                        echo "BTRFS: DATA LOSS MAY OCCUR on $fsDev: $thisGen" > /dev/kmsg
>                fi
>                lastGen="$thisGen"
>        done
> 
> Adjust the "15m" to whatever interval of data loss you can tolerate,
> but be warned that setting it to less than 5 minutes will generate a
> crapton of noise warnings on a busy server whenever you create a snapshot.
> 
>> stands right now largest problem isn't so much the reversion effect
>> itself as the silence until the data loss occurs -- the system isn't
>> sending any messages that anything is potentially wrong, so using
>> BTRFS feels like playing Russian Roulette at the moment  Downstream
>> apps don't exactly tend to list their maximum sustained IOPS, so
>> trying to balance array IOPS with application IOPS across a large
>> server cluster is not feasible from a pure calculation standpoint,
>> and is normally done semi-empirically knowing the worst case effect
>> is app stall, not data loss for an indeterminate, relatively large
>> period along with potential data corruption.
>> 
>> This same thought applies to ZFS, assuming it is also silently eating
>> data in the situation mentioned.  Very interesting to know it's a
>> general problem with this class of storage.
> 
> I'm pretty sure it's a new bug in btrfs.  I've been monitoring
> this server running 5.0 since Friday, and the superblock generation
> never gets _too_ out of date, though it does fall some minutes behind.
> There are occasional hour-long transactions, but we don't get data loss
> with those because we can't write any data during that hour.  Also this
> hasn't happened in kernel versions 3.18..4.20 that I've ever seen.
> 
> I think btrfs got too successful at unblocking writes during commits,
> and we need to start blocking them again (ideally by blocking the writes
> that make commits run longer, and maybe some of the reads too).
> 
> 

Good to know.  I'd be in favor of trying to not block the reads where possible, at least on our end we have workloads where blocking read to the remainder of a busy volume would be fairly catastrophic, whereas the write stall isn't a significant problem.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: data rolled back 5 hours after crash, long fsync running times, watchdog evasion on 5.4.11
  2020-02-09 17:08 ` Martin Raiber
  2020-02-09 23:11   ` Timothy Pearson
@ 2020-02-10  5:20   ` Zygo Blaxell
  1 sibling, 0 replies; 11+ messages in thread
From: Zygo Blaxell @ 2020-02-10  5:20 UTC (permalink / raw)
  To: Martin Raiber; +Cc: linux-btrfs, Timothy Pearson

[-- Attachment #1: Type: text/plain, Size: 8900 bytes --]

On Sun, Feb 09, 2020 at 05:08:58PM +0000, Martin Raiber wrote:
> On 09.02.2020 01:43 Zygo Blaxell wrote:
> > I reproduced a filesystem rollback similar to one reported back in
> > November by Timothy Pearson:
> >
> > 	https://www.spinics.net/lists/linux-btrfs/msg94318.html
> >
> > The timeline is something like this:
> >
> > 1.  System gets loaded, lots of writes, a few processes calling fsync().
> > Fairly normal stuff.
> >
> > 2.  Just enough new data is written continuously to keep a transaction
> > from finishing, but not enough to block new writes (i.e. the filesystem
> > is kept in equilibrium between dirty_background_bytes and dirty_bytes).
> > Typically an application gets stuck in fsync() here, but the rest of
> > the system is unaffected.  Here is one:
> >
> > 	TIMESTAMP: Fri Feb  7 01:03:21 EST 2020
> > 	==> /proc/31872/task/31872/stack <==
> > 	[<0>] wait_on_page_bit+0x172/0x250
> > 	[<0>] read_extent_buffer_pages+0x2e5/0x3a0
> > 	[<0>] btree_read_extent_buffer_pages+0xa3/0x120
> > 	[<0>] read_tree_block+0x4e/0x70
> > 	[<0>] read_block_for_search.isra.34+0x1aa/0x350
> > 	[<0>] btrfs_search_slot+0x20a/0x940
> > 	[<0>] lookup_extent_data_ref+0x7e/0x210
> > 	[<0>] __btrfs_free_extent.isra.45+0x22f/0xa10
> > 	[<0>] __btrfs_run_delayed_refs+0x2d5/0x12d0
> > 	[<0>] btrfs_run_delayed_refs+0x105/0x1b0
> > 	[<0>] btrfs_commit_transaction+0x52/0xa70
> > 	[<0>] btrfs_sync_file+0x248/0x490
> > 	[<0>] vfs_fsync_range+0x48/0x80
> > 	[<0>] do_fsync+0x3d/0x70
> > 	[<0>] __x64_sys_fdatasync+0x17/0x20
> > 	[<0>] do_syscall_64+0x5f/0x1f0
> > 	[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >
> > 3.  The transid of the subvol roots never increases as long as the balance
> > of incoming and outgoing data flows in #2 is maintained.  This can go on
> > for hours or even days on recent kernels (much longer than was possible on
> > 5.0). In this particular case it was 3 hours, in earlier cases I've caught
> > it delayed by 14 hours or more, and been able to recover by pausing write
> > workloads long enough for btrfs to keep up.  Here is an excerpt from bees
> > logs showing this (the filesystem's current transid is the "count" field):
> >
> > 	2020-02-07 00:04:24 10386.10408<7> crawl_transid: Polling 561.882s for next 100 transid RateEstimator { count = 4441796, raw = 7412.98 / 42429.6, ratio = 7412.98 / 42440.2, rate = 0.174669, duration(1) = 5.72512, seconds_for(1) = 1 }
> >
> > and 3 hours later the filesystem transid hasn't moved:
> >
> > 	2020-02-07 03:06:53 10386.10408<7> crawl_transid: Polling 719.095s for next 100 transid RateEstimator { count = 4441796, raw = 6248.72 / 45912.8, ratio = 6248.72 / 45928.7, rate = 0.136053, duration(1) = 7.35009, seconds_for(1) = 1 }
> >
> > 4.  Most writes continue without incident:  git commits, log files,
> > bees dedupe, kernel builds, rsync all run normally.  Some things block:
> > applications that call fsync() or sync() get stuck.  Snapshot creation
> > blocks.  Maintenance balances, when the maintenance window opens, also
> > block, and snapshot deletes are then blocked waiting for balance.
> >
> > In most cases this is as far as we get:  eventually something breaks
> > the equilibrium from #2, the stalled transaction commit completes,
> > and everything is normal; however, in this case, we never finish the
> > transaction.  fsync (the same one!) is still running some hours later:
> >
> > 	TIMESTAMP: Fri Feb  7 03:47:40 EST 2020
> > 	==> /proc/31872/task/31872/stack <==
> > 	[<0>] btrfs_tree_lock+0x120/0x260
> > 	[<0>] btrfs_lock_root_node+0x34/0x50
> > 	[<0>] btrfs_search_slot+0x4d5/0x940
> > 	[<0>] lookup_inline_extent_backref+0x164/0x5a0
> > 	[<0>] __btrfs_free_extent.isra.45+0x1f0/0xa10
> > 	[<0>] __btrfs_run_delayed_refs+0x2d5/0x12d0
> > 	[<0>] btrfs_run_delayed_refs+0x105/0x1b0
> > 	[<0>] btrfs_commit_transaction+0x52/0xa70
> > 	[<0>] btrfs_sync_file+0x248/0x490
> > 	[<0>] vfs_fsync_range+0x48/0x80
> > 	[<0>] do_fsync+0x3d/0x70
> > 	[<0>] __x64_sys_fdatasync+0x17/0x20
> > 	[<0>] do_syscall_64+0x5f/0x1f0
> > 	[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >
> > We know it's the same fsync call because a supervisor process killed
> > pid 31872 with SIGKILL at around 01:10.  It's not deadlocked here--the
> > stack changes continuously the whole time--but it doesn't finish running.
> >
> > 5.  2 hours later, lstat() in the watchdog daemon blocks:  
> >
> > 	TIMESTAMP: Fri Feb  7 05:15:07 EST 2020
> > 	4506 pts/10   DN+    0:00 /bin/bash /root/bin/watchdog-mdtest
> > 	==> /proc/4506/task/4506/stack <==
> > 	[<0>] lookup_slow+0x2c/0x60
> > 	[<0>] walk_component+0x1bf/0x330
> > 	[<0>] path_lookupat.isra.44+0x6d/0x220
> > 	[<0>] filename_lookup.part.59+0xa0/0x170
> > 	[<0>] user_path_at_empty+0x3e/0x50
> > 	[<0>] vfs_statx+0x76/0xe0
> > 	[<0>] __do_sys_newlstat+0x3d/0x70
> > 	[<0>] __x64_sys_newlstat+0x16/0x20
> > 	[<0>] do_syscall_64+0x5f/0x1f0
> > 	[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >
> > Up to that point, a few processes have been blocked for up to 5 hours,
> > but this is not unusual on a big filesystem given #1.  Usually processes
> > that read the filesystem (e.g. calling lstat) are not blocked, unless they
> > try to access a directory being modified by a process that is blocked.
> > lstat() being blocked is unusual.
> >
> > 6.  60 seconds later, the system reboots due to watchdog timeout.
> >
> > Upon reboot, the filesystem reverts to its state at the last completed
> > transaction 4441796 at #2, which is 5 hours earlier.  Everything seems to
> > be intact, but there is no trace of any update to the filesystem after
> > the transaction 4441796.  The last 'fi usage' logged before the crash
> > and the first 'fi usage' after show 40GB of data and 25GB of metadata
> > block groups freed in between.
> >
> > I have only observed this on kernel 5.4, but I've seen the commit blocking
> > behavior twice in two days.  I have not observed it so far on 5.0 and
> > earlier (since it was released in March 2019).  I don't have data from
> > kernels in between due to other test-blocking bugs.
> >
> > TBH I can't really say 5.0 _doesn't_ do this--while writing this post,
> > I've seen some transactions on 5.0 running for 5-10 minutes, and I
> > haven't been checking for this specific behavior in earlier testing;
> > however, 5.0 crashes a lot, so if there was a behavior like this in 5.0,
> > I'd expect somebody would have noticed.
> >
> > On kernel 5.4 we see fewer processes blocked under heavy write loads,
> > but the processes that do block are blocked longer.  In particular, our
> > watchdog daemon, which creates and removes a directory every 30 seconds
> > to detect lockups, didn't fire.  The system ran for 3 hours before the
> > watchdog detected a problem in this case, and the previous day the system
> > ran 14 hours without completing a transaction and the watchdog didn't
> > fire at all.  We'll add an fsync to the watchdog too, as the logs for
> > systems running 5.4 are full of processes stuck many hours in fsync.
> >
> > Going forward we will add an alert to our server monitoring to verify that
> > the superblock's generation number field updates at regular intervals
> > (at least once an hour) and apply a workaround if not.  I'll also add
> > to my test workload and start a bisect to see if this is a regression in
> > recent kernels.
> >
> > There is a workaround:  detect when the superblock generation field stops
> > updating, and pause (freeze or SIGSTOP) all writing processes long enough
> > for btrfs to catch up and complete the current transaction.
> 
> I have the same problem. Have a dozen threads with high throughput
> (simply writing data to a single file sequentially) combined with a
> dozen threads doing metadata-heavy workloads and it'll take a long time
> for a sync() to finish. Work-around is the same as well: Throttle the
> application if the sync doesn't finish after a (configurable) amount of
> time.
> I have seen the same problem with ZFS on FreeBSD, though, so it is by no
> means btrfs or even Linux specific. My guess is that since most file
> systems are constrained/throttled by journal (size), there is no
> mechanism for metadata vs. data fairness.
> As for most congestion problems the best solution is to increase
> capacity (storage IOPS) so everything runs below max capacity most of
> the time or to throttle at the source to reach the same goal.
> 
> https://www.spinics.net/lists/linux-btrfs/msg72909.html was my report
> back then, 

...in 2017!  A lot of stuff has changed since then.  Heck, a lot of stuff
has changed in the last six months...

> maybe take a look at the number of btrfs_delayed_ref_head in
> slabtop as well?

Interesting.  I'll add that to observations on future test runs.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: data rolled back 5 hours after crash, long fsync running times, watchdog evasion on 5.4.11
  2020-02-10  5:18   ` Zygo Blaxell
@ 2020-02-10  7:52     ` Chris Murphy
  0 siblings, 0 replies; 11+ messages in thread
From: Chris Murphy @ 2020-02-10  7:52 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Btrfs BTRFS, Timothy Pearson

On Sun, Feb 9, 2020 at 10:18 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Sun, Feb 09, 2020 at 06:49:11PM -0700, Chris Murphy wrote:
> > On Sat, Feb 8, 2020 at 5:43 PM Zygo Blaxell
> > <ce3g8jdj@umail.furryterror.org> wrote:
> > >
> > > Upon reboot, the filesystem reverts to its state at the last completed
> > > transaction 4441796 at #2, which is 5 hours earlier.  Everything seems to
> > > be intact, but there is no trace of any update to the filesystem after
> > > the transaction 4441796.  The last 'fi usage' logged before the crash
> > > and the first 'fi usage' after show 40GB of data and 25GB of metadata
> > > block groups freed in between.
> >
> > Is this behavior affected by flushoncommit mount option? i.e. do you
> > see a difference using flushoncommit vs noflushoncommit? My suspicion
> > is the problem doesn't happen with noflushoncommit, but then you get
> > another consequence that maybe your use case can't tolerate?
>
> Sigh...the first three things anyone suggests when I talk about btrfs's
> ridiculous commit latency are:
>
>         1.  Have you tried sysctl vm.dirty_background_bytes?
>
>         2.  Have you tried turning off flushoncommit?
>
>         3.  Have you tried cgroupsv2 parameter xyz?
>
> as if those are not the first things I'd try, or set up a test farm
> to run random sets of parameter combinations (including discard, ssd cache
> modes, etc) to see if there was any combination of these parameters that
> made btrfs go faster, over any of the last five years.

Nope. It was a yes or no question, not a suggestion. I don't
understand the problem enough to make a suggestion.


--
Chris Murphy

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-02-10  7:53 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-09  0:43 data rolled back 5 hours after crash, long fsync running times, watchdog evasion on 5.4.11 Zygo Blaxell
2020-02-09  9:00 ` Martin Steigerwald
2020-02-10  4:10   ` Zygo Blaxell
2020-02-09 17:08 ` Martin Raiber
2020-02-09 23:11   ` Timothy Pearson
2020-02-10  4:27     ` Zygo Blaxell
2020-02-10  5:18       ` Timothy Pearson
2020-02-10  5:20   ` Zygo Blaxell
2020-02-10  1:49 ` Chris Murphy
2020-02-10  5:18   ` Zygo Blaxell
2020-02-10  7:52     ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.