All of lore.kernel.org
 help / color / mirror / Atom feed
* XFS filesystem hang
@ 2019-01-17 11:14 Daniel Aberger - Profihost AG
  2019-01-17 12:34 ` Brian Foster
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Aberger - Profihost AG @ 2019-01-17 11:14 UTC (permalink / raw)
  To: linux-xfs

Hi,

one of our servers crashed / hung several times in the past days with
similar errors regarding XFS. Unfortunately we are unable to interprete
the call trace of dmesg to pinpoint the exact problem and I hope you
could help me to do so.

We already ran xfs_repair to ensure that there are no filesystem errors.

Here is an example dmesg output of recent crash of the mentioned server:

2019-01-12 06:06:35     INFO: task mysqld:1171 blocked for more than 120
seconds.
2019-01-12 06:06:35     Tainted: G 4.12.0+139-ph #1
2019-01-12 06:06:35     "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
2019-01-12 06:06:35     mysqld D 0 1171 1 0x00080000
2019-01-12 06:06:35     Call Trace:
2019-01-12 06:06:35     ? __schedule+0x3bc/0x820
2019-01-12 06:06:35     ? strlcpy+0x31/0x40
2019-01-12 06:06:35     ?
kernfs_path_from_node_locked+0x238/0x320schedule+0x32/0x80xlog_grant_head_wait+0xca/0x200xlog_grant_head_check+0x86/0xe0xfs_log_reserve+0xc7/0x1c0xfs_trans_reserve+0x169/0x1c0xfs_trans_alloc+0xb9/0x130xfs_vn_update_time+0x4e/0x130file_update_time+0xa7/0x100xfs_file_aio_write_checks+0x178/0x1a0
2019-01-12 06:06:35     ?
ktime_get+0x3e/0xa0xfs_file_dio_aio_write+0xb1/0x230xfs_file_write_iter+0xff/0x150aio_write+0xf6/0x150
2019-01-12 06:06:35     ? queue_unplugged+0x25/0xa0
2019-01-12 06:06:35     ? kmem_cache_alloc+0xf7/0x570
2019-01-12 06:06:35     ? do_io_submit+0x35c/0x690do_io_submit+0x35c/0x690
2019-01-12 06:06:35     ?
do_syscall_64+0x74/0x150do_syscall_64+0x74/0x150entry_SYSCALL_64_after_hwframe+0x3d/0xa2
2019-01-12 06:06:35     RIP: 0033:0x7f37aa6b6717
2019-01-12 06:06:35     RSP: 002b:00007f32953f4228 EFLAGS: 00000206
ORIG_RAX: 00000000000000d1
2019-01-12 06:06:35     RAX: ffffffffffffffda RBX: 00000000000040f6 RCX:
00007f37aa6b6717
2019-01-12 06:06:35     RDX: 00007f32953f4230 RSI: 0000000000000001 RDI:
00007f37aaef9000
2019-01-12 06:06:35     RBP: 00007f32953f4250 R08: 00007f32953f4230 R09:
0000000000000000
2019-01-12 06:06:35     R10: 0000000000010000 R11: 0000000000000206 R12:
00007f32953f4370
2019-01-12 06:06:35     R13: 00007f37a806d158 R14: 00007f32a5829b98 R15:
00007f379df26100

Thanks,

Daniel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: XFS filesystem hang
  2019-01-17 11:14 XFS filesystem hang Daniel Aberger - Profihost AG
@ 2019-01-17 12:34 ` Brian Foster
  2019-01-17 13:50   ` Daniel Aberger - Profihost AG
  0 siblings, 1 reply; 8+ messages in thread
From: Brian Foster @ 2019-01-17 12:34 UTC (permalink / raw)
  To: Daniel Aberger - Profihost AG; +Cc: linux-xfs

On Thu, Jan 17, 2019 at 12:14:11PM +0100, Daniel Aberger - Profihost AG wrote:
> Hi,
> 
> one of our servers crashed / hung several times in the past days with
> similar errors regarding XFS. Unfortunately we are unable to interprete
> the call trace of dmesg to pinpoint the exact problem and I hope you
> could help me to do so.
> 
> We already ran xfs_repair to ensure that there are no filesystem errors.
> 
> Here is an example dmesg output of recent crash of the mentioned server:
> 
> 2019-01-12 06:06:35     INFO: task mysqld:1171 blocked for more than 120
> seconds.

This is not a crash but rather an indication from the kernel that this
task has been sitting here for at least a couple minutes. This doesn't
tell us whether the task will eventually recover or remain blocked
indefinitely.

> 2019-01-12 06:06:35     Tainted: G 4.12.0+139-ph #1
> 2019-01-12 06:06:35     "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> 2019-01-12 06:06:35     mysqld D 0 1171 1 0x00080000
> 2019-01-12 06:06:35     Call Trace:
> 2019-01-12 06:06:35     ? __schedule+0x3bc/0x820
> 2019-01-12 06:06:35     ? strlcpy+0x31/0x40
> 2019-01-12 06:06:35     ?
> kernfs_path_from_node_locked+0x238/0x320schedule+0x32/0x80xlog_grant_head_wait+0xca/0x200xlog_grant_head_check+0x86/0xe0xfs_log_reserve+0xc7/0x1c0xfs_trans_reserve+0x169/0x1c0xfs_trans_alloc+0xb9/0x130xfs_vn_update_time+0x4e/0x130file_update_time+0xa7/0x100xfs_file_aio_write_checks+0x178/0x1a0
> 2019-01-12 06:06:35     ?

I'm not sure how reliable the stack is, but fwiw it looks like this is
an aio that is waiting on transaction reservation to update a file time.
Transaction reservation is a normal blocking path because log space is a
finite resource. We certainly don't expect to block here for minutes,
but the fact that we don't know anything about your filesystem geometry,
mount options, underlying storage, workload, etc. will make it difficult
for anybody to surmise what the problem could be.

Please see the following url for what information to include when
reporting a problem:

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

I'd also suggest to include broader dmesg output and/or xfs_repair
output if there are multiple error reports that aren't exactly the same.

Brian

> ktime_get+0x3e/0xa0xfs_file_dio_aio_write+0xb1/0x230xfs_file_write_iter+0xff/0x150aio_write+0xf6/0x150
> 2019-01-12 06:06:35     ? queue_unplugged+0x25/0xa0
> 2019-01-12 06:06:35     ? kmem_cache_alloc+0xf7/0x570
> 2019-01-12 06:06:35     ? do_io_submit+0x35c/0x690do_io_submit+0x35c/0x690
> 2019-01-12 06:06:35     ?
> do_syscall_64+0x74/0x150do_syscall_64+0x74/0x150entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> 2019-01-12 06:06:35     RIP: 0033:0x7f37aa6b6717
> 2019-01-12 06:06:35     RSP: 002b:00007f32953f4228 EFLAGS: 00000206
> ORIG_RAX: 00000000000000d1
> 2019-01-12 06:06:35     RAX: ffffffffffffffda RBX: 00000000000040f6 RCX:
> 00007f37aa6b6717
> 2019-01-12 06:06:35     RDX: 00007f32953f4230 RSI: 0000000000000001 RDI:
> 00007f37aaef9000
> 2019-01-12 06:06:35     RBP: 00007f32953f4250 R08: 00007f32953f4230 R09:
> 0000000000000000
> 2019-01-12 06:06:35     R10: 0000000000010000 R11: 0000000000000206 R12:
> 00007f32953f4370
> 2019-01-12 06:06:35     R13: 00007f37a806d158 R14: 00007f32a5829b98 R15:
> 00007f379df26100
> 
> Thanks,
> 
> Daniel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: XFS filesystem hang
  2019-01-17 12:34 ` Brian Foster
@ 2019-01-17 13:50   ` Daniel Aberger - Profihost AG
  2019-01-17 22:05     ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Aberger - Profihost AG @ 2019-01-17 13:50 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, s.priebe, n.fahldieck

Am 17.01.19 um 13:34 schrieb Brian Foster:
> On Thu, Jan 17, 2019 at 12:14:11PM +0100, Daniel Aberger - Profihost AG wrote:
>> Hi,
>>
>> one of our servers crashed / hung several times in the past days with
>> similar errors regarding XFS. Unfortunately we are unable to interprete
>> the call trace of dmesg to pinpoint the exact problem and I hope you
>> could help me to do so.
>>
>> We already ran xfs_repair to ensure that there are no filesystem errors.
>>
>> Here is an example dmesg output of recent crash of the mentioned server:
>>
>> 2019-01-12 06:06:35     INFO: task mysqld:1171 blocked for more than 120
>> seconds.
> 
> This is not a crash but rather an indication from the kernel that this
> task has been sitting here for at least a couple minutes. This doesn't
> tell us whether the task will eventually recover or remain blocked
> indefinitely.
> 

Yes, you're right. Well I was referring to it as a crash since the
system does not respond to any inputs anymore and we have to perform a
REISUB to reboot it.

>> 2019-01-12 06:06:35     Tainted: G 4.12.0+139-ph #1
>> 2019-01-12 06:06:35     "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> 2019-01-12 06:06:35     mysqld D 0 1171 1 0x00080000
>> 2019-01-12 06:06:35     Call Trace:
>> 2019-01-12 06:06:35     ? __schedule+0x3bc/0x820
>> 2019-01-12 06:06:35     ? strlcpy+0x31/0x40
>> 2019-01-12 06:06:35     ?
>> kernfs_path_from_node_locked+0x238/0x320schedule+0x32/0x80xlog_grant_head_wait+0xca/0x200xlog_grant_head_check+0x86/0xe0xfs_log_reserve+0xc7/0x1c0xfs_trans_reserve+0x169/0x1c0xfs_trans_alloc+0xb9/0x130xfs_vn_update_time+0x4e/0x130file_update_time+0xa7/0x100xfs_file_aio_write_checks+0x178/0x1a0
>> 2019-01-12 06:06:35     ?
> 
> I'm not sure how reliable the stack is, but fwiw it looks like this is
> an aio that is waiting on transaction reservation to update a file time.
> Transaction reservation is a normal blocking path because log space is a
> finite resource. We certainly don't expect to block here for minutes,
> but the fact that we don't know anything about your filesystem geometry,
> mount options, underlying storage, workload, etc. will make it difficult
> for anybody to surmise what the problem could be.
> 
> Please see the following url for what information to include when
> reporting a problem:
> 
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
> 
> I'd also suggest to include broader dmesg output and/or xfs_repair
> output if there are multiple error reports that aren't exactly the same.
> 

I'm sorry, I missed this page on my search for an FAQ somehow.


* Kernel Version: Linux 4.12.0+139-ph #1 SMP Tue Jan 1 21:46:16 UTC 2019
x86_64 GNU/Linux

* xfsprogs version: 4.9.0

* Single CPU, Xeon E5-1650 v3, 6 cores, 12 threads

* /proc/meminfo, /proc/mounts, /proc/partitions and xfs_info can be
found here: https://pastebin.com/cZiTrUDL

* full dmesg output of problem mentioned in the first mail:
https://pastebin.com/pLaz18L1

* a couple of more dmesg outputs from the same system with similar
behaviour:
 * https://pastebin.com/hWDbwcCr
 * https://pastebin.com/HAqs4yQc

* Raid Controller: Adaptec ASR8805
 * Logical Device 0 is a RAID6 with 5 SSDs (Samsung MZ7KM480), /dev/sda
 * Logical Device 1 is a single backup HDD (Seagate ST2000NX0253) for
internal backups, /dev/sdb

* Output of xfs_repair: https://pastebin.com/JzVMqrLh

The problem seems to happen randomly. There does not have to be high
load on our storage devices. This makes it pretty hard to find an exact
cause.


- Daniel

> Brian
> 
>> ktime_get+0x3e/0xa0xfs_file_dio_aio_write+0xb1/0x230xfs_file_write_iter+0xff/0x150aio_write+0xf6/0x150
>> 2019-01-12 06:06:35     ? queue_unplugged+0x25/0xa0
>> 2019-01-12 06:06:35     ? kmem_cache_alloc+0xf7/0x570
>> 2019-01-12 06:06:35     ? do_io_submit+0x35c/0x690do_io_submit+0x35c/0x690
>> 2019-01-12 06:06:35     ?
>> do_syscall_64+0x74/0x150do_syscall_64+0x74/0x150entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>> 2019-01-12 06:06:35     RIP: 0033:0x7f37aa6b6717
>> 2019-01-12 06:06:35     RSP: 002b:00007f32953f4228 EFLAGS: 00000206
>> ORIG_RAX: 00000000000000d1
>> 2019-01-12 06:06:35     RAX: ffffffffffffffda RBX: 00000000000040f6 RCX:
>> 00007f37aa6b6717
>> 2019-01-12 06:06:35     RDX: 00007f32953f4230 RSI: 0000000000000001 RDI:
>> 00007f37aaef9000
>> 2019-01-12 06:06:35     RBP: 00007f32953f4250 R08: 00007f32953f4230 R09:
>> 0000000000000000
>> 2019-01-12 06:06:35     R10: 0000000000010000 R11: 0000000000000206 R12:
>> 00007f32953f4370
>> 2019-01-12 06:06:35     R13: 00007f37a806d158 R14: 00007f32a5829b98 R15:
>> 00007f379df26100
>>
>> Thanks,
>>
>> Daniel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: XFS filesystem hang
  2019-01-17 13:50   ` Daniel Aberger - Profihost AG
@ 2019-01-17 22:05     ` Dave Chinner
  2019-01-18 14:48       ` Daniel Aberger - Profihost AG
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2019-01-17 22:05 UTC (permalink / raw)
  To: Daniel Aberger - Profihost AG
  Cc: Brian Foster, linux-xfs, s.priebe, n.fahldieck

On Thu, Jan 17, 2019 at 02:50:23PM +0100, Daniel Aberger - Profihost AG wrote:
> Am 17.01.19 um 13:34 schrieb Brian Foster:
> > On Thu, Jan 17, 2019 at 12:14:11PM +0100, Daniel Aberger - Profihost AG wrote:
> >> Hi,
> >>
> >> one of our servers crashed / hung several times in the past days with
> >> similar errors regarding XFS. Unfortunately we are unable to interprete
> >> the call trace of dmesg to pinpoint the exact problem and I hope you
> >> could help me to do so.
> >>
> >> We already ran xfs_repair to ensure that there are no filesystem errors.
> >>
> >> Here is an example dmesg output of recent crash of the mentioned server:
> >>
> >> 2019-01-12 06:06:35     INFO: task mysqld:1171 blocked for more than 120
> >> seconds.
> > 
> > This is not a crash but rather an indication from the kernel that this
> > task has been sitting here for at least a couple minutes. This doesn't
> > tell us whether the task will eventually recover or remain blocked
> > indefinitely.
> > 
> 
> Yes, you're right. Well I was referring to it as a crash since the
> system does not respond to any inputs anymore and we have to perform a
> REISUB to reboot it.
> 
> >> 2019-01-12 06:06:35     Tainted: G 4.12.0+139-ph #1
> >> 2019-01-12 06:06:35     "echo 0 >
> >> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> 2019-01-12 06:06:35     mysqld D 0 1171 1 0x00080000
> >> 2019-01-12 06:06:35     Call Trace:
> >> 2019-01-12 06:06:35     ? __schedule+0x3bc/0x820
> >> 2019-01-12 06:06:35     ? strlcpy+0x31/0x40
> >> 2019-01-12 06:06:35     ?
> >> kernfs_path_from_node_locked+0x238/0x320schedule+0x32/0x80xlog_grant_head_wait+0xca/0x200xlog_grant_head_check+0x86/0xe0xfs_log_reserve+0xc7/0x1c0xfs_trans_reserve+0x169/0x1c0xfs_trans_alloc+0xb9/0x130xfs_vn_update_time+0x4e/0x130file_update_time+0xa7/0x100xfs_file_aio_write_checks+0x178/0x1a0
> >> 2019-01-12 06:06:35     ?
> > 
> > I'm not sure how reliable the stack is, but fwiw it looks like this is
> > an aio that is waiting on transaction reservation to update a file time.
> > Transaction reservation is a normal blocking path because log space is a
> > finite resource. We certainly don't expect to block here for minutes,
> > but the fact that we don't know anything about your filesystem geometry,
> > mount options, underlying storage, workload, etc. will make it difficult
> > for anybody to surmise what the problem could be.
> > 
> > Please see the following url for what information to include when
> > reporting a problem:
> > 
> > http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
> > 
> > I'd also suggest to include broader dmesg output and/or xfs_repair
> > output if there are multiple error reports that aren't exactly the same.
> > 
> 
> I'm sorry, I missed this page on my search for an FAQ somehow.
> 
> 
> * Kernel Version: Linux 4.12.0+139-ph #1 SMP Tue Jan 1 21:46:16 UTC 2019
> x86_64 GNU/Linux

Is that an unmodified distro kernel or one you've patched and built
yourself?

> * xfsprogs version: 4.9.0
> 
> * Single CPU, Xeon E5-1650 v3, 6 cores, 12 threads
> 
> * /proc/meminfo, /proc/mounts, /proc/partitions and xfs_info can be
> found here: https://pastebin.com/cZiTrUDL

Just  notes as I browse it.
- lots of free memory.
- xfs-info: 1.3TB, 32 ags, ~700MB log w/sunit =64fsbs
  sunit=64 fsbs, swidth=192fsbs (RAID?)
- mount options: noatime, sunit=512,sunit=1536, usrquota
- /dev/sda3 mounted on /
- /dev/sda3 also mounted on /home/tmp (bind mount of something?)

> * full dmesg output of problem mentioned in the first mail:
> https://pastebin.com/pLaz18L1

No smoking gun.

> * a couple of more dmesg outputs from the same system with similar
> behaviour:
>  * https://pastebin.com/hWDbwcCr
>  * https://pastebin.com/HAqs4yQc

Ok, so mysqld seems to be the problem child here.

> * Raid Controller: Adaptec ASR8805
>  * Logical Device 0 is a RAID6 with 5 SSDs (Samsung MZ7KM480), /dev/sda

Ok, so RAID6 on a RADI controller with 1GB cache and a small random
write workload?

>  * Logical Device 1 is a single backup HDD (Seagate ST2000NX0253) for
> internal backups, /dev/sdb
> 
> * Output of xfs_repair: https://pastebin.com/JzVMqrLh

Nothing wrong at all. So this looks like it's either a log space
accounting leak (i.e. XFS bug) or something else....

> The problem seems to happen randomly. There does not have to be high
> load on our storage devices. This makes it pretty hard to find an exact
> cause.

... which makes me suspect "something else".

I note that nothing is stalled on reads from the device - it's all
stalled on journal space (i.e. write operations). I've seen similar
things before where the RAID controller is configured as a writeback
cache and the workload just keeps dumping lots of small random writes
into the RAID cache until it fills. Databases do lots of small
writes.

This looks like the device is lightly loaded because writing into
the cache is really fast. Right up until it fills and then has to
dump hundreds of thousands of small writes to disk, which on RAID 6
each require read-modify-raid-recalc-write cycles to run. This is
really slow, and why RAID6 is generally a poor choice for database
workloads.

i.e. when the RAID cache fills, new writes stalls waiting for a RAID
cache flush - they run at slower than disk speed because they have
to first wait for the RAID controller to comple.

It is is only noticed when the filesystem journal stalls
because it's run out of space and is trying to dump a large amount
of small metadata writes to the device to free up journal space. The
journal space won't be freed up until the metadata that pins the
tail of the log has been successfully written. When there's
thousands of metadata IOs needed (a 700MB log can hold several
hundred thousand dirty metadata objects in memory!) to move th elog
forward, and the RAID cache is full, it will take a /long time/ and
the filesystem will be stalled until that completes.

I've seen spinning disk RAID setups like this stall for hours while
large RAID caches flush. SSDs should be a lot faster, but we're
still potentially talking about having to do millions of IOs and the
RAID controller will be CPU bound on parity recalculatoins.  So this
/may/ just be a result of the storage configuration and have nothing
to do with XFS at all.

Which leads me to ask: what is your RAID cache setup - write-thru,
write-back, etc?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: XFS filesystem hang
  2019-01-17 22:05     ` Dave Chinner
@ 2019-01-18 14:48       ` Daniel Aberger - Profihost AG
  2019-01-19  0:19         ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Aberger - Profihost AG @ 2019-01-18 14:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs, s.priebe, n.fahldieck

Am 17.01.19 um 23:05 schrieb Dave Chinner:
> On Thu, Jan 17, 2019 at 02:50:23PM +0100, Daniel Aberger - Profihost AG wrote:
>> Am 17.01.19 um 13:34 schrieb Brian Foster:
>>> On Thu, Jan 17, 2019 at 12:14:11PM +0100, Daniel Aberger - Profihost AG wrote:
>>>> Hi,
>>>>
>>>> one of our servers crashed / hung several times in the past days with
>>>> similar errors regarding XFS. Unfortunately we are unable to interprete
>>>> the call trace of dmesg to pinpoint the exact problem and I hope you
>>>> could help me to do so.
>>>>
>>>> We already ran xfs_repair to ensure that there are no filesystem errors.
>>>>
>>>> Here is an example dmesg output of recent crash of the mentioned server:
>>>>
>>>> 2019-01-12 06:06:35     INFO: task mysqld:1171 blocked for more than 120
>>>> seconds.
>>>
>>> This is not a crash but rather an indication from the kernel that this
>>> task has been sitting here for at least a couple minutes. This doesn't
>>> tell us whether the task will eventually recover or remain blocked
>>> indefinitely.
>>>
>>
>> Yes, you're right. Well I was referring to it as a crash since the
>> system does not respond to any inputs anymore and we have to perform a
>> REISUB to reboot it.
>>
>>>> 2019-01-12 06:06:35     Tainted: G 4.12.0+139-ph #1
>>>> 2019-01-12 06:06:35     "echo 0 >
>>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> 2019-01-12 06:06:35     mysqld D 0 1171 1 0x00080000
>>>> 2019-01-12 06:06:35     Call Trace:
>>>> 2019-01-12 06:06:35     ? __schedule+0x3bc/0x820
>>>> 2019-01-12 06:06:35     ? strlcpy+0x31/0x40
>>>> 2019-01-12 06:06:35     ?
>>>> kernfs_path_from_node_locked+0x238/0x320schedule+0x32/0x80xlog_grant_head_wait+0xca/0x200xlog_grant_head_check+0x86/0xe0xfs_log_reserve+0xc7/0x1c0xfs_trans_reserve+0x169/0x1c0xfs_trans_alloc+0xb9/0x130xfs_vn_update_time+0x4e/0x130file_update_time+0xa7/0x100xfs_file_aio_write_checks+0x178/0x1a0
>>>> 2019-01-12 06:06:35     ?
>>>
>>> I'm not sure how reliable the stack is, but fwiw it looks like this is
>>> an aio that is waiting on transaction reservation to update a file time.
>>> Transaction reservation is a normal blocking path because log space is a
>>> finite resource. We certainly don't expect to block here for minutes,
>>> but the fact that we don't know anything about your filesystem geometry,
>>> mount options, underlying storage, workload, etc. will make it difficult
>>> for anybody to surmise what the problem could be.
>>>
>>> Please see the following url for what information to include when
>>> reporting a problem:
>>>
>>> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>>>
>>> I'd also suggest to include broader dmesg output and/or xfs_repair
>>> output if there are multiple error reports that aren't exactly the same.
>>>
>>
>> I'm sorry, I missed this page on my search for an FAQ somehow.
>>
>>
>> * Kernel Version: Linux 4.12.0+139-ph #1 SMP Tue Jan 1 21:46:16 UTC 2019
>> x86_64 GNU/Linux
> 
> Is that an unmodified distro kernel or one you've patched and built
> yourself?

Unmodified regarding XFS and any subsystems related to XFS, as I was
being told.

> 
>> * xfsprogs version: 4.9.0
>>
>> * Single CPU, Xeon E5-1650 v3, 6 cores, 12 threads
>>
>> * /proc/meminfo, /proc/mounts, /proc/partitions and xfs_info can be
>> found here: https://pastebin.com/cZiTrUDL
> 
> Just  notes as I browse it.
> - lots of free memory.
> - xfs-info: 1.3TB, 32 ags, ~700MB log w/sunit =64fsbs
>   sunit=64 fsbs, swidth=192fsbs (RAID?)
> - mount options: noatime, sunit=512,sunit=1536, usrquota
> - /dev/sda3 mounted on /
> - /dev/sda3 also mounted on /home/tmp (bind mount of something?)
> 
>> * full dmesg output of problem mentioned in the first mail:
>> https://pastebin.com/pLaz18L1
> 
> No smoking gun.
> 
>> * a couple of more dmesg outputs from the same system with similar
>> behaviour:
>>  * https://pastebin.com/hWDbwcCr
>>  * https://pastebin.com/HAqs4yQc
> 
> Ok, so mysqld seems to be the problem child here.
> 

Our MySQL workload on this server is very small except for this time of
the day because our local backup to /backup happens during this time.
The highest IO happens during the night when our local backup is being
written. The timestamps of these two outputs suggest that the "mysql
dump" phase might just have been started. Unfortunately we only keep the
log of the last job, so I can't confirm that.

I might be able to dig up one or two more dmesg outputs for when these
hangs happen and no backup job is running.

>> * Raid Controller: Adaptec ASR8805
>>  * Logical Device 0 is a RAID6 with 5 SSDs (Samsung MZ7KM480), /dev/sda
> 
> Ok, so RAID6 on a RADI controller with 1GB cache and a small random
> write workload?
> 
>>  * Logical Device 1 is a single backup HDD (Seagate ST2000NX0253) for
>> internal backups, /dev/sdb
>>
>> * Output of xfs_repair: https://pastebin.com/JzVMqrLh
> 
> Nothing wrong at all. So this looks like it's either a log space
> accounting leak (i.e. XFS bug) or something else....
> 
>> The problem seems to happen randomly. There does not have to be high
>> load on our storage devices. This makes it pretty hard to find an exact
>> cause.
> 
> ... which makes me suspect "something else".
> 
> I note that nothing is stalled on reads from the device - it's all
> stalled on journal space (i.e. write operations). I've seen similar
> things before where the RAID controller is configured as a writeback
> cache and the workload just keeps dumping lots of small random writes
> into the RAID cache until it fills. Databases do lots of small
> writes.
> 
> This looks like the device is lightly loaded because writing into
> the cache is really fast. Right up until it fills and then has to
> dump hundreds of thousands of small writes to disk, which on RAID 6
> each require read-modify-raid-recalc-write cycles to run. This is
> really slow, and why RAID6 is generally a poor choice for database
> workloads.
> 
> i.e. when the RAID cache fills, new writes stalls waiting for a RAID
> cache flush - they run at slower than disk speed because they have
> to first wait for the RAID controller to comple.
> 
> It is is only noticed when the filesystem journal stalls
> because it's run out of space and is trying to dump a large amount
> of small metadata writes to the device to free up journal space. The
> journal space won't be freed up until the metadata that pins the
> tail of the log has been successfully written. When there's
> thousands of metadata IOs needed (a 700MB log can hold several
> hundred thousand dirty metadata objects in memory!) to move th elog
> forward, and the RAID cache is full, it will take a /long time/ and
> the filesystem will be stalled until that completes.
> 
> I've seen spinning disk RAID setups like this stall for hours while
> large RAID caches flush. SSDs should be a lot faster, but we're
> still potentially talking about having to do millions of IOs and the
> RAID controller will be CPU bound on parity recalculatoins.  So this
> /may/ just be a result of the storage configuration and have nothing
> to do with XFS at all.
> 
> Which leads me to ask: what is your RAID cache setup - write-thru,
> write-back, etc?
> 

Our RAID6 cache configuration:

   Read-cache setting                       : Disabled
   Read-cache status                        : Off
   Write-cache setting                      : Disabled
   Write-cache status                       : Off

Full Configuration: https://pastebin.com/PdGatDY4


- Daniel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: XFS filesystem hang
  2019-01-18 14:48       ` Daniel Aberger - Profihost AG
@ 2019-01-19  0:19         ` Dave Chinner
  2019-01-21 14:59           ` Daniel Aberger - Profihost AG
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2019-01-19  0:19 UTC (permalink / raw)
  To: Daniel Aberger - Profihost AG
  Cc: Brian Foster, linux-xfs, s.priebe, n.fahldieck

On Fri, Jan 18, 2019 at 03:48:46PM +0100, Daniel Aberger - Profihost AG wrote:
> Am 17.01.19 um 23:05 schrieb Dave Chinner:
> > On Thu, Jan 17, 2019 at 02:50:23PM +0100, Daniel Aberger - Profihost AG wrote:
> >> * Kernel Version: Linux 4.12.0+139-ph #1 SMP Tue Jan 1 21:46:16 UTC 2019
> >> x86_64 GNU/Linux
> > 
> > Is that an unmodified distro kernel or one you've patched and built
> > yourself?
> 
> Unmodified regarding XFS and any subsystems related to XFS, as I was
> being told.

That doesn't answer my question - has the kernel been patched (and
what with) or is it a completely unmodified upstream kernel?

> >> * /proc/meminfo, /proc/mounts, /proc/partitions and xfs_info can be
> >> found here: https://pastebin.com/cZiTrUDL
> > 
> > Just  notes as I browse it.
> > - lots of free memory.
> > - xfs-info: 1.3TB, 32 ags, ~700MB log w/sunit =64fsbs
> >   sunit=64 fsbs, swidth=192fsbs (RAID?)
> > - mount options: noatime, sunit=512,sunit=1536, usrquota
> > - /dev/sda3 mounted on /
> > - /dev/sda3 also mounted on /home/tmp (bind mount of something?)
> > 
> >> * full dmesg output of problem mentioned in the first mail:
> >> https://pastebin.com/pLaz18L1
> > 
> > No smoking gun.
> > 
> >> * a couple of more dmesg outputs from the same system with similar
> >> behaviour:
> >>  * https://pastebin.com/hWDbwcCr
> >>  * https://pastebin.com/HAqs4yQc
> > 
> > Ok, so mysqld seems to be the problem child here.
> > 
> 
> Our MySQL workload on this server is very small except for this time of
> the day because our local backup to /backup happens during this time.
> The highest IO happens during the night when our local backup is being
> written. The timestamps of these two outputs suggest that the "mysql
> dump" phase might just have been started. Unfortunately we only keep the
> log of the last job, so I can't confirm that.

Ok, so you've just started loading up the btrfs volume that is also
attached to the same raid controller, which does have raid caches
enabled....

I wonder if that has anything to do with it?

Best would be to capture iostat output for both luns (as per the
FAQ) when the problem workload starts.

> > Which leads me to ask: what is your RAID cache setup - write-thru,
> > write-back, etc?
> > 
> 
> Our RAID6 cache configuration:
> 
>    Read-cache setting                       : Disabled
>    Read-cache status                        : Off
>    Write-cache setting                      : Disabled
>    Write-cache status                       : Off

Ok, so read caching is turned off, which means it likely won't even
be caching stripes between modifications. May not be very efficient,
but hard to say if it's the problem or not.

> Full Configuration: https://pastebin.com/PdGatDY4

Yeah, caching is enabled on the backup btrfs lun, so there may be
interaction issues. Is the backup device idle (or stalling) at the
same time that the XFS messages are being issued?

CHeers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: XFS filesystem hang
  2019-01-19  0:19         ` Dave Chinner
@ 2019-01-21 14:59           ` Daniel Aberger - Profihost AG
  2019-02-10 18:52             ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Aberger - Profihost AG @ 2019-01-21 14:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs, s.priebe, n.fahldieck

Am 19.01.19 um 01:19 schrieb Dave Chinner:
> On Fri, Jan 18, 2019 at 03:48:46PM +0100, Daniel Aberger - Profihost AG wrote:
>> Am 17.01.19 um 23:05 schrieb Dave Chinner:
>>> On Thu, Jan 17, 2019 at 02:50:23PM +0100, Daniel Aberger - Profihost AG wrote:
>>>> * Kernel Version: Linux 4.12.0+139-ph #1 SMP Tue Jan 1 21:46:16 UTC 2019
>>>> x86_64 GNU/Linux
>>>
>>> Is that an unmodified distro kernel or one you've patched and built
>>> yourself?
>>
>> Unmodified regarding XFS and any subsystems related to XFS, as I was
>> being told.
> 
> That doesn't answer my question - has the kernel been patched (and
> what with) or is it a completely unmodified upstream kernel?
> 

The kernel we were running was OpenSUSE SLE15 based on commit
6c5c7489089608d89b7ce310bca44812e2b0a4a5.

https://github.com/openSUSE/kernel


>>>> * /proc/meminfo, /proc/mounts, /proc/partitions and xfs_info can be
>>>> found here: https://pastebin.com/cZiTrUDL
>>>
>>> Just  notes as I browse it.
>>> - lots of free memory.
>>> - xfs-info: 1.3TB, 32 ags, ~700MB log w/sunit =64fsbs
>>>   sunit=64 fsbs, swidth=192fsbs (RAID?)
>>> - mount options: noatime, sunit=512,sunit=1536, usrquota
>>> - /dev/sda3 mounted on /
>>> - /dev/sda3 also mounted on /home/tmp (bind mount of something?)
>>>
>>>> * full dmesg output of problem mentioned in the first mail:
>>>> https://pastebin.com/pLaz18L1
>>>
>>> No smoking gun.
>>>
>>>> * a couple of more dmesg outputs from the same system with similar
>>>> behaviour:
>>>>  * https://pastebin.com/hWDbwcCr
>>>>  * https://pastebin.com/HAqs4yQc
>>>
>>> Ok, so mysqld seems to be the problem child here.
>>>
>>
>> Our MySQL workload on this server is very small except for this time of
>> the day because our local backup to /backup happens during this time.
>> The highest IO happens during the night when our local backup is being
>> written. The timestamps of these two outputs suggest that the "mysql
>> dump" phase might just have been started. Unfortunately we only keep the
>> log of the last job, so I can't confirm that.
> 
> Ok, so you've just started loading up the btrfs volume that is also
> attached to the same raid controller, which does have raid caches
> enabled....
> 
> I wonder if that has anything to do with it?
> 

Do you suggest to change any caching options?

> Best would be to capture iostat output for both luns (as per the
> FAQ) when the problem workload starts.
> 

What I can give you so far is two I/O activity screenshots of Grafana of
two of the dmesg outputs above.

https://imgur.com/a/3lL776U


>>> Which leads me to ask: what is your RAID cache setup - write-thru,
>>> write-back, etc?
>>>
>>
>> Our RAID6 cache configuration:
>>
>>    Read-cache setting                       : Disabled
>>    Read-cache status                        : Off
>>    Write-cache setting                      : Disabled
>>    Write-cache status                       : Off
> 
> Ok, so read caching is turned off, which means it likely won't even
> be caching stripes between modifications. May not be very efficient,
> but hard to say if it's the problem or not.
> 
>> Full Configuration: https://pastebin.com/PdGatDY4
> 
> Yeah, caching is enabled on the backup btrfs lun, so there may be
> interaction issues. Is the backup device idle (or stalling) at the
> same time that the XFS messages are being issued?

In 2 out of 3 cases it happened while the backup job was running, which
starts at 0:10 am and finishes roughly between 2:30 and 3:30 am on this
particular machine. So it wasn't idle.

The MySQL dumping phase takes about 20 to 25 minutes and happens at the
end of the backup job.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: XFS filesystem hang
  2019-01-21 14:59           ` Daniel Aberger - Profihost AG
@ 2019-02-10 18:52             ` Stefan Priebe - Profihost AG
  0 siblings, 0 replies; 8+ messages in thread
From: Stefan Priebe - Profihost AG @ 2019-02-10 18:52 UTC (permalink / raw)
  To: d.aberger, Dave Chinner; +Cc: Brian Foster, linux-xfs, n.fahldieck

Dear dave,

we're still seeing the issue below. I verified that we see this also on:
1.) systems without any raid or controller (pure qemu/kvm with ceph rbd
backend)

2.) systems where no btrs is running nor any backup jobs

To me this looks like a xfs lockup. I found the same trace here:
"https://access.redhat.com/solutions/2964341"

but do not have any access to this subscriber content.

Greets,
Stefan

Am 21.01.19 um 15:59 schrieb Daniel Aberger - Profihost AG:
> Am 19.01.19 um 01:19 schrieb Dave Chinner:
>> On Fri, Jan 18, 2019 at 03:48:46PM +0100, Daniel Aberger - Profihost AG wrote:
>>> Am 17.01.19 um 23:05 schrieb Dave Chinner:
>>>> On Thu, Jan 17, 2019 at 02:50:23PM +0100, Daniel Aberger - Profihost AG wrote:
>>>>> * Kernel Version: Linux 4.12.0+139-ph #1 SMP Tue Jan 1 21:46:16 UTC 2019
>>>>> x86_64 GNU/Linux
>>>>
>>>> Is that an unmodified distro kernel or one you've patched and built
>>>> yourself?
>>>
>>> Unmodified regarding XFS and any subsystems related to XFS, as I was
>>> being told.
>>
>> That doesn't answer my question - has the kernel been patched (and
>> what with) or is it a completely unmodified upstream kernel?
>>
> 
> The kernel we were running was OpenSUSE SLE15 based on commit
> 6c5c7489089608d89b7ce310bca44812e2b0a4a5.
> 
> https://github.com/openSUSE/kernel
> 
> 
>>>>> * /proc/meminfo, /proc/mounts, /proc/partitions and xfs_info can be
>>>>> found here: https://pastebin.com/cZiTrUDL
>>>>
>>>> Just  notes as I browse it.
>>>> - lots of free memory.
>>>> - xfs-info: 1.3TB, 32 ags, ~700MB log w/sunit =64fsbs
>>>>   sunit=64 fsbs, swidth=192fsbs (RAID?)
>>>> - mount options: noatime, sunit=512,sunit=1536, usrquota
>>>> - /dev/sda3 mounted on /
>>>> - /dev/sda3 also mounted on /home/tmp (bind mount of something?)
>>>>
>>>>> * full dmesg output of problem mentioned in the first mail:
>>>>> https://pastebin.com/pLaz18L1
>>>>
>>>> No smoking gun.
>>>>
>>>>> * a couple of more dmesg outputs from the same system with similar
>>>>> behaviour:
>>>>>  * https://pastebin.com/hWDbwcCr
>>>>>  * https://pastebin.com/HAqs4yQc
>>>>
>>>> Ok, so mysqld seems to be the problem child here.
>>>>
>>>
>>> Our MySQL workload on this server is very small except for this time of
>>> the day because our local backup to /backup happens during this time.
>>> The highest IO happens during the night when our local backup is being
>>> written. The timestamps of these two outputs suggest that the "mysql
>>> dump" phase might just have been started. Unfortunately we only keep the
>>> log of the last job, so I can't confirm that.
>>
>> Ok, so you've just started loading up the btrfs volume that is also
>> attached to the same raid controller, which does have raid caches
>> enabled....
>>
>> I wonder if that has anything to do with it?
>>
> 
> Do you suggest to change any caching options?
> 
>> Best would be to capture iostat output for both luns (as per the
>> FAQ) when the problem workload starts.
>>
> 
> What I can give you so far is two I/O activity screenshots of Grafana of
> two of the dmesg outputs above.
> 
> https://imgur.com/a/3lL776U
> 
> 
>>>> Which leads me to ask: what is your RAID cache setup - write-thru,
>>>> write-back, etc?
>>>>
>>>
>>> Our RAID6 cache configuration:
>>>
>>>    Read-cache setting                       : Disabled
>>>    Read-cache status                        : Off
>>>    Write-cache setting                      : Disabled
>>>    Write-cache status                       : Off
>>
>> Ok, so read caching is turned off, which means it likely won't even
>> be caching stripes between modifications. May not be very efficient,
>> but hard to say if it's the problem or not.
>>
>>> Full Configuration: https://pastebin.com/PdGatDY4
>>
>> Yeah, caching is enabled on the backup btrfs lun, so there may be
>> interaction issues. Is the backup device idle (or stalling) at the
>> same time that the XFS messages are being issued?
> 
> In 2 out of 3 cases it happened while the backup job was running, which
> starts at 0:10 am and finishes roughly between 2:30 and 3:30 am on this
> particular machine. So it wasn't idle.
> 
> The MySQL dumping phase takes about 20 to 25 minutes and happens at the
> end of the backup job.
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-02-10 18:58 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-17 11:14 XFS filesystem hang Daniel Aberger - Profihost AG
2019-01-17 12:34 ` Brian Foster
2019-01-17 13:50   ` Daniel Aberger - Profihost AG
2019-01-17 22:05     ` Dave Chinner
2019-01-18 14:48       ` Daniel Aberger - Profihost AG
2019-01-19  0:19         ` Dave Chinner
2019-01-21 14:59           ` Daniel Aberger - Profihost AG
2019-02-10 18:52             ` Stefan Priebe - Profihost AG

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.