From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-xfs-owner@vger.kernel.org>
Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:60973 "EHLO
        ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1727546AbfAQWFP (ORCPT
        <rfc822;linux-xfs@vger.kernel.org>); Thu, 17 Jan 2019 17:05:15 -0500
Date: Fri, 18 Jan 2019 09:05:12 +1100
From: Dave Chinner <david@fromorbit.com>
Subject: Re: XFS filesystem hang
Message-ID: <20190117220512.GE6173@dastard>
References: <645a3d40-47a5-0f7a-7565-821bbce103f2@profihost.ag>
 <20190117123427.GA37591@bfoster>
 <8cf51b2c-6e9b-317a-16fc-9e47f916198e@profihost.ag>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <8cf51b2c-6e9b-317a-16fc-9e47f916198e@profihost.ag>
Sender: linux-xfs-owner@vger.kernel.org
List-ID: <linux-xfs.vger.kernel.org>
List-Id: xfs
To: Daniel Aberger - Profihost AG <d.aberger@profihost.ag>
Cc: Brian Foster <bfoster@redhat.com>, linux-xfs@vger.kernel.org, s.priebe@profihost.ag, n.fahldieck@profihost.ag

On Thu, Jan 17, 2019 at 02:50:23PM +0100, Daniel Aberger - Profihost AG wrote:
> Am 17.01.19 um 13:34 schrieb Brian Foster:
> > On Thu, Jan 17, 2019 at 12:14:11PM +0100, Daniel Aberger - Profihost AG wrote:
> >> Hi,
> >>
> >> one of our servers crashed / hung several times in the past days with
> >> similar errors regarding XFS. Unfortunately we are unable to interprete
> >> the call trace of dmesg to pinpoint the exact problem and I hope you
> >> could help me to do so.
> >>
> >> We already ran xfs_repair to ensure that there are no filesystem errors.
> >>
> >> Here is an example dmesg output of recent crash of the mentioned server:
> >>
> >> 2019-01-12 06:06:35     INFO: task mysqld:1171 blocked for more than 120
> >> seconds.
> > 
> > This is not a crash but rather an indication from the kernel that this
> > task has been sitting here for at least a couple minutes. This doesn't
> > tell us whether the task will eventually recover or remain blocked
> > indefinitely.
> > 
> 
> Yes, you're right. Well I was referring to it as a crash since the
> system does not respond to any inputs anymore and we have to perform a
> REISUB to reboot it.
> 
> >> 2019-01-12 06:06:35     Tainted: G 4.12.0+139-ph #1
> >> 2019-01-12 06:06:35     "echo 0 >
> >> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> 2019-01-12 06:06:35     mysqld D 0 1171 1 0x00080000
> >> 2019-01-12 06:06:35     Call Trace:
> >> 2019-01-12 06:06:35     ? __schedule+0x3bc/0x820
> >> 2019-01-12 06:06:35     ? strlcpy+0x31/0x40
> >> 2019-01-12 06:06:35     ?
> >> kernfs_path_from_node_locked+0x238/0x320schedule+0x32/0x80xlog_grant_head_wait+0xca/0x200xlog_grant_head_check+0x86/0xe0xfs_log_reserve+0xc7/0x1c0xfs_trans_reserve+0x169/0x1c0xfs_trans_alloc+0xb9/0x130xfs_vn_update_time+0x4e/0x130file_update_time+0xa7/0x100xfs_file_aio_write_checks+0x178/0x1a0
> >> 2019-01-12 06:06:35     ?
> > 
> > I'm not sure how reliable the stack is, but fwiw it looks like this is
> > an aio that is waiting on transaction reservation to update a file time.
> > Transaction reservation is a normal blocking path because log space is a
> > finite resource. We certainly don't expect to block here for minutes,
> > but the fact that we don't know anything about your filesystem geometry,
> > mount options, underlying storage, workload, etc. will make it difficult
> > for anybody to surmise what the problem could be.
> > 
> > Please see the following url for what information to include when
> > reporting a problem:
> > 
> > http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
> > 
> > I'd also suggest to include broader dmesg output and/or xfs_repair
> > output if there are multiple error reports that aren't exactly the same.
> > 
> 
> I'm sorry, I missed this page on my search for an FAQ somehow.
> 
> 
> * Kernel Version: Linux 4.12.0+139-ph #1 SMP Tue Jan 1 21:46:16 UTC 2019
> x86_64 GNU/Linux

Is that an unmodified distro kernel or one you've patched and built
yourself?

> * xfsprogs version: 4.9.0
> 
> * Single CPU, Xeon E5-1650 v3, 6 cores, 12 threads
> 
> * /proc/meminfo, /proc/mounts, /proc/partitions and xfs_info can be
> found here: https://pastebin.com/cZiTrUDL

Just  notes as I browse it.
- lots of free memory.
- xfs-info: 1.3TB, 32 ags, ~700MB log w/sunit =64fsbs
  sunit=64 fsbs, swidth=192fsbs (RAID?)
- mount options: noatime, sunit=512,sunit=1536, usrquota
- /dev/sda3 mounted on /
- /dev/sda3 also mounted on /home/tmp (bind mount of something?)

> * full dmesg output of problem mentioned in the first mail:
> https://pastebin.com/pLaz18L1

No smoking gun.

> * a couple of more dmesg outputs from the same system with similar
> behaviour:
>  * https://pastebin.com/hWDbwcCr
>  * https://pastebin.com/HAqs4yQc

Ok, so mysqld seems to be the problem child here.

> * Raid Controller: Adaptec ASR8805
>  * Logical Device 0 is a RAID6 with 5 SSDs (Samsung MZ7KM480), /dev/sda

Ok, so RAID6 on a RADI controller with 1GB cache and a small random
write workload?

>  * Logical Device 1 is a single backup HDD (Seagate ST2000NX0253) for
> internal backups, /dev/sdb
> 
> * Output of xfs_repair: https://pastebin.com/JzVMqrLh

Nothing wrong at all. So this looks like it's either a log space
accounting leak (i.e. XFS bug) or something else....

> The problem seems to happen randomly. There does not have to be high
> load on our storage devices. This makes it pretty hard to find an exact
> cause.

... which makes me suspect "something else".

I note that nothing is stalled on reads from the device - it's all
stalled on journal space (i.e. write operations). I've seen similar
things before where the RAID controller is configured as a writeback
cache and the workload just keeps dumping lots of small random writes
into the RAID cache until it fills. Databases do lots of small
writes.

This looks like the device is lightly loaded because writing into
the cache is really fast. Right up until it fills and then has to
dump hundreds of thousands of small writes to disk, which on RAID 6
each require read-modify-raid-recalc-write cycles to run. This is
really slow, and why RAID6 is generally a poor choice for database
workloads.

i.e. when the RAID cache fills, new writes stalls waiting for a RAID
cache flush - they run at slower than disk speed because they have
to first wait for the RAID controller to comple.

It is is only noticed when the filesystem journal stalls
because it's run out of space and is trying to dump a large amount
of small metadata writes to the device to free up journal space. The
journal space won't be freed up until the metadata that pins the
tail of the log has been successfully written. When there's
thousands of metadata IOs needed (a 700MB log can hold several
hundred thousand dirty metadata objects in memory!) to move th elog
forward, and the RAID cache is full, it will take a /long time/ and
the filesystem will be stalled until that completes.

I've seen spinning disk RAID setups like this stall for hours while
large RAID caches flush. SSDs should be a lot faster, but we're
still potentially talking about having to do millions of IOs and the
RAID controller will be CPU bound on parity recalculatoins.  So this
/may/ just be a result of the storage configuration and have nothing
to do with XFS at all.

Which leads me to ask: what is your RAID cache setup - write-thru,
write-back, etc?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com