Re: Need some assistance/direction in determining a system hang during heavy IO (resolved)

* Re: Need some assistance/direction in determining a system hang during heavy IO (resolved)
@ 2017-10-26 21:48 Cheyenne Wills
  0 siblings, 0 replies; only message in thread
From: Cheyenne Wills @ 2017-10-26 21:48 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-btrfs

On Thu, Oct 26, 2017 at 11:41 AM, Roman Mamedov <rm@romanrm.net> wrote:
> On Thu, 26 Oct 2017 09:40:19 -0600
> Cheyenne Wills <cheyenne.wills@gmail.com> wrote:
>
>> Briefly when I upgraded a system from 4.0.5 kernel to 4.9.5 (and
>> later) I'm seeing a blocked task timeout with heavy IO against a
>> multi-lun btrfs filesystem.  I've tried a 4.12.12 kernel and am still
>> getting the hang.
>
> There is now 4.9.58 (fifty three versions later!) and 4.12 series is long
> abandoned and gone from the charts altogether. So just in case, did you check
> with the latest kernels?
>
> Also, keep in mind the 120 second warnings are just that, and not an error
> condition by themselves. You can disable them or increase the maximum timeout
> in sysctl settings. And it is not clear from your reports if you only get
> warnings and after the load subsides everything is back to normal, or the FS
> locks out "for good", i.e. with all access attempts hanging indefinitely and
> no way to unmount the FS or otherwise recover.
>
> --
> With respect,
> Roman

Just tried a 4.13 kernel and it appears to have fixed the problem (at
least the scrub hasn't locked up).

Because the system didn't lock up, I was able to obtain some
additional information and it appears that
the core problem was a shortage of xen grant table frames.  By
increasing the gnttab_max_frames value in
the xen host, I was not able to cause a system hang (even with some of
the prior kernels -- well at
least a 4.12.12 kernel).

I ended up closing the above mentioned issue.  I included in the issue
some of the information that
I found so that if other folks are having the same problem there is
some discussion on a possible
solution.

When the system wasn't hanging with the 4.13 kernel, I was getting an
error message about
the grant tables.  Doing some searches with that information, I was
able to find a discussion
on

"I/O to LUNs hang / stall under high load when using xen-blkfront"

Turns out that the number of grant tables has a relationship with the
number of devices
attached to a xen guest.

Thanks for the assistance :)

Cheyenne Wills

^ permalink raw reply	[flat|nested] only message in thread