Re: rsync: page allocation stalls in kernel 4.9.10 to a VessRAID NAS

From: Michal Hocko <mhocko@kernel.org>
To: Robert Kudyba <rkudyba@fordham.edu>
Cc: linux-kernel@vger.kernel.org
Subject: Re: rsync: page allocation stalls in kernel 4.9.10 to a VessRAID NAS
Date: Wed, 1 Mar 2017 18:36:26 +0100	[thread overview]
Message-ID: <20170301173625.GA20360@dhcp22.suse.cz> (raw)
In-Reply-To: <F77DA4E6-EF9B-427D-8FE9-9FB940A9B009@fordham.edu>

On Wed 01-03-17 10:55:33, Robert Kudyba wrote:
> 
> > On Mar 1, 2017, at 3:06 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > On Tue 28-02-17 14:32:18, Robert Kudyba wrote:
> >> 
> >>> On Feb 28, 2017, at 11:56 AM, Michal Hocko <mhocko@kernel.org <mailto:mhocko@kernel.org>> wrote:
> > [...]
> >>>> Will do here’s a perf report:
> >>> 
> >>> this will not tell us much. Tracepoints have much better chance to tell
> >>> us how reclaim is progressing.
> >> 
> >> I have SystemTap configured are there any scripts in the
> >> SystemTap_Beginners_Guide.pdf that I can run to help? Sorry I’m
> >> brand new to tracepoints.
> >> 
> > 
> > I am not familiar with systemtap much. What I meant was to
> > mount -t tracefs none /trace
> > echo 1 > /trace/events/vmscan/enable
> 
> OK I did this is there another step?

Yeah, you have to read the actual tracing data. Sorry for not beaing
clear enough

cat /trace/trace_pipe > output

> >> I do see these “vmscan” from this command:
> >> stap -L 'kernel.trace("*")'|sort
> >> 
> >> kernel.trace("vmscan:mm_shrink_slab_end") $shr:struct shrinker* $nid:int $shrinker_retval:int $unused_scan_cnt:long int $new_scan_cnt:long int $total_scan:long int
> >> kernel.trace("vmscan:mm_shrink_slab_start") $shr:struct shrinker* $sc:struct shrink_control* $nr_objects_to_shrink:long int $pgs_scanned:long unsigned int $lru_pgs:long unsigned int $cache_items:long unsigned int $delta:long long unsigned int $total_scan:long unsigned int
> >> kernel.trace("vmscan:mm_vmscan_direct_reclaim_begin") $order:int $may_writepage:int $gfp_flags:gfp_t $classzone_idx:int
> >> kernel.trace("vmscan:mm_vmscan_direct_reclaim_end") $nr_reclaimed:long unsigned int
> >> kernel.trace("vmscan:mm_vmscan_kswapd_sleep") $nid:int
> >> kernel.trace("vmscan:mm_vmscan_kswapd_wake") $nid:int $zid:int $order:int
> >> kernel.trace("vmscan:mm_vmscan_lru_isolate") $classzone_idx:int $order:int $nr_requested:long unsigned int $nr_scanned:long unsigned int $nr_taken:long unsigned int $isolate_mode:isolate_mode_t $file:int
> >> kernel.trace("vmscan:mm_vmscan_lru_shrink_inactive") $nid:int $nr_scanned:long unsigned int $nr_reclaimed:long unsigned int $priority:int $file:int
> >> kernel.trace("vmscan:mm_vmscan_memcg_isolate") $classzone_idx:int $order:int $nr_requested:long unsigned int $nr_scanned:long unsigned int $nr_taken:long unsigned int $isolate_mode:isolate_mode_t $file:int
> >> kernel.trace("vmscan:mm_vmscan_memcg_reclaim_begin") $order:int $may_writepage:int $gfp_flags:gfp_t $classzone_idx:int
> >> kernel.trace("vmscan:mm_vmscan_memcg_reclaim_end") $nr_reclaimed:long unsigned int
> >> kernel.trace("vmscan:mm_vmscan_memcg_softlimit_reclaim_begin") $order:int $may_writepage:int $gfp_flags:gfp_t $classzone_idx:int
> >> kernel.trace("vmscan:mm_vmscan_memcg_softlimit_reclaim_end") $nr_reclaimed:long unsigned int
> >> kernel.trace("vmscan:mm_vmscan_wakeup_kswapd") $nid:int $zid:int $order:int
> >> kernel.trace("vmscan:mm_vmscan_writepage") $page:struct page*
> > 
> > this looks like it would achieve the same
> 
> Is there anything else I can provide?

I am not familiar with filesystems and their tracepoints which might
tell us more

[...]
> Mar  1 06:30:59 curie kernel: kworker/u16:1: 
> Mar  1 06:30:59 curie kernel: kthreadd: page allocation stalls for 10197ms, order:1
> Mar  1 06:30:59 curie kernel: page allocation stalls for 11274ms, order:1

OK, so unlike in the previous situation, this is higher order request
(aka 2 physically contiguous pages).

> Mar  1 06:31:02 curie kernel: Normal free:130224kB min:128484kB low:160604kB high:192724kB active_anon:0kB inactive_anon:20kB active_file:308296kB inactive_file:3864kB unevictable:0kB writepending:0kB present:892920kB managed:791152kB mlocked:0kB slab_reclaimable:271984kB slab_unreclaimable:35880kB kernel_stack:1808kB pagetables:0kB bounce:0kB free_pcp:248kB local_pcp:0kB free_cma:0kB

In this case there is a lot of page cache to be reaclaimed. Most of it
on the active LRU list. There was a bug which could result in this fixed
by b4536f0c829c ("mm, memcg: fix the active list aging for lowmem
requests when memcg is enabled") which made it into the stable tree
4.9.5 but you have said that you are using 4.9.12 so you should already
have it. So it seems that this pagecache is indeed activate based on the
usage pattern.

The rest of the report is rather messed up but I assume that you simply
do not have contiguous memory in the lowmem. This surely sounds like a
32b specific problem which is not reasonably fixable.
-- 
Michal Hocko
SUSE Labs