From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1751485AbdB1Q5C (ORCPT <rfc822;w@1wt.eu>);
        Tue, 28 Feb 2017 11:57:02 -0500
Received: from mx2.suse.de ([195.135.220.15]:50064 "EHLO mx2.suse.de"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1751356AbdB1Q5A (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 28 Feb 2017 11:57:00 -0500
Date: Tue, 28 Feb 2017 17:56:39 +0100
From: Michal Hocko <mhocko@kernel.org>
To: Robert Kudyba <rkudyba@fordham.edu>
Cc: linux-kernel@vger.kernel.org
Subject: Re: rsync: page allocation stalls in kernel 4.9.10 to a VessRAID NAS
Message-ID: <20170228165638.GA27726@dhcp22.suse.cz>
References: <C16ACE34-A2F0-4A9F-BFBD-E369733A214F@fordham.edu>
 <20170228141520.GA28139@dhcp22.suse.cz>
 <40F07E96-7468-4355-B8EA-4B42F575ACAB@fordham.edu>
 <20170228144045.GD26792@dhcp22.suse.cz>
 <3E4C7821-A93D-4956-A0E0-730BEC67C9F0@fordham.edu>
 <20170228151535.GE26792@dhcp22.suse.cz>
 <63A3D887-EEDA-46D2-AB59-D5955FC3D23D@fordham.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <63A3D887-EEDA-46D2-AB59-D5955FC3D23D@fordham.edu>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Tue 28-02-17 11:19:33, Robert Kudyba wrote:
> 
> > On Feb 28, 2017, at 10:15 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > On Tue 28-02-17 09:59:35, Robert Kudyba wrote:
> >> 
> >>> On Feb 28, 2017, at 9:40 AM, Michal Hocko <mhocko@kernel.org> wrote:
> >>> 
> >>> On Tue 28-02-17 09:33:49, Robert Kudyba wrote:
> >>>> 
> >>>>> On Feb 28, 2017, at 9:15 AM, Michal Hocko <mhocko@kernel.org> wrote:
> >>>>> and this one is hitting the min watermark while there is not really
> >>>>> much to reclaim. Only the page cache which might be pinned and not
> >>>>> reclaimable from this context because this is GFP_NOFS request. It is
> >>>>> not all that surprising the reclaim context fights to get some memory.
> >>>>> There is a huge amount of the reclaimable slab which probably just makes
> >>>>> a slow progress.
> >>>>> 
> >>>>> That is not something completely surprsing on 32b system I am afraid.
> >>>>> 
> >>>>> Btw. is the stall repeating with the increased time or it gets resolved
> >>>>> eventually?
> >>>> 
> >>>> Yes and if you mean by repeating it’s not only affecting rsync but
> >>>> you can see just now automount and NetworkManager get these page
> >>>> allocation stalls and kswapd0 is getting heavy CPU load, are there any
> >>>> other settings I can adjust?
> >>> 
> >>> None that I am aware of. You might want to talk to FS guys, maybe they
> >>> can figure out who is pinning file pages so that they cannot be
> >>> reclaimed. They do not seem to be dirty or under writeback. It would be
> >>> also interesting to see whether that is a regression. The warning is
> >>> relatively new so you might have had this problem before just haven't
> >>> noticed it.
> >> 
> >> We have been getting out of memory errors for a while but those seem
> >> to have gone away.
> > 
> > this sounds suspicious. Are you really sure that this is a new problem?
> > Btw. is there any reason to use 32b kernel at all? It will always suffer
> > from a really small lowmem…
> 
> No this has been a problem for a while. Not sure if this server can
> handle 64b it’s a bit old.

Ok, this is unfortunate. There is usually not much interest to fixing
32b issues which are inherent to the used memory model and which are not
regressions which would be fixable, I am afraid.

> >> We did just replace the controller in the VessRAID
> >> as there were some timeouts observed and multiple login/logout
> >> attempts.
> >> 
> >> By FS guys do you mean the linux-fsdevel or linux-fsf list?
> > 
> > yeah linux-fsdevel. No idea what linux-fsf is. It would be great if you
> > could collect some tracepoints before reporting the issue. At least
> > those in events/vmscan/*.
> 
> Will do here’s a perf report:

this will not tell us much. Tracepoints have much better chance to tell
us how reclaim is progressing.
-- 
Michal Hocko
SUSE Labs