From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1753613AbcHTMQu (ORCPT <rfc822;w@1wt.eu>);
        Sat, 20 Aug 2016 08:16:50 -0400
Received: from outbound-smtp11.blacknight.com ([46.22.139.16]:40253 "EHLO
        outbound-smtp11.blacknight.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1752449AbcHTMQt (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Sat, 20 Aug 2016 08:16:49 -0400
Date: Sat, 20 Aug 2016 13:16:43 +0100
From: Mel Gorman <mgorman@techsingularity.net>
To: Dave Chinner <david@fromorbit.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
        Michal Hocko <mhocko@suse.cz>, Minchan Kim <minchan@kernel.org>,
        Vladimir Davydov <vdavydov@virtuozzo.com>,
        Johannes Weiner <hannes@cmpxchg.org>, Vlastimil Babka <vbabka@suse.cz>,
        Andrew Morton <akpm@linux-foundation.org>,
        Bob Peterson <rpeterso@redhat.com>,
        "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
        "Huang, Ying" <ying.huang@intel.com>, Christoph Hellwig <hch@lst.de>,
        Wu Fengguang <fengguang.wu@intel.com>, LKP <lkp@01.org>,
        Tejun Heo <tj@kernel.org>, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
Message-ID: <20160820121643.GR8119@techsingularity.net>
References: <CA+55aFxCSU=Hy7OqRxHoJDx1ruMD3H2qvmy4hdZ0Bjx94dwDug@mail.gmail.com>
 <20160817154907.GI8119@techsingularity.net>
 <20160818004517.GJ8119@techsingularity.net>
 <20160818071111.GD22388@dastard>
 <20160818132414.GK8119@techsingularity.net>
 <CA+55aFyEQhjm9CU0yhk0WBAArB9soOA0JfWzjricnOqG9GB41g@mail.gmail.com>
 <20160818211949.GE22388@dastard>
 <CA+55aFzqmNRWyczFHDCgjv2w6ZyD_AUJL1O1ZRmDs+DHT2=2cQ@mail.gmail.com>
 <20160819104946.GL8119@techsingularity.net>
 <20160819234839.GG22388@dastard>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-15
Content-Disposition: inline
In-Reply-To: <20160819234839.GG22388@dastard>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Sat, Aug 20, 2016 at 09:48:39AM +1000, Dave Chinner wrote:
> On Fri, Aug 19, 2016 at 11:49:46AM +0100, Mel Gorman wrote:
> > On Thu, Aug 18, 2016 at 03:25:40PM -0700, Linus Torvalds wrote:
> > > It *could* be as simple/stupid as just saying "let's allocate the page
> > > cache for new pages from the current node" - and if the process that
> > > dirties pages just stays around on one single node, that might already
> > > be sufficient.
> > > 
> > > So just for testing purposes, you could try changing that
> > > 
> > >         return alloc_pages(gfp, 0);
> > > 
> > > in __page_cache_alloc() into something like
> > > 
> > >         return alloc_pages_node(cpu_to_node(raw_smp_processor_id())), gfp, 0);
> > > 
> > > or something.
> > > 
> > 
> > The test would be interesting but I believe that keeping heavy writers
> > on one node will force them to stall early on dirty balancing even if
> > there is plenty of free memory on other nodes.
> 
> Well, it depends on the speed of the storage. The higher the speed
> of the storage, the less we care about stalling on dirty pages
> during reclaim. i.e. faster storage == shorter stalls. We really
> should stop thinking we need to optimise reclaim purely for the
> benefit of slow disks.  500MB/s write speed with latencies of a
> under a couple of milliseconds is common hardware these days. pcie
> based storage (e.g. m2, nvme) is rapidly becoming commonplace and
> they can easily do 1-2GB/s write speeds.
> 

I partially agree. I've been of the opinion for a long time that dirty_time
would be desirable and limit the amount of dirty data by microseconds
required to sync the data and pick a default like 5 seconds. It's
non-trivial as the write speed of all BDIs would have to be estimated
and on rotary storage the estimate would be unreliable.

A short-term practical idea would be to distribute pages for writing
only when the dirty limit is almost reached on a given node. For fast
storage, the distribution may never happen.

Neither idea would actually impact the current problem though unless it
was combined with discarding clean cache agressively if the underlying
storage is fast. Hence, it would still be nice if the contention problem
could be mitigated. Did that last patch help any?

-- 
Mel Gorman
SUSE Labs

From mboxrd@z Thu Jan  1 00:00:00 1970
Content-Type: multipart/mixed; boundary="===============0816361406966781989=="
MIME-Version: 1.0
From: Mel Gorman <mgorman@techsingularity.net>
To: lkp@lists.01.org
Subject: Re: [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
Date: Sat, 20 Aug 2016 13:16:43 +0100
Message-ID: <20160820121643.GR8119@techsingularity.net>
In-Reply-To: <20160819234839.GG22388@dastard>
List-Id: <oe-lkp.lists.linux.dev>

--===============0816361406966781989==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable

On Sat, Aug 20, 2016 at 09:48:39AM +1000, Dave Chinner wrote:
> On Fri, Aug 19, 2016 at 11:49:46AM +0100, Mel Gorman wrote:
> > On Thu, Aug 18, 2016 at 03:25:40PM -0700, Linus Torvalds wrote:
> > > It *could* be as simple/stupid as just saying "let's allocate the page
> > > cache for new pages from the current node" - and if the process that
> > > dirties pages just stays around on one single node, that might already
> > > be sufficient.
> > > =

> > > So just for testing purposes, you could try changing that
> > > =

> > >         return alloc_pages(gfp, 0);
> > > =

> > > in __page_cache_alloc() into something like
> > > =

> > >         return alloc_pages_node(cpu_to_node(raw_smp_processor_id())),=
 gfp, 0);
> > > =

> > > or something.
> > > =

> > =

> > The test would be interesting but I believe that keeping heavy writers
> > on one node will force them to stall early on dirty balancing even if
> > there is plenty of free memory on other nodes.
> =

> Well, it depends on the speed of the storage. The higher the speed
> of the storage, the less we care about stalling on dirty pages
> during reclaim. i.e. faster storage =3D=3D shorter stalls. We really
> should stop thinking we need to optimise reclaim purely for the
> benefit of slow disks.  500MB/s write speed with latencies of a
> under a couple of milliseconds is common hardware these days. pcie
> based storage (e.g. m2, nvme) is rapidly becoming commonplace and
> they can easily do 1-2GB/s write speeds.
> =


I partially agree. I've been of the opinion for a long time that dirty_time
would be desirable and limit the amount of dirty data by microseconds
required to sync the data and pick a default like 5 seconds. It's
non-trivial as the write speed of all BDIs would have to be estimated
and on rotary storage the estimate would be unreliable.

A short-term practical idea would be to distribute pages for writing
only when the dirty limit is almost reached on a given node. For fast
storage, the distribution may never happen.

Neither idea would actually impact the current problem though unless it
was combined with discarding clean cache agressively if the underlying
storage is fast. Hence, it would still be nice if the contention problem
could be mitigated. Did that last patch help any?

-- =

Mel Gorman
SUSE Labs

--===============0816361406966781989==--