From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753047AbcHOXsk (ORCPT <rfc822;w@1wt.eu>);
	Mon, 15 Aug 2016 19:48:40 -0400
Received: from mail-oi0-f68.google.com ([209.85.218.68]:35930 "EHLO
	mail-oi0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752365AbcHOXsj (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Mon, 15 Aug 2016 19:48:39 -0400
MIME-Version: 1.0
In-Reply-To: <CA+55aFzOAorMxCsv3uyyyhS8c5xteVnZVEm+bGyBjkjWVT5Zag@mail.gmail.com>
References: <CA+55aFwYcjXn7kpLFWQ=jF0sSrhEfZhqNsjR+6O89OOH9qRNOw@mail.gmail.com>
 <20160815004826.GW19025@dastard> <CA+55aFy8-biqTbDwwTqOPALAD+WN_PgDBK6HPTxTRiVgj5bA8Q@mail.gmail.com>
 <20160815022808.GX19025@dastard> <CA+55aFxbLsD_btd0qXpuxTqAB5OQF1O+v7-OMgj8ftBE4Qd7WA@mail.gmail.com>
 <20160815050016.GY19025@dastard> <CA+55aFwva2Xffai+Eqv1Jn_NGryk3YJ2i5JoHOQnbQv6qVPAsw@mail.gmail.com>
 <CA+55aFy14nUnJQ_GdF=j8Fa9xiH70c6fY2G3q5HQ01+8z1z3qQ@mail.gmail.com>
 <CA+55aFxp+rLehC8c157uRbH459wUC1rRPfCVgvmcq5BrG9gkyg@mail.gmail.com>
 <20160815222211.GA19025@dastard> <20160815224259.GB19025@dastard> <CA+55aFzOAorMxCsv3uyyyhS8c5xteVnZVEm+bGyBjkjWVT5Zag@mail.gmail.com>
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 15 Aug 2016 16:48:36 -0700
X-Google-Sender-Auth: IdyCGC8cUdgUt0Qd6owxp43wKkw
Message-ID: <CA+55aFwp-Aeu-6j2MfMgEDoUwq+1vThL4nBdMj-p5TqDMA5RrA@mail.gmail.com>
Subject: Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
To: Dave Chinner <david@fromorbit.com>,
        Mel Gorman <mgorman@techsingularity.net>,
        Johannes Weiner <hannes@cmpxchg.org>, Vlastimil Babka <vbabka@suse.cz>,
        Andrew Morton <akpm@linux-foundation.org>
Cc: Bob Peterson <rpeterso@redhat.com>,
        "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
        "Huang, Ying" <ying.huang@intel.com>, Christoph Hellwig <hch@lst.de>,
        Wu Fengguang <fengguang.wu@intel.com>, LKP <lkp@01.org>,
        Tejun Heo <tj@kernel.org>, LKML <linux-kernel@vger.kernel.org>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Mon, Aug 15, 2016 at 4:20 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> None of this code is all that new, which is annoying. This must have
> gone on forever,

... ooh.

Wait, I take that back.

We actually have some very recent changes that I didn't even think
about that went into this very merge window.

In particular, I wonder if it's all (or at least partly) due to the
new per-node LRU lists.

So in shrink_page_list(), when kswapd is encountering a page that is
under page writeback due to page reclaim, it does:

                        if (current_is_kswapd() &&
                            PageReclaim(page) &&
                            test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
                                nr_immediate++;
                                goto keep_locked;

which basically ignores that page and puts it back on the LRU list.

But that "is this node under writeback" is new - it now does that per
node, and it *used* to do it per zone (so it _used_ to test "is this
zone under writeback").

All the mapping pages used to be in the same zone, so I think it
effectively single-threaded the kswapd reclaim for one mapping under
reclaim writeback. But in your cases, you have multiple nodes...

Ok, that's a lot of hand-wavy new-age crystal healing thinking.

Really, I haven't looked at it more than "this is one thing that has
changed recently, I wonder if it changes the patterns and could
explain much higher spin_lock contention on the mapping->tree_lock".

I'm adding Mel Gorman and his band of miscreants to the cc, so that
they can tell me that I'm full of shit, and completely missed on what
that zone->node change actually ends up meaning.

Mel? The issue is that Dave Chinner is seeing some nasty spinlock
contention on "mapping->tree_lock":

>   31.18%  [kernel]  [k] __pv_queued_spin_lock_slowpath

and one of the main paths is this:

>    - 30.29% kswapd
>       - 30.23% shrink_node
>          - 30.07% shrink_node_memcg.isra.75
>             - 30.15% shrink_inactive_list
>                - 29.49% shrink_page_list
>                   - 22.79% __remove_mapping
>                      - 22.27% _raw_spin_lock_irqsave
>                           __pv_queued_spin_lock_slowpath

so there's something ridiculously bad going on with a fairly simple benchmark.

Dave's benchmark is literally just a "write a new 48GB file in
single-page chunks on a 4-node machine". Nothing odd - not rewriting
files, not seeking around, no nothing.

You can probably recreate it with a silly

  dd bs=4096 count=$((12*1024*1024)) if=/dev/zero of=bigfile

although Dave actually had something rather fancier, I think.

             Linus

From mboxrd@z Thu Jan  1 00:00:00 1970
Content-Type: multipart/mixed; boundary="===============2330923375142882954=="
MIME-Version: 1.0
From: Linus Torvalds <torvalds@linux-foundation.org>
To: lkp@lists.01.org
Subject: Re: [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
Date: Mon, 15 Aug 2016 16:48:36 -0700
Message-ID: <CA+55aFwp-Aeu-6j2MfMgEDoUwq+1vThL4nBdMj-p5TqDMA5RrA@mail.gmail.com>
In-Reply-To: <CA+55aFzOAorMxCsv3uyyyhS8c5xteVnZVEm+bGyBjkjWVT5Zag@mail.gmail.com>
List-Id: <oe-lkp.lists.linux.dev>

--===============2330923375142882954==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable

On Mon, Aug 15, 2016 at 4:20 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> None of this code is all that new, which is annoying. This must have
> gone on forever,

... ooh.

Wait, I take that back.

We actually have some very recent changes that I didn't even think
about that went into this very merge window.

In particular, I wonder if it's all (or at least partly) due to the
new per-node LRU lists.

So in shrink_page_list(), when kswapd is encountering a page that is
under page writeback due to page reclaim, it does:

                        if (current_is_kswapd() &&
                            PageReclaim(page) &&
                            test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
                                nr_immediate++;
                                goto keep_locked;

which basically ignores that page and puts it back on the LRU list.

But that "is this node under writeback" is new - it now does that per
node, and it *used* to do it per zone (so it _used_ to test "is this
zone under writeback").

All the mapping pages used to be in the same zone, so I think it
effectively single-threaded the kswapd reclaim for one mapping under
reclaim writeback. But in your cases, you have multiple nodes...

Ok, that's a lot of hand-wavy new-age crystal healing thinking.

Really, I haven't looked at it more than "this is one thing that has
changed recently, I wonder if it changes the patterns and could
explain much higher spin_lock contention on the mapping->tree_lock".

I'm adding Mel Gorman and his band of miscreants to the cc, so that
they can tell me that I'm full of shit, and completely missed on what
that zone->node change actually ends up meaning.

Mel? The issue is that Dave Chinner is seeing some nasty spinlock
contention on "mapping->tree_lock":

>   31.18%  [kernel]  [k] __pv_queued_spin_lock_slowpath

and one of the main paths is this:

>    - 30.29% kswapd
>       - 30.23% shrink_node
>          - 30.07% shrink_node_memcg.isra.75
>             - 30.15% shrink_inactive_list
>                - 29.49% shrink_page_list
>                   - 22.79% __remove_mapping
>                      - 22.27% _raw_spin_lock_irqsave
>                           __pv_queued_spin_lock_slowpath

so there's something ridiculously bad going on with a fairly simple benchma=
rk.

Dave's benchmark is literally just a "write a new 48GB file in
single-page chunks on a 4-node machine". Nothing odd - not rewriting
files, not seeking around, no nothing.

You can probably recreate it with a silly

  dd bs=3D4096 count=3D$((12*1024*1024)) if=3D/dev/zero of=3Dbigfile

although Dave actually had something rather fancier, I think.

             Linus

--===============2330923375142882954==--