From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=CsDi=3F=vger.kernel.org=linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-8.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,
	UNPARSEABLE_RELAY,USER_AGENT_GIT autolearn=ham autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id DEB17C32771
	for <linux-kernel@archiver.kernel.org>; Thu, 16 Jan 2020 03:05:15 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id BB84B214AF
	for <linux-kernel@archiver.kernel.org>; Thu, 16 Jan 2020 03:05:15 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1730410AbgAPDFP (ORCPT
        <rfc822;linux-kernel@archiver.kernel.org>);
        Wed, 15 Jan 2020 22:05:15 -0500
Received: from out30-42.freemail.mail.aliyun.com ([115.124.30.42]:47068 "EHLO
        out30-42.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1730244AbgAPDFO (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 15 Jan 2020 22:05:14 -0500
X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R461e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01f04455;MF=alex.shi@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0TnrEHFy_1579143910;
Received: from localhost(mailfrom:alex.shi@linux.alibaba.com fp:SMTPD_---0TnrEHFy_1579143910)
          by smtp.aliyun-inc.com(127.0.0.1);
          Thu, 16 Jan 2020 11:05:11 +0800
From:   Alex Shi <alex.shi@linux.alibaba.com>
To:     cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
        linux-mm@kvack.org, akpm@linux-foundation.org,
        mgorman@techsingularity.net, tj@kernel.org, hughd@google.com,
        khlebnikov@yandex-team.ru, daniel.m.jordan@oracle.com,
        yang.shi@linux.alibaba.com, willy@infradead.org,
        shakeelb@google.com, hannes@cmpxchg.org
Subject: [PATCH v8 00/10] per lruvec lru_lock for memcg
Date:   Thu, 16 Jan 2020 11:04:59 +0800
Message-Id: <1579143909-156105-1-git-send-email-alex.shi@linux.alibaba.com>
X-Mailer: git-send-email 1.8.3.1
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Hi all,

This patchset move lru_lock into lruvec, give a lru_lock for each of
lruvec, thus bring a lru_lock for each of memcg per node. So on a large
node machine, each of memcg don't need suffer from per node pgdat->lru_lock
waiting. They could go fast with their self lru_lock.

We introduce function lock_page_lruvec, which will lock the page's
memcg and then memcg's lruvec->lru_lock(Thanks Johannes Weiner,
Hugh Dickins and Konstantin Khlebnikov suggestion/reminder) to replace
old pgdat->lru_lock.

Following Daniel Jordan's suggestion, I run 208 'dd' with on 104
containers on a 2s * 26cores * HT box with a modefied case:
  https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-lru-file-readtwice

With this patchset, the readtwice performance increased about 80%
with containers. And no performance drops w/o container.

Another way to guard move_account is by lru_lock instead of move_lock 
Considering the memcg move task path:
   mem_cgroup_move_task:
     mem_cgroup_move_charge:
	lru_add_drain_all();
	atomic_inc(&mc.from->moving_account); //ask lruvec's move_lock
	synchronize_rcu();
	walk_parge_range: do charge_walk_ops(mem_cgroup_move_charge_pte_range):
	   isolate_lru_page();
	   mem_cgroup_move_account(page,)
		spin_lock(&from->move_lock) 
		page->mem_cgroup = to;
		spin_unlock(&from->move_lock) 
	   putback_lru_page(page)

to guard 'page->mem_cgroup = to' by to_vec->lru_lock has the similar effect with
move_lock. So for performance reason, both solutions are same.

Thanks Hugh Dickins and Konstantin Khlebnikov, they both brought the same idea
8 years ago.

Thanks all the comments from Hugh Dickins, Konstantin Khlebnikov, Daniel Jordan, 
Johannes Weiner, Mel Gorman, Shakeel Butt, Rong Chen, Fengguang Wu, Yun Wang etc.
and some testing support from Intel 0days!

v8,
  a, redo lock_page_lru cleanup as Konstantin Khlebnikov suggested.
  b, fix a bug in lruvec_memcg_debug, reported by Hugh Dickins

v7,
  a, rebase on v5.5-rc3, 
  b, move the lock_page_lru() clean up before lock replace.

v6, 
  a, rebase on v5.5-rc2, and redo performance testing.
  b, pick up Johanness' comments change and a lock_page_lru cleanup.

v5,
  a, locking page's memcg according JohannesW suggestion
  b, using macro for non memcg, according to Metthew's suggestion.

v4: 
  a, fix the page->mem_cgroup dereferencing issue, thanks Johannes Weiner
  b, remove the irqsave flags changes, thanks Metthew Wilcox
  c, merge/split patches for better understanding and bisection purpose

v3: rebase on linux-next, and fold the relock fix patch into introducing patch

v2: bypass a performance regression bug and fix some function issues

v1: initial version, aim testing show 5% performance increase on a 16 threads box.


Alex Shi (9):
  mm/vmscan: remove unnecessary lruvec adding
  mm/memcg: fold lock_page_lru into commit_charge
  mm/lru: replace pgdat lru_lock with lruvec lock
  mm/lru: introduce the relock_page_lruvec function
  mm/mlock: optimize munlock_pagevec by relocking
  mm/swap: only change the lru_lock iff page's lruvec is different
  mm/pgdat: remove pgdat lru_lock
  mm/lru: add debug checking for page memcg moving
  mm/memcg: add debug checking in lock_page_memcg

Hugh Dickins (1):
  mm/lru: revise the comments of lru_lock

 Documentation/admin-guide/cgroup-v1/memcg_test.rst |  15 +--
 Documentation/admin-guide/cgroup-v1/memory.rst     |   6 +-
 Documentation/trace/events-kmem.rst                |   2 +-
 Documentation/vm/unevictable-lru.rst               |  22 ++--
 include/linux/memcontrol.h                         |  68 ++++++++++++
 include/linux/mm_types.h                           |   2 +-
 include/linux/mmzone.h                             |   5 +-
 mm/compaction.c                                    |  57 ++++++----
 mm/filemap.c                                       |   4 +-
 mm/huge_memory.c                                   |  18 ++--
 mm/memcontrol.c                                    | 115 ++++++++++++++-------
 mm/mlock.c                                         |  28 ++---
 mm/mmzone.c                                        |   1 +
 mm/page_alloc.c                                    |   1 -
 mm/page_idle.c                                     |   7 +-
 mm/rmap.c                                          |   2 +-
 mm/swap.c                                          |  75 ++++++--------
 mm/vmscan.c                                        | 115 ++++++++++++---------
 18 files changed, 326 insertions(+), 217 deletions(-)

-- 
1.8.3.1