From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932275AbdJYRyh (ORCPT ); Wed, 25 Oct 2017 13:54:37 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:45144 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932124AbdJYRye (ORCPT ); Wed, 25 Oct 2017 13:54:34 -0400 Date: Wed, 25 Oct 2017 13:54:24 -0400 From: Johannes Weiner To: "Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)" Cc: Taras Kondratiuk , Michal Hocko , linux-mm@kvack.org, xe-linux-external@cisco.com, linux-kernel@vger.kernel.org Subject: Re: Detecting page cache trashing state Message-ID: <20171025175424.GA14039@cmpxchg.org> References: <150543458765.3781.10192373650821598320@takondra-t460s> <20170915143619.2ifgex2jxck2xt5u@dhcp22.suse.cz> <150549651001.4512.15084374619358055097@takondra-t460s> <20170918163434.GA11236@cmpxchg.org> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="0F1p//8PRICkK4MW" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.1 (2017-09-22) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org --0F1p//8PRICkK4MW Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Hi Ruslan, sorry about the delayed response, I missed the new activity in this older thread. On Thu, Sep 28, 2017 at 06:49:07PM +0300, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) wrote: > Hi Johannes, > > Hopefully I was able to rebase the patch on top v4.9.26 (latest supported > version by us right now) > and test a bit. > The overall idea definitely looks promising, although I have one question on > usage. > Will it be able to account the time which processes spend on handling major > page faults > (including fs and iowait time) of refaulting page? That's the main thing it should measure! :) The lock_page() and wait_on_page_locked() calls are where iowaits happen on a cache miss. If those are refaults, they'll be counted. > As we have one big application which code space occupies big amount of place > in page cache, > when the system under heavy memory usage will reclaim some of it, the > application will > start constantly thrashing. Since it code is placed on squashfs it spends > whole CPU time > decompressing the pages and seem memdelay counters are not detecting this > situation. > Here are some counters to indicate this: > > 19:02:44 CPU %user %nice %system %iowait %steal %idle > 19:02:45 all 0.00 0.00 100.00 0.00 0.00 0.00 > > 19:02:44 pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s > pgscand/s pgsteal/s %vmeff > 19:02:45 15284.00 0.00 428.00 352.00 19990.00 0.00 0.00 > 15802.00 0.00 > > And as nobody actively allocating memory anymore looks like memdelay > counters are not > actively incremented: > > [:~]$ cat /proc/memdelay > 268035776 > 6.13 5.43 3.58 > 1.90 1.89 1.26 How does it correlate with /proc/vmstat::workingset_activate during that time? It only counts thrashing time of refaults it can actively detect. Btw, how many CPUs does this system have? There is a bug in this version on how idle time is aggregated across multiple CPUs. The error compounds with the number of CPUs in the system. I'm attaching 3 bugfixes that go on top of what you have. There might be some conflicts, but they should be minor variable naming issues. --0F1p//8PRICkK4MW Content-Type: text/x-diff; charset=us-ascii Content-Disposition: attachment; filename="0001-mm-memdelay-fix-task-flags-race-condition.patch" >>From 7318c963a582833d4556c51fc2e1658e00c14e3e Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Thu, 5 Oct 2017 12:32:47 -0400 Subject: [PATCH 1/3] mm: memdelay: fix task flags race condition WARNING: CPU: 35 PID: 2263 at ../include/linux/memcontrol.h:466 This is memcg warning that current->memcg_may_oom is set when it doesn't expect it to. The warning came in new with the memdelay patches. They add another task flag in the same int as memcg_may_oom, but modify it from try_to_wake_up, from a task that isn't current. This isn't safe. Move the flag to the other int holding task flags, whose modifications are serialized through the scheduler locks. Signed-off-by: Johannes Weiner --- include/linux/sched.h | 2 +- kernel/sched/core.c | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index de15e3c8c43a..d1aa8f4c19ab 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -627,6 +627,7 @@ struct task_struct { unsigned sched_contributes_to_load:1; unsigned sched_migrated:1; unsigned sched_remote_wakeup:1; + unsigned sched_memdelay_requeue:1; /* Force alignment to the next boundary: */ unsigned :0; @@ -651,7 +652,6 @@ struct task_struct { /* disallow userland-initiated cgroup migration */ unsigned no_cgroup_migration:1; #endif - unsigned memdelay_migrate_enqueue:1; unsigned long atomic_flags; /* Flags requiring atomic access. */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index bf105c870da6..b4fa806bf153 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -760,10 +760,10 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags) if (!(flags & ENQUEUE_RESTORE)) sched_info_queued(rq, p); - WARN_ON_ONCE(!(flags & ENQUEUE_WAKEUP) && p->memdelay_migrate_enqueue); - if (!(flags & ENQUEUE_WAKEUP) || p->memdelay_migrate_enqueue) { + WARN_ON_ONCE(!(flags & ENQUEUE_WAKEUP) && p->sched_memdelay_requeue); + if (!(flags & ENQUEUE_WAKEUP) || p->sched_memdelay_requeue) { memdelay_add_runnable(p); - p->memdelay_migrate_enqueue = 0; + p->sched_memdelay_requeue = 0; } else { memdelay_wakeup(p); } @@ -2065,8 +2065,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) rq = __task_rq_lock(p, &rf); memdelay_del_sleeping(p); + p->sched_memdelay_requeue = 1; __task_rq_unlock(rq, &rf); - p->memdelay_migrate_enqueue = 1; set_task_cpu(p, cpu); } -- 2.14.2 --0F1p//8PRICkK4MW Content-Type: text/x-diff; charset=us-ascii Content-Disposition: attachment; filename="0002-mm-memdelay-idle-time-is-not-productive-time.patch" >>From 7157c70aed93990f59942d39d1c0d8948164cfe2 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Thu, 5 Oct 2017 12:34:49 -0400 Subject: [PATCH 2/3] mm: memdelay: idle time is not productive time There is an error in the multi-core logic, where memory delay numbers drop as the number of CPU cores increases. Idle CPUs, even though they don't host delayed processes, shouldn't contribute to the "not delayed" bucket. Because that's the baseline for productivity, and idle CPUs aren't productive. Do not consider idle CPU time in the productivity baseline. Signed-off-by: Johannes Weiner --- include/linux/memdelay.h | 3 ++- mm/memdelay.c | 4 +++- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/include/linux/memdelay.h b/include/linux/memdelay.h index d85d01692610..c49f65338c1f 100644 --- a/include/linux/memdelay.h +++ b/include/linux/memdelay.h @@ -24,7 +24,8 @@ enum memdelay_task_state { * productivity states of all tasks inside the domain. */ enum memdelay_domain_state { - MDS_NONE, /* No delayed tasks */ + MDS_IDLE, /* No tasks */ + MDS_NONE, /* Working tasks */ MDS_PART, /* Delayed tasks, working tasks */ MDS_FULL, /* Delayed tasks, no working tasks */ NR_MEMDELAY_DOMAIN_STATES, diff --git a/mm/memdelay.c b/mm/memdelay.c index c7c32dbb67ac..ea5ede79f044 100644 --- a/mm/memdelay.c +++ b/mm/memdelay.c @@ -118,8 +118,10 @@ static void domain_cpu_update(struct memdelay_domain *md, int cpu, else if (mdc->tasks[MTS_DELAYED]) state = (mdc->tasks[MTS_RUNNABLE] || mdc->tasks[MTS_IOWAIT]) ? MDS_PART : MDS_FULL; - else + else if (mdc->tasks[MTS_RUNNABLE] || mdc->tasks[MTS_IOWAIT]) state = MDS_NONE; + else + state = MDS_IDLE; if (mdc->state == state) return; -- 2.14.2 --0F1p//8PRICkK4MW Content-Type: text/x-diff; charset=us-ascii Content-Disposition: attachment; filename="0003-mm-memdelay-drop-IO-as-productive-time.patch" >>From ea663e42b24871d370b6ddbfbf47c1775a2f9f09 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Thu, 28 Sep 2017 10:36:39 -0700 Subject: [PATCH 3/3] mm: memdelay: drop IO as productive time Counting IO as productive time distorts the sense of pressure with workloads that are naturally IO-bound. Memory pressure can cause IO, and thus cause "productive" IO to slow down - yet we don't attribute this slowdown properly to a shortage of memory. Disregard IO time altogether, and use CPU time alone as the baseline. Signed-off-by: Johannes Weiner --- mm/memdelay.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/mm/memdelay.c b/mm/memdelay.c index ea5ede79f044..fbce1d4ba142 100644 --- a/mm/memdelay.c +++ b/mm/memdelay.c @@ -113,12 +113,11 @@ static void domain_cpu_update(struct memdelay_domain *md, int cpu, * one running the workload, the domain is considered fully * blocked 50% of the time. */ - if (mdc->tasks[MTS_DELAYED_ACTIVE] && !mdc->tasks[MTS_IOWAIT]) + if (mdc->tasks[MTS_DELAYED_ACTIVE]) state = MDS_FULL; else if (mdc->tasks[MTS_DELAYED]) - state = (mdc->tasks[MTS_RUNNABLE] || mdc->tasks[MTS_IOWAIT]) ? - MDS_PART : MDS_FULL; - else if (mdc->tasks[MTS_RUNNABLE] || mdc->tasks[MTS_IOWAIT]) + state = mdc->tasks[MTS_RUNNABLE] ? MDS_PART : MDS_FULL; + else if (mdc->tasks[MTS_RUNNABLE]) state = MDS_NONE; else state = MDS_IDLE; -- 2.14.2 --0F1p//8PRICkK4MW--