LKML Archive on lore.kernel.org
 help / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: "Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)" 
	<rruslich@cisco.com>
Cc: Taras Kondratiuk <takondra@cisco.com>,
	Michal Hocko <mhocko@kernel.org>,
	linux-mm@kvack.org, xe-linux-external@cisco.com,
	linux-kernel@vger.kernel.org
Subject: Re: Detecting page cache trashing state
Date: Wed, 25 Oct 2017 13:54:24 -0400
Message-ID: <20171025175424.GA14039@cmpxchg.org> (raw)
In-Reply-To: <acbf4417-4ded-fa03-7b8d-34dc0803027c@cisco.com>

[-- Attachment #1: Type: text/plain, Size: 2201 bytes --]

Hi Ruslan,

sorry about the delayed response, I missed the new activity in this
older thread.

On Thu, Sep 28, 2017 at 06:49:07PM +0300, Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco) wrote:
> Hi Johannes,
> 
> Hopefully I was able to rebase the patch on top v4.9.26 (latest supported
> version by us right now)
> and test a bit.
> The overall idea definitely looks promising, although I have one question on
> usage.
> Will it be able to account the time which processes spend on handling major
> page faults
> (including fs and iowait time) of refaulting page?

That's the main thing it should measure! :)

The lock_page() and wait_on_page_locked() calls are where iowaits
happen on a cache miss. If those are refaults, they'll be counted.

> As we have one big application which code space occupies big amount of place
> in page cache,
> when the system under heavy memory usage will reclaim some of it, the
> application will
> start constantly thrashing. Since it code is placed on squashfs it spends
> whole CPU time
> decompressing the pages and seem memdelay counters are not detecting this
> situation.
> Here are some counters to indicate this:
> 
> 19:02:44        CPU     %user     %nice   %system   %iowait %steal     %idle
> 19:02:45        all      0.00      0.00    100.00      0.00 0.00      0.00
> 
> 19:02:44     pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s
> pgscand/s pgsteal/s    %vmeff
> 19:02:45     15284.00      0.00    428.00    352.00  19990.00 0.00      0.00
> 15802.00      0.00
> 
> And as nobody actively allocating memory anymore looks like memdelay
> counters are not
> actively incremented:
> 
> [:~]$ cat /proc/memdelay
> 268035776
> 6.13 5.43 3.58
> 1.90 1.89 1.26

How does it correlate with /proc/vmstat::workingset_activate during
that time? It only counts thrashing time of refaults it can actively
detect.

Btw, how many CPUs does this system have? There is a bug in this
version on how idle time is aggregated across multiple CPUs. The error
compounds with the number of CPUs in the system.

I'm attaching 3 bugfixes that go on top of what you have. There might
be some conflicts, but they should be minor variable naming issues.


[-- Attachment #2: 0001-mm-memdelay-fix-task-flags-race-condition.patch --]
[-- Type: text/x-diff, Size: 2548 bytes --]

>From 7318c963a582833d4556c51fc2e1658e00c14e3e Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 5 Oct 2017 12:32:47 -0400
Subject: [PATCH 1/3] mm: memdelay: fix task flags race condition

WARNING: CPU: 35 PID: 2263 at ../include/linux/memcontrol.h:466

This is memcg warning that current->memcg_may_oom is set when it
doesn't expect it to.

The warning came in new with the memdelay patches. They add another
task flag in the same int as memcg_may_oom, but modify it from
try_to_wake_up, from a task that isn't current. This isn't safe.

Move the flag to the other int holding task flags, whose modifications
are serialized through the scheduler locks.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/sched.h | 2 +-
 kernel/sched/core.c   | 8 ++++----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index de15e3c8c43a..d1aa8f4c19ab 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -627,6 +627,7 @@ struct task_struct {
 	unsigned			sched_contributes_to_load:1;
 	unsigned			sched_migrated:1;
 	unsigned			sched_remote_wakeup:1;
+	unsigned			sched_memdelay_requeue:1;
 	/* Force alignment to the next boundary: */
 	unsigned			:0;
 
@@ -651,7 +652,6 @@ struct task_struct {
 	/* disallow userland-initiated cgroup migration */
 	unsigned			no_cgroup_migration:1;
 #endif
-	unsigned			memdelay_migrate_enqueue:1;
 
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bf105c870da6..b4fa806bf153 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -760,10 +760,10 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 	if (!(flags & ENQUEUE_RESTORE))
 		sched_info_queued(rq, p);
 
-	WARN_ON_ONCE(!(flags & ENQUEUE_WAKEUP) && p->memdelay_migrate_enqueue);
-	if (!(flags & ENQUEUE_WAKEUP) || p->memdelay_migrate_enqueue) {
+	WARN_ON_ONCE(!(flags & ENQUEUE_WAKEUP) && p->sched_memdelay_requeue);
+	if (!(flags & ENQUEUE_WAKEUP) || p->sched_memdelay_requeue) {
 		memdelay_add_runnable(p);
-		p->memdelay_migrate_enqueue = 0;
+		p->sched_memdelay_requeue = 0;
 	} else {
 		memdelay_wakeup(p);
 	}
@@ -2065,8 +2065,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 
 		rq = __task_rq_lock(p, &rf);
 		memdelay_del_sleeping(p);
+		p->sched_memdelay_requeue = 1;
 		__task_rq_unlock(rq, &rf);
-		p->memdelay_migrate_enqueue = 1;
 
 		set_task_cpu(p, cpu);
 	}
-- 
2.14.2


[-- Attachment #3: 0002-mm-memdelay-idle-time-is-not-productive-time.patch --]
[-- Type: text/x-diff, Size: 1796 bytes --]

>From 7157c70aed93990f59942d39d1c0d8948164cfe2 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 5 Oct 2017 12:34:49 -0400
Subject: [PATCH 2/3] mm: memdelay: idle time is not productive time

There is an error in the multi-core logic, where memory delay numbers
drop as the number of CPU cores increases. Idle CPUs, even though they
don't host delayed processes, shouldn't contribute to the "not
delayed" bucket. Because that's the baseline for productivity, and
idle CPUs aren't productive.

Do not consider idle CPU time in the productivity baseline.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memdelay.h | 3 ++-
 mm/memdelay.c            | 4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/memdelay.h b/include/linux/memdelay.h
index d85d01692610..c49f65338c1f 100644
--- a/include/linux/memdelay.h
+++ b/include/linux/memdelay.h
@@ -24,7 +24,8 @@ enum memdelay_task_state {
  * productivity states of all tasks inside the domain.
  */
 enum memdelay_domain_state {
-	MDS_NONE,		/* No delayed tasks */
+	MDS_IDLE,		/* No tasks */
+	MDS_NONE,		/* Working tasks */
 	MDS_PART,		/* Delayed tasks, working tasks */
 	MDS_FULL,		/* Delayed tasks, no working tasks */
 	NR_MEMDELAY_DOMAIN_STATES,
diff --git a/mm/memdelay.c b/mm/memdelay.c
index c7c32dbb67ac..ea5ede79f044 100644
--- a/mm/memdelay.c
+++ b/mm/memdelay.c
@@ -118,8 +118,10 @@ static void domain_cpu_update(struct memdelay_domain *md, int cpu,
 	else if (mdc->tasks[MTS_DELAYED])
 		state = (mdc->tasks[MTS_RUNNABLE] || mdc->tasks[MTS_IOWAIT]) ?
 			MDS_PART : MDS_FULL;
-	else
+	else if (mdc->tasks[MTS_RUNNABLE] || mdc->tasks[MTS_IOWAIT])
 		state = MDS_NONE;
+	else
+		state = MDS_IDLE;
 
 	if (mdc->state == state)
 		return;
-- 
2.14.2


[-- Attachment #4: 0003-mm-memdelay-drop-IO-as-productive-time.patch --]
[-- Type: text/x-diff, Size: 1451 bytes --]

>From ea663e42b24871d370b6ddbfbf47c1775a2f9f09 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <jweiner@fb.com>
Date: Thu, 28 Sep 2017 10:36:39 -0700
Subject: [PATCH 3/3] mm: memdelay: drop IO as productive time

Counting IO as productive time distorts the sense of pressure with
workloads that are naturally IO-bound. Memory pressure can cause IO,
and thus cause "productive" IO to slow down - yet we don't attribute
this slowdown properly to a shortage of memory.

Disregard IO time altogether, and use CPU time alone as the baseline.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memdelay.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/memdelay.c b/mm/memdelay.c
index ea5ede79f044..fbce1d4ba142 100644
--- a/mm/memdelay.c
+++ b/mm/memdelay.c
@@ -113,12 +113,11 @@ static void domain_cpu_update(struct memdelay_domain *md, int cpu,
 	 * one running the workload, the domain is considered fully
 	 * blocked 50% of the time.
 	 */
-	if (mdc->tasks[MTS_DELAYED_ACTIVE] && !mdc->tasks[MTS_IOWAIT])
+	if (mdc->tasks[MTS_DELAYED_ACTIVE])
 		state = MDS_FULL;
 	else if (mdc->tasks[MTS_DELAYED])
-		state = (mdc->tasks[MTS_RUNNABLE] || mdc->tasks[MTS_IOWAIT]) ?
-			MDS_PART : MDS_FULL;
-	else if (mdc->tasks[MTS_RUNNABLE] || mdc->tasks[MTS_IOWAIT])
+		state = mdc->tasks[MTS_RUNNABLE] ? MDS_PART : MDS_FULL;
+	else if (mdc->tasks[MTS_RUNNABLE])
 		state = MDS_NONE;
 	else
 		state = MDS_IDLE;
-- 
2.14.2


  parent reply index

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-15  0:16 Taras Kondratiuk
2017-09-15 11:55 ` Zdenek Kabelac
2017-09-15 14:22 ` Daniel Walker
2017-09-15 16:38   ` Taras Kondratiuk
2017-09-15 17:31     ` Daniel Walker
2017-09-15 14:36 ` Michal Hocko
2017-09-15 17:28   ` Taras Kondratiuk
2017-09-18 16:34     ` Johannes Weiner
2017-09-19 10:55       ` [PATCH 1/3] sched/loadavg: consolidate LOAD_INT, LOAD_FRAC macros kbuild test robot
2017-09-19 11:02       ` kbuild test robot
2017-09-28 15:49       ` Detecting page cache trashing state Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
2017-10-25 16:53         ` Daniel Walker
2017-10-25 17:54         ` Johannes Weiner [this message]
2017-10-27 20:19           ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
2017-11-20 19:40             ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
2017-11-27  2:18               ` Minchan Kim
2017-10-26  3:53         ` vinayak menon
2017-10-27 20:29           ` Ruslan Ruslichenko -X (rruslich - GLOBALLOGIC INC at Cisco)
2017-09-15 21:20   ` vcaputo
2017-09-15 23:40     ` Taras Kondratiuk
2017-09-18  5:55     ` Michal Hocko

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171025175424.GA14039@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=rruslich@cisco.com \
    --cc=takondra@cisco.com \
    --cc=xe-linux-external@cisco.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox