[RFC PATCH 0/3] latencytop lock usage improvement

From: Feng Tang <feng.tang@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	Arjan van de Ven <arjan@linux.intel.com>,
	Jonathan Corbet <corbet@lwn.net>, Ingo Molnar <mingo@kernel.org>,
	Eric W Biederman <ebiederm@xmission.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Dmitry Vyukov <dvyukov@google.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Andy Lutomirski <luto@kernel.org>,
	Ying Huang <ying.huang@intel.com>,
	linux-kernel@vger.kernel.org
Cc: Feng Tang <feng.tang@intel.com>
Subject: [RFC PATCH 0/3] latencytop lock usage improvement
Date: Mon, 29 Apr 2019 16:03:28 +0800	[thread overview]
Message-ID: <1556525011-28022-1-git-send-email-feng.tang@intel.com> (raw)

Hi All,

latencytop is a very nice tool for tracing system latency hotspots, and
we heavily use it in our LKP test suites.

However, we found in some benchmark tests, there are very severe lock
contentions which hits 70%+ of CPU cycles in perf profile, especially
for benchmark involving massive process scheduling on platforms with 
many CPUs, like running hackbench 
  "hackbench -g 1408 --process --pipe -l 1875 -s 1024"
on a 2 sockets Xeon E5-2699 machine (44 Cores/88 CPUs) with 64GB RAM.

And due to that, we have to explicitly disable latencytop for those test
cases.

By checking the code, we found latencytop use one global spinlock to
cover the latency data updating for both global data and per-task ones.

So initially we tried splitting single lock into one global lock and
per-task lock for better lock granularity, but the lock contention is
only reduced slightly (only 1% drop for perf profile). The reason is
the contention is still severe, as the benchmarks cause massive
scheduling on the 88 CPUs, and every schedule-in call will try to
acquire the lock.

Then we tried to reduce the operations inside the latency_lock's 
protection (between the spin_lock_irqsave/raw_spin_unlock_irqrestore
pair), and also there is only very small improvement, and lock contention
keeps high.

At last,  we tried adding one extra lazy mode which only update the global
latency data when a task exit, while still updating the per-task data
in real time. This reduces the lock contention from 70%+ to less than
5% while boost that hackbench case's throughput by 276%. 

Please help to review, thanks!

Patch 1/3 adds the missing sysctl description for "latencytop" and I
think it could be merged independently.

Patch 2/3 splits latency_lock to global lock and per task lock.
And actually, a more aggressive thought is the per-task lock may not be
needed as the per-task data is only updated at the enqueueing time for
a task, which implies no race condtion for it.

Patch 3/3 implements the lazy mode and update the document.

Thanks,
Feng

Feng Tang (3):
  kernel/sysctl: add description for "latencytop"
  latencytop: split latency_lock to global lock and per task lock
  latencytop: add a lazy mode for updating global data

 Documentation/sysctl/kernel.txt | 23 ++++++++++++++++++++
 include/linux/latencytop.h      |  5 +++++
 include/linux/sched.h           |  1 +
 init/init_task.c                |  3 +++
 kernel/exit.c                   |  2 ++
 kernel/fork.c                   |  4 ++++
 kernel/latencytop.c             | 47 +++++++++++++++++++++++++++++++++++------
 7 files changed, 78 insertions(+), 7 deletions(-)

-- 
2.7.4