linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/1] mm: Track per-task tlb events
@ 2022-09-14  1:51 Joe Damato
  2022-09-14  1:51 ` [RFC 1/1] mm: Add per-task struct tlb counters Joe Damato
  0 siblings, 1 reply; 8+ messages in thread
From: Joe Damato @ 2022-09-14  1:51 UTC (permalink / raw)
  To: x86, linux-mm, Ben Segall, Daniel Bristot de Oliveira,
	Dietmar Eggemann, H. Peter Anvin, linux-fsdevel, linux-kernel,
	Mel Gorman, Steven Rostedt, Valentin Schneider
  Cc: Joe Damato

Greetings:

TLB shootdown events can be measured on a per-CPU basis by examining
/proc/interrupts. Further information about TLB events can be harvested
from /proc/vmstat if CONFIG_DEBUG_TLBFLUSH is enabled, but this information
is system-wide.

This information is useful, but on a busy system with many tasks it can be
difficult to disambiguate the source of the TLB shootdown events.

Having this information tracked per-task can enable developers to fix or
tweak userland allocators to reduce the number of IPIs and improve
application performance.

This change adds two new fields to task_struct and signal_struct to help
track TLB events:

	- ngtlbflush: number of tlb flushes generated
	- nrtlbflush: number of tlb flushes received

These stats are exported in /proc/[pid]/stat alongside similar metrics
(e.g. min_flt and maj_flt) for analysis.

I've gotten code into kernel networking / drivers before, but I've never
hacked on mm and mm-adjacent code before. Please let me know if there's a
glaring issue and I'll be happy to tweak this code as necessary.

If this seems OK, I'll send an official v1.

Thanks!

Joe Damato (1):
  mm: Add per-task struct tlb counters

 arch/x86/mm/tlb.c            | 2 ++
 fs/proc/array.c              | 9 +++++++++
 include/linux/sched.h        | 6 ++++++
 include/linux/sched/signal.h | 1 +
 kernel/exit.c                | 6 ++++++
 kernel/fork.c                | 1 +
 6 files changed, 25 insertions(+)

-- 
2.7.4



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC 1/1] mm: Add per-task struct tlb counters
  2022-09-14  1:51 [RFC 0/1] mm: Track per-task tlb events Joe Damato
@ 2022-09-14  1:51 ` Joe Damato
  2022-09-14  7:40   ` Dave Hansen
  0 siblings, 1 reply; 8+ messages in thread
From: Joe Damato @ 2022-09-14  1:51 UTC (permalink / raw)
  To: x86, linux-mm, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider, linux-kernel, linux-fsdevel
  Cc: Joe Damato

TLB shootdowns are tracked globally, but on a busy system it can be
difficult to disambiguate the source of TLB shootdowns.

Add two counter fields:
	- nrtlbflush: number of tlb flush events received
	- ngtlbflush: number of tlb flush events generated

Expose those fields in /proc/[pid]/stat so that they can be analyzed
alongside similar metrics (e.g. min_flt and maj_flt).

Signed-off-by: Joe Damato <jdamato@fastly.com>
---
 arch/x86/mm/tlb.c            | 2 ++
 fs/proc/array.c              | 9 +++++++++
 include/linux/sched.h        | 6 ++++++
 include/linux/sched/signal.h | 1 +
 kernel/exit.c                | 6 ++++++
 kernel/fork.c                | 1 +
 6 files changed, 25 insertions(+)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index c1e31e9..58f7c59 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -745,6 +745,7 @@ static void flush_tlb_func(void *info)
 	if (!local) {
 		inc_irq_stat(irq_tlb_count);
 		count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+		current->nrtlbflush++;
 
 		/* Can only happen on remote CPUs */
 		if (f->mm && f->mm != loaded_mm)
@@ -895,6 +896,7 @@ STATIC_NOPV void native_flush_tlb_multi(const struct cpumask *cpumask,
 	 * would not happen.
 	 */
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
+	current->ngtlbflush++;
 	if (info->end == TLB_FLUSH_ALL)
 		trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
 	else
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 49283b81..435afdc 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -469,6 +469,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	unsigned long long start_time;
 	unsigned long cmin_flt = 0, cmaj_flt = 0;
 	unsigned long  min_flt = 0,  maj_flt = 0;
+	unsigned long ngtlbflush = 0, nrtlbflush = 0;
 	u64 cutime, cstime, utime, stime;
 	u64 cgtime, gtime;
 	unsigned long rsslim = 0;
@@ -530,11 +531,15 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 			do {
 				min_flt += t->min_flt;
 				maj_flt += t->maj_flt;
+				ngtlbflush += t->ngtlbflush;
+				nrtlbflush += t->nrtlbflush;
 				gtime += task_gtime(t);
 			} while_each_thread(task, t);
 
 			min_flt += sig->min_flt;
 			maj_flt += sig->maj_flt;
+			ngtlbflush += sig->ngtlbflush;
+			nrtlbflush += sig->nrtlbflush;
 			thread_group_cputime_adjusted(task, &utime, &stime);
 			gtime += sig->gtime;
 
@@ -554,6 +559,8 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	if (!whole) {
 		min_flt = task->min_flt;
 		maj_flt = task->maj_flt;
+		nrtlbflush = task->nrtlbflush;
+		ngtlbflush = task->ngtlbflush;
 		task_cputime_adjusted(task, &utime, &stime);
 		gtime = task_gtime(task);
 	}
@@ -643,6 +650,8 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
 	else
 		seq_puts(m, " 0");
 
+	seq_put_decimal_ull(m, " ", ngtlbflush);
+	seq_put_decimal_ull(m, " ", nrtlbflush);
 	seq_putc(m, '\n');
 	if (mm)
 		mmput(mm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5cdf746..2a0d879 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1047,6 +1047,12 @@ struct task_struct {
 	unsigned long			min_flt;
 	unsigned long			maj_flt;
 
+	/* Number of TLB flushes generated by this task */
+	unsigned long			ngtlbflush;
+
+	/* Number of TLB flushes received by this task */
+	unsigned long			nrtlbflush;
+
 	/* Empty if CONFIG_POSIX_CPUTIMERS=n */
 	struct posix_cputimers		posix_cputimers;
 
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 2009926..4e0b09c 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -189,6 +189,7 @@ struct signal_struct {
 	struct prev_cputime prev_cputime;
 	unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
 	unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
+	unsigned long ngtlbflush, nrtlbflush;
 	unsigned long inblock, oublock, cinblock, coublock;
 	unsigned long maxrss, cmaxrss;
 	struct task_io_accounting ioac;
diff --git a/kernel/exit.c b/kernel/exit.c
index 35e0a31..5a72755 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -141,6 +141,8 @@ static void __exit_signal(struct task_struct *tsk)
 	sig->gtime += task_gtime(tsk);
 	sig->min_flt += tsk->min_flt;
 	sig->maj_flt += tsk->maj_flt;
+	sig->ngtlbflush += tsk->ngtlbflush;
+	sig->nrtlbflush += tsk->nrtlbflush;
 	sig->nvcsw += tsk->nvcsw;
 	sig->nivcsw += tsk->nivcsw;
 	sig->inblock += task_io_get_inblock(tsk);
@@ -1095,6 +1097,10 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 			p->min_flt + sig->min_flt + sig->cmin_flt;
 		psig->cmaj_flt +=
 			p->maj_flt + sig->maj_flt + sig->cmaj_flt;
+		psig->ngtlbflush +=
+			p->ngtlbflush + sig->ngtlbflush;
+		psig->nrtlbflush +=
+			p->nrtlbflush + sig->nrtlbflush;
 		psig->cnvcsw +=
 			p->nvcsw + sig->nvcsw + sig->cnvcsw;
 		psig->cnivcsw +=
diff --git a/kernel/fork.c b/kernel/fork.c
index b339918..5fa9f64 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1555,6 +1555,7 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
 	struct mm_struct *mm, *oldmm;
 
 	tsk->min_flt = tsk->maj_flt = 0;
+	tsk->ngtlbflush = tsk->nrtlbflush = 0;
 	tsk->nvcsw = tsk->nivcsw = 0;
 #ifdef CONFIG_DETECT_HUNG_TASK
 	tsk->last_switch_count = tsk->nvcsw + tsk->nivcsw;
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC 1/1] mm: Add per-task struct tlb counters
  2022-09-14  1:51 ` [RFC 1/1] mm: Add per-task struct tlb counters Joe Damato
@ 2022-09-14  7:40   ` Dave Hansen
  2022-09-14 11:58     ` Peter Zijlstra
  2022-09-14 14:15     ` Joe Damato
  0 siblings, 2 replies; 8+ messages in thread
From: Dave Hansen @ 2022-09-14  7:40 UTC (permalink / raw)
  To: Joe Damato, x86, linux-mm, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, linux-kernel,
	linux-fsdevel

On 9/13/22 18:51, Joe Damato wrote:
> TLB shootdowns are tracked globally, but on a busy system it can be
> difficult to disambiguate the source of TLB shootdowns.
> 
> Add two counter fields:
> 	- nrtlbflush: number of tlb flush events received
> 	- ngtlbflush: number of tlb flush events generated
> 
> Expose those fields in /proc/[pid]/stat so that they can be analyzed
> alongside similar metrics (e.g. min_flt and maj_flt).

On x86 at least, we already have two other ways to count flushes.  You
even quoted them with your patch:

>  	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
> +	current->ngtlbflush++;
>  	if (info->end == TLB_FLUSH_ALL)
>  		trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);

Granted, the count_vm_tlb...() one is debugging only.  But, did you try
to use those other mechanisms?  For instance, could you patch
count_vm_tlb_event()?  Why didn't the tracepoints work for you?

Can this be done in a more arch-generic way?  It's a shame to
unconditionally add counters to the task struct and only use them on
x86.  If someone wanted to generalize the x86 tracepoints, or make them
available to other architectures, I think that would be fine even if
they have to change a bit (queue the inevitable argument about
tracepoint ABI).

P.S. I'm not a fan of the structure member naming.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC 1/1] mm: Add per-task struct tlb counters
  2022-09-14  7:40   ` Dave Hansen
@ 2022-09-14 11:58     ` Peter Zijlstra
  2022-09-14 14:23       ` Joe Damato
  2022-09-14 14:15     ` Joe Damato
  1 sibling, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2022-09-14 11:58 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Joe Damato, x86, linux-mm, Dave Hansen, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider, linux-kernel, linux-fsdevel

On Wed, Sep 14, 2022 at 12:40:55AM -0700, Dave Hansen wrote:
>  Why didn't the tracepoints work for you?

This; perf should be able to get you per-task slices of those events.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC 1/1] mm: Add per-task struct tlb counters
  2022-09-14  7:40   ` Dave Hansen
  2022-09-14 11:58     ` Peter Zijlstra
@ 2022-09-14 14:15     ` Joe Damato
  2022-09-14 14:25       ` Joe Damato
  2022-09-15  8:50       ` Peter Zijlstra
  1 sibling, 2 replies; 8+ messages in thread
From: Joe Damato @ 2022-09-14 14:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-mm, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider, linux-kernel, linux-fsdevel

On Wed, Sep 14, 2022 at 12:40:55AM -0700, Dave Hansen wrote:
> On 9/13/22 18:51, Joe Damato wrote:
> > TLB shootdowns are tracked globally, but on a busy system it can be
> > difficult to disambiguate the source of TLB shootdowns.
> > 
> > Add two counter fields:
> > 	- nrtlbflush: number of tlb flush events received
> > 	- ngtlbflush: number of tlb flush events generated
> > 
> > Expose those fields in /proc/[pid]/stat so that they can be analyzed
> > alongside similar metrics (e.g. min_flt and maj_flt).
> 
> On x86 at least, we already have two other ways to count flushes.  You
> even quoted them with your patch:
> 
> >  	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
> > +	current->ngtlbflush++;
> >  	if (info->end == TLB_FLUSH_ALL)
> >  		trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
> 
> Granted, the count_vm_tlb...() one is debugging only.  But, did you try
> to use those other mechanisms?  For instance, could you patch
> count_vm_tlb_event()? 

I tried to address this in my cover letter[1], but the count_vm_tlb_event
are system-wide, AFAICT. This is useful, certainly, but it's difficult to
know how many TLB shootdowns are being generated by which tasks without
finer granularity. The goal was to try to account these events on a
per-task basis.

I could patch count_vm_tlb... to account on a per-task basis. That seems
reasonable to me... assuming you and others are convinced that it's a
better approach than tracepoints ;)

> Why didn't the tracepoints work for you?

Tracepoints do work; but IMHO the trouble with tracepoints in this case is:

- You need to actually be running perf to gather the data at the right
  time; if you stop running perf too soon, or if the TLB shootdown storm is
  caused by some anomalous event when you weren't running perf... you are
  out of luck.
- On heavily loaded systems with O(10,000) or O(100,000) tasks, perf
  tracepoint data is hard to analyze, events can be dropped, and
  significant resources can be consumed.

In addition to this, there is existing tooling on Linux for scraping
/proc/[pid]/stat for graphing/analysis/etc.

IMO, possibly an easier way to debug large TLB shootdowns on a system might
be (using a form of this patch):

1. Examine /proc/[pid]/stat to see which process or processes are
responsible for the majority of the shootdowns. Perhaps you have a script
scraping this data at various intervals and recording deltas.

2. Now that you know the timeline of the events, which processes are
responsible, and the magnitude of the deltas... perf tracepoints can help
you determine when and where exactly they occur.

What do you think?

> Can this be done in a more arch-generic way?  It's a shame to
> unconditionally add counters to the task struct and only use them on
> x86.  If someone wanted to generalize the x86 tracepoints, or make them
> available to other architectures, I think that would be fine even if
> they have to change a bit (queue the inevitable argument about
> tracepoint ABI).

I'm not sure; maybe if I tweaked count_vm_tlb then I suppose if archs
other than x86 support count_vm_tlb in the future, they would
automatically get support for this.

> P.S. I'm not a fan of the structure member naming.

Fair enough; I was inspired by nvcsw and nivcsw :) but if you think that
this worth pursuing, I'll use more clear names in the future.

Thanks for taking a look!


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC 1/1] mm: Add per-task struct tlb counters
  2022-09-14 11:58     ` Peter Zijlstra
@ 2022-09-14 14:23       ` Joe Damato
  0 siblings, 0 replies; 8+ messages in thread
From: Joe Damato @ 2022-09-14 14:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, x86, linux-mm, Dave Hansen, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider, linux-kernel, linux-fsdevel

On Wed, Sep 14, 2022 at 01:58:27PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 14, 2022 at 12:40:55AM -0700, Dave Hansen wrote:
> >  Why didn't the tracepoints work for you?
> 
> This; perf should be able to get you per-task slices of those events.

Thanks for taking a look; I replied to Dave with a longer form response,
but IMHO, tracepoints are helpful in specific circumstances.

On a heavily loaded system with O(10,000) or O(100,000) tasks, tracepoints
can be difficult to use... especially if the TLB shootdown events are
anomalous events that happen in large bursts at unknown intervals and are
difficult to reproduce.

IMHO, I think that being able to periodically scrape /proc to see that a
particular process has a large TLB shootdown storm can then instruct you as
to when to apply perf (and to which specific tasks) in order to debug the
issue.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC 1/1] mm: Add per-task struct tlb counters
  2022-09-14 14:15     ` Joe Damato
@ 2022-09-14 14:25       ` Joe Damato
  2022-09-15  8:50       ` Peter Zijlstra
  1 sibling, 0 replies; 8+ messages in thread
From: Joe Damato @ 2022-09-14 14:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-mm, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider, linux-kernel, linux-fsdevel

On Wed, Sep 14, 2022 at 07:15:07AM -0700, Joe Damato wrote:
> On Wed, Sep 14, 2022 at 12:40:55AM -0700, Dave Hansen wrote:
> > On 9/13/22 18:51, Joe Damato wrote:
> > > TLB shootdowns are tracked globally, but on a busy system it can be
> > > difficult to disambiguate the source of TLB shootdowns.
> > > 
> > > Add two counter fields:
> > > 	- nrtlbflush: number of tlb flush events received
> > > 	- ngtlbflush: number of tlb flush events generated
> > > 
> > > Expose those fields in /proc/[pid]/stat so that they can be analyzed
> > > alongside similar metrics (e.g. min_flt and maj_flt).
> > 
> > On x86 at least, we already have two other ways to count flushes.  You
> > even quoted them with your patch:
> > 
> > >  	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
> > > +	current->ngtlbflush++;
> > >  	if (info->end == TLB_FLUSH_ALL)
> > >  		trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
> > 
> > Granted, the count_vm_tlb...() one is debugging only.  But, did you try
> > to use those other mechanisms?  For instance, could you patch
> > count_vm_tlb_event()? 
> 
> I tried to address this in my cover letter[1]...

Err, I forgot the [1]:
https://lore.kernel.org/linux-mm/1663120270-2673-1-git-send-email-jdamato@fastly.com/


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC 1/1] mm: Add per-task struct tlb counters
  2022-09-14 14:15     ` Joe Damato
  2022-09-14 14:25       ` Joe Damato
@ 2022-09-15  8:50       ` Peter Zijlstra
  1 sibling, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2022-09-15  8:50 UTC (permalink / raw)
  To: Joe Damato
  Cc: Dave Hansen, x86, linux-mm, Dave Hansen, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider, linux-kernel, linux-fsdevel

On Wed, Sep 14, 2022 at 07:15:08AM -0700, Joe Damato wrote:

> I could patch count_vm_tlb... to account on a per-task basis. That seems
> reasonable to me... assuming you and others are convinced that it's a
> better approach than tracepoints ;)

Well, we *could* do a lot of things, but we can all spend out cycles
only once. Doing endless variations of statistics contributes to
death-by-a-thoudsand-cuts.

If you really think you need this, write yourself an eBPF program and
attach it to these tracepoints. Then you get less cycles for useful
work, but the rest of us isn't bothered by that.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-09-15  8:50 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-14  1:51 [RFC 0/1] mm: Track per-task tlb events Joe Damato
2022-09-14  1:51 ` [RFC 1/1] mm: Add per-task struct tlb counters Joe Damato
2022-09-14  7:40   ` Dave Hansen
2022-09-14 11:58     ` Peter Zijlstra
2022-09-14 14:23       ` Joe Damato
2022-09-14 14:15     ` Joe Damato
2022-09-14 14:25       ` Joe Damato
2022-09-15  8:50       ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).