[v3,rcu-dev] rcuperf: Measure memory footprint during kfree_rcu() test
diff mbox series

Message ID 20191219211349.235877-1-joel@joelfernandes.org
State New
Headers show
Series
  • [v3,rcu-dev] rcuperf: Measure memory footprint during kfree_rcu() test
Related show

Commit Message

Joel Fernandes Dec. 19, 2019, 9:13 p.m. UTC
During changes to kfree_rcu() code, we often check the amount of free
memory.  As an alternative to checking this manually, this commit adds a
measurement in the test itself.  It measures four times during the test
for available memory, digitally filters these measurements to produce a
running average with a weight of 0.5, and compares this digitally
filtered value with the amount of available memory at the beginning of
the test.

We apply the digital filter only once we are more than 25% into the
test. At the 25% mark, we just read available memory and don't apply any
filtering. This prevents the first sample from skewing the results
as we would not consider memory readings that were before memory was
allocated.

A sample run shows something like:

Total time taken by all kfree'ers: 6369738407 ns, loops: 10000, batches: 764, memory footprint: 216MB

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>


---
v1->v2: Minor corrections
v1->v3: Use long long to prevent 32-bit system's overflow
	Handle case where some threads start later than others.
	Start measuring only once 25% into the test. Slightly more accurate.

Cc: bristot@redhat.com
Cc: frextrite@gmail.com
Cc: madhuparnabhowmik04@gmail.com
Cc: urezki@gmail.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

 kernel/rcu/rcuperf.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

Comments

Paul E. McKenney Dec. 21, 2019, 12:07 a.m. UTC | #1
On Thu, Dec 19, 2019 at 04:13:49PM -0500, Joel Fernandes (Google) wrote:
> During changes to kfree_rcu() code, we often check the amount of free
> memory.  As an alternative to checking this manually, this commit adds a
> measurement in the test itself.  It measures four times during the test
> for available memory, digitally filters these measurements to produce a
> running average with a weight of 0.5, and compares this digitally
> filtered value with the amount of available memory at the beginning of
> the test.
> 
> We apply the digital filter only once we are more than 25% into the
> test. At the 25% mark, we just read available memory and don't apply any
> filtering. This prevents the first sample from skewing the results
> as we would not consider memory readings that were before memory was
> allocated.
> 
> A sample run shows something like:
> 
> Total time taken by all kfree'ers: 6369738407 ns, loops: 10000, batches: 764, memory footprint: 216MB
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

Much better!  A few comments below.

							Thanx, Paul

> ---
> v1->v2: Minor corrections
> v1->v3: Use long long to prevent 32-bit system's overflow
> 	Handle case where some threads start later than others.
> 	Start measuring only once 25% into the test. Slightly more accurate.
> 
> Cc: bristot@redhat.com
> Cc: frextrite@gmail.com
> Cc: madhuparnabhowmik04@gmail.com
> Cc: urezki@gmail.com
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
>  kernel/rcu/rcuperf.c | 23 +++++++++++++++++++++--
>  1 file changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
> index da94b89cd531..67e0f804ea97 100644
> --- a/kernel/rcu/rcuperf.c
> +++ b/kernel/rcu/rcuperf.c
> @@ -12,6 +12,7 @@
>  #include <linux/types.h>
>  #include <linux/kernel.h>
>  #include <linux/init.h>
> +#include <linux/mm.h>
>  #include <linux/module.h>
>  #include <linux/kthread.h>
>  #include <linux/err.h>
> @@ -604,6 +605,8 @@ struct kfree_obj {
>  	struct rcu_head rh;
>  };
>  
> +long long mem_begin;
> +
>  static int
>  kfree_perf_thread(void *arg)
>  {
> @@ -611,6 +614,7 @@ kfree_perf_thread(void *arg)
>  	long me = (long)arg;
>  	struct kfree_obj *alloc_ptr;
>  	u64 start_time, end_time;
> +	long long mem_during = si_mem_available();

You initialize here, which makes quite a bit of sense...

>  	VERBOSE_PERFOUT_STRING("kfree_perf_thread task started");
>  	set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids));
> @@ -626,6 +630,15 @@ kfree_perf_thread(void *arg)
>  	}
>  
>  	do {
> +		// Moving average of memory availability measurements.
> +		// Start measuring only from when we are at least 25% into the test.
> +		if (loop && kfree_loops > 3 && (loop % (kfree_loops / 4) == 0)) {
> +			if (loop == (kfree_loops / 4))
> +				mem_during = si_mem_available();

But then you reinitialize here.  Perhaps to avoid the compiler being
confused into complaining about uninitialized variables?  (But if so,
please comment it.)

The thing is that by the fourth measurement, the initial influence has
been diluted by a factor of something like 16 or 32, correct?  I don't
believe that your measurements are any more accurate than that, given the
bursty nature of the RCU reclamation process.  So why not just initialize
it and average it at each sample point?

If you want more accuracy, you could increase the number of sample
points, while changing the filter constants to more heavily weight
past measurements.

Actually, I strongly suggest recording frequent raw data, and applying
various filtering techniques offline to see what works best.  I might
be mistaken, but it feels like you are shooting in the dark here.

> +			else
> +				mem_during = (mem_during + si_mem_available()) / 2;
> +		}
> +
>  		for (i = 0; i < kfree_alloc_num; i++) {
>  			alloc_ptr = kmalloc(sizeof(struct kfree_obj), GFP_KERNEL);
>  			if (!alloc_ptr)
> @@ -645,9 +658,13 @@ kfree_perf_thread(void *arg)
>  		else
>  			b_rcu_gp_test_finished = cur_ops->get_gp_seq();
>  
> -		pr_alert("Total time taken by all kfree'ers: %llu ns, loops: %d, batches: %ld\n",
> +		// The "memory footprint" field represents how much in-flight
> +		// memory is allocated during the test and waiting to be freed.

This added comment is much better, thank you!

> +		pr_alert("Total time taken by all kfree'ers: %llu ns, loops: %d, batches: %ld, memory footprint: %lldMB\n",
>  		       (unsigned long long)(end_time - start_time), kfree_loops,
> -		       rcuperf_seq_diff(b_rcu_gp_test_finished, b_rcu_gp_test_started));
> +		       rcuperf_seq_diff(b_rcu_gp_test_finished, b_rcu_gp_test_started),
> +		       (mem_begin - mem_during) >> (20 - PAGE_SHIFT));
> +
>  		if (shutdown) {
>  			smp_mb(); /* Assign before wake. */
>  			wake_up(&shutdown_wq);
> @@ -719,6 +736,8 @@ kfree_perf_init(void)
>  		goto unwind;
>  	}
>  
> +	mem_begin = si_mem_available();
> +
>  	for (i = 0; i < kfree_nrealthreads; i++) {
>  		firsterr = torture_create_kthread(kfree_perf_thread, (void *)i,
>  						  kfree_reader_tasks[i]);
> -- 
> 2.24.1.735.g03f4e72817-goog
Joel Fernandes Dec. 21, 2019, 3:37 a.m. UTC | #2
On Fri, Dec 20, 2019 at 04:07:29PM -0800, Paul E. McKenney wrote:
> On Thu, Dec 19, 2019 at 04:13:49PM -0500, Joel Fernandes (Google) wrote:
> > During changes to kfree_rcu() code, we often check the amount of free
> > memory.  As an alternative to checking this manually, this commit adds a
> > measurement in the test itself.  It measures four times during the test
> > for available memory, digitally filters these measurements to produce a
> > running average with a weight of 0.5, and compares this digitally
> > filtered value with the amount of available memory at the beginning of
> > the test.
> > 
> > We apply the digital filter only once we are more than 25% into the
> > test. At the 25% mark, we just read available memory and don't apply any
> > filtering. This prevents the first sample from skewing the results
> > as we would not consider memory readings that were before memory was
> > allocated.
> > 
> > A sample run shows something like:
> > 
> > Total time taken by all kfree'ers: 6369738407 ns, loops: 10000, batches: 764, memory footprint: 216MB
> > 
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> Much better!  A few comments below.
> 
> 							Thanx, Paul
> 
> > ---
> > v1->v2: Minor corrections
> > v1->v3: Use long long to prevent 32-bit system's overflow
> > 	Handle case where some threads start later than others.
> > 	Start measuring only once 25% into the test. Slightly more accurate.
> > 
> > Cc: bristot@redhat.com
> > Cc: frextrite@gmail.com
> > Cc: madhuparnabhowmik04@gmail.com
> > Cc: urezki@gmail.com
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > 
> >  kernel/rcu/rcuperf.c | 23 +++++++++++++++++++++--
> >  1 file changed, 21 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
> > index da94b89cd531..67e0f804ea97 100644
> > --- a/kernel/rcu/rcuperf.c
> > +++ b/kernel/rcu/rcuperf.c
> > @@ -12,6 +12,7 @@
> >  #include <linux/types.h>
> >  #include <linux/kernel.h>
> >  #include <linux/init.h>
> > +#include <linux/mm.h>
> >  #include <linux/module.h>
> >  #include <linux/kthread.h>
> >  #include <linux/err.h>
> > @@ -604,6 +605,8 @@ struct kfree_obj {
> >  	struct rcu_head rh;
> >  };
> >  
> > +long long mem_begin;
> > +
> >  static int
> >  kfree_perf_thread(void *arg)
> >  {
> > @@ -611,6 +614,7 @@ kfree_perf_thread(void *arg)
> >  	long me = (long)arg;
> >  	struct kfree_obj *alloc_ptr;
> >  	u64 start_time, end_time;
> > +	long long mem_during = si_mem_available();
> 
> You initialize here, which makes quite a bit of sense...
> 
> >  	VERBOSE_PERFOUT_STRING("kfree_perf_thread task started");
> >  	set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids));
> > @@ -626,6 +630,15 @@ kfree_perf_thread(void *arg)
> >  	}
> >  
> >  	do {
> > +		// Moving average of memory availability measurements.
> > +		// Start measuring only from when we are at least 25% into the test.
> > +		if (loop && kfree_loops > 3 && (loop % (kfree_loops / 4) == 0)) {
> > +			if (loop == (kfree_loops / 4))
> > +				mem_during = si_mem_available();
> 
> But then you reinitialize here.  Perhaps to avoid the compiler being
> confused into complaining about uninitialized variables?  (But if so,
> please comment it.)

It is reinitialized here like that, because if kfree_loops is < 4, then
mem_during needs to hold some value to avoid the (mem_begin - mem_during) to
falsely appear quite large. That's why I initialized it in the beginning. If
kfree_loops is >= 4, then yes it will be initialized twice but that's Ok.

I can add a comment to the earlier initialization if you like.

If that is the only comment on the patch left, I appreciate if you can
change:
 + long long mem_during = si_mem_available();
 to 
 + long long mem_during = si_mem_available(); // for kfree_loops < 4 case

> The thing is that by the fourth measurement, the initial influence has
> been diluted by a factor of something like 16 or 32, correct?  I don't
> believe that your measurements are any more accurate than that, given the
> bursty nature of the RCU reclamation process.  So why not just initialize
> it and average it at each sample point?

Yes but diluting 200MB of delta by 16 is still high and causes a skew.

> 
> If you want more accuracy, you could increase the number of sample
> points, while changing the filter constants to more heavily weight
> past measurements.

At the moment the results are showing me consistent numbers in my tests and
I'm quite happy with them. And it is better to have some measurement, than
not having anything.

We can certainly refine it further but at this time I am thinking of spending
my time reviewing Lai's patches and learning some other RCU things I need to
catch up on. If you hate this patch too much, we can also defer this patch
review for a bit and I can carry it in my tree for now as it is only a patch
to test code. But honestly, in its current form I am sort of happy with it.

> Actually, I strongly suggest recording frequent raw data, and applying
> various filtering techniques offline to see what works best.  I might
> be mistaken, but it feels like you are shooting in the dark here.

The number I am aiming for is close to a constant. For example in a run, I
see the available memory goes down from 460000 pages to ~ 400000 pages. That
is around ~240MB of memory. This deviates by +- 20MB. I would rather take
many samples than just 1 sample and apply some form of reduction to combine
them into a single sample.  My current filtering works well in my testing and
I could do more complex things, but then again at the moment I am happy with
the test. I am open to any other filtering related suggestions though!

Sorry if I missed something!

thanks,

 - Joel


> > +			else
> > +				mem_during = (mem_during + si_mem_available()) / 2;
> > +		}
> > +
> >  		for (i = 0; i < kfree_alloc_num; i++) {
> >  			alloc_ptr = kmalloc(sizeof(struct kfree_obj), GFP_KERNEL);
> >  			if (!alloc_ptr)
> > @@ -645,9 +658,13 @@ kfree_perf_thread(void *arg)
> >  		else
> >  			b_rcu_gp_test_finished = cur_ops->get_gp_seq();
> >  
> > -		pr_alert("Total time taken by all kfree'ers: %llu ns, loops: %d, batches: %ld\n",
> > +		// The "memory footprint" field represents how much in-flight
> > +		// memory is allocated during the test and waiting to be freed.
> 
> This added comment is much better, thank you!
> 
> > +		pr_alert("Total time taken by all kfree'ers: %llu ns, loops: %d, batches: %ld, memory footprint: %lldMB\n",
> >  		       (unsigned long long)(end_time - start_time), kfree_loops,
> > -		       rcuperf_seq_diff(b_rcu_gp_test_finished, b_rcu_gp_test_started));
> > +		       rcuperf_seq_diff(b_rcu_gp_test_finished, b_rcu_gp_test_started),
> > +		       (mem_begin - mem_during) >> (20 - PAGE_SHIFT));
> > +
> >  		if (shutdown) {
> >  			smp_mb(); /* Assign before wake. */
> >  			wake_up(&shutdown_wq);
> > @@ -719,6 +736,8 @@ kfree_perf_init(void)
> >  		goto unwind;
> >  	}
> >  
> > +	mem_begin = si_mem_available();
> > +
> >  	for (i = 0; i < kfree_nrealthreads; i++) {
> >  		firsterr = torture_create_kthread(kfree_perf_thread, (void *)i,
> >  						  kfree_reader_tasks[i]);
> > -- 
> > 2.24.1.735.g03f4e72817-goog
Paul E. McKenney Jan. 6, 2020, 7:52 p.m. UTC | #3
On Fri, Dec 20, 2019 at 10:37:14PM -0500, Joel Fernandes wrote:
> On Fri, Dec 20, 2019 at 04:07:29PM -0800, Paul E. McKenney wrote:
> > On Thu, Dec 19, 2019 at 04:13:49PM -0500, Joel Fernandes (Google) wrote:
> > > During changes to kfree_rcu() code, we often check the amount of free
> > > memory.  As an alternative to checking this manually, this commit adds a
> > > measurement in the test itself.  It measures four times during the test
> > > for available memory, digitally filters these measurements to produce a
> > > running average with a weight of 0.5, and compares this digitally
> > > filtered value with the amount of available memory at the beginning of
> > > the test.
> > > 
> > > We apply the digital filter only once we are more than 25% into the
> > > test. At the 25% mark, we just read available memory and don't apply any
> > > filtering. This prevents the first sample from skewing the results
> > > as we would not consider memory readings that were before memory was
> > > allocated.
> > > 
> > > A sample run shows something like:
> > > 
> > > Total time taken by all kfree'ers: 6369738407 ns, loops: 10000, batches: 764, memory footprint: 216MB
> > > 
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > 
> > Much better!  A few comments below.
> > 
> > 							Thanx, Paul
> > 
> > > ---
> > > v1->v2: Minor corrections
> > > v1->v3: Use long long to prevent 32-bit system's overflow
> > > 	Handle case where some threads start later than others.
> > > 	Start measuring only once 25% into the test. Slightly more accurate.
> > > 
> > > Cc: bristot@redhat.com
> > > Cc: frextrite@gmail.com
> > > Cc: madhuparnabhowmik04@gmail.com
> > > Cc: urezki@gmail.com
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > 
> > >  kernel/rcu/rcuperf.c | 23 +++++++++++++++++++++--
> > >  1 file changed, 21 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
> > > index da94b89cd531..67e0f804ea97 100644
> > > --- a/kernel/rcu/rcuperf.c
> > > +++ b/kernel/rcu/rcuperf.c
> > > @@ -12,6 +12,7 @@
> > >  #include <linux/types.h>
> > >  #include <linux/kernel.h>
> > >  #include <linux/init.h>
> > > +#include <linux/mm.h>
> > >  #include <linux/module.h>
> > >  #include <linux/kthread.h>
> > >  #include <linux/err.h>
> > > @@ -604,6 +605,8 @@ struct kfree_obj {
> > >  	struct rcu_head rh;
> > >  };
> > >  
> > > +long long mem_begin;
> > > +
> > >  static int
> > >  kfree_perf_thread(void *arg)
> > >  {
> > > @@ -611,6 +614,7 @@ kfree_perf_thread(void *arg)
> > >  	long me = (long)arg;
> > >  	struct kfree_obj *alloc_ptr;
> > >  	u64 start_time, end_time;
> > > +	long long mem_during = si_mem_available();
> > 
> > You initialize here, which makes quite a bit of sense...
> > 
> > >  	VERBOSE_PERFOUT_STRING("kfree_perf_thread task started");
> > >  	set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids));
> > > @@ -626,6 +630,15 @@ kfree_perf_thread(void *arg)
> > >  	}
> > >  
> > >  	do {
> > > +		// Moving average of memory availability measurements.
> > > +		// Start measuring only from when we are at least 25% into the test.
> > > +		if (loop && kfree_loops > 3 && (loop % (kfree_loops / 4) == 0)) {
> > > +			if (loop == (kfree_loops / 4))
> > > +				mem_during = si_mem_available();
> > 
> > But then you reinitialize here.  Perhaps to avoid the compiler being
> > confused into complaining about uninitialized variables?  (But if so,
> > please comment it.)
> 
> It is reinitialized here like that, because if kfree_loops is < 4, then
> mem_during needs to hold some value to avoid the (mem_begin - mem_during) to
> falsely appear quite large. That's why I initialized it in the beginning. If
> kfree_loops is >= 4, then yes it will be initialized twice but that's Ok.
> 
> I can add a comment to the earlier initialization if you like.

Could we just force kfree_loops >= 4?  Complain if not, set it to 4?

> If that is the only comment on the patch left, I appreciate if you can
> change:
>  + long long mem_during = si_mem_available();
>  to 
>  + long long mem_during = si_mem_available(); // for kfree_loops < 4 case

Let's see what we end up with.

> > The thing is that by the fourth measurement, the initial influence has
> > been diluted by a factor of something like 16 or 32, correct?  I don't
> > believe that your measurements are any more accurate than that, given the
> > bursty nature of the RCU reclamation process.  So why not just initialize
> > it and average it at each sample point?
> 
> Yes but diluting 200MB of delta by 16 is still high and causes a skew.

You get similar errors by only sampling four times, though.  Assuming a
reasonably long test run compared to the typical grace-period duration,
each of the four samples has a 50% chance of being above the median,
thus a 1/16 chance of all four samples being above the median.

> > If you want more accuracy, you could increase the number of sample
> > points, while changing the filter constants to more heavily weight
> > past measurements.
> 
> At the moment the results are showing me consistent numbers in my tests and
> I'm quite happy with them. And it is better to have some measurement, than
> not having anything.

No argument in general, but consistent numbers do not rule out the
possibility of a deterministic bias.

> We can certainly refine it further but at this time I am thinking of spending
> my time reviewing Lai's patches and learning some other RCU things I need to
> catch up on. If you hate this patch too much, we can also defer this patch
> review for a bit and I can carry it in my tree for now as it is only a patch
> to test code. But honestly, in its current form I am sort of happy with it.

OK, I will keep it as is for now and let's look again later on.  It is not
in the bucket for the upcoming merge window in any case, so we do have
quite a bit of time.

It is not that I hate it, but rather that I want to be able to give
good answers to questions that might come up.  And given that I have
occasionally given certain people a hard time about their statistics,
it is only reasonable to expect them to return the favor.  I wouldn't
want you to be caught in the crossfire.  ;-)

> > Actually, I strongly suggest recording frequent raw data, and applying
> > various filtering techniques offline to see what works best.  I might
> > be mistaken, but it feels like you are shooting in the dark here.
> 
> The number I am aiming for is close to a constant. For example in a run, I
> see the available memory goes down from 460000 pages to ~ 400000 pages. That
> is around ~240MB of memory. This deviates by +- 20MB. I would rather take
> many samples than just 1 sample and apply some form of reduction to combine
> them into a single sample.  My current filtering works well in my testing and
> I could do more complex things, but then again at the moment I am happy with
> the test. I am open to any other filtering related suggestions though!
> 
> Sorry if I missed something!

My thoughts are to use an explicit flag to catch the first-time case,
take something like 30 samples as the default, and restructure the loop
a bit to reduce at least some of the indentation.

But let's keep it on the back burner for a bit while we get other things
sorted out.

							Thanx, Paul
Joel Fernandes Jan. 15, 2020, 10:03 p.m. UTC | #4
On Mon, Jan 06, 2020 at 11:52:00AM -0800, Paul E. McKenney wrote:
> On Fri, Dec 20, 2019 at 10:37:14PM -0500, Joel Fernandes wrote:
> > On Fri, Dec 20, 2019 at 04:07:29PM -0800, Paul E. McKenney wrote:
> > > On Thu, Dec 19, 2019 at 04:13:49PM -0500, Joel Fernandes (Google) wrote:
> > > > During changes to kfree_rcu() code, we often check the amount of free
> > > > memory.  As an alternative to checking this manually, this commit adds a
> > > > measurement in the test itself.  It measures four times during the test
> > > > for available memory, digitally filters these measurements to produce a
> > > > running average with a weight of 0.5, and compares this digitally
> > > > filtered value with the amount of available memory at the beginning of
> > > > the test.
> > > > 
> > > > We apply the digital filter only once we are more than 25% into the
> > > > test. At the 25% mark, we just read available memory and don't apply any
> > > > filtering. This prevents the first sample from skewing the results
> > > > as we would not consider memory readings that were before memory was
> > > > allocated.
> > > > 
> > > > A sample run shows something like:
> > > > 
> > > > Total time taken by all kfree'ers: 6369738407 ns, loops: 10000, batches: 764, memory footprint: 216MB
> > > > 
> > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > 
> > > Much better!  A few comments below.
> > > 
> > > 							Thanx, Paul
> > > 
> > > > ---
> > > > v1->v2: Minor corrections
> > > > v1->v3: Use long long to prevent 32-bit system's overflow
> > > > 	Handle case where some threads start later than others.
> > > > 	Start measuring only once 25% into the test. Slightly more accurate.
> > > > 
> > > > Cc: bristot@redhat.com
> > > > Cc: frextrite@gmail.com
> > > > Cc: madhuparnabhowmik04@gmail.com
> > > > Cc: urezki@gmail.com
> > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > 
> > > >  kernel/rcu/rcuperf.c | 23 +++++++++++++++++++++--
> > > >  1 file changed, 21 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
> > > > index da94b89cd531..67e0f804ea97 100644
> > > > --- a/kernel/rcu/rcuperf.c
> > > > +++ b/kernel/rcu/rcuperf.c
> > > > @@ -12,6 +12,7 @@
> > > >  #include <linux/types.h>
> > > >  #include <linux/kernel.h>
> > > >  #include <linux/init.h>
> > > > +#include <linux/mm.h>
> > > >  #include <linux/module.h>
> > > >  #include <linux/kthread.h>
> > > >  #include <linux/err.h>
> > > > @@ -604,6 +605,8 @@ struct kfree_obj {
> > > >  	struct rcu_head rh;
> > > >  };
> > > >  
> > > > +long long mem_begin;
> > > > +
> > > >  static int
> > > >  kfree_perf_thread(void *arg)
> > > >  {
> > > > @@ -611,6 +614,7 @@ kfree_perf_thread(void *arg)
> > > >  	long me = (long)arg;
> > > >  	struct kfree_obj *alloc_ptr;
> > > >  	u64 start_time, end_time;
> > > > +	long long mem_during = si_mem_available();
> > > 
> > > You initialize here, which makes quite a bit of sense...
> > > 
> > > >  	VERBOSE_PERFOUT_STRING("kfree_perf_thread task started");
> > > >  	set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids));
> > > > @@ -626,6 +630,15 @@ kfree_perf_thread(void *arg)
> > > >  	}
> > > >  
> > > >  	do {
> > > > +		// Moving average of memory availability measurements.
> > > > +		// Start measuring only from when we are at least 25% into the test.
> > > > +		if (loop && kfree_loops > 3 && (loop % (kfree_loops / 4) == 0)) {
> > > > +			if (loop == (kfree_loops / 4))
> > > > +				mem_during = si_mem_available();
> > > 
> > > But then you reinitialize here.  Perhaps to avoid the compiler being
> > > confused into complaining about uninitialized variables?  (But if so,
> > > please comment it.)
> > 
> > It is reinitialized here like that, because if kfree_loops is < 4, then
> > mem_during needs to hold some value to avoid the (mem_begin - mem_during) to
> > falsely appear quite large. That's why I initialized it in the beginning. If
> > kfree_loops is >= 4, then yes it will be initialized twice but that's Ok.
> > 
> > I can add a comment to the earlier initialization if you like.
> 
> Could we just force kfree_loops >= 4?  Complain if not, set it to 4?

Sure.

> > > The thing is that by the fourth measurement, the initial influence has
> > > been diluted by a factor of something like 16 or 32, correct?  I don't
> > > believe that your measurements are any more accurate than that, given the
> > > bursty nature of the RCU reclamation process.  So why not just initialize
> > > it and average it at each sample point?
> > 
> > Yes but diluting 200MB of delta by 16 is still high and causes a skew.
> 
> You get similar errors by only sampling four times, though.  Assuming a
> reasonably long test run compared to the typical grace-period duration,
> each of the four samples has a 50% chance of being above the median,
> thus a 1/16 chance of all four samples being above the median.

[snip]
> > We can certainly refine it further but at this time I am thinking of spending
> > my time reviewing Lai's patches and learning some other RCU things I need to
> > catch up on. If you hate this patch too much, we can also defer this patch
> > review for a bit and I can carry it in my tree for now as it is only a patch
> > to test code. But honestly, in its current form I am sort of happy with it.
> 
> OK, I will keep it as is for now and let's look again later on.  It is not
> in the bucket for the upcoming merge window in any case, so we do have
> quite a bit of time.
> 
> It is not that I hate it, but rather that I want to be able to give
> good answers to questions that might come up.  And given that I have
> occasionally given certain people a hard time about their statistics,
> it is only reasonable to expect them to return the favor.  I wouldn't
> want you to be caught in the crossfire.  ;-)

Since the weights were concerning, I was thinking of just using a weight of
(1 / N) where N is the number of samples. Essentially taking the average.
That could be simple enough and does not cause your concerns with weight
tuning. I tested it and looks good, I'll post it shortly.

thanks,

 - Joel
Paul E. McKenney Jan. 15, 2020, 10:42 p.m. UTC | #5
On Wed, Jan 15, 2020 at 05:03:00PM -0500, Joel Fernandes wrote:
> On Mon, Jan 06, 2020 at 11:52:00AM -0800, Paul E. McKenney wrote:
> > On Fri, Dec 20, 2019 at 10:37:14PM -0500, Joel Fernandes wrote:
> > > On Fri, Dec 20, 2019 at 04:07:29PM -0800, Paul E. McKenney wrote:
> > > > On Thu, Dec 19, 2019 at 04:13:49PM -0500, Joel Fernandes (Google) wrote:
> > > > > During changes to kfree_rcu() code, we often check the amount of free
> > > > > memory.  As an alternative to checking this manually, this commit adds a
> > > > > measurement in the test itself.  It measures four times during the test
> > > > > for available memory, digitally filters these measurements to produce a
> > > > > running average with a weight of 0.5, and compares this digitally
> > > > > filtered value with the amount of available memory at the beginning of
> > > > > the test.
> > > > > 
> > > > > We apply the digital filter only once we are more than 25% into the
> > > > > test. At the 25% mark, we just read available memory and don't apply any
> > > > > filtering. This prevents the first sample from skewing the results
> > > > > as we would not consider memory readings that were before memory was
> > > > > allocated.
> > > > > 
> > > > > A sample run shows something like:
> > > > > 
> > > > > Total time taken by all kfree'ers: 6369738407 ns, loops: 10000, batches: 764, memory footprint: 216MB
> > > > > 
> > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > 
> > > > Much better!  A few comments below.
> > > > 
> > > > 							Thanx, Paul
> > > > 
> > > > > ---
> > > > > v1->v2: Minor corrections
> > > > > v1->v3: Use long long to prevent 32-bit system's overflow
> > > > > 	Handle case where some threads start later than others.
> > > > > 	Start measuring only once 25% into the test. Slightly more accurate.
> > > > > 
> > > > > Cc: bristot@redhat.com
> > > > > Cc: frextrite@gmail.com
> > > > > Cc: madhuparnabhowmik04@gmail.com
> > > > > Cc: urezki@gmail.com
> > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > > 
> > > > >  kernel/rcu/rcuperf.c | 23 +++++++++++++++++++++--
> > > > >  1 file changed, 21 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
> > > > > index da94b89cd531..67e0f804ea97 100644
> > > > > --- a/kernel/rcu/rcuperf.c
> > > > > +++ b/kernel/rcu/rcuperf.c
> > > > > @@ -12,6 +12,7 @@
> > > > >  #include <linux/types.h>
> > > > >  #include <linux/kernel.h>
> > > > >  #include <linux/init.h>
> > > > > +#include <linux/mm.h>
> > > > >  #include <linux/module.h>
> > > > >  #include <linux/kthread.h>
> > > > >  #include <linux/err.h>
> > > > > @@ -604,6 +605,8 @@ struct kfree_obj {
> > > > >  	struct rcu_head rh;
> > > > >  };
> > > > >  
> > > > > +long long mem_begin;
> > > > > +
> > > > >  static int
> > > > >  kfree_perf_thread(void *arg)
> > > > >  {
> > > > > @@ -611,6 +614,7 @@ kfree_perf_thread(void *arg)
> > > > >  	long me = (long)arg;
> > > > >  	struct kfree_obj *alloc_ptr;
> > > > >  	u64 start_time, end_time;
> > > > > +	long long mem_during = si_mem_available();
> > > > 
> > > > You initialize here, which makes quite a bit of sense...
> > > > 
> > > > >  	VERBOSE_PERFOUT_STRING("kfree_perf_thread task started");
> > > > >  	set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids));
> > > > > @@ -626,6 +630,15 @@ kfree_perf_thread(void *arg)
> > > > >  	}
> > > > >  
> > > > >  	do {
> > > > > +		// Moving average of memory availability measurements.
> > > > > +		// Start measuring only from when we are at least 25% into the test.
> > > > > +		if (loop && kfree_loops > 3 && (loop % (kfree_loops / 4) == 0)) {
> > > > > +			if (loop == (kfree_loops / 4))
> > > > > +				mem_during = si_mem_available();
> > > > 
> > > > But then you reinitialize here.  Perhaps to avoid the compiler being
> > > > confused into complaining about uninitialized variables?  (But if so,
> > > > please comment it.)
> > > 
> > > It is reinitialized here like that, because if kfree_loops is < 4, then
> > > mem_during needs to hold some value to avoid the (mem_begin - mem_during) to
> > > falsely appear quite large. That's why I initialized it in the beginning. If
> > > kfree_loops is >= 4, then yes it will be initialized twice but that's Ok.
> > > 
> > > I can add a comment to the earlier initialization if you like.
> > 
> > Could we just force kfree_loops >= 4?  Complain if not, set it to 4?
> 
> Sure.
> 
> > > > The thing is that by the fourth measurement, the initial influence has
> > > > been diluted by a factor of something like 16 or 32, correct?  I don't
> > > > believe that your measurements are any more accurate than that, given the
> > > > bursty nature of the RCU reclamation process.  So why not just initialize
> > > > it and average it at each sample point?
> > > 
> > > Yes but diluting 200MB of delta by 16 is still high and causes a skew.
> > 
> > You get similar errors by only sampling four times, though.  Assuming a
> > reasonably long test run compared to the typical grace-period duration,
> > each of the four samples has a 50% chance of being above the median,
> > thus a 1/16 chance of all four samples being above the median.
> 
> [snip]
> > > We can certainly refine it further but at this time I am thinking of spending
> > > my time reviewing Lai's patches and learning some other RCU things I need to
> > > catch up on. If you hate this patch too much, we can also defer this patch
> > > review for a bit and I can carry it in my tree for now as it is only a patch
> > > to test code. But honestly, in its current form I am sort of happy with it.
> > 
> > OK, I will keep it as is for now and let's look again later on.  It is not
> > in the bucket for the upcoming merge window in any case, so we do have
> > quite a bit of time.
> > 
> > It is not that I hate it, but rather that I want to be able to give
> > good answers to questions that might come up.  And given that I have
> > occasionally given certain people a hard time about their statistics,
> > it is only reasonable to expect them to return the favor.  I wouldn't
> > want you to be caught in the crossfire.  ;-)
> 
> Since the weights were concerning, I was thinking of just using a weight of
> (1 / N) where N is the number of samples. Essentially taking the average.
> That could be simple enough and does not cause your concerns with weight
> tuning. I tested it and looks good, I'll post it shortly.

YES!!!  ;-)

Snapshot mem_begin before entering the loop.  For the mean value to
be solid, you need at least 20-30 samples, which might mean upping the
default for kfree_loops.  Have an "unsigned long long" to accumulate the
sum, which should avoid any possibility of overflow for current systems
and for all systems for kfree_loops less than PAGE_SIZE.  At which point,
forget the "%" stuff and just sum up the si_mem_available() on each pass
through the loop.

Do the division on exit from the loop, preferably checking for divide
by zero.

Straightforward, fast, reasonably reliable, and easy to defend.

							Thanx, Paul
Joel Fernandes Jan. 15, 2020, 10:45 p.m. UTC | #6
On Wed, Jan 15, 2020 at 02:42:51PM -0800, Paul E. McKenney wrote:
> > [snip]
> > > > We can certainly refine it further but at this time I am thinking of spending
> > > > my time reviewing Lai's patches and learning some other RCU things I need to
> > > > catch up on. If you hate this patch too much, we can also defer this patch
> > > > review for a bit and I can carry it in my tree for now as it is only a patch
> > > > to test code. But honestly, in its current form I am sort of happy with it.
> > > 
> > > OK, I will keep it as is for now and let's look again later on.  It is not
> > > in the bucket for the upcoming merge window in any case, so we do have
> > > quite a bit of time.
> > > 
> > > It is not that I hate it, but rather that I want to be able to give
> > > good answers to questions that might come up.  And given that I have
> > > occasionally given certain people a hard time about their statistics,
> > > it is only reasonable to expect them to return the favor.  I wouldn't
> > > want you to be caught in the crossfire.  ;-)
> > 
> > Since the weights were concerning, I was thinking of just using a weight of
> > (1 / N) where N is the number of samples. Essentially taking the average.
> > That could be simple enough and does not cause your concerns with weight
> > tuning. I tested it and looks good, I'll post it shortly.
> 
> YES!!!  ;-)
> 
> Snapshot mem_begin before entering the loop.  For the mean value to
> be solid, you need at least 20-30 samples, which might mean upping the
> default for kfree_loops.  Have an "unsigned long long" to accumulate the
> sum, which should avoid any possibility of overflow for current systems
> and for all systems for kfree_loops less than PAGE_SIZE.  At which point,
> forget the "%" stuff and just sum up the si_mem_available() on each pass
> through the loop.
> 
> Do the division on exit from the loop, preferably checking for divide
> by zero.
> 
> Straightforward, fast, reasonably reliable, and easy to defend.

I mostly did it along these lines. Hopefully the latest posting is reasonable
enough ;-) I sent it twice because I messed up the authorship (sorry).

thanks,

 - Joel
Paul E. McKenney Jan. 16, 2020, 12:01 a.m. UTC | #7
On Wed, Jan 15, 2020 at 05:45:42PM -0500, Joel Fernandes wrote:
> On Wed, Jan 15, 2020 at 02:42:51PM -0800, Paul E. McKenney wrote:
> > > [snip]
> > > > > We can certainly refine it further but at this time I am thinking of spending
> > > > > my time reviewing Lai's patches and learning some other RCU things I need to
> > > > > catch up on. If you hate this patch too much, we can also defer this patch
> > > > > review for a bit and I can carry it in my tree for now as it is only a patch
> > > > > to test code. But honestly, in its current form I am sort of happy with it.
> > > > 
> > > > OK, I will keep it as is for now and let's look again later on.  It is not
> > > > in the bucket for the upcoming merge window in any case, so we do have
> > > > quite a bit of time.
> > > > 
> > > > It is not that I hate it, but rather that I want to be able to give
> > > > good answers to questions that might come up.  And given that I have
> > > > occasionally given certain people a hard time about their statistics,
> > > > it is only reasonable to expect them to return the favor.  I wouldn't
> > > > want you to be caught in the crossfire.  ;-)
> > > 
> > > Since the weights were concerning, I was thinking of just using a weight of
> > > (1 / N) where N is the number of samples. Essentially taking the average.
> > > That could be simple enough and does not cause your concerns with weight
> > > tuning. I tested it and looks good, I'll post it shortly.
> > 
> > YES!!!  ;-)
> > 
> > Snapshot mem_begin before entering the loop.  For the mean value to
> > be solid, you need at least 20-30 samples, which might mean upping the
> > default for kfree_loops.  Have an "unsigned long long" to accumulate the
> > sum, which should avoid any possibility of overflow for current systems
> > and for all systems for kfree_loops less than PAGE_SIZE.  At which point,
> > forget the "%" stuff and just sum up the si_mem_available() on each pass
> > through the loop.
> > 
> > Do the division on exit from the loop, preferably checking for divide
> > by zero.
> > 
> > Straightforward, fast, reasonably reliable, and easy to defend.
> 
> I mostly did it along these lines. Hopefully the latest posting is reasonable
> enough ;-) I sent it twice because I messed up the authorship (sorry).

No problem with the authorship-fix resend!

But let's get this patch consistent with basic statistics!

You really do need 20-30 samples for an average to mean much.

Of course, right now you default kfree_loops to 10.  You are doing
8000 kmalloc()/kfree_rcu() loops on each pass.  This is large enough
that just dropping the "% 4" should be just fine from the viewpoint of
si_mem_available() overhead.  But 8000 allocations and frees should get
done in way less than one second, so kicking the default kfree_loops up
to 30 should be a non-problem.

Then the patch would be both simpler and statistically valid.

So could you please stop using me as the middleman in your fight with
the laws of mathematics and get this patch to a defensible state?  ;-)

							Thanx, Paul
Joel Fernandes Jan. 16, 2020, 4:05 a.m. UTC | #8
On Wed, Jan 15, 2020 at 04:01:04PM -0800, Paul E. McKenney wrote:
> On Wed, Jan 15, 2020 at 05:45:42PM -0500, Joel Fernandes wrote:
> > On Wed, Jan 15, 2020 at 02:42:51PM -0800, Paul E. McKenney wrote:
> > > > [snip]
> > > > > > We can certainly refine it further but at this time I am thinking of spending
> > > > > > my time reviewing Lai's patches and learning some other RCU things I need to
> > > > > > catch up on. If you hate this patch too much, we can also defer this patch
> > > > > > review for a bit and I can carry it in my tree for now as it is only a patch
> > > > > > to test code. But honestly, in its current form I am sort of happy with it.
> > > > > 
> > > > > OK, I will keep it as is for now and let's look again later on.  It is not
> > > > > in the bucket for the upcoming merge window in any case, so we do have
> > > > > quite a bit of time.
> > > > > 
> > > > > It is not that I hate it, but rather that I want to be able to give
> > > > > good answers to questions that might come up.  And given that I have
> > > > > occasionally given certain people a hard time about their statistics,
> > > > > it is only reasonable to expect them to return the favor.  I wouldn't
> > > > > want you to be caught in the crossfire.  ;-)
> > > > 
> > > > Since the weights were concerning, I was thinking of just using a weight of
> > > > (1 / N) where N is the number of samples. Essentially taking the average.
> > > > That could be simple enough and does not cause your concerns with weight
> > > > tuning. I tested it and looks good, I'll post it shortly.
> > > 
> > > YES!!!  ;-)
> > > 
> > > Snapshot mem_begin before entering the loop.  For the mean value to
> > > be solid, you need at least 20-30 samples, which might mean upping the
> > > default for kfree_loops.  Have an "unsigned long long" to accumulate the
> > > sum, which should avoid any possibility of overflow for current systems
> > > and for all systems for kfree_loops less than PAGE_SIZE.  At which point,
> > > forget the "%" stuff and just sum up the si_mem_available() on each pass
> > > through the loop.
> > > 
> > > Do the division on exit from the loop, preferably checking for divide
> > > by zero.
> > > 
> > > Straightforward, fast, reasonably reliable, and easy to defend.
> > 
> > I mostly did it along these lines. Hopefully the latest posting is reasonable
> > enough ;-) I sent it twice because I messed up the authorship (sorry).
> 
> No problem with the authorship-fix resend!
> 
> But let's get this patch consistent with basic statistics!
> 
> You really do need 20-30 samples for an average to mean much.
> 
> Of course, right now you default kfree_loops to 10.  You are doing
> 8000 kmalloc()/kfree_rcu() loops on each pass.  This is large enough
> that just dropping the "% 4" should be just fine from the viewpoint of
> si_mem_available() overhead.  But 8000 allocations and frees should get
> done in way less than one second, so kicking the default kfree_loops up
> to 30 should be a non-problem.
> 
> Then the patch would be both simpler and statistically valid.
> 
> So could you please stop using me as the middleman in your fight with
> the laws of mathematics and get this patch to a defensible state?  ;-)

The thing is the signal doesn't vary much. I could very well just take one
out of the 4 samples and report that. But I still took the average since
there are 4 samples. I don't see much point in taking more samples here since
I am not concerned that the signal will fluctuate much (and if it really
does, then I can easily catch that kind of variation with multiple rcuperf
runs).

But if you really want though, I can increase the sampling to 20 samples or a
number like that and resend it.

thanks,

 - Joel
Joel Fernandes Jan. 16, 2020, 7:25 p.m. UTC | #9
On Wed, Jan 15, 2020 at 11:05:58PM -0500, Joel Fernandes wrote:
> On Wed, Jan 15, 2020 at 04:01:04PM -0800, Paul E. McKenney wrote:
> > On Wed, Jan 15, 2020 at 05:45:42PM -0500, Joel Fernandes wrote:
> > > On Wed, Jan 15, 2020 at 02:42:51PM -0800, Paul E. McKenney wrote:
> > > > > [snip]
> > > > > > > We can certainly refine it further but at this time I am thinking of spending
> > > > > > > my time reviewing Lai's patches and learning some other RCU things I need to
> > > > > > > catch up on. If you hate this patch too much, we can also defer this patch
> > > > > > > review for a bit and I can carry it in my tree for now as it is only a patch
> > > > > > > to test code. But honestly, in its current form I am sort of happy with it.
> > > > > > 
> > > > > > OK, I will keep it as is for now and let's look again later on.  It is not
> > > > > > in the bucket for the upcoming merge window in any case, so we do have
> > > > > > quite a bit of time.
> > > > > > 
> > > > > > It is not that I hate it, but rather that I want to be able to give
> > > > > > good answers to questions that might come up.  And given that I have
> > > > > > occasionally given certain people a hard time about their statistics,
> > > > > > it is only reasonable to expect them to return the favor.  I wouldn't
> > > > > > want you to be caught in the crossfire.  ;-)
> > > > > 
> > > > > Since the weights were concerning, I was thinking of just using a weight of
> > > > > (1 / N) where N is the number of samples. Essentially taking the average.
> > > > > That could be simple enough and does not cause your concerns with weight
> > > > > tuning. I tested it and looks good, I'll post it shortly.
> > > > 
> > > > YES!!!  ;-)
> > > > 
> > > > Snapshot mem_begin before entering the loop.  For the mean value to
> > > > be solid, you need at least 20-30 samples, which might mean upping the
> > > > default for kfree_loops.  Have an "unsigned long long" to accumulate the
> > > > sum, which should avoid any possibility of overflow for current systems
> > > > and for all systems for kfree_loops less than PAGE_SIZE.  At which point,
> > > > forget the "%" stuff and just sum up the si_mem_available() on each pass
> > > > through the loop.
> > > > 
> > > > Do the division on exit from the loop, preferably checking for divide
> > > > by zero.
> > > > 
> > > > Straightforward, fast, reasonably reliable, and easy to defend.
> > > 
> > > I mostly did it along these lines. Hopefully the latest posting is reasonable
> > > enough ;-) I sent it twice because I messed up the authorship (sorry).
> > 
> > No problem with the authorship-fix resend!
> > 
> > But let's get this patch consistent with basic statistics!
> > 
> > You really do need 20-30 samples for an average to mean much.
> > 
> > Of course, right now you default kfree_loops to 10.  You are doing
> > 8000 kmalloc()/kfree_rcu() loops on each pass.  This is large enough
> > that just dropping the "% 4" should be just fine from the viewpoint of
> > si_mem_available() overhead.  But 8000 allocations and frees should get
> > done in way less than one second, so kicking the default kfree_loops up
> > to 30 should be a non-problem.
> > 
> > Then the patch would be both simpler and statistically valid.
> > 
> > So could you please stop using me as the middleman in your fight with
> > the laws of mathematics and get this patch to a defensible state?  ;-)
> 
> The thing is the signal doesn't vary much. I could very well just take one
> out of the 4 samples and report that. But I still took the average since
> there are 4 samples. I don't see much point in taking more samples here since
> I am not concerned that the signal will fluctuate much (and if it really
> does, then I can easily catch that kind of variation with multiple rcuperf
> runs).
> 
> But if you really want though, I can increase the sampling to 20 samples or a
> number like that and resend it.

Changed it to around 20 samples and the tests look fine. Updated patch below (v5) :

---8<-----------------------

From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Subject: [PATCH] rcuperf: Measure memory footprint during kfree_rcu() test (v5)

During changes to kfree_rcu() code, we often check the amount of free
memory.  As an alternative to checking this manually, this commit adds a
measurement in the test itself.  It measures four times during the test
for available memory, digitally filters these measurements to produce a
running average with a weight of 0.5, and compares this digitally
filtered value with the amount of available memory at the beginning of
the test.

We apply the digital filter only once we are more than 25% into the
test. At the 25% mark, we just read available memory and don't apply any
filtering. This prevents the first sample from skewing the results
as we would not consider memory readings that were before memory was
allocated.

A sample run shows something like:

Total time taken by all kfree'ers: 6369738407 ns, loops: 10000, batches: 764, memory footprint: 216MB

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

---
v1->v2: Minor corrections
v1->v3: Use long long to prevent 32-bit system's overflow
	Handle case where some threads start later than others.
	Start measuring only once 25% into the test. Slightly more accurate.
v3->v4: Simplified test more. Using simple average.
v4->v5: Collect more memory samples than previous revision (4 to 5, vs, 20 to 21).

 kernel/rcu/rcuperf.c | 30 ++++++++++++++++++++++++++----
 1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
index 1fd0cc72022e..284625af4c79 100644
--- a/kernel/rcu/rcuperf.c
+++ b/kernel/rcu/rcuperf.c
@@ -12,6 +12,7 @@
 #include <linux/types.h>
 #include <linux/kernel.h>
 #include <linux/init.h>
+#include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/kthread.h>
 #include <linux/err.h>
@@ -593,7 +594,7 @@ rcu_perf_shutdown(void *arg)
 
 torture_param(int, kfree_nthreads, -1, "Number of threads running loops of kfree_rcu().");
 torture_param(int, kfree_alloc_num, 8000, "Number of allocations and frees done in an iteration.");
-torture_param(int, kfree_loops, 10, "Number of loops doing kfree_alloc_num allocations and frees.");
+torture_param(int, kfree_loops, 20, "Number of loops doing kfree_alloc_num allocations and frees.");
 
 static struct task_struct **kfree_reader_tasks;
 static int kfree_nrealthreads;
@@ -616,14 +617,17 @@ DEFINE_KFREE_OBJ(32); // goes on kmalloc-64 slab
 DEFINE_KFREE_OBJ(64); // goes on kmalloc-96 slab
 DEFINE_KFREE_OBJ(96); // goes on kmalloc-128 slab
 
+long long mem_begin;
+
 static int
 kfree_perf_thread(void *arg)
 {
 	int i, loop = 0;
 	long me = (long)arg;
 	void *alloc_ptr;
-
 	u64 start_time, end_time;
+	long mem_samples = 0;
+	long long mem_during = 0;
 
 	VERBOSE_PERFOUT_STRING("kfree_perf_thread task started");
 	set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids));
@@ -638,7 +642,17 @@ kfree_perf_thread(void *arg)
 			b_rcu_gp_test_started = cur_ops->get_gp_seq();
 	}
 
+	// Prevent "% 0" error below.
+	if (kfree_loops < 20)
+		kfree_loops = 20;
+
 	do {
+		// Start measuring only from when we are a little into the test.
+		if ((loop != 0) && (loop % (kfree_loops / 20) == 0)) {
+			mem_during = mem_during + si_mem_available();
+			mem_samples++;
+		}
+
 		for (i = 0; i < kfree_alloc_num; i++) {
 			int kfree_type = i % 4;
 
@@ -671,6 +685,8 @@ kfree_perf_thread(void *arg)
 		cond_resched();
 	} while (!torture_must_stop() && ++loop < kfree_loops);
 
+	mem_during = (mem_during / mem_samples);
+
 	if (atomic_inc_return(&n_kfree_perf_thread_ended) >= kfree_nrealthreads) {
 		end_time = ktime_get_mono_fast_ns();
 
@@ -679,9 +695,13 @@ kfree_perf_thread(void *arg)
 		else
 			b_rcu_gp_test_finished = cur_ops->get_gp_seq();
 
-		pr_alert("Total time taken by all kfree'ers: %llu ns, loops: %d, batches: %ld\n",
+		// The "memory footprint" field represents how much in-flight
+		// memory is allocated during the test and waiting to be freed.
+		pr_alert("Total time taken by all kfree'ers: %llu ns, loops: %d, batches: %ld, memory footprint: %lldMB\n",
 		       (unsigned long long)(end_time - start_time), kfree_loops,
-		       rcuperf_seq_diff(b_rcu_gp_test_finished, b_rcu_gp_test_started));
+		       rcuperf_seq_diff(b_rcu_gp_test_finished, b_rcu_gp_test_started),
+		       (mem_begin - mem_during) >> (20 - PAGE_SHIFT));
+
 		if (shutdown) {
 			smp_mb(); /* Assign before wake. */
 			wake_up(&shutdown_wq);
@@ -753,6 +773,8 @@ kfree_perf_init(void)
 		goto unwind;
 	}
 
+	mem_begin = si_mem_available();
+
 	for (i = 0; i < kfree_nrealthreads; i++) {
 		firsterr = torture_create_kthread(kfree_perf_thread, (void *)i,
 						  kfree_reader_tasks[i]);

Patch
diff mbox series

diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
index da94b89cd531..67e0f804ea97 100644
--- a/kernel/rcu/rcuperf.c
+++ b/kernel/rcu/rcuperf.c
@@ -12,6 +12,7 @@ 
 #include <linux/types.h>
 #include <linux/kernel.h>
 #include <linux/init.h>
+#include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/kthread.h>
 #include <linux/err.h>
@@ -604,6 +605,8 @@  struct kfree_obj {
 	struct rcu_head rh;
 };
 
+long long mem_begin;
+
 static int
 kfree_perf_thread(void *arg)
 {
@@ -611,6 +614,7 @@  kfree_perf_thread(void *arg)
 	long me = (long)arg;
 	struct kfree_obj *alloc_ptr;
 	u64 start_time, end_time;
+	long long mem_during = si_mem_available();
 
 	VERBOSE_PERFOUT_STRING("kfree_perf_thread task started");
 	set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids));
@@ -626,6 +630,15 @@  kfree_perf_thread(void *arg)
 	}
 
 	do {
+		// Moving average of memory availability measurements.
+		// Start measuring only from when we are at least 25% into the test.
+		if (loop && kfree_loops > 3 && (loop % (kfree_loops / 4) == 0)) {
+			if (loop == (kfree_loops / 4))
+				mem_during = si_mem_available();
+			else
+				mem_during = (mem_during + si_mem_available()) / 2;
+		}
+
 		for (i = 0; i < kfree_alloc_num; i++) {
 			alloc_ptr = kmalloc(sizeof(struct kfree_obj), GFP_KERNEL);
 			if (!alloc_ptr)
@@ -645,9 +658,13 @@  kfree_perf_thread(void *arg)
 		else
 			b_rcu_gp_test_finished = cur_ops->get_gp_seq();
 
-		pr_alert("Total time taken by all kfree'ers: %llu ns, loops: %d, batches: %ld\n",
+		// The "memory footprint" field represents how much in-flight
+		// memory is allocated during the test and waiting to be freed.
+		pr_alert("Total time taken by all kfree'ers: %llu ns, loops: %d, batches: %ld, memory footprint: %lldMB\n",
 		       (unsigned long long)(end_time - start_time), kfree_loops,
-		       rcuperf_seq_diff(b_rcu_gp_test_finished, b_rcu_gp_test_started));
+		       rcuperf_seq_diff(b_rcu_gp_test_finished, b_rcu_gp_test_started),
+		       (mem_begin - mem_during) >> (20 - PAGE_SHIFT));
+
 		if (shutdown) {
 			smp_mb(); /* Assign before wake. */
 			wake_up(&shutdown_wq);
@@ -719,6 +736,8 @@  kfree_perf_init(void)
 		goto unwind;
 	}
 
+	mem_begin = si_mem_available();
+
 	for (i = 0; i < kfree_nrealthreads; i++) {
 		firsterr = torture_create_kthread(kfree_perf_thread, (void *)i,
 						  kfree_reader_tasks[i]);