Re: Re: [RFC v5 0/7] Implement Data Access Monitoring-based Memory Operation Schemes

From: SeongJae Park <sjpark@amazon.com>
To: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Cc: SeongJae Park <sjpark@amazon.com>,
	<alexander.shishkin@linux.intel.com>, <linux-mm@kvack.org>,
	<akpm@linux-foundation.org>, SeongJae Park <sjpark@amazon.de>,
	<aarcange@redhat.com>, <acme@kernel.org>, <amit@kernel.org>,
	<brendan.d.gregg@gmail.com>, <brendanhiggins@google.com>,
	<cai@lca.pw>, <colin.king@canonical.com>, <corbet@lwn.net>,
	<dwmw@amazon.com>, <jolsa@redhat.com>, <kirill@shutemov.name>,
	<mark.rutland@arm.com>, <mgorman@suse.de>, <minchan@kernel.org>,
	<mingo@redhat.com>, <namhyung@kernel.org>, <peterz@infradead.org>,
	<rdunlap@infradead.org>, <riel@surriel.com>,
	<rientjes@google.com>, <rostedt@goodmis.org>,
	<shakeelb@google.com>, <shuah@kernel.org>, <sj38.park@gmail.com>,
	<vbabka@suse.cz>, <vdavydov.dev@gmail.com>,
	<yang.shi@linux.alibaba.com>, <ying.huang@intel.com>,
	<linux-doc@vger.kernel.org>, <linux-kernel@vger.kernel.org>
Subject: Re: Re: [RFC v5 0/7] Implement Data Access Monitoring-based Memory Operation Schemes
Date: Wed, 1 Apr 2020 10:21:50 +0200	[thread overview]
Message-ID: <20200401082150.21124-1-sjpark@amazon.com> (raw)
In-Reply-To: <20200331173908.0000696f@Huawei.com> (raw)

On Tue, 31 Mar 2020 17:39:08 +0100 Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:

> On Tue, 31 Mar 2020 18:18:19 +0200
> SeongJae Park <sjpark@amazon.com> wrote:
> 
> > On Tue, 31 Mar 2020 16:51:55 +0100 Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> > 
> > > On Mon, 30 Mar 2020 13:50:35 +0200
> > > SeongJae Park <sjpark@amazon.com> wrote:
> > >   
> > > > From: SeongJae Park <sjpark@amazon.de>
> > > > 
> > > > DAMON[1] can be used as a primitive for data access awared memory management
> > > > optimizations.  That said, users who want such optimizations should run DAMON,
> > > > read the monitoring results, analyze it, plan a new memory management scheme,
> > > > and apply the new scheme by themselves.  Such efforts will be inevitable for
> > > > some complicated optimizations.
> > > > 
> > > > However, in many other cases, the users could simply want the system to apply a
> > > > memory management action to a memory region of a specific size having a
> > > > specific access frequency for a specific time.  For example, "page out a memory
> > > > region larger than 100 MiB keeping only rare accesses more than 2 minutes", or
> > > > "Do not use THP for a memory region larger than 2 MiB rarely accessed for more
> > > > than 1 seconds".
> > > > 
> > > > This RFC patchset makes DAMON to handle such data access monitoring-based
> > > > operation schemes.  With this change, users can do the data access awared
> > > > optimizations by simply specifying their schemes to DAMON.  
[...]
> > > > 
> > > > Efficient THP
> > > > ~~~~~~~~~~~~~
> > > > 
> > > > THP 'always' enabled policy achieves 5.57% speedup but incurs 7.29% memory
> > > > overhead.  It achieves 41.62% speedup in best case, but 79.98% memory overhead
> > > > in worst case.  Interestingly, both the best and worst case are with
> > > > 'splash2x/ocean_ncp').  
> > > 
> > > The results above don't seems to support this any more? 
> > >   
> > > > runtime                 orig     rec      (overhead) thp      (overhead) ethp     (overhead) prcl     (overhead)
> > > > splash2x/ocean_ncp      86.927   87.065   (0.16)     50.747   (-41.62)   86.855   (-0.08)    199.553  (129.57)   
> > 
> > Hmm... But, I don't get what point you meaning...  In the data, column of 'thp'
> > means the THP 'always' enabled policy.  And, the following column shows the
> > overhead of it compared to that of 'orig', in percent.  Thus, the data says THP
> > 'always' enabled policy enabled kernel consumes 50.747 seconds to finish
> > splash2x/ocean_ncp, while THP disabled original kernel consumes 86.927 seconds.
> 
> ah. I got myself confused. 
> 
> However, I was expecting to see a significant performance advantage
> to ethp for this particular case as we did in the previous version.
> 
> In the previous version (you posted in reply to v6 of Damon), for ethp we had a significant gain with:
> 
> runtime                 orig     rec      (overhead) thp      (overhead) ethp     (overhead) prcl     (overhead)
> splash2x/ocean_ncp      81.360   81.434   (0.09)     51.157   (-37.12)   66.711   (-18.00)   91.611   (12.60) 
> 
> So, in ethp we got roughly half the performance back (at the cost of some of the memory)
> 
> That was a result I have been trying to replicate, hence was at the front of my mind!
> 
> Any idea why that changed so much? 

Ah, I forgot to explain about this change.  Thank you for let me know.

Overall, ETHP in DAMON-based Operations Schemes RFC v5 shows worse peak
performance gains.

For example, splash2x/fft shows best case speedup with ETHP for both this
version and previous version.  The speedup changed from 19% to 12%.  In case of
splash2x/ocean_ncp, it changed from 18% to only 0.08%.

That said, total performance speedup is improved.  It was 1.83% before, and
2.21% now.  Also, there are several workloads showing better speedup.  In case
of parsec3/canneal, the speedup changed from 3.86% to 6.34%.

Also note that ETHP's memory savings for the workloads showing less speedup are
much improved.  For example, the memory overhead of ETHP for splash2x/ocean_ncp
was 24.4% before, but only 3.5% now.

This is due to the fact that I didn't update the schemes for the updated DAMON.
Basically, the effect of the schemes are access pattern dependent.  In this
case, because DAMON has changed so that it might report access pattern that
different from that of previous version, the schemes should also be modified to
make best performance.  However, I didn't update the schemes because those are
for only proof of the concepts, not for productions.

The change of report was not so huge, fortunately.  I also confirmed this with
my human eyes by comparing the visualized access patterns of the two version.
The overall trend (better performance, less memory overhead) also changed only
subtle.  However, some individual workloads got some remarkable changes.

Thanks,
SeongJae Park

> 
> Thanks,
> 
> Jonathan
> 
> 
> > Thus, the overhead is ``(50.747 - 86.927) / 86.927 = -0.4162``.  In other
> > words, 41.62% speedup.
> > 
> > Also, 5.57% speedup and 7.29% memory overhead is for _total_.  This data shows
> > it.
> > 
> > > > runtime                 orig     rec      (overhead) thp      (overhead) ethp     (overhead) prcl     (overhead)
> > > > total                   3020.570 3028.080 (0.25)     2852.190 (-5.57)    2953.960 (-2.21)    3276.550 (8.47)      
> > 
> > Maybe I made you confused by ambiguously saying this.  Sorry if so.  Or, if I'm
> > still misunderstanding your point, please let me know.
> 
> 
> > 
> > 
> > Thanks,
> > SeongJae Park
> >  
> > [...]
>