linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: SeongJae Park <sjpark@amazon.com>
To: SeongJae Park <sj38.park@gmail.com>
Cc: Jonathan Cameron <Jonathan.Cameron@Huawei.com>,
	SeongJae Park <sjpark@amazon.com>, <akpm@linux-foundation.org>,
	SeongJae Park <sjpark@amazon.de>, <aarcange@redhat.com>,
	<acme@kernel.org>, <alexander.shishkin@linux.intel.com>,
	<amit@kernel.org>, <brendan.d.gregg@gmail.com>,
	<brendanhiggins@google.com>, <cai@lca.pw>,
	<colin.king@canonical.com>, <corbet@lwn.net>, <dwmw@amazon.com>,
	<jolsa@redhat.com>, <kirill@shutemov.name>,
	<mark.rutland@arm.com>, <mgorman@suse.de>, <minchan@kernel.org>,
	<mingo@redhat.com>, <namhyung@kernel.org>, <peterz@infradead.org>,
	<rdunlap@infradead.org>, <riel@surriel.com>,
	<rientjes@google.com>, <rostedt@goodmis.org>, <shuah@kernel.org>,
	<vbabka@suse.cz>, <vdavydov.dev@gmail.com>,
	<yang.shi@linux.alibaba.com>, <ying.huang@intel.com>,
	<linux-mm@kvack.org>, <linux-doc@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>
Subject: Re: Re: Re: [PATCH v7 04/15] mm/damon: Implement region based sampling
Date: Fri, 3 Apr 2020 12:30:59 +0200	[thread overview]
Message-ID: <20200403103059.12762-1-sjpark@amazon.com> (raw)
In-Reply-To: <20200402180022.32671-1-sj38.park@gmail.com> (raw)

On Thu,  2 Apr 2020 20:00:22 +0200 SeongJae Park <sj38.park@gmail.com> wrote:

> On Thu, 2 Apr 2020 18:24:01 +0100 Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> 
> > On Thu, 2 Apr 2020 15:59:59 +0200
> > SeongJae Park <sjpark@amazon.com> wrote:
> > 
> > > On Wed, 1 Apr 2020 10:22:22 +0200 SeongJae Park <sjpark@amazon.com> wrote:
> > > 
> > > > On Tue, 31 Mar 2020 17:02:33 +0100 Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> > > >   
> > > > > On Wed, 18 Mar 2020 12:27:11 +0100
> > > > > SeongJae Park <sjpark@amazon.com> wrote:
> > > > >   
> > > > > > From: SeongJae Park <sjpark@amazon.de>
[...]
> > > > > > +static void damon_pte_pmd_mkold(pte_t *pte, pmd_t *pmd)
> > > > > > +{
> > > > > > +	if (pte) {
> > > > > > +		if (pte_young(*pte)) {
> > > > > > +			clear_page_idle(pte_page(*pte));
> > > > > > +			set_page_young(pte_page(*pte));
> > > > > > +		}
> > > > > > +		*pte = pte_mkold(*pte);
> > > > > > +		return;
> > > > > > +	}
> > > > > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > > > > +	if (pmd) {
> > > > > > +		if (pmd_young(*pmd)) {
> > > > > > +			clear_page_idle(pmd_page(*pmd));
> > > > > > +			set_page_young(pmd_page(*pmd));
> > > > > > +		}
> > > > > > +		*pmd = pmd_mkold(*pmd);
> > > > > > +	}
> > > > > > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */  
> > > > > 
> > > > > No need to flush the TLBs?  
> > > > 
> > > > Good point!
> > > > 
> > > > I have intentionally skipped TLB flushing here to minimize the performance
> > > > effect to the target workload.  I also thought this might not degrade the
> > > > monitoring accuracy so much because we are targetting for the DRAM level
> > > > accesses of memory-intensive workloads, which might make TLB flood frequently.
> > > > 
> > > > However, your comment makes me thinking differently now.  By flushing the TLB
> > > > here, we will increase up to `number_of_regions` TLB misses for sampling
> > > > interval.  This might be not a huge overhead.  Also, improving the monitoring
> > > > accuracy makes no harm at all.  I even didn't measured the overhead.
> > > > 
> > > > I will test the overhead and if it is not significant, I will make this code to
> > > > flush TLB, in the next spin.  
> > > 
> > > Hmm, it seems like 'page_idle.c' is also modifying the Accessed bit but doesn't
> > > flush related TLB entries.  If I'm not missing something here, I would like to
> > > leave this part as is to make the behavior consistent.
> > 
> > Interesting.  In that usecase, the risk is that the MMU believes
> > the page still has the accessed bit set when we have cleared it and hence
> > the accessed bit is not written out to the table in memory.
> > 
> > That will give them a wrong decision so not great and would lead to them
> > thinking more pages are idle than are.
> > 
> > Here we could have a particular TLB entry for a huge page in which
> > a region lies entirely.  Because we don't flush the TLB each time
> > we could end with a count of 0 accesses when it should be the maximum.
> > A very frequently accessed page might well sit in the TLB for a very
> > long time (particularly if the TLB is running a clever eviction
> > strategy).
> > 
> > I think we would want to be test this and see if we get that
> > pathological case sometimes.  Also worth benchmarking if it actually
> > costs us very much to do the flushes.
> 
> Agreed, it wouldn't be late to make a decision after measuring the real cost.
> I will share the measurement results soon.  Meanwhile, it would be helpful if
> anyone could confirm whether page_idle.c is skipping TLB flushing and explain
> why such decision has made.

I just finished implementing TLB flushing in straightforward way (the diff for
this change is at the bottom of this mail) on the version I applied your
recommended changes and my one bug fix (setting 'last_accessed' false).

It shows monitoring accuracy improvement as expected, though it is not so big.
I compared the visualized access patterns of each test workload to check this.
There is no big difference, but the visualized pattern of TLB flushing version
seems a little bit more clear to my human eye.

However, the overhead is clear and significant to some workloads including
parsec3/streamcluster, splash2x/barnes, splash2x/lu_ncb and splash2x/volrend.
The CPU utilization of kdamond, the deamon monitoring the access pattern, never
exceeds 2% of single CPU time for the 4 workloads before the change is applied,
but it frequently exceeds 30% of single CPU time with the TLB flushing.  The
runtimes of the monitoring target workloads also increased.  In case of
parsec3/streamcluster, the TLB flushing changed its runtime from 320 seconds to
470 seconds.

So, it seems the benefit of TLB flushing is smaller than the cost in some
cases.  Thus, I think enabling TLB flushes by default wouldn't be a good
decision.  Rather than that, it could be better to allow users to manually do
TLB flushing for their specific workloads.  This will be easily available by
using the DAMON's sampling callback functions.  Also, I am planning to let
users to configure DAMON with their own access check / sampling setup functions
in future, to support not only virtual memory but also physical memory and some
other special cases.

If I'm missing something or you have other thinking, please let me know.


Thanks,
SeongJae Park

=============================== >8 ============================================

Below is the essential diff for the TLB flushing I implemented.  To double
check the overhead is really from the TLB flushing, I also measured the
overhead of the additional works including the vma searching by commenting out
the 'flush_tlb_range()' call.  After commenting out it, the CPU utilization of
kdamond and runtime of target workloads has restored back.  So, the overhead
seems really came from the TLB flushing.

@@ -408,6 +411,9 @@ static void kdamond_prepare_sampling(struct damon_ctx *ctx,
        pte_t *pte = NULL;
        pmd_t *pmd = NULL;
        spinlock_t *ptl;
+       struct vm_area_struct *vma;
+       unsigned long tlb_start;
+       unsigned long tlb_size = PAGE_SIZE;

        r->sampling_addr = damon_rand(ctx, r->vm_start, r->vm_end);

@@ -420,18 +426,29 @@ static void kdamond_prepare_sampling(struct damon_ctx *ctx,
                        set_page_young(pte_page(*pte));
                }
                *pte = pte_mkold(*pte);
-               return;
        }
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-       if (pmd) {
+       else if (pmd) {
                if (pmd_young(*pmd)) {
                        clear_page_idle(pmd_page(*pmd));
                        set_page_young(pmd_page(*pmd));
                }
                *pmd = pmd_mkold(*pmd);
+               tlb_size = ((1UL) << HPAGE_PMD_SHIFT);
        }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
        spin_unlock(ptl);
+
+       tlb_start = ALIGN(r->sampling_addr, tlb_size);
+
+       down_read(&mm->mmap_sem);
+       vma = find_vma(mm, r->sampling_addr);
+       if (!vma || (r->sampling_addr < vma->vm_start))
+               goto out;
+       flush_tlb_range(vma, tlb_start, tlb_start + tlb_size);
+
+out:
+       up_read(&mm->mmap_sem);
 }


Thanks,
SeongJae Park


  reply	other threads:[~2020-04-03 10:31 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-18 11:27 [PATCH v7 00/15] Introduce Data Access MONitor (DAMON) SeongJae Park
2020-03-18 11:27 ` [PATCH v7 01/15] scripts/spelling: Add a few more typos SeongJae Park
2020-03-18 11:27 ` [PATCH v7 02/15] mm/page_ext: Export lookup_page_ext() to GPL modules SeongJae Park
2020-03-18 11:27 ` [PATCH v7 03/15] mm: Introduce Data Access MONitor (DAMON) SeongJae Park
2020-03-18 11:27 ` [PATCH v7 04/15] mm/damon: Implement region based sampling SeongJae Park
2020-03-31 16:02   ` Jonathan Cameron
2020-04-01  8:22     ` SeongJae Park
2020-04-01 14:24       ` Jonathan Cameron
2020-04-01 14:46         ` SeongJae Park
2020-04-02 13:59       ` SeongJae Park
2020-04-02 17:24         ` Jonathan Cameron
2020-04-02 18:00           ` SeongJae Park
2020-04-03 10:30             ` SeongJae Park [this message]
2020-04-02 15:07   ` SeongJae Park
2020-03-18 11:27 ` [PATCH v7 05/15] mm/damon: Adaptively adjust regions SeongJae Park
2020-03-31 16:08   ` Jonathan Cameron
2020-04-01  8:22     ` SeongJae Park
2020-03-18 11:27 ` [PATCH v7 06/15] mm/damon: Apply dynamic memory mapping changes SeongJae Park
2020-03-18 11:27 ` [PATCH v7 07/15] mm/damon: Implement callbacks SeongJae Park
2020-03-18 11:27 ` [PATCH v7 08/15] mm/damon: Implement access pattern recording SeongJae Park
2020-03-18 11:27 ` [PATCH v7 09/15] mm/damon: Add debugfs interface SeongJae Park
2020-03-18 11:27 ` [PATCH v7 10/15] mm/damon: Add tracepoints SeongJae Park
2020-03-18 11:27 ` [PATCH v7 11/15] tools: Add a minimal user-space tool for DAMON SeongJae Park
2020-03-18 11:27 ` [PATCH v7 12/15] Documentation/admin-guide/mm: Add a document " SeongJae Park
2020-03-18 11:27 ` [PATCH v7 13/15] mm/damon: Add kunit tests SeongJae Park
2020-03-18 11:38 ` [PATCH v7 14/15] mm/damon: Add user space selftests SeongJae Park
2020-03-18 11:39 ` [PATCH v7 15/15] MAINTAINERS: Update for DAMON SeongJae Park
2020-03-30 10:53 ` [PATCH v7 00/15] Introduce Data Access MONitor (DAMON) SeongJae Park

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200403103059.12762-1-sjpark@amazon.com \
    --to=sjpark@amazon.com \
    --cc=Jonathan.Cameron@Huawei.com \
    --cc=aarcange@redhat.com \
    --cc=acme@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=amit@kernel.org \
    --cc=brendan.d.gregg@gmail.com \
    --cc=brendanhiggins@google.com \
    --cc=cai@lca.pw \
    --cc=colin.king@canonical.com \
    --cc=corbet@lwn.net \
    --cc=dwmw@amazon.com \
    --cc=jolsa@redhat.com \
    --cc=kirill@shutemov.name \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mark.rutland@arm.com \
    --cc=mgorman@suse.de \
    --cc=minchan@kernel.org \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rdunlap@infradead.org \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=rostedt@goodmis.org \
    --cc=shuah@kernel.org \
    --cc=sj38.park@gmail.com \
    --cc=sjpark@amazon.de \
    --cc=vbabka@suse.cz \
    --cc=vdavydov.dev@gmail.com \
    --cc=yang.shi@linux.alibaba.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).