linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@kernel.org>
To: Minchan Kim <minchan@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	linux-api@vger.kernel.org, Johannes Weiner <hannes@cmpxchg.org>,
	Tim Murray <timmurray@google.com>,
	Joel Fernandes <joel@joelfernandes.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Daniel Colascione <dancol@google.com>,
	Shakeel Butt <shakeelb@google.com>,
	Sonny Rao <sonnyrao@google.com>,
	Brian Geffon <bgeffon@google.com>,
	jannh@google.com, oleg@redhat.com, christian@brauner.io,
	oleksandr@redhat.com, hdanton@sina.com, lizeb@google.com
Subject: Re: [PATCH v2 1/5] mm: introduce MADV_COLD
Date: Wed, 19 Jun 2019 14:56:12 +0200	[thread overview]
Message-ID: <20190619125611.GO2968@dhcp22.suse.cz> (raw)
In-Reply-To: <20190610111252.239156-2-minchan@kernel.org>

On Mon 10-06-19 20:12:48, Minchan Kim wrote:
> When a process expects no accesses to a certain memory range, it could
> give a hint to kernel that the pages can be reclaimed when memory pressure
> happens but data should be preserved for future use.  This could reduce
> workingset eviction so it ends up increasing performance.
> 
> This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> MADV_COLD can be used by a process to mark a memory range as not expected
> to be used in the near future. The hint can help kernel in deciding which
> pages to evict early during memory pressure.
> 
> It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
> 
> 	active file page -> inactive file LRU
> 	active anon page -> inacdtive anon LRU
> 
> Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
> file LRU's head because MADV_COLD is a little bit different symantic.
> MADV_FREE means it's okay to discard when the memory pressure because
> the content of the page is *garbage* so freeing such pages is almost zero
> overhead since we don't need to swap out and access afterward causes just
> minor fault. Thus, it would make sense to put those freeable pages in
> inactive file LRU to compete other used-once pages. It makes sense for
> implmentaion point of view, too because it's not swapbacked memory any
> longer until it would be re-dirtied. Even, it could give a bonus to make
> them be reclaimed on swapless system. However, MADV_COLD doesn't mean
> garbage so reclaiming them requires swap-out/in in the end so it's bigger
> cost. Since we have designed VM LRU aging based on cost-model, anonymous
> cold pages would be better to position inactive anon's LRU list, not file
> LRU. Furthermore, it would help to avoid unnecessary scanning if system
> doesn't have a swap device. Let's start simpler way without adding
> complexity at this moment.

I would only add that it is a caveat that workloads with a lot of page
cache are likely to ignore MADV_COLD on anonymous memory because we
rarely age anonymous LRU lists.

[...]
> +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> +				unsigned long end, struct mm_walk *walk)
> +{

This is duplicating a large part of madvise_free_pte_range with some
subtle differences which are not explained anywhere (e.g. why does
madvise_free_huge_pmd need try_lock on a page while not here? etc.).

Why cannot we reuse a large part of that code and differ essentially on
the reclaim target check and action? Have you considered to consolidate
the code to share as much as possible? Maybe that is easier said than
done because the devil is always in details...

I would definitely feel much more comfortable to review the code without
thinking about all those subtle details that have been already solved
before. Especially all the THP ones.

Other than that the patch looks sane to me.

> +	struct mmu_gather *tlb = walk->private;
> +	struct mm_struct *mm = tlb->mm;
> +	struct vm_area_struct *vma = walk->vma;
> +	pte_t *orig_pte, *pte, ptent;
> +	spinlock_t *ptl;
> +	struct page *page;
> +	unsigned long next;
> +
> +	next = pmd_addr_end(addr, end);
> +	if (pmd_trans_huge(*pmd)) {
> +		pmd_t orig_pmd;
> +
> +		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
> +		ptl = pmd_trans_huge_lock(pmd, vma);
> +		if (!ptl)
> +			return 0;
> +
> +		orig_pmd = *pmd;
> +		if (is_huge_zero_pmd(orig_pmd))
> +			goto huge_unlock;
> +
> +		if (unlikely(!pmd_present(orig_pmd))) {
> +			VM_BUG_ON(thp_migration_supported() &&
> +					!is_pmd_migration_entry(orig_pmd));
> +			goto huge_unlock;
> +		}
> +
> +		page = pmd_page(orig_pmd);
> +		if (next - addr != HPAGE_PMD_SIZE) {
> +			int err;
> +
> +			if (page_mapcount(page) != 1)
> +				goto huge_unlock;
> +
> +			get_page(page);
> +			spin_unlock(ptl);
> +			lock_page(page);
> +			err = split_huge_page(page);
> +			unlock_page(page);
> +			put_page(page);
> +			if (!err)
> +				goto regular_page;
> +			return 0;
> +		}
> +
> +		if (pmd_young(orig_pmd)) {
> +			pmdp_invalidate(vma, addr, pmd);
> +			orig_pmd = pmd_mkold(orig_pmd);
> +
> +			set_pmd_at(mm, addr, pmd, orig_pmd);
> +			tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
> +		}
> +
> +		test_and_clear_page_young(page);
> +		deactivate_page(page);
> +huge_unlock:
> +		spin_unlock(ptl);
> +		return 0;
> +	}
> +
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +
> +regular_page:
> +	tlb_change_page_size(tlb, PAGE_SIZE);
> +	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	flush_tlb_batched_pending(mm);
> +	arch_enter_lazy_mmu_mode();
> +	for (; addr < end; pte++, addr += PAGE_SIZE) {
> +		ptent = *pte;
> +
> +		if (pte_none(ptent))
> +			continue;
> +
> +		if (!pte_present(ptent))
> +			continue;
> +
> +		page = vm_normal_page(vma, addr, ptent);
> +		if (!page)
> +			continue;
> +
> +		if (pte_young(ptent)) {
> +			ptent = ptep_get_and_clear_full(mm, addr, pte,
> +							tlb->fullmm);
> +			ptent = pte_mkold(ptent);
> +			set_pte_at(mm, addr, pte, ptent);
> +			tlb_remove_tlb_entry(tlb, pte, addr);
> +		}
> +
> +		/*
> +		 * We are deactivating a page for accelerating reclaiming.
> +		 * VM couldn't reclaim the page unless we clear PG_young.
> +		 * As a side effect, it makes confuse idle-page tracking
> +		 * because they will miss recent referenced history.
> +		 */
> +		test_and_clear_page_young(page);
> +		deactivate_page(page);
> +	}
> +
> +	arch_enter_lazy_mmu_mode();
> +	pte_unmap_unlock(orig_pte, ptl);
> +	cond_resched();
> +
> +	return 0;
> +}
-- 
Michal Hocko
SUSE Labs

  reply	other threads:[~2019-06-19 12:56 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-10 11:12 [PATCH v2 0/5] Introduce MADV_COLD and MADV_PAGEOUT Minchan Kim
2019-06-10 11:12 ` [PATCH v2 1/5] mm: introduce MADV_COLD Minchan Kim
2019-06-19 12:56   ` Michal Hocko [this message]
2019-06-20  0:06     ` Minchan Kim
2019-06-20  7:08       ` Michal Hocko
2019-06-20  8:44         ` Minchan Kim
2019-06-10 11:12 ` [PATCH v2 2/5] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM Minchan Kim
2019-06-19 13:09   ` Michal Hocko
2019-06-10 11:12 ` [PATCH v2 3/5] mm: account nr_isolated_xxx in [isolate|putback]_lru_page Minchan Kim
2019-06-10 11:12 ` [PATCH v2 4/5] mm: introduce MADV_PAGEOUT Minchan Kim
2019-06-19 13:24   ` Michal Hocko
2019-06-20  4:16     ` Minchan Kim
2019-06-20  7:04       ` Michal Hocko
2019-06-20  8:40         ` Minchan Kim
2019-06-20  9:22           ` Michal Hocko
2019-06-20 10:32             ` Minchan Kim
2019-06-20 10:55               ` Michal Hocko
2019-06-10 11:12 ` [PATCH v2 5/5] mm: factor out pmd young/dirty bit handling and THP split Minchan Kim
2019-06-10 18:03 ` [PATCH v2 0/5] Introduce MADV_COLD and MADV_PAGEOUT Dave Hansen
2019-06-13  4:51   ` Minchan Kim
2019-06-12 10:59 ` Pavel Machek
2019-06-12 11:19   ` Oleksandr Natalenko
2019-06-12 11:37     ` Pavel Machek
2019-06-19 12:27 ` Michal Hocko
2019-06-19 23:42   ` Minchan Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190619125611.GO2968@dhcp22.suse.cz \
    --to=mhocko@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=bgeffon@google.com \
    --cc=christian@brauner.io \
    --cc=dancol@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=hdanton@sina.com \
    --cc=jannh@google.com \
    --cc=joel@joelfernandes.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizeb@google.com \
    --cc=minchan@kernel.org \
    --cc=oleg@redhat.com \
    --cc=oleksandr@redhat.com \
    --cc=shakeelb@google.com \
    --cc=sonnyrao@google.com \
    --cc=surenb@google.com \
    --cc=timmurray@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).