linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andy Lutomirski <luto@amacapital.net>
To: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Michael Kerrisk <mtk.manpages@gmail.com>,
	Michal Hocko <mhocko@suse.cz>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Rik van Riel <riel@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Linux API <linux-api@vger.kernel.org>, Jason Evans <je@fb.com>,
	Shaohua Li <shli@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	yalin.wang2010@gmail.com, Daniel Micay <danielmicay@gmail.com>,
	Mel Gorman <mgorman@suse.de>
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)
Date: Tue, 3 Nov 2015 19:41:35 -0800	[thread overview]
Message-ID: <CALCETrUuNs=26UQtkU88cKPomx_Bik9mbgUUF9q7Nmh1pQJ4qg@mail.gmail.com> (raw)
In-Reply-To: <1446600367-7976-2-git-send-email-minchan@kernel.org>

On Nov 3, 2015 5:30 PM, "Minchan Kim" <minchan@kernel.org> wrote:
>
> Linux doesn't have an ability to free pages lazy while other OS already
> have been supported that named by madvise(MADV_FREE).
>
> The gain is clear that kernel can discard freed pages rather than swapping
> out or OOM if memory pressure happens.
>
> Without memory pressure, freed pages would be reused by userspace without
> another additional overhead(ex, page fault + allocation + zeroing).
>

[...]

>
> How it works:
>
> When madvise syscall is called, VM clears dirty bit of ptes of the range.
> If memory pressure happens, VM checks dirty bit of page table and if it
> found still "clean", it means it's a "lazyfree pages" so VM could discard
> the page instead of swapping out.  Once there was store operation for the
> page before VM peek a page to reclaim, dirty bit is set so VM can swap out
> the page instead of discarding.

What happens if you MADV_FREE something that's MAP_SHARED or isn't
ordinary anonymous memory?  There's a long history of MADV_DONTNEED on
such mappings causing exploitable problems, and I think it would be
nice if MADV_FREE were obviously safe.

Does this set the write protect bit?

What happens on architectures without hardware dirty tracking?  For
that matter, even on architecture with hardware dirty tracking, what
happens in multithreaded processes that have the dirty TLB state
cached in a different CPU's TLB?

Using the dirty bit for these semantics scares me.  This API creates a
page that can have visible nonzero contents and then can
asynchronously and magically zero itself thereafter.  That makes me
nervous.  Could we use the accessed bit instead?  Then the observable
semantics would be equivalent to having MADV_FREE either zero the page
or do nothing, except that it doesn't make up its mind until the next
read.

> +                       ptent = pte_mkold(ptent);
> +                       ptent = pte_mkclean(ptent);
> +                       set_pte_at(mm, addr, pte, ptent);
> +                       tlb_remove_tlb_entry(tlb, pte, addr);

It looks like you are flushing the TLB.  In a multithreaded program,
that's rather expensive.  Potentially silly question: would it be
better to just zero the page immediately in a multithreaded program
and then, when swapping out, check the page is zeroed and, if so, skip
swapping it out?  That could be done without forcing an IPI.

> +static int madvise_free_single_vma(struct vm_area_struct *vma,
> +                       unsigned long start_addr, unsigned long end_addr)
> +{
> +       unsigned long start, end;
> +       struct mm_struct *mm = vma->vm_mm;
> +       struct mmu_gather tlb;
> +
> +       if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
> +               return -EINVAL;
> +
> +       /* MADV_FREE works for only anon vma at the moment */
> +       if (!vma_is_anonymous(vma))
> +               return -EINVAL;

Does anything weird happen if it's shared?

> +               if (!PageDirty(page) && (flags & TTU_FREE)) {
> +                       /* It's a freeable page by MADV_FREE */
> +                       dec_mm_counter(mm, MM_ANONPAGES);
> +                       goto discard;
> +               }

Does something clear TTU_FREE the next time the page gets marked clean?

--Andy

  parent reply	other threads:[~2015-11-04  3:41 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-04  1:25 [PATCH v2 00/13] MADV_FREE support Minchan Kim
2015-11-04  1:25 ` [PATCH v2 01/13] mm: support madvise(MADV_FREE) Minchan Kim
2015-11-04  2:16   ` Sergey Senozhatsky
2015-11-04 23:39     ` Minchan Kim
2015-11-05  3:41       ` Sergey Senozhatsky
2015-11-04  2:29   ` Sergey Senozhatsky
2015-11-04 23:40     ` Minchan Kim
2015-11-04  3:41   ` Andy Lutomirski [this message]
2015-11-04  5:50     ` Daniel Micay
2015-11-04  5:53       ` Daniel Micay
2015-11-04  6:04         ` Daniel Micay
2015-11-04 18:23       ` Andy Lutomirski
2015-11-04 22:05         ` Daniel Micay
2015-11-05 18:17           ` Shaohua Li
2015-11-05 20:13             ` Daniel Micay
2015-11-05 20:14               ` Daniel Micay
2015-11-05  0:13     ` Minchan Kim
2015-11-05  0:42       ` Andy Lutomirski
2015-11-05  0:56         ` Minchan Kim
2015-11-05  1:29           ` Andy Lutomirski
2015-11-05  1:48             ` Minchan Kim
2015-11-04 20:00   ` Shaohua Li
2015-11-04 21:16     ` Daniel Micay
2015-11-04 21:29       ` Daniel Micay
2015-11-04 21:43     ` Andy Lutomirski
2015-11-05  1:33     ` Minchan Kim
2015-11-05  1:37       ` Minchan Kim
2015-12-01 22:30     ` John Stultz
2015-11-04  1:25 ` [PATCH v2 02/13] mm: define MADV_FREE for some arches Minchan Kim
2015-11-04  1:25 ` [PATCH v2 03/13] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures Minchan Kim
2015-11-04  1:25 ` [PATCH v2 04/13] mm: free swp_entry in madvise_free Minchan Kim
2015-11-04  1:25 ` [PATCH v2 05/13] mm: move lazily freed pages to inactive list Minchan Kim
2015-11-04  1:26 ` [PATCH v2 06/13] mm: clear PG_dirty to mark page freeable Minchan Kim
2015-11-04  1:26 ` [PATCH v2 07/13] mm: mark stable page dirty in KSM Minchan Kim
2015-11-04  1:26 ` [PATCH v2 08/13] x86: add pmd_[dirty|mkclean] for THP Minchan Kim
2015-11-04  1:26 ` [PATCH v2 09/13] sparc: " Minchan Kim
2015-11-04  1:26 ` [PATCH v2 10/13] powerpc: " Minchan Kim
2015-11-04  1:26 ` [PATCH v2 11/13] arm: add pmd_mkclean " Minchan Kim
2015-11-04  1:26 ` [PATCH v2 12/13] arm64: " Minchan Kim
2015-11-04  1:26 ` [PATCH v2 13/13] mm: don't split THP page when syscall is called Minchan Kim
2015-12-05 11:10 ` [PATCH v2 00/13] MADV_FREE support Pavel Machek
2015-12-05 15:51   ` Daniel Micay

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALCETrUuNs=26UQtkU88cKPomx_Bik9mbgUUF9q7Nmh1pQJ4qg@mail.gmail.com' \
    --to=luto@amacapital.net \
    --cc=akpm@linux-foundation.org \
    --cc=danielmicay@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=je@fb.com \
    --cc=kirill@shutemov.name \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.cz \
    --cc=minchan@kernel.org \
    --cc=mtk.manpages@gmail.com \
    --cc=riel@redhat.com \
    --cc=shli@kernel.org \
    --cc=yalin.wang2010@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).