From: Michal Hocko <mhocko@kernel.org> To: Minchan Kim <minchan@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org>, Andrew Morton <akpm@linux-foundation.org>, linux-mm <linux-mm@kvack.org>, LKML <linux-kernel@vger.kernel.org>, linux-api@vger.kernel.org, Tim Murray <timmurray@google.com>, Joel Fernandes <joel@joelfernandes.org>, Suren Baghdasaryan <surenb@google.com>, Daniel Colascione <dancol@google.com>, Shakeel Butt <shakeelb@google.com>, Sonny Rao <sonnyrao@google.com>, Brian Geffon <bgeffon@google.com>, jannh@google.com, oleg@redhat.com, christian@brauner.io, oleksandr@redhat.com, hdanton@sina.com Subject: Re: [RFCv2 1/6] mm: introduce MADV_COLD Date: Tue, 4 Jun 2019 08:56:57 +0200 Message-ID: <20190604065657.GC4669@dhcp22.suse.cz> (raw) In-Reply-To: <20190603230205.GA43390@google.com> On Tue 04-06-19 08:02:05, Minchan Kim wrote: > Hi Johannes, > > On Mon, Jun 03, 2019 at 05:50:59PM -0400, Johannes Weiner wrote: > > On Mon, Jun 03, 2019 at 10:32:30PM +0200, Michal Hocko wrote: > > > On Mon 03-06-19 13:27:17, Johannes Weiner wrote: > > > > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote: > > > > > On Fri 31-05-19 23:34:07, Minchan Kim wrote: > > > > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote: > > > > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote: > > > > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote: > > > > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote: > > > > > > > > > > When a process expects no accesses to a certain memory range, it could > > > > > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure > > > > > > > > > > happens but data should be preserved for future use. This could reduce > > > > > > > > > > workingset eviction so it ends up increasing performance. > > > > > > > > > > > > > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall. > > > > > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected > > > > > > > > > > to be used in the near future. The hint can help kernel in deciding which > > > > > > > > > > pages to evict early during memory pressure. > > > > > > > > > > > > > > > > > > > > Internally, it works via deactivating pages from active list to inactive's > > > > > > > > > > head if the page is private because inactive list could be full of > > > > > > > > > > used-once pages which are first candidate for the reclaiming and that's a > > > > > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore, > > > > > > > > > > if the memory pressure happens, they will be reclaimed earlier than other > > > > > > > > > > active pages unless there is no access until the time. > > > > > > > > > > > > > > > > > > [I am intentionally not looking at the implementation because below > > > > > > > > > points should be clear from the changelog - sorry about nagging ;)] > > > > > > > > > > > > > > > > > > What kind of pages can be deactivated? Anonymous/File backed. > > > > > > > > > Private/shared? If shared, are there any restrictions? > > > > > > > > > > > > > > > > Both file and private pages could be deactived from each active LRU > > > > > > > > to each inactive LRU if the page has one map_count. In other words, > > > > > > > > > > > > > > > > if (page_mapcount(page) <= 1) > > > > > > > > deactivate_page(page); > > > > > > > > > > > > > > Why do we restrict to pages that are single mapped? > > > > > > > > > > > > Because page table in one of process shared the page would have access bit > > > > > > so finally we couldn't reclaim the page. The more process it is shared, > > > > > > the more fail to reclaim. > > > > > > > > > > So what? In other words why should it be restricted solely based on the > > > > > map count. I can see a reason to restrict based on the access > > > > > permissions because we do not want to simplify all sorts of side channel > > > > > attacks but memory reclaim is capable of reclaiming shared pages and so > > > > > far I haven't heard any sound argument why madvise should skip those. > > > > > Again if there are any reasons, then document them in the changelog. > > > > > > > > I think it makes sense. It could be explained, but it also follows > > > > established madvise semantics, and I'm not sure it's necessarily > > > > Minchan's job to re-iterate those. > > > > > > > > Sharing isn't exactly transparent to userspace. The kernel does COW, > > > > ksm etc. When you madvise, you can really only speak for your own > > > > reference to that memory - "*I* am not using this." > > > > > > > > This is in line with other madvise calls: MADV_DONTNEED clears the > > > > local page table entries and drops the corresponding references, so > > > > shared pages won't get freed. MADV_FREE clears the pte dirty bit and > > > > also has explicit mapcount checks before clearing PG_dirty, so again > > > > shared pages don't get freed. > > > > > > Right, being consistent with other madvise syscalls is certainly a way > > > to go. And I am not pushing one way or another, I just want this to be > > > documented with a reasoning behind. Consistency is certainly an argument > > > to use. > > > > > > On the other hand these non-destructive madvise operations are quite > > > different and the shared policy might differ as a result as well. We are > > > aging objects rather than destroying them after all. Being able to age > > > a pagecache with a sufficient privileges sounds like a useful usecase to > > > me. In other words you are able to cause the same effect indirectly > > > without the madvise operation so it kinda makes sense to allow it in a > > > more sophisticated way. > > > > Right, I don't think it's about permission - as you say, you can do > > this indirectly. Page reclaim is all about relative page order, so if > > we thwarted you from demoting some pages, you could instead promote > > other pages to cause a similar end result. > > > > I think it's about intent. You're advising the kernel that *you're* > > not using this memory and would like to have it cleared out based on > > that knowledge. You could do the same by simply allocating the new > > pages and have the kernel sort it out. However, if the kernel sorts it > > out, it *will* look at other users of the page, and it might decide > > that other pages are actually colder when considering all users. > > > > When you ignore shared state, on the other hand, the pages you advise > > out could refault right after. And then, not only did you not free up > > the memory, but you also caused IO that may interfere with bringing in > > the new data for which you tried to create room in the first place. > > > > So I don't think it ever makes sense to override it. > > > > But it might be better to drop the explicit mapcount check and instead > > make the local pte young and call shrink_page_list() without the > ^ > old? > > > TTU_IGNORE_ACCESS, ignore_references flags - leave it to reclaim code > > to handle references and shared pages exactly the same way it would if > > those pages came fresh off the LRU tail, excluding only the reference > > from the mapping that we're madvising. > > You are confused from the name change. Here, MADV_COLD is deactivating > , not pageing out. Therefore, shrink_page_list doesn't matter. > And madvise_cold_pte_range already makes the local pte *old*(I guess > your saying was typo). > I guess that's exactly what Michal wanted: just removing page_mapcount > check and defers to decision on normal page reclaim policy: > If I didn't miss your intention, it seems you and Michal are on same page. > (Please correct me if you want to say something other) Indeed. > I could drop the page_mapcount check at next revision. Yes please. -- Michal Hocko SUSE Labs
next prev parent reply index Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-05-31 6:43 [RFCv2 0/6] introduce memory hinting API for external process Minchan Kim 2019-05-31 6:43 ` [RFCv2 1/6] mm: introduce MADV_COLD Minchan Kim 2019-05-31 8:47 ` Michal Hocko 2019-05-31 13:39 ` Minchan Kim 2019-05-31 14:03 ` Michal Hocko 2019-05-31 14:34 ` Minchan Kim 2019-06-03 7:16 ` Michal Hocko 2019-06-03 15:43 ` Daniel Colascione 2019-06-03 17:27 ` Johannes Weiner 2019-06-03 20:32 ` Michal Hocko 2019-06-03 21:50 ` Johannes Weiner 2019-06-03 23:02 ` Minchan Kim 2019-06-04 6:56 ` Michal Hocko [this message] 2019-06-04 12:06 ` Johannes Weiner 2019-06-04 6:55 ` Michal Hocko 2019-06-04 4:26 ` Minchan Kim 2019-06-04 7:02 ` Michal Hocko 2019-05-31 6:43 ` [RFCv2 2/6] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM Minchan Kim 2019-05-31 6:43 ` [RFCv2 3/6] mm: introduce MADV_PAGEOUT Minchan Kim 2019-05-31 8:50 ` Michal Hocko 2019-05-31 13:44 ` Minchan Kim 2019-05-31 16:59 ` Johannes Weiner 2019-05-31 23:14 ` Minchan Kim 2019-05-31 6:43 ` [RFCv2 4/6] mm: factor out madvise's core functionality Minchan Kim 2019-05-31 7:04 ` Oleksandr Natalenko 2019-05-31 13:12 ` Minchan Kim 2019-05-31 14:35 ` Oleksandr Natalenko 2019-05-31 23:29 ` Minchan Kim 2019-06-05 13:27 ` Oleksandr Natalenko 2019-06-10 10:12 ` Minchan Kim 2019-05-31 6:43 ` [RFCv2 5/6] mm: introduce external memory hinting API Minchan Kim 2019-05-31 8:37 ` Michal Hocko 2019-05-31 13:19 ` Minchan Kim 2019-05-31 14:00 ` Michal Hocko 2019-05-31 14:11 ` Minchan Kim 2019-05-31 17:35 ` Daniel Colascione 2019-05-31 6:43 ` [RFCv2 6/6] mm: extend process_madvise syscall to support vector arrary Minchan Kim 2019-05-31 10:06 ` Yann Droneaud 2019-05-31 23:18 ` Minchan Kim
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20190604065657.GC4669@dhcp22.suse.cz \ --to=mhocko@kernel.org \ --cc=akpm@linux-foundation.org \ --cc=bgeffon@google.com \ --cc=christian@brauner.io \ --cc=dancol@google.com \ --cc=hannes@cmpxchg.org \ --cc=hdanton@sina.com \ --cc=jannh@google.com \ --cc=joel@joelfernandes.org \ --cc=linux-api@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=minchan@kernel.org \ --cc=oleg@redhat.com \ --cc=oleksandr@redhat.com \ --cc=shakeelb@google.com \ --cc=sonnyrao@google.com \ --cc=surenb@google.com \ --cc=timmurray@google.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
LKML Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \ linux-kernel@vger.kernel.org public-inbox-index lkml Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel AGPL code for this site: git clone https://public-inbox.org/public-inbox.git