From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758020Ab2KVWBW (ORCPT ); Thu, 22 Nov 2012 17:01:22 -0500 Received: from e2.ny.us.ibm.com ([32.97.182.142]:32939 "EHLO e2.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754880Ab2KVWBP (ORCPT ); Thu, 22 Nov 2012 17:01:15 -0500 Message-ID: <50AD739A.30804@linaro.org> Date: Wed, 21 Nov 2012 16:36:42 -0800 From: John Stultz User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:16.0) Gecko/20121028 Thunderbird/16.0.2 MIME-Version: 1.0 To: Minchan Kim CC: Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Christoph Lameter , Android Kernel Team , Robert Love , Mel Gorman , Hugh Dickins , Dave Hansen , Rik van Riel , Dave Chinner , Neil Brown , Mike Hommey , Taras Glek , KOSAKI Motohiro , KAMEZAWA Hiroyuki Subject: Re: [RFC v2] Support volatile range for anon vma References: <1351560594-18366-1-git-send-email-minchan@kernel.org> In-Reply-To: <1351560594-18366-1-git-send-email-minchan@kernel.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Content-Scanned: Fidelis XPS MAILER x-cbid: 12112200-5112-0000-0000-00000ED60120 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10/29/2012 06:29 PM, Minchan Kim wrote: > This patch introudces new madvise behavior MADV_VOLATILE and > MADV_NOVOLATILE for anonymous pages. It's different with > John Stultz's version which considers only tmpfs while this patch > considers only anonymous pages so this cannot cover John's one. > If below idea is proved as reasonable, I hope we can unify both > concepts by madvise/fadvise. > > Rationale is following as. > Many allocators call munmap(2) when user call free(3) if ptr is > in mmaped area. But munmap isn't cheap because it have to clean up > all pte entries and unlinking a vma so overhead would be increased > linearly by mmaped area's size. > > Volatile conecept of Robert Love could be very useful for reducing > free(3) overhead. Allocators can do madvise(MADV_VOLATILE) instead of > munmap(2)(Of course, they need to manage volatile mmaped area to > reduce shortage of address space and sometime ends up unmaping them). > The madvise(MADV_VOLATILE|NOVOLATILE) is very cheap opeartion because > > 1) it just marks the flag in VMA and > 2) if memory pressure happens, VM can discard pages of volatile VMA > instead of swapping out when volatile pages is selected as victim > by normal VM aging policy. > 3) freed mmaped area doesn't include any meaningful data so there > is no point to swap them out. > > Allocator should call madvise(MADV_NOVOLATILE) before reusing for > allocating that area to user. Otherwise, accessing of volatile range > will meet SIGBUS error. > > The downside is that we have to age anon lru list although we don't > have swap because I don't want to discard volatile pages by top priority > when memory pressure happens as volatile in this patch means "We don't > need to swap out because user can handle the situation which data are > disappear suddenly", NOT "They are useless so hurry up to reclaim them". > So I want to apply same aging rule of nomal pages to them. > > Anon background aging of non-swap system would be a trade-off for > getting good feature. Even, we had done it two years ago until merge > [1] and I believe free(3) performance gain will beat loss of anon lru > aging's overead once all of allocator start to use madvise. > (This patch doesn't include background aging in case of non-swap system > but it's trivial if we decide) Hey Minchan! So I've been looking at your patch for a bit, and I'm still trying to fully grok it and the rmap code. Overall this approach looks pretty interesting, and while your patch description focused on malloc/free behavior, I suspect your patch would satisfy what the mozilla folks are looking for, and while its not quite sufficient yet for Android, the interface semantics are very close to what I've been wanting (my test cases were easily mapped over). The two major issues for me are: 1) As you noted, this approach currently doesn't work on non-swap systems, as we don't try to shrink the anonymous page lrus. This is a big problem, as it makes it unusable for most all Android systems. You suggest we may want to change aging the anonymous lru, and I had a patch earlier that tried to change some of the anonymous lru aging rules for volatile pages, but its not quite right for what you have here. So I'd be interested in hearing how you think the anonymous lru aging should happen with swapoff. 2) Being able to use this with tmpfs files. I'm currently trying to better understand the rmap code, looking to see if there's a way to have try_to_unmap_file() work similarly to try_to_unmap_anon(), to allow allow users to madvise() on mmapped tmpfs files. This would provide a very similar interface as to what I've been proposing with fadvise/fallocate, but just using process virtual addresses instead of (fd, offset) pairs. The benefit with (fd,offset) pairs for Android is that its easier to manage shared volatile ranges between two processes that are sharing data via an mmapped tmpfs file (although this actual use case may be fairly rare). I believe we should still be able to rework the ashmem internals to use madvise (which would provide legacy support for existing android apps), so then its just a question of if we could then eventually convince Android apps to use the madvise interface directly, rather then the ashmem unpin ioctl. The other concern with the madvise on mmapped files approach is that there's no easy way I can see to limit it to tmpfs files. I know some have been interested in having fallocate(VOLATILE) interface for non-tmpfs files, but I'm not sure I see the benefit there yet. I have noted folks mixing the idea of volatile pages being purged under memory pressure with the idea of volatile files, which might be purged from disk under disk pressure. While I think the second idea is interesting, I do think its completely separate from the volatile memory concept. Anyway, I'd be interested in your thoughts on these two issues. Thanks so much for sending out this patch, its given me quite a bit to chew on, and I too hope we can merge our different approaches together. thanks -john