From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755961AbYAQQeq (ORCPT ); Thu, 17 Jan 2008 11:34:46 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752564AbYAQQei (ORCPT ); Thu, 17 Jan 2008 11:34:38 -0500 Received: from host36-195-149-62.serverdedicati.aruba.it ([62.149.195.36]:38917 "EHLO mx.cpushare.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751263AbYAQQeh (ORCPT ); Thu, 17 Jan 2008 11:34:37 -0500 X-Greylist: delayed 671 seconds by postgrey-1.27 at vger.kernel.org; Thu, 17 Jan 2008 11:34:37 EST Date: Thu, 17 Jan 2008 17:23:02 +0100 From: Andrea Arcangeli To: Izik Eidus Cc: Rik van Riel , linux-kernel@vger.kernel.org, linux-mm@kvack.org, kvm-devel@lists.sourceforge.net, Avi Kivity , clameter@sgi.com, daniel.blueman@quadrics.com, holt@sgi.com, steiner@sgi.com, Andrew Morton , Hugh Dickins , Nick Piggin , Benjamin Herrenschmidt , andrea@qumranet.com Subject: Re: [PATCH] mmu notifiers #v2 Message-ID: <20080117162302.GI7170@v2.random> References: <20080113162418.GE8736@v2.random> <20080116124256.44033d48@bree.surriel.com> <478E4356.7030303@qumranet.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <478E4356.7030303@qumranet.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 16, 2008 at 07:48:06PM +0200, Izik Eidus wrote: > Rik van Riel wrote: >> On Sun, 13 Jan 2008 17:24:18 +0100 >> Andrea Arcangeli wrote: >> >> >>> In my basic initial patch I only track the tlb flushes which should be >>> the minimum required to have a nice linux-VM controlled swapping >>> behavior of the KVM gphysical memory. >> >> I have a vaguely related question on KVM swapping. >> >> Do page accesses inside KVM guests get propagated to the host >> OS, so Linux can choose a reasonable page for eviction, or is >> the pageout of KVM guest pages essentially random? Right, selection of the guest OS pages to swap is partly random but wait: _only_ for the long-cached and hot spte entries. It's certainly not entirely random. As the shadow-cache is a bit dynamic, every new instantiated spte will refresh the PG_referenced bit in follow_page already (through minor faults). not-present fault of swapped non-present sptes, can trigger minor faults from swapcache too and they'll refresh young regular ptes. > right now when kvm remove pte from the shadow cache, it mark as access the > page that this pte pointed to. Yes: the referenced bit in the mmu-notifier invalidate case isn't useful because it's set right before freeing the page. > it was a good solution untill the mmut notifiers beacuse the pages were > pinned and couldnt be swapped to disk It probably still makes sense for sptes removed because of other reasons (not mmu notifier invalidates). > so now it will have to do something more sophisticated or at least mark as > access every page pointed by pte > that get insrted to the shadow cache.... I think that should already be the case, see the mark_page_accessed in follow_page, isn't FOLL_TOUCH set, isn't it? The only thing we clearly miss is a logic that refreshes the PG_referenced bitflag for "hot" sptes that remains instantiated and cached for a long time. For regular linux ptes this is done by the cpu through the young bitflag. But note that not all architectures have the young bitflag support in hardware! So I suppose the swapping of the KVM task, is like the swapping any other task but on an alpha CPU. It works good enough in practice even if we clearly have room for further optimizations in this area (like there would be on archs w/o young bit updated in hardware too). To refresh the PG_referenced bit for long lived hot sptes, I think the easiest solution is to chain the sptes in a lru, and to start dropping them when memory pressure start. We could drop one spte every X pages collected by the VM. So the "age" time factor depends on the VM velocity and we totally avoid useless shadow page faults when there's no VM pressure. When VM pressure increases, the kvm non-present fault will then take care to refresh the PG_referenced bit. This should solve the aging-issue for long lived and hot sptes. This should improve the responsiveness of the guest OS during "initial" swap pressure (after the initial swap pressure, the working set finds itself in ram again). So it should avoid some swapout/swapin not required jitter during the initial swap. I see this mostly as a kvm internal optimization, not strictly related to the mmu notifiers though.