From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1754439AbcK2D5U (ORCPT <rfc822;w@1wt.eu>);
        Mon, 28 Nov 2016 22:57:20 -0500
Received: from mail-pg0-f52.google.com ([74.125.83.52]:33904 "EHLO
        mail-pg0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1751064AbcK2D5K (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 28 Nov 2016 22:57:10 -0500
Date: Mon, 28 Nov 2016 19:56:48 -0800 (PST)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@eggly.anvils
To: "Kirill A. Shutemov" <kirill@shutemov.name>
cc: Hugh Dickins <hughd@google.com>,
        "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
        Andrea Arcangeli <aarcange@redhat.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Andi Kleen <ak@linux.intel.com>, Dave Chinner <david@fromorbit.com>,
        Michal Hocko <mhocko@kernel.org>, linux-mm@kvack.org,
        linux-kernel@vger.kernel.org
Subject: Re: [PATCHv4] shmem: avoid huge pages for small files
In-Reply-To: <20161114140957.GA9950@node.shutemov.name>
Message-ID: <alpine.LSU.2.11.1611281840000.3299@eggly.anvils>
References: <20161021185103.117938-1-kirill.shutemov@linux.intel.com> <20161021224629.tnwuvruhblkg22qj@black.fi.intel.com> <alpine.LSU.2.11.1611071433340.1384@eggly.anvils> <20161110162540.GA12743@node.shutemov.name> <alpine.LSU.2.11.1611111247580.9200@eggly.anvils>
 <20161114140957.GA9950@node.shutemov.name>
User-Agent: Alpine 2.11 (LSU 23 2013-08-11)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Mon, 14 Nov 2016, Kirill A. Shutemov wrote:
> On Fri, Nov 11, 2016 at 01:41:11PM -0800, Hugh Dickins wrote:
> > 
> > Certainly the new condition is easier to understand than the old condition:
> > which is a plus, even though it's hackish (I do dislike hobbling the first
> > extent, when it's an incomplete last extent which deserves to be hobbled -
> > easier said than implemented of course).
> 
> Well, it's just heuristic that I found useful. I don't see a reason to
> make more complex if it works.

You like it because it allocates huge pages to some extents,
but not to all extents.  I dislike it because it allocates
huge pages to the wrong extents.

You did much the same three or four years ago, in your THP-on-ramfs
series: I admired your resourcefulness, in getting the little files
to fit in memory; but it was not a solution I wanted to see again.

Consider copying a 2097153-byte file into such a filesystem: the first
2MB would be allocated with 4kB pages, the final byte with a 2MB page;
but it looks like I already pointed that out, and we just disagree.

This patch does not convince me at all: I expect you will come up with
some better strategy in a month or two, and I'd rather wait for that
than keep messing around with what we have.  But if you can persuade
the filesystem guys that this heuristic would be a sensible mount
option for them, then in the end I shall not want tmpfs to diverge.

> 
> > But isn't the new condition (with its ||) always weaker than the old
> > condition (with its &&)?  Whereas I thought you were trying to change
> > it to be less keen to allocate hugepages, not more.
> 
> I tried to make it less keen to allocate hugepages comparing to
> huge=always.
> 
> Current huge=within_size is fairly restrictive: we don't allocate huge
> pages to grow the file. For shmem, it means we would allocate huge pages
> if user did truncate(2) to set file size, before touching data in it
> (shared memory APIs do this). This policy would be more useful for
> filesystem with backing storage.
> 
> The patch relaxes condition: only require file size >= HPAGE_PMD_SIZE.
> 
> > What the condition ought to say, I don't know: I got too confused,
> > and depressed by my confusion, so I'm just handing it back to you.
> > 
> > And then there's the SHMEM_HUGE_WITHIN_SIZE case in shmem_huge_enabled()
> > (for khugepaged), which you have explicitly not changed in this patch:
> > looks strange to me, is it doing the right thing?
> 
> I missed that.
> 
> -----8<-----
> From b2158fdd8523e3e35a548857a1cb02fe6bcd1ea4 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Date: Mon, 17 Oct 2016 14:44:47 +0300
> Subject: [PATCH] shmem: avoid huge pages for small files
> 
> Huge pages are detrimental for small file: they causes noticible
> overhead on both allocation performance and memory footprint.
> 
> This patch aimed to address this issue by avoiding huge pages until
> file grown to size of huge page if the filesystem mounted with
> huge=within_size option.
> 
> This would cover most of the cases where huge pages causes slowdown
> comparing to small pages.
> 
> Later we can consider huge=within_size as the default for tmpfs.

I'm sceptical of that, and I do not think this implementation will
make a sensible default.

> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  Documentation/vm/transhuge.txt |  8 ++++++--
>  mm/shmem.c                     | 12 +++---------
>  2 files changed, 9 insertions(+), 11 deletions(-)
> 
> diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> index 2ec6adb5a4ce..7703e9c241ca 100644
> --- a/Documentation/vm/transhuge.txt
> +++ b/Documentation/vm/transhuge.txt
> @@ -206,13 +206,17 @@ You can control hugepage allocation policy in tmpfs with mount option
>  "huge=". It can have following values:
>  
>    - "always":
> -    Attempt to allocate huge pages every time we need a new page;
> +    Attempt to allocate huge pages every time we need a new page.
> +    This option can lead to significant overhead if filesystem is used to
> +    store small files.

Good, yes, that part I fully agree with.

>  
>    - "never":
>      Do not allocate huge pages;
>  
>    - "within_size":
> -    Only allocate huge page if it will be fully within i_size.
> +    Only allocate huge page if size of the file more than size of huge
> +    page. This helps to avoid overhead for small files.
> +
>      Also respect fadvise()/madvise() hints;
>  
>    - "advise:
> diff --git a/mm/shmem.c b/mm/shmem.c
> index ad7813d73ea7..ef8fdadd0626 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1677,14 +1677,11 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>  			goto alloc_huge;
>  		switch (sbinfo->huge) {
>  			loff_t i_size;
> -			pgoff_t off;
>  		case SHMEM_HUGE_NEVER:
>  			goto alloc_nohuge;
>  		case SHMEM_HUGE_WITHIN_SIZE:
> -			off = round_up(index, HPAGE_PMD_NR);
> -			i_size = round_up(i_size_read(inode), PAGE_SIZE);
> -			if (i_size >= HPAGE_PMD_SIZE &&
> -					i_size >> PAGE_SHIFT >= off)

I certainly agree that the old test is obscure: I give up and cry each
time I try to work out exactly what it does.  I wanted so much to offer
a constructive alternative before responding: how about

			if (index < round_down(i_size_read(inode),
					HPAGE_PMD_SIZE) >> PAGE_SHIFT))

Of course that does not give you any huge pages while a file is being
copied in (without a preparatory ftruncate), but it seems a more
comprehensible within_size implementation to me.

> +			i_size = i_size_read(inode);
> +			if (index >= HPAGE_PMD_NR || i_size >= HPAGE_PMD_SIZE)
>  				goto alloc_huge;
>  			/* fallthrough */
>  		case SHMEM_HUGE_ADVISE:
> @@ -3856,7 +3853,6 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)
>  	struct inode *inode = file_inode(vma->vm_file);
>  	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
>  	loff_t i_size;
> -	pgoff_t off;
>  
>  	if (shmem_huge == SHMEM_HUGE_FORCE)
>  		return true;
> @@ -3868,10 +3864,8 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)
>  		case SHMEM_HUGE_ALWAYS:
>  			return true;
>  		case SHMEM_HUGE_WITHIN_SIZE:
> -			off = round_up(vma->vm_pgoff, HPAGE_PMD_NR);
>  			i_size = round_up(i_size_read(inode), PAGE_SIZE);
> -			if (i_size >= HPAGE_PMD_SIZE &&
> -					i_size >> PAGE_SHIFT >= off)
> +			if (i_size >= HPAGE_PMD_SIZE)
>  				return true;

That's reasonable, given what you propose for shmem_getpage_gfp().
And given other conditions at the calling khugepaged end, it might
even be okay with my suggestion - I've not given it enough thought.
Or simply return true there, and let khugepaged work it out?
I am pretty sure the original condition was wrong.

>  		case SHMEM_HUGE_ADVISE:
>  			/* TODO: implement fadvise() hints */
> -- 
>  Kirill A. Shutemov