linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv4] shmem: avoid huge pages for small files
@ 2016-10-21 18:51 Kirill A. Shutemov
  2016-10-21 22:46 ` Kirill A. Shutemov
  0 siblings, 1 reply; 29+ messages in thread
From: Kirill A. Shutemov @ 2016-10-21 18:51 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Andi Kleen, Dave Chinner, Michal Hocko, linux-mm, linux-kernel,
	Kirill A. Shutemov

Huge pages are detrimental for small file: they causes noticible
overhead on both allocation performance and memory footprint.

This patch aimed to address this issue by avoiding huge pages until file
grown to size of huge page. This would cover most of the cases where huge
pages causes regressions in performance.

Couple notes:

  - if shmem_enabled is set to 'force', the limit is ignored. We still
    want to generate as many pages as possible for functional testing.

  - the limit doesn't affect khugepaged behaviour: it still can collapse
    pages based on its settings;

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/vm/transhuge.txt | 3 +++
 mm/shmem.c                     | 5 +++++
 2 files changed, 8 insertions(+)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 2ec6adb5a4ce..d1889c7c8c46 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -238,6 +238,9 @@ values:
   - "force":
     Force the huge option on for all - very useful for testing;
 
+To avoid overhead for small files, we don't allocate huge pages for a file
+until it grows to size of huge pages.
+
 == Need of application restart ==
 
 The transparent_hugepage/enabled values and tmpfs mount option only affect
diff --git a/mm/shmem.c b/mm/shmem.c
index ad7813d73ea7..c7b3cb5aecdc 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1692,6 +1692,11 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 				goto alloc_huge;
 			/* TODO: implement fadvise() hints */
 			goto alloc_nohuge;
+		case SHEME_HUGE_ALWAYS:
+			i_size = i_size_read(inode);
+			if (index < HPAGE_PMD_NR && i_size < HPAGE_PMD_SIZE)
+				goto alloc_nohuge;
+			break;
 		}
 
 alloc_huge:
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCHv4] shmem: avoid huge pages for small files
  2016-10-21 18:51 [PATCHv4] shmem: avoid huge pages for small files Kirill A. Shutemov
@ 2016-10-21 22:46 ` Kirill A. Shutemov
  2016-10-24 12:43   ` Michal Hocko
  2016-11-07 23:17   ` Hugh Dickins
  0 siblings, 2 replies; 29+ messages in thread
From: Kirill A. Shutemov @ 2016-10-21 22:46 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Andi Kleen, Dave Chinner, Michal Hocko, linux-mm, linux-kernel

On Fri, Oct 21, 2016 at 09:51:03PM +0300, Kirill A. Shutemov wrote:
> +		case SHEME_HUGE_ALWAYS:

Oops. Forgot to commit the fixup :-/

>From 79b0a3bf4503225d0e6ba553b8496f0c4d55514e Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Mon, 17 Oct 2016 14:44:47 +0300
Subject: [PATCHv4] shmem: avoid huge pages for small files

Huge pages are detrimental for small file: they causes noticible
overhead on both allocation performance and memory footprint.

This patch aimed to address this issue by avoiding huge pages until file
grown to size of huge page. This would cover most of the cases where huge
pages causes regressions in performance.

Couple notes:

  - if shmem_enabled is set to 'force', the limit is ignored. We still
    want to generate as many pages as possible for functional testing.

  - the limit doesn't affect khugepaged behaviour: it still can collapse
    pages based on its settings;

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/vm/transhuge.txt | 3 +++
 mm/shmem.c                     | 5 +++++
 2 files changed, 8 insertions(+)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 2ec6adb5a4ce..d1889c7c8c46 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -238,6 +238,9 @@ values:
   - "force":
     Force the huge option on for all - very useful for testing;
 
+To avoid overhead for small files, we don't allocate huge pages for a file
+until it grows to size of huge pages.
+
 == Need of application restart ==
 
 The transparent_hugepage/enabled values and tmpfs mount option only affect
diff --git a/mm/shmem.c b/mm/shmem.c
index ad7813d73ea7..49618d2d6330 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1692,6 +1692,11 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 				goto alloc_huge;
 			/* TODO: implement fadvise() hints */
 			goto alloc_nohuge;
+		case SHMEM_HUGE_ALWAYS:
+			i_size = i_size_read(inode);
+			if (index < HPAGE_PMD_NR && i_size < HPAGE_PMD_SIZE)
+				goto alloc_nohuge;
+			break;
 		}
 
 alloc_huge:
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCHv4] shmem: avoid huge pages for small files
  2016-10-21 22:46 ` Kirill A. Shutemov
@ 2016-10-24 12:43   ` Michal Hocko
  2016-11-07 23:17   ` Hugh Dickins
  1 sibling, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2016-10-24 12:43 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrea Arcangeli, Andrew Morton, Andi Kleen,
	Dave Chinner, linux-mm, linux-kernel

On Sat 22-10-16 01:46:29, Kirill A. Shutemov wrote:
> On Fri, Oct 21, 2016 at 09:51:03PM +0300, Kirill A. Shutemov wrote:
> > +		case SHEME_HUGE_ALWAYS:
> 
> Oops. Forgot to commit the fixup :-/
> 
> >From 79b0a3bf4503225d0e6ba553b8496f0c4d55514e Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Date: Mon, 17 Oct 2016 14:44:47 +0300
> Subject: [PATCHv4] shmem: avoid huge pages for small files
> 
> Huge pages are detrimental for small file: they causes noticible
> overhead on both allocation performance and memory footprint.
> 
> This patch aimed to address this issue by avoiding huge pages until file
> grown to size of huge page. This would cover most of the cases where huge
> pages causes regressions in performance.
> 
> Couple notes:
> 
>   - if shmem_enabled is set to 'force', the limit is ignored. We still
>     want to generate as many pages as possible for functional testing.
> 
>   - the limit doesn't affect khugepaged behaviour: it still can collapse
>     pages based on its settings;
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  Documentation/vm/transhuge.txt | 3 +++
>  mm/shmem.c                     | 5 +++++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> index 2ec6adb5a4ce..d1889c7c8c46 100644
> --- a/Documentation/vm/transhuge.txt
> +++ b/Documentation/vm/transhuge.txt
> @@ -238,6 +238,9 @@ values:
>    - "force":
>      Force the huge option on for all - very useful for testing;
>  
> +To avoid overhead for small files, we don't allocate huge pages for a file
> +until it grows to size of huge pages.
> +
>  == Need of application restart ==
>  
>  The transparent_hugepage/enabled values and tmpfs mount option only affect
> diff --git a/mm/shmem.c b/mm/shmem.c
> index ad7813d73ea7..49618d2d6330 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1692,6 +1692,11 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>  				goto alloc_huge;
>  			/* TODO: implement fadvise() hints */
>  			goto alloc_nohuge;
> +		case SHMEM_HUGE_ALWAYS:
> +			i_size = i_size_read(inode);
> +			if (index < HPAGE_PMD_NR && i_size < HPAGE_PMD_SIZE)
> +				goto alloc_nohuge;
> +			break;
>  		}
>  
>  alloc_huge:
> -- 
>  Kirill A. Shutemov

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv4] shmem: avoid huge pages for small files
  2016-10-21 22:46 ` Kirill A. Shutemov
  2016-10-24 12:43   ` Michal Hocko
@ 2016-11-07 23:17   ` Hugh Dickins
  2016-11-10 16:25     ` Kirill A. Shutemov
  1 sibling, 1 reply; 29+ messages in thread
From: Hugh Dickins @ 2016-11-07 23:17 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrea Arcangeli, Andrew Morton, Andi Kleen,
	Dave Chinner, Michal Hocko, linux-mm, linux-kernel

On Sat, 22 Oct 2016, Kirill A. Shutemov wrote:
> 
> Huge pages are detrimental for small file: they causes noticible
> overhead on both allocation performance and memory footprint.
> 
> This patch aimed to address this issue by avoiding huge pages until file
> grown to size of huge page. This would cover most of the cases where huge
> pages causes regressions in performance.
> 
> Couple notes:
> 
>   - if shmem_enabled is set to 'force', the limit is ignored. We still
>     want to generate as many pages as possible for functional testing.
> 
>   - the limit doesn't affect khugepaged behaviour: it still can collapse
>     pages based on its settings;
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Sorry, but NAK.  I was expecting a patch to tune within_size behaviour.

> ---
>  Documentation/vm/transhuge.txt | 3 +++
>  mm/shmem.c                     | 5 +++++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> index 2ec6adb5a4ce..d1889c7c8c46 100644
> --- a/Documentation/vm/transhuge.txt
> +++ b/Documentation/vm/transhuge.txt
> @@ -238,6 +238,9 @@ values:
>    - "force":
>      Force the huge option on for all - very useful for testing;
>  
> +To avoid overhead for small files, we don't allocate huge pages for a file
> +until it grows to size of huge pages.
> +
>  == Need of application restart ==
>  
>  The transparent_hugepage/enabled values and tmpfs mount option only affect
> diff --git a/mm/shmem.c b/mm/shmem.c
> index ad7813d73ea7..49618d2d6330 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1692,6 +1692,11 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>  				goto alloc_huge;
>  			/* TODO: implement fadvise() hints */
>  			goto alloc_nohuge;
> +		case SHMEM_HUGE_ALWAYS:
> +			i_size = i_size_read(inode);
> +			if (index < HPAGE_PMD_NR && i_size < HPAGE_PMD_SIZE)
> +				goto alloc_nohuge;
> +			break;
>  		}
>  
>  alloc_huge:

So (eliding the SHMEM_HUGE_ADVISE case in between) you now have:

		case SHMEM_HUGE_WITHIN_SIZE:
			off = round_up(index, HPAGE_PMD_NR);
			i_size = round_up(i_size_read(inode), PAGE_SIZE);
			if (i_size >= HPAGE_PMD_SIZE &&
					i_size >> PAGE_SHIFT >= off)
				goto alloc_huge;
			goto alloc_nohuge;
		case SHMEM_HUGE_ALWAYS:
			i_size = i_size_read(inode);
			if (index < HPAGE_PMD_NR && i_size < HPAGE_PMD_SIZE)
				goto alloc_nohuge;
			goto alloc_huge;

I'll concede that those two conditions are not the same; but again you're
messing with huge=always to make it, not always, but conditional on size.

Please, keep huge=always as is: if I copy a 4MiB file into a huge tmpfs,
I got ShmemHugePages 4096 kB before, which is what I wanted.  Whereas
with this change I get only 2048 kB, just like with huge=within_size.

Treating the first extent differently is a hack, and does not respect
that this is a filesystem, on which size is likely to increase.

By all means refine the condition for huge=within_size, and by all means
warn in transhuge.txt that huge=always may tend to waste valuable huge
pages if the filesystem is used for small files without good reason
(but maybe the implementation needs to reclaim those more effectively).

Hugh

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv4] shmem: avoid huge pages for small files
  2016-11-07 23:17   ` Hugh Dickins
@ 2016-11-10 16:25     ` Kirill A. Shutemov
  2016-11-10 17:42       ` [PATCH] " kbuild test robot
  2016-11-11 21:41       ` [PATCHv4] " Hugh Dickins
  0 siblings, 2 replies; 29+ messages in thread
From: Kirill A. Shutemov @ 2016-11-10 16:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Andi Kleen,
	Dave Chinner, Michal Hocko, linux-mm, linux-kernel

On Mon, Nov 07, 2016 at 03:17:11PM -0800, Hugh Dickins wrote:
> On Sat, 22 Oct 2016, Kirill A. Shutemov wrote:
> > 
> > Huge pages are detrimental for small file: they causes noticible
> > overhead on both allocation performance and memory footprint.
> > 
> > This patch aimed to address this issue by avoiding huge pages until file
> > grown to size of huge page. This would cover most of the cases where huge
> > pages causes regressions in performance.
> > 
> > Couple notes:
> > 
> >   - if shmem_enabled is set to 'force', the limit is ignored. We still
> >     want to generate as many pages as possible for functional testing.
> > 
> >   - the limit doesn't affect khugepaged behaviour: it still can collapse
> >     pages based on its settings;
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> 
> Sorry, but NAK.  I was expecting a patch to tune within_size behaviour.
> 
> > ---
> >  Documentation/vm/transhuge.txt | 3 +++
> >  mm/shmem.c                     | 5 +++++
> >  2 files changed, 8 insertions(+)
> > 
> > diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> > index 2ec6adb5a4ce..d1889c7c8c46 100644
> > --- a/Documentation/vm/transhuge.txt
> > +++ b/Documentation/vm/transhuge.txt
> > @@ -238,6 +238,9 @@ values:
> >    - "force":
> >      Force the huge option on for all - very useful for testing;
> >  
> > +To avoid overhead for small files, we don't allocate huge pages for a file
> > +until it grows to size of huge pages.
> > +
> >  == Need of application restart ==
> >  
> >  The transparent_hugepage/enabled values and tmpfs mount option only affect
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index ad7813d73ea7..49618d2d6330 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1692,6 +1692,11 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
> >  				goto alloc_huge;
> >  			/* TODO: implement fadvise() hints */
> >  			goto alloc_nohuge;
> > +		case SHMEM_HUGE_ALWAYS:
> > +			i_size = i_size_read(inode);
> > +			if (index < HPAGE_PMD_NR && i_size < HPAGE_PMD_SIZE)
> > +				goto alloc_nohuge;
> > +			break;
> >  		}
> >  
> >  alloc_huge:
> 
> So (eliding the SHMEM_HUGE_ADVISE case in between) you now have:
> 
> 		case SHMEM_HUGE_WITHIN_SIZE:
> 			off = round_up(index, HPAGE_PMD_NR);
> 			i_size = round_up(i_size_read(inode), PAGE_SIZE);
> 			if (i_size >= HPAGE_PMD_SIZE &&
> 					i_size >> PAGE_SHIFT >= off)
> 				goto alloc_huge;
> 			goto alloc_nohuge;
> 		case SHMEM_HUGE_ALWAYS:
> 			i_size = i_size_read(inode);
> 			if (index < HPAGE_PMD_NR && i_size < HPAGE_PMD_SIZE)
> 				goto alloc_nohuge;
> 			goto alloc_huge;
> 
> I'll concede that those two conditions are not the same; but again you're
> messing with huge=always to make it, not always, but conditional on size.
> 
> Please, keep huge=always as is: if I copy a 4MiB file into a huge tmpfs,
> I got ShmemHugePages 4096 kB before, which is what I wanted.  Whereas
> with this change I get only 2048 kB, just like with huge=within_size.

I don't think it's a problem really. We don't have guarantees anyway.
And we can collapse the page later.

But okay.

> Treating the first extent differently is a hack, and does not respect
> that this is a filesystem, on which size is likely to increase.
> 
> By all means refine the condition for huge=within_size, and by all means
> warn in transhuge.txt that huge=always may tend to waste valuable huge
> pages if the filesystem is used for small files without good reason

Would it be okay, if I just replace huge=within_size logic with what I
proposed here for huge=always?

That's not what I intended initially for this option, but...

> (but maybe the implementation needs to reclaim those more effectively).

It's more about cost of allocation than memory pressure.

-----8<-----

>From 287ab05c09bfd49c7356ca74b6fea36d8131edaf Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Mon, 17 Oct 2016 14:44:47 +0300
Subject: [PATCH] shmem: avoid huge pages for small files

Huge pages are detrimental for small file: they causes noticible
overhead on both allocation performance and memory footprint.

This patch aimed to address this issue by avoiding huge pages until
file grown to size of huge page if the filesystem mounted with
huge=within_size option.

This would cover most of the cases where huge pages causes regressions
in performance.

The limit doesn't affect khugepaged behaviour: it still can collapse
pages based on its settings.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/vm/transhuge.txt | 7 ++++++-
 mm/shmem.c                     | 6 ++----
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 2ec6adb5a4ce..14c911c56f4a 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -208,11 +208,16 @@ You can control hugepage allocation policy in tmpfs with mount option
   - "always":
     Attempt to allocate huge pages every time we need a new page;
 
+    This option can lead to significant overhead if filesystem is used to
+    store small files.
+
   - "never":
     Do not allocate huge pages;
 
   - "within_size":
-    Only allocate huge page if it will be fully within i_size.
+    Only allocate huge page if size of the file more than size of huge
+    page. This helps to avoid overhead for small files.
+
     Also respect fadvise()/madvise() hints;
 
   - "advise:
diff --git a/mm/shmem.c b/mm/shmem.c
index ad7813d73ea7..3589d36c7c63 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1681,10 +1681,8 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 		case SHMEM_HUGE_NEVER:
 			goto alloc_nohuge;
 		case SHMEM_HUGE_WITHIN_SIZE:
-			off = round_up(index, HPAGE_PMD_NR);
-			i_size = round_up(i_size_read(inode), PAGE_SIZE);
-			if (i_size >= HPAGE_PMD_SIZE &&
-					i_size >> PAGE_SHIFT >= off)
+			i_size = i_size_read(inode);
+			if (index >= HPAGE_PMD_NR || i_size >= HPAGE_PMD_SIZE)
 				goto alloc_huge;
 			/* fallthrough */
 		case SHMEM_HUGE_ADVISE:
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] shmem: avoid huge pages for small files
  2016-11-10 16:25     ` Kirill A. Shutemov
@ 2016-11-10 17:42       ` kbuild test robot
  2016-11-10 17:51         ` Kirill A. Shutemov
  2016-11-11 21:41       ` [PATCHv4] " Hugh Dickins
  1 sibling, 1 reply; 29+ messages in thread
From: kbuild test robot @ 2016-11-10 17:42 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: kbuild-all, Hugh Dickins, Kirill A. Shutemov, Andrea Arcangeli,
	Andrew Morton, Andi Kleen, Dave Chinner, Michal Hocko, linux-mm,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2966 bytes --]

Hi Kirill,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.9-rc4 next-20161110]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/shmem-avoid-huge-pages-for-small-files/20161111-005428
config: i386-randconfig-s0-201645 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   mm/shmem.c: In function 'shmem_getpage_gfp':
>> mm/shmem.c:1680:12: warning: unused variable 'off' [-Wunused-variable]
       pgoff_t off;
               ^~~

vim +/off +1680 mm/shmem.c

66d2f4d2 Hugh Dickins       2014-07-02  1664  			mark_page_accessed(page);
66d2f4d2 Hugh Dickins       2014-07-02  1665  
54af6042 Hugh Dickins       2011-08-03  1666  		delete_from_swap_cache(page);
27ab7006 Hugh Dickins       2011-07-25  1667  		set_page_dirty(page);
27ab7006 Hugh Dickins       2011-07-25  1668  		swap_free(swap);
27ab7006 Hugh Dickins       2011-07-25  1669  
54af6042 Hugh Dickins       2011-08-03  1670  	} else {
800d8c63 Kirill A. Shutemov 2016-07-26  1671  		/* shmem_symlink() */
800d8c63 Kirill A. Shutemov 2016-07-26  1672  		if (mapping->a_ops != &shmem_aops)
800d8c63 Kirill A. Shutemov 2016-07-26  1673  			goto alloc_nohuge;
657e3038 Kirill A. Shutemov 2016-07-26  1674  		if (shmem_huge == SHMEM_HUGE_DENY || sgp_huge == SGP_NOHUGE)
800d8c63 Kirill A. Shutemov 2016-07-26  1675  			goto alloc_nohuge;
800d8c63 Kirill A. Shutemov 2016-07-26  1676  		if (shmem_huge == SHMEM_HUGE_FORCE)
800d8c63 Kirill A. Shutemov 2016-07-26  1677  			goto alloc_huge;
800d8c63 Kirill A. Shutemov 2016-07-26  1678  		switch (sbinfo->huge) {
800d8c63 Kirill A. Shutemov 2016-07-26  1679  			loff_t i_size;
800d8c63 Kirill A. Shutemov 2016-07-26 @1680  			pgoff_t off;
800d8c63 Kirill A. Shutemov 2016-07-26  1681  		case SHMEM_HUGE_NEVER:
800d8c63 Kirill A. Shutemov 2016-07-26  1682  			goto alloc_nohuge;
800d8c63 Kirill A. Shutemov 2016-07-26  1683  		case SHMEM_HUGE_WITHIN_SIZE:
bb89f249 Kirill A. Shutemov 2016-11-10  1684  			i_size = i_size_read(inode);
bb89f249 Kirill A. Shutemov 2016-11-10  1685  			if (index >= HPAGE_PMD_NR || i_size >= HPAGE_PMD_SIZE)
800d8c63 Kirill A. Shutemov 2016-07-26  1686  				goto alloc_huge;
800d8c63 Kirill A. Shutemov 2016-07-26  1687  			/* fallthrough */
800d8c63 Kirill A. Shutemov 2016-07-26  1688  		case SHMEM_HUGE_ADVISE:

:::::: The code at line 1680 was first introduced by commit
:::::: 800d8c63b2e989c2e349632d1648119bf5862f01 shmem: add huge pages support

:::::: TO: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
:::::: CC: Linus Torvalds <torvalds@linux-foundation.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 21473 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] shmem: avoid huge pages for small files
  2016-11-10 17:42       ` [PATCH] " kbuild test robot
@ 2016-11-10 17:51         ` Kirill A. Shutemov
  0 siblings, 0 replies; 29+ messages in thread
From: Kirill A. Shutemov @ 2016-11-10 17:51 UTC (permalink / raw)
  To: kbuild test robot
  Cc: Kirill A. Shutemov, kbuild-all, Hugh Dickins, Andrea Arcangeli,
	Andrew Morton, Andi Kleen, Dave Chinner, Michal Hocko, linux-mm,
	linux-kernel

On Fri, Nov 11, 2016 at 01:42:47AM +0800, kbuild test robot wrote:
> Hi Kirill,
> 
> [auto build test WARNING on linus/master]
> [also build test WARNING on v4.9-rc4 next-20161110]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Kirill-A-Shutemov/shmem-avoid-huge-pages-for-small-files/20161111-005428
> config: i386-randconfig-s0-201645 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=i386 
> 
> All warnings (new ones prefixed by >>):
> 
>    mm/shmem.c: In function 'shmem_getpage_gfp':
> >> mm/shmem.c:1680:12: warning: unused variable 'off' [-Wunused-variable]
>        pgoff_t off;


>From f0a582888ac6dcb56c6134611c83edfb091bbcb6 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Mon, 17 Oct 2016 14:44:47 +0300
Subject: [PATCH] shmem: avoid huge pages for small files

Huge pages are detrimental for small file: they causes noticible
overhead on both allocation performance and memory footprint.

This patch aimed to address this issue by avoiding huge pages until
file grown to size of huge page if the filesystem mounted with
huge=within_size option.

This would cover most of the cases where huge pages causes regressions
in performance.

The limit doesn't affect khugepaged behaviour: it still can collapse
pages based on its settings.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/vm/transhuge.txt | 7 ++++++-
 mm/shmem.c                     | 7 ++-----
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 2ec6adb5a4ce..14c911c56f4a 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -208,11 +208,16 @@ You can control hugepage allocation policy in tmpfs with mount option
   - "always":
     Attempt to allocate huge pages every time we need a new page;
 
+    This option can lead to significant overhead if filesystem is used to
+    store small files.
+
   - "never":
     Do not allocate huge pages;
 
   - "within_size":
-    Only allocate huge page if it will be fully within i_size.
+    Only allocate huge page if size of the file more than size of huge
+    page. This helps to avoid overhead for small files.
+
     Also respect fadvise()/madvise() hints;
 
   - "advise:
diff --git a/mm/shmem.c b/mm/shmem.c
index ad7813d73ea7..3e2c0912c587 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1677,14 +1677,11 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 			goto alloc_huge;
 		switch (sbinfo->huge) {
 			loff_t i_size;
-			pgoff_t off;
 		case SHMEM_HUGE_NEVER:
 			goto alloc_nohuge;
 		case SHMEM_HUGE_WITHIN_SIZE:
-			off = round_up(index, HPAGE_PMD_NR);
-			i_size = round_up(i_size_read(inode), PAGE_SIZE);
-			if (i_size >= HPAGE_PMD_SIZE &&
-					i_size >> PAGE_SHIFT >= off)
+			i_size = i_size_read(inode);
+			if (index >= HPAGE_PMD_NR || i_size >= HPAGE_PMD_SIZE)
 				goto alloc_huge;
 			/* fallthrough */
 		case SHMEM_HUGE_ADVISE:
-- 
2.9.3

-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCHv4] shmem: avoid huge pages for small files
  2016-11-10 16:25     ` Kirill A. Shutemov
  2016-11-10 17:42       ` [PATCH] " kbuild test robot
@ 2016-11-11 21:41       ` Hugh Dickins
  2016-11-14 14:09         ` Kirill A. Shutemov
  1 sibling, 1 reply; 29+ messages in thread
From: Hugh Dickins @ 2016-11-11 21:41 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrea Arcangeli,
	Andrew Morton, Andi Kleen, Dave Chinner, Michal Hocko, linux-mm,
	linux-kernel

On Thu, 10 Nov 2016, Kirill A. Shutemov wrote:
> On Mon, Nov 07, 2016 at 03:17:11PM -0800, Hugh Dickins wrote:
> 
> > Treating the first extent differently is a hack, and does not respect
> > that this is a filesystem, on which size is likely to increase.
> > 
> > By all means refine the condition for huge=within_size, and by all means
> > warn in transhuge.txt that huge=always may tend to waste valuable huge
> > pages if the filesystem is used for small files without good reason
> 
> Would it be okay, if I just replace huge=within_size logic with what I
> proposed here for huge=always?

In principle yes, that would be fine with me: I just don't care very
much about this option, since we do not force "huge=always" on anyone,
so everyone is free to use it where it's useful, and not where it's not.

But perhaps your aim is to have "huge=within_size" set by default on /tmp,
and so not behave badly there: I'd never aimed for that, and I'm a bit
sceptical about it, but if you can get good enough behaviour out of it
for that, I won't stand in your way.

> 
> That's not what I intended initially for this option, but...
> 
> > (but maybe the implementation needs to reclaim those more effectively).
> 
> It's more about cost of allocation than memory pressure.

Regarding that issue, I think you should reconsider the GFP flags used
in shmem_alloc_hugepage().  GFP flags, and compaction latency avoidance,
have been moving targets over the last year, and I've not rechecked;
but I got the impression that your GFP flags are still asking for the
compaction stalls that are now deprecated on the anon THP fault path?
I repeat, I've not rechecked that before writing, maybe it's a libel!

> 
> -----8<-----
> 
> From 287ab05c09bfd49c7356ca74b6fea36d8131edaf Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Date: Mon, 17 Oct 2016 14:44:47 +0300
> Subject: [PATCH] shmem: avoid huge pages for small files
> 
> Huge pages are detrimental for small file: they causes noticible
> overhead on both allocation performance and memory footprint.
> 
> This patch aimed to address this issue by avoiding huge pages until
> file grown to size of huge page if the filesystem mounted with
> huge=within_size option.
> 
> This would cover most of the cases where huge pages causes regressions
> in performance.

It's not a regression if "huge=always" is worse than "huge=never" in
some cases: just cases where it's better not to mount "huge=always".

> 
> The limit doesn't affect khugepaged behaviour: it still can collapse
> pages based on its settings.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  Documentation/vm/transhuge.txt | 7 ++++++-
>  mm/shmem.c                     | 6 ++----
>  2 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> index 2ec6adb5a4ce..14c911c56f4a 100644
> --- a/Documentation/vm/transhuge.txt
> +++ b/Documentation/vm/transhuge.txt
> @@ -208,11 +208,16 @@ You can control hugepage allocation policy in tmpfs with mount option
>    - "always":
>      Attempt to allocate huge pages every time we need a new page;
>  

Nit: please change the semi-colon to full-stop, and delete the blank line.

> +    This option can lead to significant overhead if filesystem is used to
> +    store small files.
> +
>    - "never":
>      Do not allocate huge pages;
>  
>    - "within_size":
> -    Only allocate huge page if it will be fully within i_size.
> +    Only allocate huge page if size of the file more than size of huge
> +    page. This helps to avoid overhead for small files.
> +
>      Also respect fadvise()/madvise() hints;
>  
>    - "advise:
> diff --git a/mm/shmem.c b/mm/shmem.c
> index ad7813d73ea7..3589d36c7c63 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1681,10 +1681,8 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>  		case SHMEM_HUGE_NEVER:
>  			goto alloc_nohuge;
>  		case SHMEM_HUGE_WITHIN_SIZE:
> -			off = round_up(index, HPAGE_PMD_NR);
> -			i_size = round_up(i_size_read(inode), PAGE_SIZE);
> -			if (i_size >= HPAGE_PMD_SIZE &&
> -					i_size >> PAGE_SHIFT >= off)
> +			i_size = i_size_read(inode);
> +			if (index >= HPAGE_PMD_NR || i_size >= HPAGE_PMD_SIZE)
>  				goto alloc_huge;

I said fine in principle above, but when I look at this, I'm puzzled.

Certainly the new condition is easier to understand than the old condition:
which is a plus, even though it's hackish (I do dislike hobbling the first
extent, when it's an incomplete last extent which deserves to be hobbled -
easier said than implemented of course).

But isn't the new condition (with its ||) always weaker than the old
condition (with its &&)?  Whereas I thought you were trying to change
it to be less keen to allocate hugepages, not more.

What the condition ought to say, I don't know: I got too confused,
and depressed by my confusion, so I'm just handing it back to you.

And then there's the SHMEM_HUGE_WITHIN_SIZE case in shmem_huge_enabled()
(for khugepaged), which you have explicitly not changed in this patch:
looks strange to me, is it doing the right thing?

>  			/* fallthrough */
>  		case SHMEM_HUGE_ADVISE:
> -- 
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv4] shmem: avoid huge pages for small files
  2016-11-11 21:41       ` [PATCHv4] " Hugh Dickins
@ 2016-11-14 14:09         ` Kirill A. Shutemov
  2016-11-29  3:56           ` Hugh Dickins
  0 siblings, 1 reply; 29+ messages in thread
From: Kirill A. Shutemov @ 2016-11-14 14:09 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Andi Kleen,
	Dave Chinner, Michal Hocko, linux-mm, linux-kernel

On Fri, Nov 11, 2016 at 01:41:11PM -0800, Hugh Dickins wrote:
> On Thu, 10 Nov 2016, Kirill A. Shutemov wrote:
> > On Mon, Nov 07, 2016 at 03:17:11PM -0800, Hugh Dickins wrote:
> > 
> > > Treating the first extent differently is a hack, and does not respect
> > > that this is a filesystem, on which size is likely to increase.
> > > 
> > > By all means refine the condition for huge=within_size, and by all means
> > > warn in transhuge.txt that huge=always may tend to waste valuable huge
> > > pages if the filesystem is used for small files without good reason
> > 
> > Would it be okay, if I just replace huge=within_size logic with what I
> > proposed here for huge=always?
> 
> In principle yes, that would be fine with me: I just don't care very
> much about this option, since we do not force "huge=always" on anyone,
> so everyone is free to use it where it's useful, and not where it's not.
> 
> But perhaps your aim is to have "huge=within_size" set by default on /tmp,
> and so not behave badly there: I'd never aimed for that, and I'm a bit
> sceptical about it, but if you can get good enough behaviour out of it
> for that, I won't stand in your way.

Yeah, I would like one day add compile-time option to choose default huge=
allocation policy.

> > That's not what I intended initially for this option, but...
> > 
> > > (but maybe the implementation needs to reclaim those more effectively).
> > 
> > It's more about cost of allocation than memory pressure.
> 
> Regarding that issue, I think you should reconsider the GFP flags used
> in shmem_alloc_hugepage().  GFP flags, and compaction latency avoidance,
> have been moving targets over the last year, and I've not rechecked;
> but I got the impression that your GFP flags are still asking for the
> compaction stalls that are now deprecated on the anon THP fault path?
> I repeat, I've not rechecked that before writing, maybe it's a libel!

Looks like you're right, we should clear __GFP_KSWAPD_RECLAIM from gfp
flags when allocate from fault path.

Anon-THP also takes into account VM_HUGEPAGE to choose gfp. It's not easy
to get this info into shmem_alloc_hugepage()...

> > -----8<-----
> > 
> > From 287ab05c09bfd49c7356ca74b6fea36d8131edaf Mon Sep 17 00:00:00 2001
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > Date: Mon, 17 Oct 2016 14:44:47 +0300
> > Subject: [PATCH] shmem: avoid huge pages for small files
> > 
> > Huge pages are detrimental for small file: they causes noticible
> > overhead on both allocation performance and memory footprint.
> > 
> > This patch aimed to address this issue by avoiding huge pages until
> > file grown to size of huge page if the filesystem mounted with
> > huge=within_size option.
> > 
> > This would cover most of the cases where huge pages causes regressions
> > in performance.
> 
> It's not a regression if "huge=always" is worse than "huge=never" in
> some cases: just cases where it's better not to mount "huge=always".

I'm not sure what wording would be better. I mean slower comparing to
small pages.

> > The limit doesn't affect khugepaged behaviour: it still can collapse
> > pages based on its settings.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  Documentation/vm/transhuge.txt | 7 ++++++-
> >  mm/shmem.c                     | 6 ++----
> >  2 files changed, 8 insertions(+), 5 deletions(-)
> > 
> > diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> > index 2ec6adb5a4ce..14c911c56f4a 100644
> > --- a/Documentation/vm/transhuge.txt
> > +++ b/Documentation/vm/transhuge.txt
> > @@ -208,11 +208,16 @@ You can control hugepage allocation policy in tmpfs with mount option
> >    - "always":
> >      Attempt to allocate huge pages every time we need a new page;
> >  
> 
> Nit: please change the semi-colon to full-stop, and delete the blank line.

Okay.

> 
> > +    This option can lead to significant overhead if filesystem is used to
> > +    store small files.
> > +
> >    - "never":
> >      Do not allocate huge pages;
> >  
> >    - "within_size":
> > -    Only allocate huge page if it will be fully within i_size.
> > +    Only allocate huge page if size of the file more than size of huge
> > +    page. This helps to avoid overhead for small files.
> > +
> >      Also respect fadvise()/madvise() hints;
> >  
> >    - "advise:
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index ad7813d73ea7..3589d36c7c63 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1681,10 +1681,8 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
> >  		case SHMEM_HUGE_NEVER:
> >  			goto alloc_nohuge;
> >  		case SHMEM_HUGE_WITHIN_SIZE:
> > -			off = round_up(index, HPAGE_PMD_NR);
> > -			i_size = round_up(i_size_read(inode), PAGE_SIZE);
> > -			if (i_size >= HPAGE_PMD_SIZE &&
> > -					i_size >> PAGE_SHIFT >= off)
> > +			i_size = i_size_read(inode);
> > +			if (index >= HPAGE_PMD_NR || i_size >= HPAGE_PMD_SIZE)
> >  				goto alloc_huge;
> 
> I said fine in principle above, but when I look at this, I'm puzzled.
> 
> Certainly the new condition is easier to understand than the old condition:
> which is a plus, even though it's hackish (I do dislike hobbling the first
> extent, when it's an incomplete last extent which deserves to be hobbled -
> easier said than implemented of course).

Well, it's just heuristic that I found useful. I don't see a reason to
make more complex if it works.

> But isn't the new condition (with its ||) always weaker than the old
> condition (with its &&)?  Whereas I thought you were trying to change
> it to be less keen to allocate hugepages, not more.

I tried to make it less keen to allocate hugepages comparing to
huge=always.

Current huge=within_size is fairly restrictive: we don't allocate huge
pages to grow the file. For shmem, it means we would allocate huge pages
if user did truncate(2) to set file size, before touching data in it
(shared memory APIs do this). This policy would be more useful for
filesystem with backing storage.

The patch relaxes condition: only require file size >= HPAGE_PMD_SIZE.

> What the condition ought to say, I don't know: I got too confused,
> and depressed by my confusion, so I'm just handing it back to you.
> 
> And then there's the SHMEM_HUGE_WITHIN_SIZE case in shmem_huge_enabled()
> (for khugepaged), which you have explicitly not changed in this patch:
> looks strange to me, is it doing the right thing?

I missed that.

-----8<-----
>From b2158fdd8523e3e35a548857a1cb02fe6bcd1ea4 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Mon, 17 Oct 2016 14:44:47 +0300
Subject: [PATCH] shmem: avoid huge pages for small files

Huge pages are detrimental for small file: they causes noticible
overhead on both allocation performance and memory footprint.

This patch aimed to address this issue by avoiding huge pages until
file grown to size of huge page if the filesystem mounted with
huge=within_size option.

This would cover most of the cases where huge pages causes slowdown
comparing to small pages.

Later we can consider huge=within_size as the default for tmpfs.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/vm/transhuge.txt |  8 ++++++--
 mm/shmem.c                     | 12 +++---------
 2 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 2ec6adb5a4ce..7703e9c241ca 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -206,13 +206,17 @@ You can control hugepage allocation policy in tmpfs with mount option
 "huge=". It can have following values:
 
   - "always":
-    Attempt to allocate huge pages every time we need a new page;
+    Attempt to allocate huge pages every time we need a new page.
+    This option can lead to significant overhead if filesystem is used to
+    store small files.
 
   - "never":
     Do not allocate huge pages;
 
   - "within_size":
-    Only allocate huge page if it will be fully within i_size.
+    Only allocate huge page if size of the file more than size of huge
+    page. This helps to avoid overhead for small files.
+
     Also respect fadvise()/madvise() hints;
 
   - "advise:
diff --git a/mm/shmem.c b/mm/shmem.c
index ad7813d73ea7..ef8fdadd0626 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1677,14 +1677,11 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 			goto alloc_huge;
 		switch (sbinfo->huge) {
 			loff_t i_size;
-			pgoff_t off;
 		case SHMEM_HUGE_NEVER:
 			goto alloc_nohuge;
 		case SHMEM_HUGE_WITHIN_SIZE:
-			off = round_up(index, HPAGE_PMD_NR);
-			i_size = round_up(i_size_read(inode), PAGE_SIZE);
-			if (i_size >= HPAGE_PMD_SIZE &&
-					i_size >> PAGE_SHIFT >= off)
+			i_size = i_size_read(inode);
+			if (index >= HPAGE_PMD_NR || i_size >= HPAGE_PMD_SIZE)
 				goto alloc_huge;
 			/* fallthrough */
 		case SHMEM_HUGE_ADVISE:
@@ -3856,7 +3853,6 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)
 	struct inode *inode = file_inode(vma->vm_file);
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
 	loff_t i_size;
-	pgoff_t off;
 
 	if (shmem_huge == SHMEM_HUGE_FORCE)
 		return true;
@@ -3868,10 +3864,8 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)
 		case SHMEM_HUGE_ALWAYS:
 			return true;
 		case SHMEM_HUGE_WITHIN_SIZE:
-			off = round_up(vma->vm_pgoff, HPAGE_PMD_NR);
 			i_size = round_up(i_size_read(inode), PAGE_SIZE);
-			if (i_size >= HPAGE_PMD_SIZE &&
-					i_size >> PAGE_SHIFT >= off)
+			if (i_size >= HPAGE_PMD_SIZE)
 				return true;
 		case SHMEM_HUGE_ADVISE:
 			/* TODO: implement fadvise() hints */
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCHv4] shmem: avoid huge pages for small files
  2016-11-14 14:09         ` Kirill A. Shutemov
@ 2016-11-29  3:56           ` Hugh Dickins
  2016-11-29 11:11             ` Kirill A. Shutemov
  0 siblings, 1 reply; 29+ messages in thread
From: Hugh Dickins @ 2016-11-29  3:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrea Arcangeli,
	Andrew Morton, Andi Kleen, Dave Chinner, Michal Hocko, linux-mm,
	linux-kernel

On Mon, 14 Nov 2016, Kirill A. Shutemov wrote:
> On Fri, Nov 11, 2016 at 01:41:11PM -0800, Hugh Dickins wrote:
> > 
> > Certainly the new condition is easier to understand than the old condition:
> > which is a plus, even though it's hackish (I do dislike hobbling the first
> > extent, when it's an incomplete last extent which deserves to be hobbled -
> > easier said than implemented of course).
> 
> Well, it's just heuristic that I found useful. I don't see a reason to
> make more complex if it works.

You like it because it allocates huge pages to some extents,
but not to all extents.  I dislike it because it allocates
huge pages to the wrong extents.

You did much the same three or four years ago, in your THP-on-ramfs
series: I admired your resourcefulness, in getting the little files
to fit in memory; but it was not a solution I wanted to see again.

Consider copying a 2097153-byte file into such a filesystem: the first
2MB would be allocated with 4kB pages, the final byte with a 2MB page;
but it looks like I already pointed that out, and we just disagree.

This patch does not convince me at all: I expect you will come up with
some better strategy in a month or two, and I'd rather wait for that
than keep messing around with what we have.  But if you can persuade
the filesystem guys that this heuristic would be a sensible mount
option for them, then in the end I shall not want tmpfs to diverge.

> 
> > But isn't the new condition (with its ||) always weaker than the old
> > condition (with its &&)?  Whereas I thought you were trying to change
> > it to be less keen to allocate hugepages, not more.
> 
> I tried to make it less keen to allocate hugepages comparing to
> huge=always.
> 
> Current huge=within_size is fairly restrictive: we don't allocate huge
> pages to grow the file. For shmem, it means we would allocate huge pages
> if user did truncate(2) to set file size, before touching data in it
> (shared memory APIs do this). This policy would be more useful for
> filesystem with backing storage.
> 
> The patch relaxes condition: only require file size >= HPAGE_PMD_SIZE.
> 
> > What the condition ought to say, I don't know: I got too confused,
> > and depressed by my confusion, so I'm just handing it back to you.
> > 
> > And then there's the SHMEM_HUGE_WITHIN_SIZE case in shmem_huge_enabled()
> > (for khugepaged), which you have explicitly not changed in this patch:
> > looks strange to me, is it doing the right thing?
> 
> I missed that.
> 
> -----8<-----
> From b2158fdd8523e3e35a548857a1cb02fe6bcd1ea4 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Date: Mon, 17 Oct 2016 14:44:47 +0300
> Subject: [PATCH] shmem: avoid huge pages for small files
> 
> Huge pages are detrimental for small file: they causes noticible
> overhead on both allocation performance and memory footprint.
> 
> This patch aimed to address this issue by avoiding huge pages until
> file grown to size of huge page if the filesystem mounted with
> huge=within_size option.
> 
> This would cover most of the cases where huge pages causes slowdown
> comparing to small pages.
> 
> Later we can consider huge=within_size as the default for tmpfs.

I'm sceptical of that, and I do not think this implementation will
make a sensible default.

> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  Documentation/vm/transhuge.txt |  8 ++++++--
>  mm/shmem.c                     | 12 +++---------
>  2 files changed, 9 insertions(+), 11 deletions(-)
> 
> diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> index 2ec6adb5a4ce..7703e9c241ca 100644
> --- a/Documentation/vm/transhuge.txt
> +++ b/Documentation/vm/transhuge.txt
> @@ -206,13 +206,17 @@ You can control hugepage allocation policy in tmpfs with mount option
>  "huge=". It can have following values:
>  
>    - "always":
> -    Attempt to allocate huge pages every time we need a new page;
> +    Attempt to allocate huge pages every time we need a new page.
> +    This option can lead to significant overhead if filesystem is used to
> +    store small files.

Good, yes, that part I fully agree with.

>  
>    - "never":
>      Do not allocate huge pages;
>  
>    - "within_size":
> -    Only allocate huge page if it will be fully within i_size.
> +    Only allocate huge page if size of the file more than size of huge
> +    page. This helps to avoid overhead for small files.
> +
>      Also respect fadvise()/madvise() hints;
>  
>    - "advise:
> diff --git a/mm/shmem.c b/mm/shmem.c
> index ad7813d73ea7..ef8fdadd0626 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1677,14 +1677,11 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>  			goto alloc_huge;
>  		switch (sbinfo->huge) {
>  			loff_t i_size;
> -			pgoff_t off;
>  		case SHMEM_HUGE_NEVER:
>  			goto alloc_nohuge;
>  		case SHMEM_HUGE_WITHIN_SIZE:
> -			off = round_up(index, HPAGE_PMD_NR);
> -			i_size = round_up(i_size_read(inode), PAGE_SIZE);
> -			if (i_size >= HPAGE_PMD_SIZE &&
> -					i_size >> PAGE_SHIFT >= off)

I certainly agree that the old test is obscure: I give up and cry each
time I try to work out exactly what it does.  I wanted so much to offer
a constructive alternative before responding: how about

			if (index < round_down(i_size_read(inode),
					HPAGE_PMD_SIZE) >> PAGE_SHIFT))

Of course that does not give you any huge pages while a file is being
copied in (without a preparatory ftruncate), but it seems a more
comprehensible within_size implementation to me.

> +			i_size = i_size_read(inode);
> +			if (index >= HPAGE_PMD_NR || i_size >= HPAGE_PMD_SIZE)
>  				goto alloc_huge;
>  			/* fallthrough */
>  		case SHMEM_HUGE_ADVISE:
> @@ -3856,7 +3853,6 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)
>  	struct inode *inode = file_inode(vma->vm_file);
>  	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
>  	loff_t i_size;
> -	pgoff_t off;
>  
>  	if (shmem_huge == SHMEM_HUGE_FORCE)
>  		return true;
> @@ -3868,10 +3864,8 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)
>  		case SHMEM_HUGE_ALWAYS:
>  			return true;
>  		case SHMEM_HUGE_WITHIN_SIZE:
> -			off = round_up(vma->vm_pgoff, HPAGE_PMD_NR);
>  			i_size = round_up(i_size_read(inode), PAGE_SIZE);
> -			if (i_size >= HPAGE_PMD_SIZE &&
> -					i_size >> PAGE_SHIFT >= off)
> +			if (i_size >= HPAGE_PMD_SIZE)
>  				return true;

That's reasonable, given what you propose for shmem_getpage_gfp().
And given other conditions at the calling khugepaged end, it might
even be okay with my suggestion - I've not given it enough thought.
Or simply return true there, and let khugepaged work it out?
I am pretty sure the original condition was wrong.

>  		case SHMEM_HUGE_ADVISE:
>  			/* TODO: implement fadvise() hints */
> -- 
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv4] shmem: avoid huge pages for small files
  2016-11-29  3:56           ` Hugh Dickins
@ 2016-11-29 11:11             ` Kirill A. Shutemov
  0 siblings, 0 replies; 29+ messages in thread
From: Kirill A. Shutemov @ 2016-11-29 11:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, Andi Kleen,
	Dave Chinner, Michal Hocko, linux-mm, linux-kernel

On Mon, Nov 28, 2016 at 07:56:48PM -0800, Hugh Dickins wrote:
> On Mon, 14 Nov 2016, Kirill A. Shutemov wrote:
> > On Fri, Nov 11, 2016 at 01:41:11PM -0800, Hugh Dickins wrote:
> > > 
> > > Certainly the new condition is easier to understand than the old condition:
> > > which is a plus, even though it's hackish (I do dislike hobbling the first
> > > extent, when it's an incomplete last extent which deserves to be hobbled -
> > > easier said than implemented of course).
> > 
> > Well, it's just heuristic that I found useful. I don't see a reason to
> > make more complex if it works.
> 
> You like it because it allocates huge pages to some extents,
> but not to all extents.  I dislike it because it allocates
> huge pages to the wrong extents.
> 
> You did much the same three or four years ago, in your THP-on-ramfs
> series: I admired your resourcefulness, in getting the little files
> to fit in memory; but it was not a solution I wanted to see again.
> 
> Consider copying a 2097153-byte file into such a filesystem: the first
> 2MB would be allocated with 4kB pages, the final byte with a 2MB page;
> but it looks like I already pointed that out, and we just disagree.

I agree with you that's not elegant. But it works.

Small files tend to be well within size of extent. And they contribute the
most to overhead just because they are small and you can fit a lot of them
onto a filesystem of a size.

And what you've described is the worst case. There are not that many files
on the border of one extent.

Usually files that benefit the most from huge pages are at least several
extents in size -- media, databases, etc. And ratio of "allocation of huge
pages to wrong extents" diminish as file grows.

Let's agree to disagree. (which means I loose as you're the maintainer) :)

> 
> This patch does not convince me at all: I expect you will come up with
> some better strategy in a month or two, and I'd rather wait for that
> than keep messing around with what we have.  But if you can persuade
> the filesystem guys that this heuristic would be a sensible mount
> option for them, then in the end I shall not want tmpfs to diverge.

For a filesystem with backing storage, I think the old heuristic for
huge=within_size is more appropriate as we don't start with empty
filesystem every time.

> > > But isn't the new condition (with its ||) always weaker than the old
> > > condition (with its &&)?  Whereas I thought you were trying to change
> > > it to be less keen to allocate hugepages, not more.
> > 
> > I tried to make it less keen to allocate hugepages comparing to
> > huge=always.
> > 
> > Current huge=within_size is fairly restrictive: we don't allocate huge
> > pages to grow the file. For shmem, it means we would allocate huge pages
> > if user did truncate(2) to set file size, before touching data in it
> > (shared memory APIs do this). This policy would be more useful for
> > filesystem with backing storage.
> > 
> > The patch relaxes condition: only require file size >= HPAGE_PMD_SIZE.
> > 
> > > What the condition ought to say, I don't know: I got too confused,
> > > and depressed by my confusion, so I'm just handing it back to you.
> > > 
> > > And then there's the SHMEM_HUGE_WITHIN_SIZE case in shmem_huge_enabled()
> > > (for khugepaged), which you have explicitly not changed in this patch:
> > > looks strange to me, is it doing the right thing?
> > 
> > I missed that.
> > 
> > -----8<-----
> > From b2158fdd8523e3e35a548857a1cb02fe6bcd1ea4 Mon Sep 17 00:00:00 2001
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > Date: Mon, 17 Oct 2016 14:44:47 +0300
> > Subject: [PATCH] shmem: avoid huge pages for small files
> > 
> > Huge pages are detrimental for small file: they causes noticible
> > overhead on both allocation performance and memory footprint.
> > 
> > This patch aimed to address this issue by avoiding huge pages until
> > file grown to size of huge page if the filesystem mounted with
> > huge=within_size option.
> > 
> > This would cover most of the cases where huge pages causes slowdown
> > comparing to small pages.
> > 
> > Later we can consider huge=within_size as the default for tmpfs.
> 
> I'm sceptical of that, and I do not think this implementation will
> make a sensible default.
> 
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  Documentation/vm/transhuge.txt |  8 ++++++--
> >  mm/shmem.c                     | 12 +++---------
> >  2 files changed, 9 insertions(+), 11 deletions(-)
> > 
> > diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> > index 2ec6adb5a4ce..7703e9c241ca 100644
> > --- a/Documentation/vm/transhuge.txt
> > +++ b/Documentation/vm/transhuge.txt
> > @@ -206,13 +206,17 @@ You can control hugepage allocation policy in tmpfs with mount option
> >  "huge=". It can have following values:
> >  
> >    - "always":
> > -    Attempt to allocate huge pages every time we need a new page;
> > +    Attempt to allocate huge pages every time we need a new page.
> > +    This option can lead to significant overhead if filesystem is used to
> > +    store small files.
> 
> Good, yes, that part I fully agree with.
> 
> >  
> >    - "never":
> >      Do not allocate huge pages;
> >  
> >    - "within_size":
> > -    Only allocate huge page if it will be fully within i_size.
> > +    Only allocate huge page if size of the file more than size of huge
> > +    page. This helps to avoid overhead for small files.
> > +
> >      Also respect fadvise()/madvise() hints;
> >  
> >    - "advise:
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index ad7813d73ea7..ef8fdadd0626 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1677,14 +1677,11 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
> >  			goto alloc_huge;
> >  		switch (sbinfo->huge) {
> >  			loff_t i_size;
> > -			pgoff_t off;
> >  		case SHMEM_HUGE_NEVER:
> >  			goto alloc_nohuge;
> >  		case SHMEM_HUGE_WITHIN_SIZE:
> > -			off = round_up(index, HPAGE_PMD_NR);
> > -			i_size = round_up(i_size_read(inode), PAGE_SIZE);
> > -			if (i_size >= HPAGE_PMD_SIZE &&
> > -					i_size >> PAGE_SHIFT >= off)
> 
> I certainly agree that the old test is obscure: I give up and cry each
> time I try to work out exactly what it does.  I wanted so much to offer
> a constructive alternative before responding: how about
> 
> 			if (index < round_down(i_size_read(inode),
> 					HPAGE_PMD_SIZE) >> PAGE_SHIFT))

I tried to be cleaver here and allocate huge pages when size is more or
equal HPAGE_PMD_SIZE - PAGE_SIZE + 1, so we would catch a little bit more
cases where huge page allocation makes sense.

I did badly on writing it clearly. Maybe something like this (untested):

			i_size = round_up(i_size_read(inode), PAGE_SIZE);
			if (index < round_down(i_size, HPAGE_PMD_SIZE) >>
					PAGE_SHIFT)

> Of course that does not give you any huge pages while a file is being
> copied in (without a preparatory ftruncate), but it seems a more
> comprehensible within_size implementation to me.
> 
> > +			i_size = i_size_read(inode);
> > +			if (index >= HPAGE_PMD_NR || i_size >= HPAGE_PMD_SIZE)
> >  				goto alloc_huge;
> >  			/* fallthrough */
> >  		case SHMEM_HUGE_ADVISE:
> > @@ -3856,7 +3853,6 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)
> >  	struct inode *inode = file_inode(vma->vm_file);
> >  	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
> >  	loff_t i_size;
> > -	pgoff_t off;
> >  
> >  	if (shmem_huge == SHMEM_HUGE_FORCE)
> >  		return true;
> > @@ -3868,10 +3864,8 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)
> >  		case SHMEM_HUGE_ALWAYS:
> >  			return true;
> >  		case SHMEM_HUGE_WITHIN_SIZE:
> > -			off = round_up(vma->vm_pgoff, HPAGE_PMD_NR);
> >  			i_size = round_up(i_size_read(inode), PAGE_SIZE);
> > -			if (i_size >= HPAGE_PMD_SIZE &&
> > -					i_size >> PAGE_SHIFT >= off)
> > +			if (i_size >= HPAGE_PMD_SIZE)
> >  				return true;
> 
> That's reasonable, given what you propose for shmem_getpage_gfp().
> And given other conditions at the calling khugepaged end, it might
> even be okay with my suggestion - I've not given it enough thought.
> Or simply return true there, and let khugepaged work it out?

Hm. Return true, seems do the job.

> I am pretty sure the original condition was wrong.
> 
> >  		case SHMEM_HUGE_ADVISE:
> >  			/* TODO: implement fadvise() hints */
> > -- 
> >  Kirill A. Shutemov
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] shmem: avoid huge pages for small files
  2016-10-24 20:34                           ` Dave Hansen
@ 2016-10-25  5:28                             ` Dave Chinner
  0 siblings, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2016-10-25  5:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Andi Kleen, Hugh Dickins, Michal Hocko,
	Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, linux-mm,
	linux-kernel

On Mon, Oct 24, 2016 at 01:34:53PM -0700, Dave Hansen wrote:
> On 10/21/2016 03:50 PM, Dave Chinner wrote:
> > On Fri, Oct 21, 2016 at 06:00:07PM +0300, Kirill A. Shutemov wrote:
> >> On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote:
> >> To me, most of things you're talking about is highly dependent on access
> >> pattern generated by userspace:
> >>
> >>   - we may want to allocate huge pages from byte 1 if we know that file
> >>     will grow;
> > 
> > delayed allocation takes care of that. We use a growing speculative
> > delalloc size that kicks in at specific sizes and can be used
> > directly to determine if a large page shoul dbe allocated. This code
> > is aware of sparse files, sparse writes, etc.
> 
> OK, so somebody does a write() of 1 byte.  We can delay the underlying
> block allocation for a long time, but we can *not* delay the memory
> allocation.  We've got to decide before the write() returns.

> How does delayed allocation help with that decision?

You (and Kirill) have likely misunderstood what I'm saying, based
on the fact you are thinking I'm talking about delayed allocation of
page cache pages.

I'm not. The current code does this for a sequential write:

write( off, len)
  for each PAGE_SIZE chunk
     grab page cache page
       alloc + insert if not found
     map block to page
       get_block(off, PAGE_SIZE);
         filesystem does allocation		<<<<<<<<<<<<
       update bufferhead attached to page
     write data into page


Essentially, delayed block allocation occurs inside the get_block()
call, completely hidden from the page cache and IO layers.

In XFS, we special stuff based on the offset being written to, the
size of the existing extent we are adding, etc. to specualtively
/reserve/ more blocks that the write actually needs and keep them in
a delalloc extent that extends beyond EOF.

The next page is grabbed, get_block is called again, and we find
we've already got a delalloc reservation for that file offset, so
we return immediately. And we repeat that until the delalloc extent
runs out.

When it runs out, we allocate a bigger delalloc extent beyond EOF
so that as the file grows we do fewer and fewer larger delayed
allocation reservations. These grow out to /gigabytes/ if there is
that much data to write.

i.e. the filesystem clearly knows when using large pages would be
appropriate, but because it's inside the page cache allocation,
it can't influence it at all.

Here's what the new fs/iomap.c code does for the same sequential
write:

write(off, len)
  iomap_begin(off, len)
    filesystem does delayed allocation of at least len bytes <<<<<<
  returns an iomap with single mapping
  iomap_apply()
    for each PAGE_SIZE chunk
     grab page cache page
       alloc + insert if not found
     map iomap to page
     write data into page
  iomap_end()

Hence if the write was for 1 byte into an empty file, we'd get a
single block extent back, which would match to a single PAGE_SIZE
page cache allocation required. If the app is doing sequential 1
byte IO, and we're at offset 2MB and the filesystem returns a 2MB
delalloc extent (i.e. extends 2MB byte beyond EOF), we know we're
getting sequential write IO and we could use a 2MB page in the page
cache for this.

Similarly, if the app is doing large IO - say 16MB at a time, we'll
get at least a 16MB delalloc extent returned from the filesystem,
and we know we could map that quickly and easily to 8 x 2MB huge
pages in the page cache.

But if we get random 4k writes into a sparse file, the filesystem
will be allocating single blocks, so the iomaps being returned would
be for a single block, and we know that PAGE_SIZE pages would be
best to allocate.

Or we could have a 2 MB extent size hint set, so every iomap
returned from the filesysetm is going to be 2MB aligned and sized,
in which case we'd could always map the returned iomap to a huge
page rather than worry about Io sizes and incoming IO patterns.

Now do you see the difference? The filesystem now has the ability to
optimise allocation based on user application IO patterns rather
than trying to guess from single page size block mapping requests.
And because this happens before we look at the page cache, we can
use that information to influence what we do with the page cache.

> >> I'm not convinced that filesystem is in better position to see access
> >> patterns than mm for page cache. It's not all about on-disk layout.
> > 
> > Spoken like a true mm developer. IO performance is all about IO
> > patterns, and the primary contributor to bad IO patterns is bad
> > filesystem allocation patterns.... :P
> 
> For writes, I think you have a good point.  Managing a horribly
> fragmented file with larger pages and eating the associated write
> magnification that comes along with it seems like a recipe for disaster.
> 
> But, Isn't some level of disconnection between the page cache and the
> underlying IO patterns a *good* thing?

Up to a point. Buffered IO only hides underlying IO patterns whilst
readahead and writebehind are effective. They rely on the filesystem
laying out files sequentially for good read rates, and writes to hit
the disk sequentially for good write rates. The moment either of
these two things break down, the latency induced by the underlying
IO patterns will immediately dominate buffered IO performance
characteristics.

The filesystem allocation and mapping routines will do a far better
job is we can feed them extents rather than 4k pages at a time.
Kirill is struggling to work out how to get huge pages into the page
cache readahead model, and this is something that we'll solve by
adding a readahead implemenation to the iomap code. i.e. iomap the
extent, decide on how much to read ahead and the page size to use,
populate the page cache, build the bios, off we go....

The problem with the existing code is that the page cache breaks IO
up into discrete PAGE_SIZE mapping requests. For all the fs knows it
could be 1 byte IOs or it could be 1GB IOs being done. We can do
(and already have done) an awful lot of optimisation if we know
exactly what IO is being thrown at the fs. 

Effectively we are moving the IO path mapping methods from a
single-page-and-block iteration model to single-extent-multiple-page
iteration model. It's a different mechanism, but it's a lot more
efficient for moving data around than what we have now...

> Once we've gone to the trouble
> of bringing some (potentially very fragmented) data into the page cache,
> why _not_ manage it in a lower-overhead way if we can?  For read-only
> data it seems like a no-brainer that we'd want things in as large of a
> management unit as we can get.
> 
> IOW, why let the underlying block allocation layout hamstring how the
> memory is managed?

We're not changing how memory is managed at all. All we are doing is
is changing how we populate the page cache....

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] shmem: avoid huge pages for small files
  2016-10-21 22:50                         ` Dave Chinner
  2016-10-21 23:32                           ` Kirill A. Shutemov
@ 2016-10-24 20:34                           ` Dave Hansen
  2016-10-25  5:28                             ` Dave Chinner
  1 sibling, 1 reply; 29+ messages in thread
From: Dave Hansen @ 2016-10-24 20:34 UTC (permalink / raw)
  To: Dave Chinner, Kirill A. Shutemov
  Cc: Andi Kleen, Hugh Dickins, Michal Hocko, Kirill A. Shutemov,
	Andrea Arcangeli, Andrew Morton, linux-mm, linux-kernel

On 10/21/2016 03:50 PM, Dave Chinner wrote:
> On Fri, Oct 21, 2016 at 06:00:07PM +0300, Kirill A. Shutemov wrote:
>> On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote:
>> To me, most of things you're talking about is highly dependent on access
>> pattern generated by userspace:
>>
>>   - we may want to allocate huge pages from byte 1 if we know that file
>>     will grow;
> 
> delayed allocation takes care of that. We use a growing speculative
> delalloc size that kicks in at specific sizes and can be used
> directly to determine if a large page shoul dbe allocated. This code
> is aware of sparse files, sparse writes, etc.

OK, so somebody does a write() of 1 byte.  We can delay the underlying
block allocation for a long time, but we can *not* delay the memory
allocation.  We've got to decide before the write() returns.

How does delayed allocation help with that decision?

I guess we could (always?) allocate small pages up front, and then only
bother promoting them once the FS delayed-allocation code kicks in and
is *also* giving us underlying large allocations.  That punts the logic
to the filesystem, which is a bit counterintuitive, but it seems
relatively sane.

>>> As such, there is no way we should be considering different
>>> interfaces and methods for configuring the /same functionality/ just
>>> because DAX is enabled or not. It's the /same decision/ that needs
>>> to be made, and the filesystem knows an awful lot more about whether
>>> huge pages can be used efficiently at the time of access than just
>>> about any other actor you can name....
>>
>> I'm not convinced that filesystem is in better position to see access
>> patterns than mm for page cache. It's not all about on-disk layout.
> 
> Spoken like a true mm developer. IO performance is all about IO
> patterns, and the primary contributor to bad IO patterns is bad
> filesystem allocation patterns.... :P

For writes, I think you have a good point.  Managing a horribly
fragmented file with larger pages and eating the associated write
magnification that comes along with it seems like a recipe for disaster.

But, Isn't some level of disconnection between the page cache and the
underlying IO patterns a *good* thing?  Once we've gone to the trouble
of bringing some (potentially very fragmented) data into the page cache,
why _not_ manage it in a lower-overhead way if we can?  For read-only
data it seems like a no-brainer that we'd want things in as large of a
management unit as we can get.

IOW, why let the underlying block allocation layout hamstring how the
memory is managed?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] shmem: avoid huge pages for small files
  2016-10-21 22:50                         ` Dave Chinner
@ 2016-10-21 23:32                           ` Kirill A. Shutemov
  2016-10-24 20:34                           ` Dave Hansen
  1 sibling, 0 replies; 29+ messages in thread
From: Kirill A. Shutemov @ 2016-10-21 23:32 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andi Kleen, Hugh Dickins, Michal Hocko, Kirill A. Shutemov,
	Andrea Arcangeli, Andrew Morton, linux-mm, linux-kernel

On Sat, Oct 22, 2016 at 09:50:13AM +1100, Dave Chinner wrote:
> On Fri, Oct 21, 2016 at 06:00:07PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote:
> > > On Thu, Oct 20, 2016 at 07:01:16PM -0700, Andi Kleen wrote:
> > > > > Ugh, no, please don't use mount options for file specific behaviours
> > > > > in filesystems like ext4 and XFS. This is exactly the sort of
> > > > > behaviour that should either just work automatically (i.e. be
> > > > > completely controlled by the filesystem) or only be applied to files
> > > > 
> > > > Can you explain what you mean? How would the file system control it?
> > > 
> > > There's no point in asking for huge pages when populating the page
> > > cache if the file is:
> > > 
> > > 	- significantly smaller than the huge page size
> > > 	- largely sparse
> > > 	- being randomly accessed in small chunks
> > > 	- badly fragmented and so takes hundreds of IO to read/write
> > > 	  a huge page
> > > 	- able to optimise delayed allocation to match huge page
> > > 	  sizes and alignments
> > > 
> > > These are all constraints the filesystem knows about, but the
> > > application and user don't.
> > 
> > Really?
> > 
> > To me, most of things you're talking about is highly dependent on access
> > pattern generated by userspace:
> > 
> >   - we may want to allocate huge pages from byte 1 if we know that file
> >     will grow;
> 
> delayed allocation takes care of that. We use a growing speculative
> delalloc size that kicks in at specific sizes and can be used
> directly to determine if a large page shoul dbe allocated. This code
> is aware of sparse files, sparse writes, etc.

I'm confused here. How can we delay allocation of page cache?

Delalloc is helpful to have reasonable on-disk layout, but my
understanding is that it uses page cache as buffering to postpone
block allocation. Later on writeback we see access pattern using pages
from page cache.

I'm likely missing something important here. Hm?

> >   - it will be beneficial to allocate huge page even for fragmented files,
> >     if it's read-mostly;
> 
> No, no it won't. The IO latency impact here can be massive.
> read-ahead of single 4k pages hides most of this latency from the
> application, but with a 2MB page, we can't use readhead to hide this
> IO latency because the first access could stall for hundreds of
> small random read IOs to be completed instead of just 1.

I agree that it will lead to initial latency spike. But don't we have
workloads which would tolerate it to get faster hot-cache behaviour?

> > > Further, we are moving the IO path to a model where we use extents
> > > for mapping, not blocks.  We're optimising for the fact that modern
> > > filesystems use extents and so massively reduce the number of block
> > > mapping lookup calls we need to do for a given IO.
> > > 
> > > i.e. instead of doing "get page, map block to page" over and over
> > > again until we've alked over the entire IO range, we're doing
> > > "map extent for entire IO range" once, then iterating "get page"
> > > until we've mapped the entire range.
> > 
> > That's great, but it's not how IO path works *now*. And will takes a long
> > time (if ever) to flip it over to what you've described.
> 
> Wrong. fs/iomap.c. XFS already uses it, ext4 is being converted
> right now, GFS2 will use parts of it in the next release, DAX
> already uses it and PMD support in DAX is being built on top of it.

That's interesting. I've managed to miss whole fs/iomap.c thing...

> > > As such, there is no way we should be considering different
> > > interfaces and methods for configuring the /same functionality/ just
> > > because DAX is enabled or not. It's the /same decision/ that needs
> > > to be made, and the filesystem knows an awful lot more about whether
> > > huge pages can be used efficiently at the time of access than just
> > > about any other actor you can name....
> > 
> > I'm not convinced that filesystem is in better position to see access
> > patterns than mm for page cache. It's not all about on-disk layout.
> 
> Spoken like a true mm developer.

Guilty.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] shmem: avoid huge pages for small files
  2016-10-21 15:00                       ` Kirill A. Shutemov
  2016-10-21 15:12                         ` Michal Hocko
@ 2016-10-21 22:50                         ` Dave Chinner
  2016-10-21 23:32                           ` Kirill A. Shutemov
  2016-10-24 20:34                           ` Dave Hansen
  1 sibling, 2 replies; 29+ messages in thread
From: Dave Chinner @ 2016-10-21 22:50 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andi Kleen, Hugh Dickins, Michal Hocko, Kirill A. Shutemov,
	Andrea Arcangeli, Andrew Morton, linux-mm, linux-kernel

On Fri, Oct 21, 2016 at 06:00:07PM +0300, Kirill A. Shutemov wrote:
> On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote:
> > On Thu, Oct 20, 2016 at 07:01:16PM -0700, Andi Kleen wrote:
> > > > Ugh, no, please don't use mount options for file specific behaviours
> > > > in filesystems like ext4 and XFS. This is exactly the sort of
> > > > behaviour that should either just work automatically (i.e. be
> > > > completely controlled by the filesystem) or only be applied to files
> > > 
> > > Can you explain what you mean? How would the file system control it?
> > 
> > There's no point in asking for huge pages when populating the page
> > cache if the file is:
> > 
> > 	- significantly smaller than the huge page size
> > 	- largely sparse
> > 	- being randomly accessed in small chunks
> > 	- badly fragmented and so takes hundreds of IO to read/write
> > 	  a huge page
> > 	- able to optimise delayed allocation to match huge page
> > 	  sizes and alignments
> > 
> > These are all constraints the filesystem knows about, but the
> > application and user don't.
> 
> Really?
> 
> To me, most of things you're talking about is highly dependent on access
> pattern generated by userspace:
> 
>   - we may want to allocate huge pages from byte 1 if we know that file
>     will grow;

delayed allocation takes care of that. We use a growing speculative
delalloc size that kicks in at specific sizes and can be used
directly to determine if a large page shoul dbe allocated. This code
is aware of sparse files, sparse writes, etc.

>   - the same for sparse file that will be filled;

See above.

>   - it will be beneficial to allocate huge page even for fragmented files,
>     if it's read-mostly;

No, no it won't. The IO latency impact here can be massive.
read-ahead of single 4k pages hides most of this latency from the
application, but with a 2MB page, we can't use readhead to hide this
IO latency because the first access could stall for hundreds of
small random read IOs to be completed instead of just 1.


> > Further, we are moving the IO path to a model where we use extents
> > for mapping, not blocks.  We're optimising for the fact that modern
> > filesystems use extents and so massively reduce the number of block
> > mapping lookup calls we need to do for a given IO.
> > 
> > i.e. instead of doing "get page, map block to page" over and over
> > again until we've alked over the entire IO range, we're doing
> > "map extent for entire IO range" once, then iterating "get page"
> > until we've mapped the entire range.
> 
> That's great, but it's not how IO path works *now*. And will takes a long
> time (if ever) to flip it over to what you've described.

Wrong. fs/iomap.c. XFS already uses it, ext4 is being converted
right now, GFS2 will use parts of it in the next release, DAX
already uses it and PMD support in DAX is being built on top of it.

> > As such, there is no way we should be considering different
> > interfaces and methods for configuring the /same functionality/ just
> > because DAX is enabled or not. It's the /same decision/ that needs
> > to be made, and the filesystem knows an awful lot more about whether
> > huge pages can be used efficiently at the time of access than just
> > about any other actor you can name....
> 
> I'm not convinced that filesystem is in better position to see access
> patterns than mm for page cache. It's not all about on-disk layout.

Spoken like a true mm developer. IO performance is all about IO
patterns, and the primary contributor to bad IO patterns is bad
filesystem allocation patterns.... :P

We're rapidly moving away from the world where a page cache is
needed to give applications decent performance. DAX doesn't have a
page cache, applications wanting to use high IOPS (hundreds of
thousands to millions) storage are using direct IO, because the page
cache just introduces latency, memory usage issues and
non-deterministic IO behaviour.

I we try to make the page cache the "one true IO optimisation source"
then we're screwing ourselves because the incoming IO technologies
simply don't require it anymore. We need to be ahead of that curve,
not playing catchup, and that's why this sort of "what should the
page cache do" decisions really need to come from the IO path where
we see /all/ the IO, not just buffered IO....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] shmem: avoid huge pages for small files
  2016-10-21 15:00                       ` Kirill A. Shutemov
@ 2016-10-21 15:12                         ` Michal Hocko
  2016-10-21 22:50                         ` Dave Chinner
  1 sibling, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2016-10-21 15:12 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Chinner, Andi Kleen, Hugh Dickins, Kirill A. Shutemov,
	Andrea Arcangeli, Andrew Morton, linux-mm, linux-kernel

On Fri 21-10-16 18:00:07, Kirill A. Shutemov wrote:
> On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote:
[...]
> > None of these aspects can be optimised sanely by a single threshold,
> > especially when considering the combination of access patterns vs file
> > layout.
> 
> I agree.
> 
> Here I tried to address the particular performance regression I see with
> huge pages enabled on tmpfs. It doesn't mean to fix all possible issues.

So can we start simple and use huge pages on shmem mappings only when
they are larger than the huge page? Without any tunable which might turn
out to be misleading/wrong later on. If I understand Dave's comments it
is really not all that clear that a mount option makes sense. I cannot
comment on those but they clearly show that there are multiple points of
view here.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] shmem: avoid huge pages for small files
  2016-10-21  5:01                     ` Dave Chinner
@ 2016-10-21 15:00                       ` Kirill A. Shutemov
  2016-10-21 15:12                         ` Michal Hocko
  2016-10-21 22:50                         ` Dave Chinner
  0 siblings, 2 replies; 29+ messages in thread
From: Kirill A. Shutemov @ 2016-10-21 15:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andi Kleen, Hugh Dickins, Michal Hocko, Kirill A. Shutemov,
	Andrea Arcangeli, Andrew Morton, linux-mm, linux-kernel

On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote:
> On Thu, Oct 20, 2016 at 07:01:16PM -0700, Andi Kleen wrote:
> > > Ugh, no, please don't use mount options for file specific behaviours
> > > in filesystems like ext4 and XFS. This is exactly the sort of
> > > behaviour that should either just work automatically (i.e. be
> > > completely controlled by the filesystem) or only be applied to files
> > 
> > Can you explain what you mean? How would the file system control it?
> 
> There's no point in asking for huge pages when populating the page
> cache if the file is:
> 
> 	- significantly smaller than the huge page size
> 	- largely sparse
> 	- being randomly accessed in small chunks
> 	- badly fragmented and so takes hundreds of IO to read/write
> 	  a huge page
> 	- able to optimise delayed allocation to match huge page
> 	  sizes and alignments
> 
> These are all constraints the filesystem knows about, but the
> application and user don't.

Really?

To me, most of things you're talking about is highly dependent on access
pattern generated by userspace:

  - we may want to allocate huge pages from byte 1 if we know that file
    will grow;
  - the same for sparse file that will be filled;
  - it will be beneficial to allocate huge page even for fragmented files,
    if it's read-mostly;

> None of these aspects can be optimised sanely by a single threshold,
> especially when considering the combination of access patterns vs file
> layout.

I agree.

Here I tried to address the particular performance regression I see with
huge pages enabled on tmpfs. It doesn't mean to fix all possible issues.

> Further, we are moving the IO path to a model where we use extents
> for mapping, not blocks.  We're optimising for the fact that modern
> filesystems use extents and so massively reduce the number of block
> mapping lookup calls we need to do for a given IO.
> 
> i.e. instead of doing "get page, map block to page" over and over
> again until we've alked over the entire IO range, we're doing
> "map extent for entire IO range" once, then iterating "get page"
> until we've mapped the entire range.

That's great, but it's not how IO path works *now*. And will takes a long
time (if ever) to flip it over to what you've described.

> Hence if we have a 2MB IO come in from userspace, and the iomap
> returned is a covers that entire range, it's a no-brainer to ask the
> page cache for a huge page instead of iterating 512 times to map all
> the 4k pages needed.

Yeah, it's no-brainier.

But do we want to limit huge page allocation only to such best-possible
cases? I hardly ever seen 2MB IOs in real world...

And this approach will put too much decision power on the first access to
the file range. It may or may not represent future access pattern.

> > > specifically configured with persistent hints to reliably allocate
> > > extents in a way that can be easily mapped to huge pages.
> > 
> > > e.g. on XFS you will need to apply extent size hints to get large
> > > page sized/aligned extent allocation to occur, and so this
> > 
> > It sounds like you're confusing alignment in memory with alignment
> > on disk here? I don't see why on disk alignment would be needed
> > at all, unless we're talking about DAX here (which is out of 
> > scope currently) Kirill's changes are all about making the memory
> > access for cached data more efficient, it's not about disk layout
> > optimizations.
> 
> No, I'm not confusing this with DAX. However, this automatic use
> model for huge pages fits straight into DAX as well.  Same
> mechanisms, same behaviours, slightly stricter alignment
> characteristics. All stuff the filesystem already knows about.
> 
> Mount options are, quite frankly, a terrible mechanism for
> specifying filesystem policy. Setting up DAX this way was a mistake,
> and it's a mount option I plan to remove from XFS once we get nearer
> to having DAX feature complete and stablised. We've already got
> on-disk "use DAX for this file" flags in XFS, so we can easier and
> cleanly support different methods of accessing PMEM from the same
> filesystem.
> 
> As such, there is no way we should be considering different
> interfaces and methods for configuring the /same functionality/ just
> because DAX is enabled or not. It's the /same decision/ that needs
> to be made, and the filesystem knows an awful lot more about whether
> huge pages can be used efficiently at the time of access than just
> about any other actor you can name....

I'm not convinced that filesystem is in better position to see access
patterns than mm for page cache. It's not all about on-disk layout.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] shmem: avoid huge pages for small files
  2016-10-21  2:01                   ` Andi Kleen
@ 2016-10-21  5:01                     ` Dave Chinner
  2016-10-21 15:00                       ` Kirill A. Shutemov
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2016-10-21  5:01 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Kirill A. Shutemov, Hugh Dickins, Michal Hocko,
	Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, linux-mm,
	linux-kernel

On Thu, Oct 20, 2016 at 07:01:16PM -0700, Andi Kleen wrote:
> > Ugh, no, please don't use mount options for file specific behaviours
> > in filesystems like ext4 and XFS. This is exactly the sort of
> > behaviour that should either just work automatically (i.e. be
> > completely controlled by the filesystem) or only be applied to files
> 
> Can you explain what you mean? How would the file system control it?

There's no point in asking for huge pages when populating the page
cache if the file is:

	- significantly smaller than the huge page size
	- largely sparse
	- being randomly accessed in small chunks
	- badly fragmented and so takes hundreds of IO to read/write
	  a huge page
	- able to optimise delayed allocation to match huge page
	  sizes and alignments

These are all constraints the filesystem knows about, but the
application and user don't. None of these aspects can be optimised
sanely by a single threshold, especially when considering the
combination of access patterns vs file layout.

Further, we are moving the IO path to a model where we use extents
for mapping, not blocks.  We're optimising for the fact that modern
filesystems use extents and so massively reduce the number of block
mapping lookup calls we need to do for a given IO.

i.e. instead of doing "get page, map block to page" over and over
again until we've alked over the entire IO range, we're doing
"map extent for entire IO range" once, then iterating "get page"
until we've mapped the entire range.

Hence if we have a 2MB IO come in from userspace, and the iomap
returned is a covers that entire range, it's a no-brainer to ask the
page cache for a huge page instead of iterating 512 times to map all
the 4k pages needed.

> > specifically configured with persistent hints to reliably allocate
> > extents in a way that can be easily mapped to huge pages.
> 
> > e.g. on XFS you will need to apply extent size hints to get large
> > page sized/aligned extent allocation to occur, and so this
> 
> It sounds like you're confusing alignment in memory with alignment
> on disk here? I don't see why on disk alignment would be needed
> at all, unless we're talking about DAX here (which is out of 
> scope currently) Kirill's changes are all about making the memory
> access for cached data more efficient, it's not about disk layout
> optimizations.

No, I'm not confusing this with DAX. However, this automatic use
model for huge pages fits straight into DAX as well.  Same
mechanisms, same behaviours, slightly stricter alignment
characteristics. All stuff the filesystem already knows about.

Mount options are, quite frankly, a terrible mechanism for
specifying filesystem policy. Setting up DAX this way was a mistake,
and it's a mount option I plan to remove from XFS once we get nearer
to having DAX feature complete and stablised. We've already got
on-disk "use DAX for this file" flags in XFS, so we can easier and
cleanly support different methods of accessing PMEM from the same
filesystem.

As such, there is no way we should be considering different
interfaces and methods for configuring the /same functionality/ just
because DAX is enabled or not. It's the /same decision/ that needs
to be made, and the filesystem knows an awful lot more about whether
huge pages can be used efficiently at the time of access than just
about any other actor you can name....

> > persistent extent size hint should trigger the filesystem to use
> > large pages if supported, the hint is correctly sized and aligned,
> > and there are large pages available for allocation.
> 
> That would be ioctls and similar?

You can, but existing filesystem admin tools can already set up
allocation policies without the apps being aware that they even
exist. If you want to use huge page mappings with DAX you'll already
need to do this because of the physical alignment requirements of
DAX.

Further, such techniques are already used by many admins for things
like limiting fragmentation of sparse vm image files. So while you
may not know it, extent size hints and per-file inheritable
attributes are quire widely used already to manage filesystem
behaviour without users or applications even being aware that the
filesystem policies have been modified by the admin...

> That would imply that every application wanting to use large pages
> would need to be especially enabled. That would seem awfully limiting
> to me and needlessly deny benefits to most existing code.

No change to applications will be necessary (see above), though
there's no reason why couldn't directly use the VFS interfaces to
explicitly ask for such behaviour themselves....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] shmem: avoid huge pages for small files
  2016-10-20 22:46                 ` Dave Chinner
@ 2016-10-21  2:01                   ` Andi Kleen
  2016-10-21  5:01                     ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2016-10-21  2:01 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kirill A. Shutemov, Hugh Dickins, Michal Hocko,
	Kirill A. Shutemov, Andrea Arcangeli, Andrew Morton, linux-mm,
	linux-kernel

> Ugh, no, please don't use mount options for file specific behaviours
> in filesystems like ext4 and XFS. This is exactly the sort of
> behaviour that should either just work automatically (i.e. be
> completely controlled by the filesystem) or only be applied to files

Can you explain what you mean? How would the file system control it?

> specifically configured with persistent hints to reliably allocate
> extents in a way that can be easily mapped to huge pages.

> e.g. on XFS you will need to apply extent size hints to get large
> page sized/aligned extent allocation to occur, and so this

It sounds like you're confusing alignment in memory with alignment
on disk here? I don't see why on disk alignment would be needed
at all, unless we're talking about DAX here (which is out of 
scope currently) Kirill's changes are all about making the memory
access for cached data more efficient, it's not about disk layout
optimizations.

> persistent extent size hint should trigger the filesystem to use
> large pages if supported, the hint is correctly sized and aligned,
> and there are large pages available for allocation.

That would be ioctls and similar?

That would imply that every application wanting to use large pages
would need to be especially enabled. That would seem awfully limiting
to me and needlessly deny benefits to most existing code.

-Andi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] shmem: avoid huge pages for small files
  2016-10-20 10:39               ` Kirill A. Shutemov
@ 2016-10-20 22:46                 ` Dave Chinner
  2016-10-21  2:01                   ` Andi Kleen
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2016-10-20 22:46 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Michal Hocko, Kirill A. Shutemov, Andrea Arcangeli,
	Andrew Morton, Andi Kleen, linux-mm, linux-kernel

On Thu, Oct 20, 2016 at 01:39:46PM +0300, Kirill A. Shutemov wrote:
> On Wed, Oct 19, 2016 at 11:13:54AM -0700, Hugh Dickins wrote:
> > On Tue, 18 Oct 2016, Michal Hocko wrote:
> > > On Tue 18-10-16 17:32:07, Kirill A. Shutemov wrote:
> > > > On Tue, Oct 18, 2016 at 04:20:07PM +0200, Michal Hocko wrote:
> > > > > On Mon 17-10-16 17:55:40, Kirill A. Shutemov wrote:
> > > > > > On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote:
> > > > > > > On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote:
> > > > > [...]
> > > > > > > > We add two handle to specify minimal file size for huge pages:
> > > > > > > > 
> > > > > > > >   - mount option 'huge_min_size';
> > > > > > > > 
> > > > > > > >   - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
> > > > > > > >     in-kernel tmpfs mountpoint;
> > > > > > > 
> > > > > > > Could you explain who might like to change the minimum value (other than
> > > > > > > disable the feautre for the mount point) and for what reason?
> > > > > > 
> > > > > > Depending on how well CPU microarchitecture deals with huge pages, you
> > > > > > might need to set it higher in order to balance out overhead with benefit
> > > > > > of huge pages.
> > > > > 
> > > > > I am not sure this is a good argument. How do a user know and what will
> > > > > help to make that decision? Why we cannot autotune that? In other words,
> > > > > adding new knobs just in case turned out to be a bad idea in the past.
> > > > 
> > > > Well, I don't see a reasonable way to autotune it. We can just let
> > > > arch-specific code to redefine it, but the argument below still stands.
> > > > 
> > > > > > In other case, if it's known in advance that specific mount would be
> > > > > > populated with large files, you might want to set it to zero to get huge
> > > > > > pages allocated from the beginning.
> > > 
> > > Do you think this is a sufficient reason to provide a tunable with such a
> > > precision? In other words why cannot we simply start by using an
> > > internal only limit at the huge page size for the initial transition
> > > (with a way to disable THP altogether for a mount point) and only add a
> > > more fine grained tunning if there ever is a real need for it with a use
> > > case description. In other words can we be less optimistic about
> > > tunables than we used to be in the past and often found out that those
> > > were mistakes much later?
> > 
> > I'm not sure whether I'm arguing in the same or the opposite direction
> > as you, Michal, but what makes me unhappy is not so much the tunable,
> > as the proliferation of mount options.
> > 
> > Kirill, this issue is (not exactly but close enough) what the mount
> > option "huge=within_size" was supposed to be about: not wasting huge
> > pages on small files.  I'd be much happier if you made huge_min_size
> > into a /sys/kernel/mm/transparent_hugepage/shmem_within_size tunable,
> > and used it to govern "huge=within_size" mounts only.
> 
> Well, you're right that I tried originally address the issue with
> huge=within_size, but this option makes much more sense for filesystem
> with persistent storage. For ext4, it would be pretty usable option.

Ugh, no, please don't use mount options for file specific behaviours
in filesystems like ext4 and XFS. This is exactly the sort of
behaviour that should either just work automatically (i.e. be
completely controlled by the filesystem) or only be applied to files
specifically configured with persistent hints to reliably allocate
extents in a way that can be easily mapped to huge pages.

e.g. on XFS you will need to apply extent size hints to get large
page sized/aligned extent allocation to occur, and so this
persistent extent size hint should trigger the filesystem to use
large pages if supported, the hint is correctly sized and aligned,
and there are large pages available for allocation.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] shmem: avoid huge pages for small files
  2016-10-19 18:13             ` Hugh Dickins
@ 2016-10-20 10:39               ` Kirill A. Shutemov
  2016-10-20 22:46                 ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: Kirill A. Shutemov @ 2016-10-20 10:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Kirill A. Shutemov, Andrea Arcangeli,
	Andrew Morton, Andi Kleen, linux-mm, linux-kernel

On Wed, Oct 19, 2016 at 11:13:54AM -0700, Hugh Dickins wrote:
> On Tue, 18 Oct 2016, Michal Hocko wrote:
> > On Tue 18-10-16 17:32:07, Kirill A. Shutemov wrote:
> > > On Tue, Oct 18, 2016 at 04:20:07PM +0200, Michal Hocko wrote:
> > > > On Mon 17-10-16 17:55:40, Kirill A. Shutemov wrote:
> > > > > On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote:
> > > > > > On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote:
> > > > [...]
> > > > > > > We add two handle to specify minimal file size for huge pages:
> > > > > > > 
> > > > > > >   - mount option 'huge_min_size';
> > > > > > > 
> > > > > > >   - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
> > > > > > >     in-kernel tmpfs mountpoint;
> > > > > > 
> > > > > > Could you explain who might like to change the minimum value (other than
> > > > > > disable the feautre for the mount point) and for what reason?
> > > > > 
> > > > > Depending on how well CPU microarchitecture deals with huge pages, you
> > > > > might need to set it higher in order to balance out overhead with benefit
> > > > > of huge pages.
> > > > 
> > > > I am not sure this is a good argument. How do a user know and what will
> > > > help to make that decision? Why we cannot autotune that? In other words,
> > > > adding new knobs just in case turned out to be a bad idea in the past.
> > > 
> > > Well, I don't see a reasonable way to autotune it. We can just let
> > > arch-specific code to redefine it, but the argument below still stands.
> > > 
> > > > > In other case, if it's known in advance that specific mount would be
> > > > > populated with large files, you might want to set it to zero to get huge
> > > > > pages allocated from the beginning.
> > 
> > Do you think this is a sufficient reason to provide a tunable with such a
> > precision? In other words why cannot we simply start by using an
> > internal only limit at the huge page size for the initial transition
> > (with a way to disable THP altogether for a mount point) and only add a
> > more fine grained tunning if there ever is a real need for it with a use
> > case description. In other words can we be less optimistic about
> > tunables than we used to be in the past and often found out that those
> > were mistakes much later?
> 
> I'm not sure whether I'm arguing in the same or the opposite direction
> as you, Michal, but what makes me unhappy is not so much the tunable,
> as the proliferation of mount options.
> 
> Kirill, this issue is (not exactly but close enough) what the mount
> option "huge=within_size" was supposed to be about: not wasting huge
> pages on small files.  I'd be much happier if you made huge_min_size
> into a /sys/kernel/mm/transparent_hugepage/shmem_within_size tunable,
> and used it to govern "huge=within_size" mounts only.

Well, you're right that I tried originally address the issue with
huge=within_size, but this option makes much more sense for filesystem
with persistent storage. For ext4, it would be pretty usable option.

What you propose would change the semantics of the option and it will
diverge from how it works on ext4.

I guess it may have sense, taking into account that shmem/tmpfs is
special, in sense that we always start with empty filesystem.

If everybody agree, I'll respin the patch with single tunable that manage
all huge=within_size mounts.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] shmem: avoid huge pages for small files
  2016-10-18 18:30           ` Michal Hocko
@ 2016-10-19 18:13             ` Hugh Dickins
  2016-10-20 10:39               ` Kirill A. Shutemov
  0 siblings, 1 reply; 29+ messages in thread
From: Hugh Dickins @ 2016-10-19 18:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kirill A. Shutemov, Kirill A. Shutemov, Hugh Dickins,
	Andrea Arcangeli, Andrew Morton, Andi Kleen, linux-mm,
	linux-kernel

On Tue, 18 Oct 2016, Michal Hocko wrote:
> On Tue 18-10-16 17:32:07, Kirill A. Shutemov wrote:
> > On Tue, Oct 18, 2016 at 04:20:07PM +0200, Michal Hocko wrote:
> > > On Mon 17-10-16 17:55:40, Kirill A. Shutemov wrote:
> > > > On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote:
> > > > > On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote:
> > > [...]
> > > > > > We add two handle to specify minimal file size for huge pages:
> > > > > > 
> > > > > >   - mount option 'huge_min_size';
> > > > > > 
> > > > > >   - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
> > > > > >     in-kernel tmpfs mountpoint;
> > > > > 
> > > > > Could you explain who might like to change the minimum value (other than
> > > > > disable the feautre for the mount point) and for what reason?
> > > > 
> > > > Depending on how well CPU microarchitecture deals with huge pages, you
> > > > might need to set it higher in order to balance out overhead with benefit
> > > > of huge pages.
> > > 
> > > I am not sure this is a good argument. How do a user know and what will
> > > help to make that decision? Why we cannot autotune that? In other words,
> > > adding new knobs just in case turned out to be a bad idea in the past.
> > 
> > Well, I don't see a reasonable way to autotune it. We can just let
> > arch-specific code to redefine it, but the argument below still stands.
> > 
> > > > In other case, if it's known in advance that specific mount would be
> > > > populated with large files, you might want to set it to zero to get huge
> > > > pages allocated from the beginning.
> 
> Do you think this is a sufficient reason to provide a tunable with such a
> precision? In other words why cannot we simply start by using an
> internal only limit at the huge page size for the initial transition
> (with a way to disable THP altogether for a mount point) and only add a
> more fine grained tunning if there ever is a real need for it with a use
> case description. In other words can we be less optimistic about
> tunables than we used to be in the past and often found out that those
> were mistakes much later?

I'm not sure whether I'm arguing in the same or the opposite direction
as you, Michal, but what makes me unhappy is not so much the tunable,
as the proliferation of mount options.

Kirill, this issue is (not exactly but close enough) what the mount
option "huge=within_size" was supposed to be about: not wasting huge
pages on small files.  I'd be much happier if you made huge_min_size
into a /sys/kernel/mm/transparent_hugepage/shmem_within_size tunable,
and used it to govern "huge=within_size" mounts only.

Hugh

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] shmem: avoid huge pages for small files
  2016-10-18 14:32         ` Kirill A. Shutemov
@ 2016-10-18 18:30           ` Michal Hocko
  2016-10-19 18:13             ` Hugh Dickins
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2016-10-18 18:30 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli,
	Andrew Morton, Andi Kleen, linux-mm, linux-kernel

On Tue 18-10-16 17:32:07, Kirill A. Shutemov wrote:
> On Tue, Oct 18, 2016 at 04:20:07PM +0200, Michal Hocko wrote:
> > On Mon 17-10-16 17:55:40, Kirill A. Shutemov wrote:
> > > On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote:
> > > > On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote:
> > [...]
> > > > > We add two handle to specify minimal file size for huge pages:
> > > > > 
> > > > >   - mount option 'huge_min_size';
> > > > > 
> > > > >   - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
> > > > >     in-kernel tmpfs mountpoint;
> > > > 
> > > > Could you explain who might like to change the minimum value (other than
> > > > disable the feautre for the mount point) and for what reason?
> > > 
> > > Depending on how well CPU microarchitecture deals with huge pages, you
> > > might need to set it higher in order to balance out overhead with benefit
> > > of huge pages.
> > 
> > I am not sure this is a good argument. How do a user know and what will
> > help to make that decision? Why we cannot autotune that? In other words,
> > adding new knobs just in case turned out to be a bad idea in the past.
> 
> Well, I don't see a reasonable way to autotune it. We can just let
> arch-specific code to redefine it, but the argument below still stands.
> 
> > > In other case, if it's known in advance that specific mount would be
> > > populated with large files, you might want to set it to zero to get huge
> > > pages allocated from the beginning.

Do you think this is a sufficient reason to provide a tunable with such a
precision? In other words why cannot we simply start by using an
internal only limit at the huge page size for the initial transition
(with a way to disable THP altogether for a mount point) and only add a
more fine grained tunning if there ever is a real need for it with a use
case description. In other words can we be less optimistic about
tunables than we used to be in the past and often found out that those
were mistakes much later?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] shmem: avoid huge pages for small files
  2016-10-18 14:20       ` Michal Hocko
@ 2016-10-18 14:32         ` Kirill A. Shutemov
  2016-10-18 18:30           ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Kirill A. Shutemov @ 2016-10-18 14:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli,
	Andrew Morton, Andi Kleen, linux-mm, linux-kernel

On Tue, Oct 18, 2016 at 04:20:07PM +0200, Michal Hocko wrote:
> On Mon 17-10-16 17:55:40, Kirill A. Shutemov wrote:
> > On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote:
> > > On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote:
> [...]
> > > > We add two handle to specify minimal file size for huge pages:
> > > > 
> > > >   - mount option 'huge_min_size';
> > > > 
> > > >   - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
> > > >     in-kernel tmpfs mountpoint;
> > > 
> > > Could you explain who might like to change the minimum value (other than
> > > disable the feautre for the mount point) and for what reason?
> > 
> > Depending on how well CPU microarchitecture deals with huge pages, you
> > might need to set it higher in order to balance out overhead with benefit
> > of huge pages.
> 
> I am not sure this is a good argument. How do a user know and what will
> help to make that decision? Why we cannot autotune that? In other words,
> adding new knobs just in case turned out to be a bad idea in the past.

Well, I don't see a reasonable way to autotune it. We can just let
arch-specific code to redefine it, but the argument below still stands.

> > In other case, if it's known in advance that specific mount would be
> > populated with large files, you might want to set it to zero to get huge
> > pages allocated from the beginning.
> 
> Cannot we use [mf]advise for that purpose?

There's no fadvise for this at the moment. We can use madvise, except that
the patch makes it lower priority than the limit :P. I'll fix that.

But in general, it would require change to the program which is not always
desirable or even possible.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] shmem: avoid huge pages for small files
  2016-10-17 14:55     ` Kirill A. Shutemov
@ 2016-10-18 14:20       ` Michal Hocko
  2016-10-18 14:32         ` Kirill A. Shutemov
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2016-10-18 14:20 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli,
	Andrew Morton, Andi Kleen, linux-mm, linux-kernel

On Mon 17-10-16 17:55:40, Kirill A. Shutemov wrote:
> On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote:
> > On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote:
[...]
> > > We add two handle to specify minimal file size for huge pages:
> > > 
> > >   - mount option 'huge_min_size';
> > > 
> > >   - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
> > >     in-kernel tmpfs mountpoint;
> > 
> > Could you explain who might like to change the minimum value (other than
> > disable the feautre for the mount point) and for what reason?
> 
> Depending on how well CPU microarchitecture deals with huge pages, you
> might need to set it higher in order to balance out overhead with benefit
> of huge pages.

I am not sure this is a good argument. How do a user know and what will
help to make that decision? Why we cannot autotune that? In other words,
adding new knobs just in case turned out to be a bad idea in the past.

> In other case, if it's known in advance that specific mount would be
> populated with large files, you might want to set it to zero to get huge
> pages allocated from the beginning.

Cannot we use [mf]advise for that purpose?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] shmem: avoid huge pages for small files
  2016-10-17 14:12   ` Michal Hocko
@ 2016-10-17 14:55     ` Kirill A. Shutemov
  2016-10-18 14:20       ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Kirill A. Shutemov @ 2016-10-17 14:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli,
	Andrew Morton, Andi Kleen, linux-mm, linux-kernel

On Mon, Oct 17, 2016 at 04:12:46PM +0200, Michal Hocko wrote:
> On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote:
> [...]
> > >From fd0b01b9797ddf2bef308c506c42d3dd50f11793 Mon Sep 17 00:00:00 2001
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > Date: Mon, 17 Oct 2016 14:44:47 +0300
> > Subject: [PATCH] shmem: avoid huge pages for small files
> > 
> > Huge pages are detrimental for small file: they causes noticible
> > overhead on both allocation performance and memory footprint.
> > 
> > This patch aimed to address this issue by avoiding huge pages until file
> > grown to specified size. This would cover most of the cases where huge
> > pages causes regressions in performance.
> > 
> > By default the minimal file size to allocate huge pages is equal to size
> > of huge page.
> 
> ok
> 
> > We add two handle to specify minimal file size for huge pages:
> > 
> >   - mount option 'huge_min_size';
> > 
> >   - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
> >     in-kernel tmpfs mountpoint;
> 
> Could you explain who might like to change the minimum value (other than
> disable the feautre for the mount point) and for what reason?

Depending on how well CPU microarchitecture deals with huge pages, you
might need to set it higher in order to balance out overhead with benefit
of huge pages.

In other case, if it's known in advance that specific mount would be
populated with large files, you might want to set it to zero to get huge
pages allocated from the beginning.

> > @@ -238,6 +238,12 @@ values:
> >    - "force":
> >      Force the huge option on for all - very useful for testing;
> >  
> > +Tehre's limit on minimal file size before kenrel starts allocate huge
> > +pages for it. By default it's size of huge page.
> 
> Smoe tyopse

Wlil fxi!

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] shmem: avoid huge pages for small files
  2016-10-17 12:30 ` Kirill A. Shutemov
@ 2016-10-17 14:12   ` Michal Hocko
  2016-10-17 14:55     ` Kirill A. Shutemov
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2016-10-17 14:12 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrea Arcangeli, Andrew Morton, Andi Kleen,
	linux-mm, linux-kernel

On Mon 17-10-16 15:30:21, Kirill A. Shutemov wrote:
[...]
> >From fd0b01b9797ddf2bef308c506c42d3dd50f11793 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Date: Mon, 17 Oct 2016 14:44:47 +0300
> Subject: [PATCH] shmem: avoid huge pages for small files
> 
> Huge pages are detrimental for small file: they causes noticible
> overhead on both allocation performance and memory footprint.
> 
> This patch aimed to address this issue by avoiding huge pages until file
> grown to specified size. This would cover most of the cases where huge
> pages causes regressions in performance.
> 
> By default the minimal file size to allocate huge pages is equal to size
> of huge page.

ok

> We add two handle to specify minimal file size for huge pages:
> 
>   - mount option 'huge_min_size';
> 
>   - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
>     in-kernel tmpfs mountpoint;

Could you explain who might like to change the minimum value (other than
disable the feautre for the mount point) and for what reason?

[...]

> @@ -238,6 +238,12 @@ values:
>    - "force":
>      Force the huge option on for all - very useful for testing;
>  
> +Tehre's limit on minimal file size before kenrel starts allocate huge
> +pages for it. By default it's size of huge page.

Smoe tyopse
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] shmem: avoid huge pages for small files
  2016-10-17 12:18 [PATCH] " Kirill A. Shutemov
@ 2016-10-17 12:30 ` Kirill A. Shutemov
  2016-10-17 14:12   ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Kirill A. Shutemov @ 2016-10-17 12:30 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Andi Kleen, linux-mm, linux-kernel

On Mon, Oct 17, 2016 at 03:18:09PM +0300, Kirill A. Shutemov wrote:
> diff --git a/mm/shmem.c b/mm/shmem.c
> index ad7813d73ea7..c69047386e2f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -369,6 +369,7 @@ static bool shmem_confirm_swap(struct address_space *mapping,
>  /* ifdef here to avoid bloating shmem.o when not necessary */
>  
>  int shmem_huge __read_mostly;
> +unsigned long long shmem_huge_min_size = HPAGE_PMD_SIZE __read_mostly;

Arghh.. Last second changes...

This should be 

unsigned long long shmem_huge_min_size __read_mostly = HPAGE_PMD_SIZE;


>From fd0b01b9797ddf2bef308c506c42d3dd50f11793 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Mon, 17 Oct 2016 14:44:47 +0300
Subject: [PATCH] shmem: avoid huge pages for small files

Huge pages are detrimental for small file: they causes noticible
overhead on both allocation performance and memory footprint.

This patch aimed to address this issue by avoiding huge pages until file
grown to specified size. This would cover most of the cases where huge
pages causes regressions in performance.

By default the minimal file size to allocate huge pages is equal to size
of huge page.

We add two handle to specify minimal file size for huge pages:

  - mount option 'huge_min_size';

  - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
    in-kernel tmpfs mountpoint;

Few notes:

  - if shmem_enabled is set to 'force', the limit is ignored. We still
    want to generate as many pages as possible for functional testing.

  - the limit doesn't affect khugepaged behaviour: it still can collapse
    pages based on its settings;

  - remount of the filesystem doesn't affect previously allocated pages,
    but the limit is applied for new allocations;

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/vm/transhuge.txt |  6 +++++
 include/linux/huge_mm.h        |  1 +
 include/linux/shmem_fs.h       |  1 +
 mm/huge_memory.c               |  1 +
 mm/shmem.c                     | 56 ++++++++++++++++++++++++++++++++++++++----
 5 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 2ec6adb5a4ce..40006d193687 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -238,6 +238,12 @@ values:
   - "force":
     Force the huge option on for all - very useful for testing;
 
+Tehre's limit on minimal file size before kenrel starts allocate huge
+pages for it. By default it's size of huge page.
+
+You can adjust the limit using "huge_min_size=" mount option or
+/sys/kernel/mm/transparent_hugepage/shmem_min_size for in-kernel mount.
+
 == Need of application restart ==
 
 The transparent_hugepage/enabled values and tmpfs mount option only affect
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9b9f65d99873..515b96a5a592 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -52,6 +52,7 @@ extern ssize_t single_hugepage_flag_show(struct kobject *kobj,
 				struct kobj_attribute *attr, char *buf,
 				enum transparent_hugepage_flag flag);
 extern struct kobj_attribute shmem_enabled_attr;
+extern struct kobj_attribute shmem_min_size_attr;
 
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index ff078e7043b6..e7c3bddc6335 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -31,6 +31,7 @@ struct shmem_sb_info {
 	spinlock_t stat_lock;	    /* Serialize shmem_sb_info changes */
 	umode_t mode;		    /* Mount mode for root directory */
 	unsigned char huge;	    /* Whether to try for hugepages */
+	loff_t huge_min_size;       /* No hugepages if i_size less than this */
 	kuid_t uid;		    /* Mount uid for root directory */
 	kgid_t gid;		    /* Mount gid for root directory */
 	struct mempolicy *mpol;     /* default memory policy for mappings */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cdcd25cb30fe..fa133eb5bf62 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -309,6 +309,7 @@ static struct attribute *hugepage_attr[] = {
 	&use_zero_page_attr.attr,
 #if defined(CONFIG_SHMEM) && defined(CONFIG_TRANSPARENT_HUGE_PAGECACHE)
 	&shmem_enabled_attr.attr,
+	&shmem_min_size_attr.attr,
 #endif
 #ifdef CONFIG_DEBUG_VM
 	&debug_cow_attr.attr,
diff --git a/mm/shmem.c b/mm/shmem.c
index ad7813d73ea7..c15eee0eb885 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -369,6 +369,7 @@ static bool shmem_confirm_swap(struct address_space *mapping,
 /* ifdef here to avoid bloating shmem.o when not necessary */
 
 int shmem_huge __read_mostly;
+unsigned long long shmem_huge_min_size __read_mostly = HPAGE_PMD_SIZE;
 
 static int shmem_parse_huge(const char *str)
 {
@@ -1668,6 +1669,8 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 		swap_free(swap);
 
 	} else {
+		loff_t i_size;
+
 		/* shmem_symlink() */
 		if (mapping->a_ops != &shmem_aops)
 			goto alloc_nohuge;
@@ -1675,14 +1678,17 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 			goto alloc_nohuge;
 		if (shmem_huge == SHMEM_HUGE_FORCE)
 			goto alloc_huge;
+		i_size = i_size_read(inode);
+		if (i_size < sbinfo->huge_min_size &&
+				index < (sbinfo->huge_min_size >> PAGE_SHIFT))
+			goto alloc_nohuge;
 		switch (sbinfo->huge) {
-			loff_t i_size;
 			pgoff_t off;
 		case SHMEM_HUGE_NEVER:
 			goto alloc_nohuge;
 		case SHMEM_HUGE_WITHIN_SIZE:
 			off = round_up(index, HPAGE_PMD_NR);
-			i_size = round_up(i_size_read(inode), PAGE_SIZE);
+			i_size = round_up(i_size, PAGE_SIZE);
 			if (i_size >= HPAGE_PMD_SIZE &&
 					i_size >> PAGE_SHIFT >= off)
 				goto alloc_huge;
@@ -3349,6 +3355,10 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
 					huge != SHMEM_HUGE_NEVER)
 				goto bad_val;
 			sbinfo->huge = huge;
+		} else if (!strcmp(this_char, "huge_min_size")) {
+			sbinfo->huge_min_size = memparse(value, &rest);
+			if (*rest)
+				goto bad_val;
 #endif
 #ifdef CONFIG_NUMA
 		} else if (!strcmp(this_char,"mpol")) {
@@ -3382,6 +3392,8 @@ static int shmem_remount_fs(struct super_block *sb, int *flags, char *data)
 	int error = -EINVAL;
 
 	config.mpol = NULL;
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
+		config.huge_min_size = HPAGE_PMD_SIZE;
 	if (shmem_parse_options(data, &config, true))
 		return error;
 
@@ -3403,6 +3415,7 @@ static int shmem_remount_fs(struct super_block *sb, int *flags, char *data)
 
 	error = 0;
 	sbinfo->huge = config.huge;
+	sbinfo->huge_min_size = config.huge_min_size;
 	sbinfo->max_blocks  = config.max_blocks;
 	sbinfo->max_inodes  = config.max_inodes;
 	sbinfo->free_inodes = config.max_inodes - inodes;
@@ -3438,8 +3451,10 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
 				from_kgid_munged(&init_user_ns, sbinfo->gid));
 #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
 	/* Rightly or wrongly, show huge mount option unmasked by shmem_huge */
-	if (sbinfo->huge)
+	if (sbinfo->huge) {
 		seq_printf(seq, ",huge=%s", shmem_format_huge(sbinfo->huge));
+		seq_printf(seq, ",huge_min_size=%llu", sbinfo->huge_min_size);
+	}
 #endif
 	shmem_show_mpol(seq, sbinfo->mpol);
 	return 0;
@@ -3542,6 +3557,8 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent)
 	sbinfo->mode = S_IRWXUGO | S_ISVTX;
 	sbinfo->uid = current_fsuid();
 	sbinfo->gid = current_fsgid();
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
+		sbinfo->huge_min_size = HPAGE_PMD_SIZE;
 	sb->s_fs_info = sbinfo;
 
 #ifdef CONFIG_TMPFS
@@ -3780,9 +3797,10 @@ int __init shmem_init(void)
 	}
 
 #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
-	if (has_transparent_hugepage() && shmem_huge < SHMEM_HUGE_DENY)
+	if (has_transparent_hugepage() && shmem_huge < SHMEM_HUGE_DENY) {
 		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
-	else
+		SHMEM_SB(shm_mnt->mnt_sb)->huge_min_size = shmem_huge_min_size;
+	} else
 		shmem_huge = 0; /* just in case it was patched */
 #endif
 	return 0;
@@ -3848,6 +3866,34 @@ static ssize_t shmem_enabled_store(struct kobject *kobj,
 
 struct kobj_attribute shmem_enabled_attr =
 	__ATTR(shmem_enabled, 0644, shmem_enabled_show, shmem_enabled_store);
+
+static ssize_t shmem_min_size_show(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%llu\n", shmem_huge_min_size);
+}
+
+
+static ssize_t shmem_min_size_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	unsigned long long size;
+	char *end;
+
+	size = memparse(buf, &end);
+	if (end == buf)
+		return  -EINVAL;
+	if (*end == '\n')
+		end++;
+	if (*end != '\0')
+		return -EINVAL;
+	shmem_huge_min_size = size;
+	SHMEM_SB(shm_mnt->mnt_sb)->huge_min_size = size;
+	return end - buf;
+}
+
+struct kobj_attribute shmem_min_size_attr =
+	__ATTR(shmem_min_size, 0644, shmem_min_size_show, shmem_min_size_store);
 #endif /* CONFIG_TRANSPARENT_HUGE_PAGECACHE && CONFIG_SYSFS */
 
 #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
-- 
2.9.3

-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH] shmem: avoid huge pages for small files
@ 2016-10-17 12:18 Kirill A. Shutemov
  2016-10-17 12:30 ` Kirill A. Shutemov
  0 siblings, 1 reply; 29+ messages in thread
From: Kirill A. Shutemov @ 2016-10-17 12:18 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Andi Kleen, linux-mm, linux-kernel, Kirill A. Shutemov

Huge pages are detrimental for small file: they causes noticible
overhead on both allocation performance and memory footprint.

This patch aimed to address this issue by avoiding huge pages until file
grown to specified size. This would cover most of the cases where huge
pages causes regressions in performance.

By default the minimal file size to allocate huge pages is equal to size
of huge page.

We add two handle to specify minimal file size for huge pages:

  - mount option 'huge_min_size';

  - sysfs file /sys/kernel/mm/transparent_hugepage/shmem_min_size for
    in-kernel tmpfs mountpoint;

Few notes:

  - if shmem_enabled is set to 'force', the limit is ignored. We still
    want to generate as many pages as possible for functional testing.

  - the limit doesn't affect khugepaged behaviour: it still can collapse
    pages based on its settings;

  - remount of the filesystem doesn't affect previously allocated pages,
    but the limit is applied for new allocations;

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/vm/transhuge.txt |  6 +++++
 include/linux/huge_mm.h        |  1 +
 include/linux/shmem_fs.h       |  1 +
 mm/huge_memory.c               |  1 +
 mm/shmem.c                     | 56 ++++++++++++++++++++++++++++++++++++++----
 5 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 2ec6adb5a4ce..40006d193687 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -238,6 +238,12 @@ values:
   - "force":
     Force the huge option on for all - very useful for testing;
 
+Tehre's limit on minimal file size before kenrel starts allocate huge
+pages for it. By default it's size of huge page.
+
+You can adjust the limit using "huge_min_size=" mount option or
+/sys/kernel/mm/transparent_hugepage/shmem_min_size for in-kernel mount.
+
 == Need of application restart ==
 
 The transparent_hugepage/enabled values and tmpfs mount option only affect
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9b9f65d99873..515b96a5a592 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -52,6 +52,7 @@ extern ssize_t single_hugepage_flag_show(struct kobject *kobj,
 				struct kobj_attribute *attr, char *buf,
 				enum transparent_hugepage_flag flag);
 extern struct kobj_attribute shmem_enabled_attr;
+extern struct kobj_attribute shmem_min_size_attr;
 
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index ff078e7043b6..e7c3bddc6335 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -31,6 +31,7 @@ struct shmem_sb_info {
 	spinlock_t stat_lock;	    /* Serialize shmem_sb_info changes */
 	umode_t mode;		    /* Mount mode for root directory */
 	unsigned char huge;	    /* Whether to try for hugepages */
+	loff_t huge_min_size;       /* No hugepages if i_size less than this */
 	kuid_t uid;		    /* Mount uid for root directory */
 	kgid_t gid;		    /* Mount gid for root directory */
 	struct mempolicy *mpol;     /* default memory policy for mappings */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cdcd25cb30fe..fa133eb5bf62 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -309,6 +309,7 @@ static struct attribute *hugepage_attr[] = {
 	&use_zero_page_attr.attr,
 #if defined(CONFIG_SHMEM) && defined(CONFIG_TRANSPARENT_HUGE_PAGECACHE)
 	&shmem_enabled_attr.attr,
+	&shmem_min_size_attr.attr,
 #endif
 #ifdef CONFIG_DEBUG_VM
 	&debug_cow_attr.attr,
diff --git a/mm/shmem.c b/mm/shmem.c
index ad7813d73ea7..c69047386e2f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -369,6 +369,7 @@ static bool shmem_confirm_swap(struct address_space *mapping,
 /* ifdef here to avoid bloating shmem.o when not necessary */
 
 int shmem_huge __read_mostly;
+unsigned long long shmem_huge_min_size = HPAGE_PMD_SIZE __read_mostly;
 
 static int shmem_parse_huge(const char *str)
 {
@@ -1668,6 +1669,8 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 		swap_free(swap);
 
 	} else {
+		loff_t i_size;
+
 		/* shmem_symlink() */
 		if (mapping->a_ops != &shmem_aops)
 			goto alloc_nohuge;
@@ -1675,14 +1678,17 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 			goto alloc_nohuge;
 		if (shmem_huge == SHMEM_HUGE_FORCE)
 			goto alloc_huge;
+		i_size = i_size_read(inode);
+		if (i_size < sbinfo->huge_min_size &&
+				index < (sbinfo->huge_min_size >> PAGE_SHIFT))
+			goto alloc_nohuge;
 		switch (sbinfo->huge) {
-			loff_t i_size;
 			pgoff_t off;
 		case SHMEM_HUGE_NEVER:
 			goto alloc_nohuge;
 		case SHMEM_HUGE_WITHIN_SIZE:
 			off = round_up(index, HPAGE_PMD_NR);
-			i_size = round_up(i_size_read(inode), PAGE_SIZE);
+			i_size = round_up(i_size, PAGE_SIZE);
 			if (i_size >= HPAGE_PMD_SIZE &&
 					i_size >> PAGE_SHIFT >= off)
 				goto alloc_huge;
@@ -3349,6 +3355,10 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
 					huge != SHMEM_HUGE_NEVER)
 				goto bad_val;
 			sbinfo->huge = huge;
+		} else if (!strcmp(this_char, "huge_min_size")) {
+			sbinfo->huge_min_size = memparse(value, &rest);
+			if (*rest)
+				goto bad_val;
 #endif
 #ifdef CONFIG_NUMA
 		} else if (!strcmp(this_char,"mpol")) {
@@ -3382,6 +3392,8 @@ static int shmem_remount_fs(struct super_block *sb, int *flags, char *data)
 	int error = -EINVAL;
 
 	config.mpol = NULL;
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
+		config.huge_min_size = HPAGE_PMD_SIZE;
 	if (shmem_parse_options(data, &config, true))
 		return error;
 
@@ -3403,6 +3415,7 @@ static int shmem_remount_fs(struct super_block *sb, int *flags, char *data)
 
 	error = 0;
 	sbinfo->huge = config.huge;
+	sbinfo->huge_min_size = config.huge_min_size;
 	sbinfo->max_blocks  = config.max_blocks;
 	sbinfo->max_inodes  = config.max_inodes;
 	sbinfo->free_inodes = config.max_inodes - inodes;
@@ -3438,8 +3451,10 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
 				from_kgid_munged(&init_user_ns, sbinfo->gid));
 #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
 	/* Rightly or wrongly, show huge mount option unmasked by shmem_huge */
-	if (sbinfo->huge)
+	if (sbinfo->huge) {
 		seq_printf(seq, ",huge=%s", shmem_format_huge(sbinfo->huge));
+		seq_printf(seq, ",huge_min_size=%llu", sbinfo->huge_min_size);
+	}
 #endif
 	shmem_show_mpol(seq, sbinfo->mpol);
 	return 0;
@@ -3542,6 +3557,8 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent)
 	sbinfo->mode = S_IRWXUGO | S_ISVTX;
 	sbinfo->uid = current_fsuid();
 	sbinfo->gid = current_fsgid();
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
+		sbinfo->huge_min_size = HPAGE_PMD_SIZE;
 	sb->s_fs_info = sbinfo;
 
 #ifdef CONFIG_TMPFS
@@ -3780,9 +3797,10 @@ int __init shmem_init(void)
 	}
 
 #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
-	if (has_transparent_hugepage() && shmem_huge < SHMEM_HUGE_DENY)
+	if (has_transparent_hugepage() && shmem_huge < SHMEM_HUGE_DENY) {
 		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
-	else
+		SHMEM_SB(shm_mnt->mnt_sb)->huge_min_size = shmem_huge_min_size;
+	} else
 		shmem_huge = 0; /* just in case it was patched */
 #endif
 	return 0;
@@ -3848,6 +3866,34 @@ static ssize_t shmem_enabled_store(struct kobject *kobj,
 
 struct kobj_attribute shmem_enabled_attr =
 	__ATTR(shmem_enabled, 0644, shmem_enabled_show, shmem_enabled_store);
+
+static ssize_t shmem_min_size_show(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%llu\n", shmem_huge_min_size);
+}
+
+
+static ssize_t shmem_min_size_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	unsigned long long size;
+	char *end;
+
+	size = memparse(buf, &end);
+	if (end == buf)
+		return  -EINVAL;
+	if (*end == '\n')
+		end++;
+	if (*end != '\0')
+		return -EINVAL;
+	shmem_huge_min_size = size;
+	SHMEM_SB(shm_mnt->mnt_sb)->huge_min_size = size;
+	return end - buf;
+}
+
+struct kobj_attribute shmem_min_size_attr =
+	__ATTR(shmem_min_size, 0644, shmem_min_size_show, shmem_min_size_store);
 #endif /* CONFIG_TRANSPARENT_HUGE_PAGECACHE && CONFIG_SYSFS */
 
 #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2016-11-29 11:11 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-21 18:51 [PATCHv4] shmem: avoid huge pages for small files Kirill A. Shutemov
2016-10-21 22:46 ` Kirill A. Shutemov
2016-10-24 12:43   ` Michal Hocko
2016-11-07 23:17   ` Hugh Dickins
2016-11-10 16:25     ` Kirill A. Shutemov
2016-11-10 17:42       ` [PATCH] " kbuild test robot
2016-11-10 17:51         ` Kirill A. Shutemov
2016-11-11 21:41       ` [PATCHv4] " Hugh Dickins
2016-11-14 14:09         ` Kirill A. Shutemov
2016-11-29  3:56           ` Hugh Dickins
2016-11-29 11:11             ` Kirill A. Shutemov
  -- strict thread matches above, loose matches on Subject: below --
2016-10-17 12:18 [PATCH] " Kirill A. Shutemov
2016-10-17 12:30 ` Kirill A. Shutemov
2016-10-17 14:12   ` Michal Hocko
2016-10-17 14:55     ` Kirill A. Shutemov
2016-10-18 14:20       ` Michal Hocko
2016-10-18 14:32         ` Kirill A. Shutemov
2016-10-18 18:30           ` Michal Hocko
2016-10-19 18:13             ` Hugh Dickins
2016-10-20 10:39               ` Kirill A. Shutemov
2016-10-20 22:46                 ` Dave Chinner
2016-10-21  2:01                   ` Andi Kleen
2016-10-21  5:01                     ` Dave Chinner
2016-10-21 15:00                       ` Kirill A. Shutemov
2016-10-21 15:12                         ` Michal Hocko
2016-10-21 22:50                         ` Dave Chinner
2016-10-21 23:32                           ` Kirill A. Shutemov
2016-10-24 20:34                           ` Dave Hansen
2016-10-25  5:28                             ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).