linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Flushing transparent hugepages
@ 2020-08-18 15:07 Matthew Wilcox
  2020-08-18 16:08 ` Will Deacon
  2020-09-09 10:26 ` Aneesh Kumar K.V
  0 siblings, 2 replies; 4+ messages in thread
From: Matthew Wilcox @ 2020-08-18 15:07 UTC (permalink / raw)
  To: linux-arch
  Cc: Vineet Gupta, linux-snps-arc, Russell King, linux-arm-kernel,
	Catalin Marinas, Will Deacon, Thomas Bogendoerfer, linux-mips,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, David S. Miller, sparclinux, linux-mm

If your arch does not support HAVE_ARCH_TRANSPARENT_HUGEPAGE, you can
stop reading now.  Although maybe you're curious about adding support.

$ git grep -w HAVE_ARCH_TRANSPARENT_HUGEPAGE arch
arch/Kconfig:config HAVE_ARCH_TRANSPARENT_HUGEPAGE
arch/arc/Kconfig:config HAVE_ARCH_TRANSPARENT_HUGEPAGE
arch/arm/Kconfig:config HAVE_ARCH_TRANSPARENT_HUGEPAGE
arch/arm64/Kconfig:     select HAVE_ARCH_TRANSPARENT_HUGEPAGE
arch/mips/Kconfig:      select HAVE_ARCH_TRANSPARENT_HUGEPAGE if CPU_SUPPORTS_HUGEPAGES
arch/powerpc/platforms/Kconfig.cputype: select HAVE_ARCH_TRANSPARENT_HUGEPAGE
arch/s390/Kconfig:      select HAVE_ARCH_TRANSPARENT_HUGEPAGE
arch/sparc/Kconfig:     select HAVE_ARCH_TRANSPARENT_HUGEPAGE
arch/x86/Kconfig:       select HAVE_ARCH_TRANSPARENT_HUGEPAGE

If your arch does not implement flush_dcache_page(), you can also
stop reading.

$ for i in arc arm arm64 mips powerpc s390 sparc x86; do git grep -l flush_dcache_page arch/$i/include; done
arch/arc/include/asm/cacheflush.h
arch/arm/include/asm/cacheflush.h
arch/arm64/include/asm/cacheflush.h
arch/mips/include/asm/cacheflush.h
arch/powerpc/include/asm/cacheflush.h
arch/sparc/include/asm/cacheflush_32.h
arch/sparc/include/asm/cacheflush_64.h
arch/sparc/include/asm/pgtable_64.h

OK, so we're down to arc, arm, arm64, mips, powerpc & sparc.  Hi!  ;-)

I'm working on adding THP support for filesystems with storage backing
and part of that is expanding the definition of THP to be any order
(ie any power of two of PAGE_SIZE).  Now, shmem already has some calls
to flush_dcache_page() for THPs, for example:

        if (sgp != SGP_WRITE && !PageUptodate(page)) {
                struct page *head = compound_head(page);
                int i;

                for (i = 0; i < compound_nr(head); i++) {
                        clear_highpage(head + i);
                        flush_dcache_page(head + i);
                }
                SetPageUptodate(head);
        }

where you'll be called once for each subpage.  But ... these are error
paths, and I'm sure you all diligently test cache coherency scenarios
of error paths in shmem ... right?

For example, arm64 seems confused in this scenario:

void flush_dcache_page(struct page *page)
{
        if (test_bit(PG_dcache_clean, &page->flags))
                clear_bit(PG_dcache_clean, &page->flags);
}

...

void __sync_icache_dcache(pte_t pte)
{
        struct page *page = pte_page(pte);

        if (!test_and_set_bit(PG_dcache_clean, &page->flags))
                sync_icache_aliases(page_address(page), page_size(page));
}

So arm64 keeps track on a per-page basis which ones have been flushed.
page_size() will return PAGE_SIZE if called on a tail page or regular
page, but will return PAGE_SIZE << compound_order if called on a head
page.  So this will either over-flush, or it's missing the opportunity
to clear the bits on all the subpages which have now been flushed.

PowerPC has special handling of hugetlbfs pages.  Well, that's what
the config option says, but actually it handles THP as well.  If
the config option is enabled.

#ifdef CONFIG_HUGETLB_PAGE
        if (PageCompound(page)) {
                flush_dcache_icache_hugepage(page);
                return;
        }
#endif

By the way, THPs can be mapped askew -- that is, at an offset which
means you can't use a PMD to map a PMD sized page.

Anyway, we don't really have consensus between the various architectures
on how to handle either THPs or hugetlb pages.  It's not contemplated
in Documentation/core-api/cachetlb.rst so there's no real surprise
we've diverged.

What would you _like_ to see?  Would you rather flush_dcache_page()
were called once for each subpage, or would you rather maintain
the page-needs-flushing state once per compound page?  We could also
introduce flush_dcache_thp() if some architectures would prefer it one
way and one the other, although that brings into question what to do
for hugetlbfs pages.

It might not be a bad idea to centralise the handling of all this stuff
somewhere.  Sounds like the kind of thing Arnd would like to do ;-) I'll
settle for getting enough clear feedback about what the various arch
maintainers want that I can write a documentation update for cachetlb.rst.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Flushing transparent hugepages
  2020-08-18 15:07 Flushing transparent hugepages Matthew Wilcox
@ 2020-08-18 16:08 ` Will Deacon
  2020-08-28 17:06   ` Catalin Marinas
  2020-09-09 10:26 ` Aneesh Kumar K.V
  1 sibling, 1 reply; 4+ messages in thread
From: Will Deacon @ 2020-08-18 16:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-arch, Vineet Gupta, linux-snps-arc, Russell King,
	linux-arm-kernel, Catalin Marinas, Thomas Bogendoerfer,
	linux-mips, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, linuxppc-dev, David S. Miller, sparclinux,
	linux-mm

On Tue, Aug 18, 2020 at 04:07:36PM +0100, Matthew Wilcox wrote:
> For example, arm64 seems confused in this scenario:
> 
> void flush_dcache_page(struct page *page)
> {
>         if (test_bit(PG_dcache_clean, &page->flags))
>                 clear_bit(PG_dcache_clean, &page->flags);
> }
> 
> ...
> 
> void __sync_icache_dcache(pte_t pte)
> {
>         struct page *page = pte_page(pte);
> 
>         if (!test_and_set_bit(PG_dcache_clean, &page->flags))
>                 sync_icache_aliases(page_address(page), page_size(page));
> }
> 
> So arm64 keeps track on a per-page basis which ones have been flushed.
> page_size() will return PAGE_SIZE if called on a tail page or regular
> page, but will return PAGE_SIZE << compound_order if called on a head
> page.  So this will either over-flush, or it's missing the opportunity
> to clear the bits on all the subpages which have now been flushed.

Hmm, that seems to go all the way back to 2014 as the result of a bug fix
in 923b8f5044da ("arm64: mm: Make icache synchronisation logic huge page
aware") which has a Reported-by Mark and a CC stable, suggesting something
_was_ going wrong at the time :/ Was there a point where the tail pages
could end up with PG_arch_1 uncleared on allocation?

> What would you _like_ to see?  Would you rather flush_dcache_page()
> were called once for each subpage, or would you rather maintain
> the page-needs-flushing state once per compound page?  We could also
> introduce flush_dcache_thp() if some architectures would prefer it one
> way and one the other, although that brings into question what to do
> for hugetlbfs pages.

For arm64, we'd like to see PG_arch_1 preserved during huge page splitting
[1], but there was a worry that it might break x86 and s390. It's also not
clear to me that we can change __sync_icache_dcache() as it's called when
we're installing the entry in the page-table, so why would it be called
again for the tail pages?

Will

[1] https://lore.kernel.org/linux-arch/20200703153718.16973-8-catalin.marinas@arm.com/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Flushing transparent hugepages
  2020-08-18 16:08 ` Will Deacon
@ 2020-08-28 17:06   ` Catalin Marinas
  0 siblings, 0 replies; 4+ messages in thread
From: Catalin Marinas @ 2020-08-28 17:06 UTC (permalink / raw)
  To: Will Deacon
  Cc: Matthew Wilcox, linux-arch, Vineet Gupta, linux-snps-arc,
	Russell King, linux-arm-kernel, Thomas Bogendoerfer, linux-mips,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	linuxppc-dev, David S. Miller, sparclinux, linux-mm

On Tue, Aug 18, 2020 at 05:08:16PM +0100, Will Deacon wrote:
> On Tue, Aug 18, 2020 at 04:07:36PM +0100, Matthew Wilcox wrote:
> > For example, arm64 seems confused in this scenario:
> > 
> > void flush_dcache_page(struct page *page)
> > {
> >         if (test_bit(PG_dcache_clean, &page->flags))
> >                 clear_bit(PG_dcache_clean, &page->flags);
> > }
> > 
> > ...
> > 
> > void __sync_icache_dcache(pte_t pte)
> > {
> >         struct page *page = pte_page(pte);
> > 
> >         if (!test_and_set_bit(PG_dcache_clean, &page->flags))
> >                 sync_icache_aliases(page_address(page), page_size(page));
> > }
> > 
> > So arm64 keeps track on a per-page basis which ones have been flushed.
> > page_size() will return PAGE_SIZE if called on a tail page or regular
> > page, but will return PAGE_SIZE << compound_order if called on a head
> > page.  So this will either over-flush, or it's missing the opportunity
> > to clear the bits on all the subpages which have now been flushed.
> 
> Hmm, that seems to go all the way back to 2014 as the result of a bug fix
> in 923b8f5044da ("arm64: mm: Make icache synchronisation logic huge page
> aware") which has a Reported-by Mark and a CC stable, suggesting something
> _was_ going wrong at the time :/ Was there a point where the tail pages
> could end up with PG_arch_1 uncleared on allocation?

In my experience, it's the other way around: you can end up with
PG_arch_1 cleared in a tail page when the head one was set (splitting
THP).

> > What would you _like_ to see?  Would you rather flush_dcache_page()
> > were called once for each subpage, or would you rather maintain
> > the page-needs-flushing state once per compound page?  We could also
> > introduce flush_dcache_thp() if some architectures would prefer it one
> > way and one the other, although that brings into question what to do
> > for hugetlbfs pages.
> 
> For arm64, we'd like to see PG_arch_1 preserved during huge page splitting
> [1], but there was a worry that it might break x86 and s390. It's also not
> clear to me that we can change __sync_icache_dcache() as it's called when
> we're installing the entry in the page-table, so why would it be called
> again for the tail pages?

Indeed, __sync_icache_dcache() is called from set_pte_at() on the head
page, though it could always iterate and flush the tail pages
individually (I think we could have done this in commit 923b8f5044da).
Currently I suspect it does some over-flushing if you use THP on
executable pages (it's a no-op on non-exec pages).

With MTE (arm64 memory tagging) I'm introducing a PG_arch_2 flag and
losing this is more problematic as it can lead to clearing valid tags.
In the subsequent patch [2], mte_sync_tags() (also called from
set_pte_at()) checks the PG_arch_2 in each page of a compound one.

My preference would be to treat both PG_arch_1 and _2 similarly.

> [1] https://lore.kernel.org/linux-arch/20200703153718.16973-8-catalin.marinas@arm.com/

[2] https://lore.kernel.org/linux-arch/20200703153718.16973-9-catalin.marinas@arm.com/

-- 
Catalin

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Flushing transparent hugepages
  2020-08-18 15:07 Flushing transparent hugepages Matthew Wilcox
  2020-08-18 16:08 ` Will Deacon
@ 2020-09-09 10:26 ` Aneesh Kumar K.V
  1 sibling, 0 replies; 4+ messages in thread
From: Aneesh Kumar K.V @ 2020-09-09 10:26 UTC (permalink / raw)
  To: Matthew Wilcox, linux-arch
  Cc: Thomas Bogendoerfer, Vineet Gupta, linuxppc-dev, Russell King,
	linux-mips, linux-mm, Paul Mackerras, Catalin Marinas,
	sparclinux, linux-snps-arc, Will Deacon, David S. Miller,
	linux-arm-kernel

Matthew Wilcox <willy@infradead.org> writes:

> PowerPC has special handling of hugetlbfs pages.  Well, that's what
> the config option says, but actually it handles THP as well.  If
> the config option is enabled.
>
> #ifdef CONFIG_HUGETLB_PAGE
>         if (PageCompound(page)) {
>                 flush_dcache_icache_hugepage(page);
>                 return;
>         }
> #endif

I do have a change posted sometime back to avoid that confusion.
http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200320103256.229365-1-aneesh.kumar@linux.ibm.com/

But IIUC we use the head page flags (PG_arch_1) to track whether we need
the flush or not.

>
> By the way, THPs can be mapped askew -- that is, at an offset which
> means you can't use a PMD to map a PMD sized page.
>
> Anyway, we don't really have consensus between the various architectures
> on how to handle either THPs or hugetlb pages.  It's not contemplated
> in Documentation/core-api/cachetlb.rst so there's no real surprise
> we've diverged.
>
> What would you _like_ to see?  Would you rather flush_dcache_page()
> were called once for each subpage, or would you rather maintain
> the page-needs-flushing state once per compound page?  We could also
> introduce flush_dcache_thp() if some architectures would prefer it one
> way and one the other, although that brings into question what to do
> for hugetlbfs pages.
>
> It might not be a bad idea to centralise the handling of all this stuff
> somewhere.  Sounds like the kind of thing Arnd would like to do ;-) I'll
> settle for getting enough clear feedback about what the various arch
> maintainers want that I can write a documentation update for cachetlb.rst.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-09-09 10:28 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-18 15:07 Flushing transparent hugepages Matthew Wilcox
2020-08-18 16:08 ` Will Deacon
2020-08-28 17:06   ` Catalin Marinas
2020-09-09 10:26 ` Aneesh Kumar K.V

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).