All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] mm: allow deferred page init for vmemmap only
@ 2018-05-10 11:53 Pavel Tatashin
  2018-05-10 12:30 ` Michal Hocko
  0 siblings, 1 reply; 8+ messages in thread
From: Pavel Tatashin @ 2018-05-10 11:53 UTC (permalink / raw)
  To: steven.sistare, daniel.m.jordan, akpm, linux-kernel, tglx,
	mhocko, linux-mm, mgorman, mingo, peterz, rostedt, fengguang.wu,
	dennisszhou

It is unsafe to do virtual to physical translations before mm_init() is
called if struct page is needed in order to determine the memory section
number (see SECTION_IN_PAGE_FLAGS). This is because only in mm_init() we
initialize struct pages for all the allocated memory when deferred struct
pages are used.

My recent fix exposed this problem, because it greatly reduced number of
pages that are initialized before mm_init(), but the problem existed even
before my fix, as Fengguang Wu found.

Below is a more detailed explanation of the problem.

We initialize struct pages in four places:

1. Early in boot a small set of struct pages is initialized to fill
the first section, and lower zones.
2. During mm_init() we initialize "struct pages" for all the memory
that is allocated, i.e reserved in memblock.
3. Using on-demand logic when pages are allocated after mm_init call (when
memblock is finished)
4. After smp_init() when the rest free deferred pages are initialized.

The problem occurs if we try to do va to phys translation of a memory
between steps 1 and 2. Because we have not yet initialized struct pages for
all the reserved pages, it is inherently unsafe to do va to phys if the
translation itself requires access of "struct page" as in case of this
combination: CONFIG_SPARSE && !CONFIG_SPARSE_VMEMMAP

Here is a sample path, where translation is required, that occurs before
mm_init():

start_kernel()
 trap_init()
  setup_cpu_entry_areas()
   setup_cpu_entry_area(cpu)
    get_cpu_gdt_paddr(cpu)
     per_cpu_ptr_to_phys(addr)
      pcpu_addr_to_page(addr)
       virt_to_page(addr)
        pfn_to_page(__pa(addr) >> PAGE_SHIFT)

The problems are discussed in these threads:
http://lkml.kernel.org/r/20180418135300.inazvpxjxowogyge@wfg-t540p.sh.intel.com
http://lkml.kernel.org/r/20180419013128.iurzouiqxvcnpbvz@wfg-t540p.sh.intel.com
http://lkml.kernel.org/r/20180426202619.2768-1-pasha.tatashin@oracle.com

Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 mm/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index d5004d82a1d6..1cd32d67ca30 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -635,7 +635,7 @@ config DEFERRED_STRUCT_PAGE_INIT
 	bool "Defer initialisation of struct pages to kthreads"
 	default n
 	depends on NO_BOOTMEM
-	depends on !FLATMEM
+	depends on SPARSEMEM_VMEMMAP
 	help
 	  Ordinarily all struct pages are initialised during early boot in a
 	  single thread. On very large machines this can take a considerable
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm: allow deferred page init for vmemmap only
  2018-05-10 11:53 [PATCH v2] mm: allow deferred page init for vmemmap only Pavel Tatashin
@ 2018-05-10 12:30 ` Michal Hocko
  2018-05-11 14:17   ` Pavel Tatashin
  0 siblings, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2018-05-10 12:30 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: steven.sistare, daniel.m.jordan, akpm, linux-kernel, tglx,
	linux-mm, mgorman, mingo, peterz, rostedt, fengguang.wu,
	dennisszhou

On Thu 10-05-18 07:53:56, Pavel Tatashin wrote:
[...]
> Here is a sample path, where translation is required, that occurs before
> mm_init():
> 
> start_kernel()
>  trap_init()
>   setup_cpu_entry_areas()
>    setup_cpu_entry_area(cpu)
>     get_cpu_gdt_paddr(cpu)
>      per_cpu_ptr_to_phys(addr)
>       pcpu_addr_to_page(addr)
>        virt_to_page(addr)
>         pfn_to_page(__pa(addr) >> PAGE_SHIFT)

Thanks that helped me to see the problem. On the other hand isn't this a
bit of an overkill? AFAICS this affects only NEED_PER_CPU_KM which is !SMP
and DEFERRED_STRUCT_PAGE_INIT makes only very limited sense on UP,
right?

Or do we have more such places?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm: allow deferred page init for vmemmap only
  2018-05-10 12:30 ` Michal Hocko
@ 2018-05-11 14:17   ` Pavel Tatashin
  2018-05-15  9:10     ` Michal Hocko
  0 siblings, 1 reply; 8+ messages in thread
From: Pavel Tatashin @ 2018-05-11 14:17 UTC (permalink / raw)
  To: mhocko
  Cc: Steven Sistare, Daniel Jordan, Andrew Morton, LKML, tglx,
	Linux Memory Management List, mgorman, mingo, peterz,
	Steven Rostedt, Fengguang Wu, Dennis Zhou

> Thanks that helped me to see the problem. On the other hand isn't this a
> bit of an overkill? AFAICS this affects only NEED_PER_CPU_KM which is !SMP
> and DEFERRED_STRUCT_PAGE_INIT makes only very limited sense on UP,
> right?

> Or do we have more such places?

I do not know other places, but my worry is that trap_init() is arch
specific and we cannot guarantee that arches won't do virt to phys in
trap_init() in other places. Therefore, I think a proper fix is simply
allow DEFERRED_STRUCT_PAGE_INIT when it is safe to do virt to phys without
accessing struct pages, which is with SPARSEMEM_VMEMMAP.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm: allow deferred page init for vmemmap only
  2018-05-11 14:17   ` Pavel Tatashin
@ 2018-05-15  9:10     ` Michal Hocko
  2018-05-15 12:17       ` Pavel Tatashin
  0 siblings, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2018-05-15  9:10 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Steven Sistare, Daniel Jordan, Andrew Morton, LKML, tglx,
	Linux Memory Management List, mgorman, mingo, peterz,
	Steven Rostedt, Fengguang Wu, Dennis Zhou

On Fri 11-05-18 10:17:55, Pavel Tatashin wrote:
> > Thanks that helped me to see the problem. On the other hand isn't this a
> > bit of an overkill? AFAICS this affects only NEED_PER_CPU_KM which is !SMP
> > and DEFERRED_STRUCT_PAGE_INIT makes only very limited sense on UP,
> > right?
> 
> > Or do we have more such places?
> 
> I do not know other places, but my worry is that trap_init() is arch
> specific and we cannot guarantee that arches won't do virt to phys in
> trap_init() in other places. Therefore, I think a proper fix is simply
> allow DEFERRED_STRUCT_PAGE_INIT when it is safe to do virt to phys without
> accessing struct pages, which is with SPARSEMEM_VMEMMAP.

You are now disabling a potentially useful feature to SPARSEMEM users
without having any evidence that they do suffer from the issue which is
kinda sad. Especially when the only known offender is a UP pcp allocator
implementation.

I will not insist of course but it seems like your fix doesn't really
prevent virt_to_page or other direct page access either.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm: allow deferred page init for vmemmap only
  2018-05-15  9:10     ` Michal Hocko
@ 2018-05-15 12:17       ` Pavel Tatashin
  2018-05-15 12:55         ` Michal Hocko
  0 siblings, 1 reply; 8+ messages in thread
From: Pavel Tatashin @ 2018-05-15 12:17 UTC (permalink / raw)
  To: mhocko
  Cc: Steven Sistare, Daniel Jordan, Andrew Morton, LKML, tglx,
	Linux Memory Management List, mgorman, mingo, peterz,
	Steven Rostedt, Fengguang Wu, Dennis Zhou

Hi Michal,

Thank you for your reply, my comments below:

> You are now disabling a potentially useful feature to SPARSEMEM users
> without having any evidence that they do suffer from the issue which is
> kinda sad. Especially when the only known offender is a UP pcp allocator
> implementation.

True, but what is the use case for having SPARSEMEM without virtual mapping
and deferred struct page init together. Is it a common case to have
multiple gigabyte of memory and currently NUMA config to benefit from
deferred page init and yet not having a memory for virtual mapping of
struct pages? Or am I missing some common case here?

> I will not insist of course but it seems like your fix doesn't really
> prevent virt_to_page or other direct page access either.

I am not sure what do you mean, I do not prevent virt_to_page, but that is
OK for SPARSEMEM_VMEMMAP case, because we do not need to access "struct
page" for this operation, as translation is in page table. Yes, we do not
prohibit other struct page accesses before mm_init(), but we now have a
feature that checks for uninitialized struct page access, and if those will
happen, we will learn about them.

Thank you,
Pavel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm: allow deferred page init for vmemmap only
  2018-05-15 12:17       ` Pavel Tatashin
@ 2018-05-15 12:55         ` Michal Hocko
  2018-05-15 15:59           ` Pavel Tatashin
  0 siblings, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2018-05-15 12:55 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Steven Sistare, Daniel Jordan, Andrew Morton, LKML, tglx,
	Linux Memory Management List, mgorman, mingo, peterz,
	Steven Rostedt, Fengguang Wu, Dennis Zhou

On Tue 15-05-18 08:17:27, Pavel Tatashin wrote:
> Hi Michal,
> 
> Thank you for your reply, my comments below:
> 
> > You are now disabling a potentially useful feature to SPARSEMEM users
> > without having any evidence that they do suffer from the issue which is
> > kinda sad. Especially when the only known offender is a UP pcp allocator
> > implementation.
> 
> True, but what is the use case for having SPARSEMEM without virtual mapping
> and deferred struct page init together. Is it a common case to have
> multiple gigabyte of memory and currently NUMA config to benefit from
> deferred page init and yet not having a memory for virtual mapping of
> struct pages? Or am I missing some common case here?

Well, I strongly suspect that this is more a momentum, then a real
reason to stick with SPARSEMEM_MANUAL. I would really love to reduce the
number of memory models we have. Getting rid of SPARSEMEM would be a
good start as VMEMMAP should be much better.
 
> > I will not insist of course but it seems like your fix doesn't really
> > prevent virt_to_page or other direct page access either.
> 
> I am not sure what do you mean, I do not prevent virt_to_page, but that is
> OK for SPARSEMEM_VMEMMAP case, because we do not need to access "struct
> page" for this operation, as translation is in page table. Yes, we do not
> prohibit other struct page accesses before mm_init(), but we now have a
> feature that checks for uninitialized struct page access, and if those will
> happen, we will learn about them.

This will always be a maze as the early boot tends to be. Sad but true.
That is why I am not really convinced we should use a large hammer and
disallow deferred page initialization just because UP implementation of
pcp does something too early. We should instead rule that one odd case.
Your patch simply doesn't rule a large class of potential issues. It
just rules out a potentially useful feature for an odd case. See my
point?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm: allow deferred page init for vmemmap only
  2018-05-15 12:55         ` Michal Hocko
@ 2018-05-15 15:59           ` Pavel Tatashin
  2018-05-15 20:38             ` Michal Hocko
  0 siblings, 1 reply; 8+ messages in thread
From: Pavel Tatashin @ 2018-05-15 15:59 UTC (permalink / raw)
  To: mhocko
  Cc: Steven Sistare, Daniel Jordan, Andrew Morton, LKML, tglx,
	Linux Memory Management List, mgorman, mingo, peterz,
	Steven Rostedt, Fengguang Wu, Dennis Zhou

> This will always be a maze as the early boot tends to be. Sad but true.
> That is why I am not really convinced we should use a large hammer and
> disallow deferred page initialization just because UP implementation of
> pcp does something too early. We should instead rule that one odd case.
> Your patch simply doesn't rule a large class of potential issues. It
> just rules out a potentially useful feature for an odd case. See my
> point?

Hi Michal,

OK, I will send an updated patch with disabling deferred pages only whe
NEED_PER_CPU_KM. Hopefully, we won't see similar issues in other places.

Thank you,
Pavel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] mm: allow deferred page init for vmemmap only
  2018-05-15 15:59           ` Pavel Tatashin
@ 2018-05-15 20:38             ` Michal Hocko
  0 siblings, 0 replies; 8+ messages in thread
From: Michal Hocko @ 2018-05-15 20:38 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Steven Sistare, Daniel Jordan, Andrew Morton, LKML, tglx,
	Linux Memory Management List, mgorman, mingo, peterz,
	Steven Rostedt, Fengguang Wu, Dennis Zhou

On Tue 15-05-18 11:59:25, Pavel Tatashin wrote:
> > This will always be a maze as the early boot tends to be. Sad but true.
> > That is why I am not really convinced we should use a large hammer and
> > disallow deferred page initialization just because UP implementation of
> > pcp does something too early. We should instead rule that one odd case.
> > Your patch simply doesn't rule a large class of potential issues. It
> > just rules out a potentially useful feature for an odd case. See my
> > point?
> 
> Hi Michal,
> 
> OK, I will send an updated patch with disabling deferred pages only whe
> NEED_PER_CPU_KM. Hopefully, we won't see similar issues in other places.

If we do we will probably need to think more about a more systematic
solution.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-05-15 20:38 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-10 11:53 [PATCH v2] mm: allow deferred page init for vmemmap only Pavel Tatashin
2018-05-10 12:30 ` Michal Hocko
2018-05-11 14:17   ` Pavel Tatashin
2018-05-15  9:10     ` Michal Hocko
2018-05-15 12:17       ` Pavel Tatashin
2018-05-15 12:55         ` Michal Hocko
2018-05-15 15:59           ` Pavel Tatashin
2018-05-15 20:38             ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.