linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock
@ 2019-09-24 14:36 David Hildenbrand
  2019-09-24 14:48 ` Michal Hocko
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: David Hildenbrand @ 2019-09-24 14:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Oscar Salvador,
	Michal Hocko, Pavel Tatashin, Dan Williams, Thomas Gleixner

Since commit 3f906ba23689 ("mm/memory-hotplug: switch locking to a percpu
rwsem") we do a cpus_read_lock() in mem_hotplug_begin(). This was
introduced to fix a potential deadlock between get_online_mems() and
get_online_cpus() - the memory and cpu hotplug lock. The root issue was
that build_all_zonelists() -> stop_machine() required the cpu hotplug lock:
    The reason is that memory hotplug takes the memory hotplug lock and
    then calls stop_machine() which calls get_online_cpus().  That's the
    reverse lock order to get_online_cpus(); get_online_mems(); in
    mm/slub_common.c

So memory hotplug never really required any cpu lock itself, only
stop_machine() and lru_add_drain_all() required it. Back then,
stop_machine_cpuslocked() and lru_add_drain_all_cpuslocked() were used
as the cpu hotplug lock was now obtained in the caller.

Since commit 11cd8638c37f ("mm, page_alloc: remove stop_machine from build
all_zonelists"), the stop_machine_cpuslocked() call is gone.
build_all_zonelists() does no longer require the cpu lock and does no
longer make use of stop_machine().

Since commit 9852a7212324 ("mm: drop hotplug lock from
lru_add_drain_all()"), lru_add_drain_all() "Doesn't need any cpu hotplug
locking because we do rely on per-cpu kworkers being shut down before our
page_alloc_cpu_dead callback is executed on the offlined cpu.". The
lru_add_drain_all_cpuslocked() variant was removed.

So there is nothing left that requires the cpu hotplug lock. The memory
hotplug lock and the device hotplug lock are sufficient.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---

RFC -> v1:
- Reword and add more details why the cpu hotplug lock was needed here
  in the first place, and why we no longer require it.

---
 mm/memory_hotplug.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c3e9aed6023f..5fa30f3010e1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -88,14 +88,12 @@ __setup("memhp_default_state=", setup_memhp_default_state);
 
 void mem_hotplug_begin(void)
 {
-	cpus_read_lock();
 	percpu_down_write(&mem_hotplug_lock);
 }
 
 void mem_hotplug_done(void)
 {
 	percpu_up_write(&mem_hotplug_lock);
-	cpus_read_unlock();
 }
 
 u64 max_mem_size = U64_MAX;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock
  2019-09-24 14:36 [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock David Hildenbrand
@ 2019-09-24 14:48 ` Michal Hocko
  2019-09-24 15:01   ` David Hildenbrand
  2019-09-24 15:03 ` Qian Cai
  2019-10-02 21:37 ` Qian Cai
  2 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2019-09-24 14:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Oscar Salvador,
	Pavel Tatashin, Dan Williams, Thomas Gleixner

On Tue 24-09-19 16:36:15, David Hildenbrand wrote:
> Since commit 3f906ba23689 ("mm/memory-hotplug: switch locking to a percpu
> rwsem") we do a cpus_read_lock() in mem_hotplug_begin(). This was
> introduced to fix a potential deadlock between get_online_mems() and
> get_online_cpus() - the memory and cpu hotplug lock. The root issue was
> that build_all_zonelists() -> stop_machine() required the cpu hotplug lock:
>     The reason is that memory hotplug takes the memory hotplug lock and
>     then calls stop_machine() which calls get_online_cpus().  That's the
>     reverse lock order to get_online_cpus(); get_online_mems(); in
>     mm/slub_common.c
> 
> So memory hotplug never really required any cpu lock itself, only
> stop_machine() and lru_add_drain_all() required it. Back then,
> stop_machine_cpuslocked() and lru_add_drain_all_cpuslocked() were used
> as the cpu hotplug lock was now obtained in the caller.
> 
> Since commit 11cd8638c37f ("mm, page_alloc: remove stop_machine from build
> all_zonelists"), the stop_machine_cpuslocked() call is gone.
> build_all_zonelists() does no longer require the cpu lock and does no
> longer make use of stop_machine().
> 
> Since commit 9852a7212324 ("mm: drop hotplug lock from
> lru_add_drain_all()"), lru_add_drain_all() "Doesn't need any cpu hotplug
> locking because we do rely on per-cpu kworkers being shut down before our
> page_alloc_cpu_dead callback is executed on the offlined cpu.". The
> lru_add_drain_all_cpuslocked() variant was removed.
> 
> So there is nothing left that requires the cpu hotplug lock. The memory
> hotplug lock and the device hotplug lock are sufficient.

I would love to see this happen. The biggest offenders should be gone. 
I really hated how those two locks have been conflated which likely
resulted in some undocumented/unintended dependencies. So for now, I
cannot really tell you whether the patch is correct. It would really
require a lot of testing. I am not sure this is reasonably reviewable.

So please add some testing results (ideally cpu hotplug racing a lot
with the memory hotplug). Then I would be willing to give this a try
and see. First by keeping it in linux-next for a release or two and then
eyes closed, fingers crossed and go to the wild. Do we have a tag for
that Dared-by maybe?

> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
> 
> RFC -> v1:
> - Reword and add more details why the cpu hotplug lock was needed here
>   in the first place, and why we no longer require it.
> 
> ---
>  mm/memory_hotplug.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c3e9aed6023f..5fa30f3010e1 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -88,14 +88,12 @@ __setup("memhp_default_state=", setup_memhp_default_state);
>  
>  void mem_hotplug_begin(void)
>  {
> -	cpus_read_lock();
>  	percpu_down_write(&mem_hotplug_lock);
>  }
>  
>  void mem_hotplug_done(void)
>  {
>  	percpu_up_write(&mem_hotplug_lock);
> -	cpus_read_unlock();
>  }
>  
>  u64 max_mem_size = U64_MAX;
> -- 
> 2.21.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock
  2019-09-24 14:48 ` Michal Hocko
@ 2019-09-24 15:01   ` David Hildenbrand
  0 siblings, 0 replies; 23+ messages in thread
From: David Hildenbrand @ 2019-09-24 15:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Andrew Morton, Oscar Salvador,
	Pavel Tatashin, Dan Williams, Thomas Gleixner

On 24.09.19 16:48, Michal Hocko wrote:
> On Tue 24-09-19 16:36:15, David Hildenbrand wrote:
>> Since commit 3f906ba23689 ("mm/memory-hotplug: switch locking to a percpu
>> rwsem") we do a cpus_read_lock() in mem_hotplug_begin(). This was
>> introduced to fix a potential deadlock between get_online_mems() and
>> get_online_cpus() - the memory and cpu hotplug lock. The root issue was
>> that build_all_zonelists() -> stop_machine() required the cpu hotplug lock:
>>     The reason is that memory hotplug takes the memory hotplug lock and
>>     then calls stop_machine() which calls get_online_cpus().  That's the
>>     reverse lock order to get_online_cpus(); get_online_mems(); in
>>     mm/slub_common.c
>>
>> So memory hotplug never really required any cpu lock itself, only
>> stop_machine() and lru_add_drain_all() required it. Back then,
>> stop_machine_cpuslocked() and lru_add_drain_all_cpuslocked() were used
>> as the cpu hotplug lock was now obtained in the caller.
>>
>> Since commit 11cd8638c37f ("mm, page_alloc: remove stop_machine from build
>> all_zonelists"), the stop_machine_cpuslocked() call is gone.
>> build_all_zonelists() does no longer require the cpu lock and does no
>> longer make use of stop_machine().
>>
>> Since commit 9852a7212324 ("mm: drop hotplug lock from
>> lru_add_drain_all()"), lru_add_drain_all() "Doesn't need any cpu hotplug
>> locking because we do rely on per-cpu kworkers being shut down before our
>> page_alloc_cpu_dead callback is executed on the offlined cpu.". The
>> lru_add_drain_all_cpuslocked() variant was removed.
>>
>> So there is nothing left that requires the cpu hotplug lock. The memory
>> hotplug lock and the device hotplug lock are sufficient.
> 
> I would love to see this happen. The biggest offenders should be gone. 
> I really hated how those two locks have been conflated which likely
> resulted in some undocumented/unintended dependencies. So for now, I
> cannot really tell you whether the patch is correct. It would really
> require a lot of testing. I am not sure this is reasonably reviewable.

At least judging from the obvious history, this should be fine. But the
bad parts are undocumented side effects that sneaked in in the meantime.
I agree that reviewing this is hard - too much hidden and undocumented
complexity.

> 
> So please add some testing results (ideally cpu hotplug racing a lot
> with the memory hotplug). Then I would be willing to give this a try
> and see. First by keeping it in linux-next for a release or two and then
> eyes closed, fingers crossed and go to the wild. Do we have a tag for
> that Dared-by maybe?

I'll do some more testing to try to find the obvious issues. But I
suspect if there are still some leftovers, they will be hard to trigger.
So I agree that keeping this in linux-next for a longer period sounds
like a good idea.

Tentatively-Acked-by ? :D

> 
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>
>> RFC -> v1:
>> - Reword and add more details why the cpu hotplug lock was needed here
>>   in the first place, and why we no longer require it.
>>
>> ---
>>  mm/memory_hotplug.c | 2 --
>>  1 file changed, 2 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index c3e9aed6023f..5fa30f3010e1 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -88,14 +88,12 @@ __setup("memhp_default_state=", setup_memhp_default_state);
>>  
>>  void mem_hotplug_begin(void)
>>  {
>> -	cpus_read_lock();
>>  	percpu_down_write(&mem_hotplug_lock);
>>  }
>>  
>>  void mem_hotplug_done(void)
>>  {
>>  	percpu_up_write(&mem_hotplug_lock);
>> -	cpus_read_unlock();
>>  }
>>  
>>  u64 max_mem_size = U64_MAX;
>> -- 
>> 2.21.0
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock
  2019-09-24 14:36 [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock David Hildenbrand
  2019-09-24 14:48 ` Michal Hocko
@ 2019-09-24 15:03 ` Qian Cai
  2019-09-24 15:11   ` Michal Hocko
  2019-09-24 15:23   ` David Hildenbrand
  2019-10-02 21:37 ` Qian Cai
  2 siblings, 2 replies; 23+ messages in thread
From: Qian Cai @ 2019-09-24 15:03 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Oscar Salvador, Michal Hocko,
	Pavel Tatashin, Dan Williams, Thomas Gleixner

On Tue, 2019-09-24 at 16:36 +0200, David Hildenbrand wrote:
> Since commit 3f906ba23689 ("mm/memory-hotplug: switch locking to a percpu
> rwsem") we do a cpus_read_lock() in mem_hotplug_begin(). This was
> introduced to fix a potential deadlock between get_online_mems() and
> get_online_cpus() - the memory and cpu hotplug lock. The root issue was
> that build_all_zonelists() -> stop_machine() required the cpu hotplug lock:
>     The reason is that memory hotplug takes the memory hotplug lock and
>     then calls stop_machine() which calls get_online_cpus().  That's the
>     reverse lock order to get_online_cpus(); get_online_mems(); in
>     mm/slub_common.c
> 
> So memory hotplug never really required any cpu lock itself, only
> stop_machine() and lru_add_drain_all() required it. Back then,
> stop_machine_cpuslocked() and lru_add_drain_all_cpuslocked() were used
> as the cpu hotplug lock was now obtained in the caller.
> 
> Since commit 11cd8638c37f ("mm, page_alloc: remove stop_machine from build
> all_zonelists"), the stop_machine_cpuslocked() call is gone.
> build_all_zonelists() does no longer require the cpu lock and does no
> longer make use of stop_machine().
> 
> Since commit 9852a7212324 ("mm: drop hotplug lock from
> lru_add_drain_all()"), lru_add_drain_all() "Doesn't need any cpu hotplug
> locking because we do rely on per-cpu kworkers being shut down before our
> page_alloc_cpu_dead callback is executed on the offlined cpu.". The
> lru_add_drain_all_cpuslocked() variant was removed.
> 
> So there is nothing left that requires the cpu hotplug lock. The memory
> hotplug lock and the device hotplug lock are sufficient.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
> 
> RFC -> v1:
> - Reword and add more details why the cpu hotplug lock was needed here
>   in the first place, and why we no longer require it.
> 
> ---
>  mm/memory_hotplug.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c3e9aed6023f..5fa30f3010e1 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -88,14 +88,12 @@ __setup("memhp_default_state=", setup_memhp_default_state);
>  
>  void mem_hotplug_begin(void)
>  {
> -	cpus_read_lock();
>  	percpu_down_write(&mem_hotplug_lock);
>  }
>  
>  void mem_hotplug_done(void)
>  {
>  	percpu_up_write(&mem_hotplug_lock);
> -	cpus_read_unlock();
>  }
>  
>  u64 max_mem_size = U64_MAX;

While at it, it might be a good time to rethink the whole locking over there, as
it right now read files under /sys/kernel/slab/ could trigger a possible
deadlock anyway.

[  442.258806][ T5224] WARNING: possible circular locking dependency detected
[  442.265678][ T5224] 5.3.0-rc7-mm1+ #6 Tainted: G             L   
[  442.271766][ T5224] ------------------------------------------------------
[  442.278635][ T5224] cat/5224 is trying to acquire lock:
[  442.283857][ T5224] ffff900012ac3120 (mem_hotplug_lock.rw_sem){++++}, at:
show_slab_objects+0x94/0x3a8
[  442.293189][ T5224] 
[  442.293189][ T5224] but task is already holding lock:
[  442.300404][ T5224] b8ff009693eee398 (kn->count#45){++++}, at:
kernfs_seq_start+0x44/0xf0
[  442.308587][ T5224] 
[  442.308587][ T5224] which lock already depends on the new lock.
[  442.308587][ T5224] 
[  442.318841][ T5224] 
[  442.318841][ T5224] the existing dependency chain (in reverse order) is:
[  442.327705][ T5224] 
[  442.327705][ T5224] -> #2 (kn->count#45){++++}:
[  442.334413][ T5224]        lock_acquire+0x31c/0x360
[  442.339286][ T5224]        __kernfs_remove+0x290/0x490
[  442.344428][ T5224]        kernfs_remove+0x30/0x44
[  442.349224][ T5224]        sysfs_remove_dir+0x70/0x88
[  442.354276][ T5224]        kobject_del+0x50/0xb0
[  442.358890][ T5224]        sysfs_slab_unlink+0x2c/0x38
[  442.364025][ T5224]        shutdown_cache+0xa0/0xf0
[  442.368898][ T5224]        kmemcg_cache_shutdown_fn+0x1c/0x34
[  442.374640][ T5224]        kmemcg_workfn+0x44/0x64
[  442.379428][ T5224]        process_one_work+0x4f4/0x950
[  442.384649][ T5224]        worker_thread+0x390/0x4bc
[  442.389610][ T5224]        kthread+0x1cc/0x1e8
[  442.394052][ T5224]        ret_from_fork+0x10/0x18
[  442.398835][ T5224] 
[  442.398835][ T5224] -> #1 (slab_mutex){+.+.}:
[  442.405365][ T5224]        lock_acquire+0x31c/0x360
[  442.410240][ T5224]        __mutex_lock_common+0x16c/0xf78
[  442.415722][ T5224]        mutex_lock_nested+0x40/0x50
[  442.420855][ T5224]        memcg_create_kmem_cache+0x38/0x16c
[  442.426598][ T5224]        memcg_kmem_cache_create_func+0x3c/0x70
[  442.432687][ T5224]        process_one_work+0x4f4/0x950
[  442.437908][ T5224]        worker_thread+0x390/0x4bc
[  442.442868][ T5224]        kthread+0x1cc/0x1e8
[  442.447307][ T5224]        ret_from_fork+0x10/0x18
[  442.452090][ T5224] 
[  442.452090][ T5224] -> #0 (mem_hotplug_lock.rw_sem){++++}:
[  442.459748][ T5224]        validate_chain+0xd10/0x2bcc
[  442.464883][ T5224]        __lock_acquire+0x7f4/0xb8c
[  442.469930][ T5224]        lock_acquire+0x31c/0x360
[  442.474803][ T5224]        get_online_mems+0x54/0x150
[  442.479850][ T5224]        show_slab_objects+0x94/0x3a8
[  442.485072][ T5224]        total_objects_show+0x28/0x34
[  442.490292][ T5224]        slab_attr_show+0x38/0x54
[  442.495166][ T5224]        sysfs_kf_seq_show+0x198/0x2d4
[  442.500473][ T5224]        kernfs_seq_show+0xa4/0xcc
[  442.505433][ T5224]        seq_read+0x30c/0x8a8
[  442.509958][ T5224]        kernfs_fop_read+0xa8/0x314
[  442.515007][ T5224]        __vfs_read+0x88/0x20c
[  442.519620][ T5224]        vfs_read+0xd8/0x10c
[  442.524060][ T5224]        ksys_read+0xb0/0x120
[  442.528586][ T5224]        __arm64_sys_read+0x54/0x88
[  442.533634][ T5224]        el0_svc_handler+0x170/0x240
[  442.538768][ T5224]        el0_svc+0x8/0xc
[  442.542858][ T5224] 
[  442.542858][ T5224] other info that might help us debug this:
[  442.542858][ T5224] 
[  442.552936][ T5224] Chain exists of:
[  442.552936][ T5224]   mem_hotplug_lock.rw_sem --> slab_mutex --> kn->count#45
[  442.552936][ T5224] 
[  442.565803][ T5224]  Possible unsafe locking scenario:
[  442.565803][ T5224] 
[  442.573105][ T5224]        CPU0                    CPU1
[  442.578322][ T5224]        ----                    ----
[  442.583539][ T5224]   lock(kn->count#45);
[  442.587545][ T5224]                                lock(slab_mutex);
[  442.593898][ T5224]                                lock(kn->count#45);
[  442.600433][ T5224]   lock(mem_hotplug_lock.rw_sem);
[  442.605393][ T5224] 
[  442.605393][ T5224]  *** DEADLOCK ***
[  442.605393][ T5224] 
[  442.613390][ T5224] 3 locks held by cat/5224:
[  442.617740][ T5224]  #0: 9eff00095b14b2a0 (&p->lock){+.+.}, at:
seq_read+0x4c/0x8a8
[  442.625399][ T5224]  #1: 0eff008997041480 (&of->mutex){+.+.}, at:
kernfs_seq_start+0x34/0xf0
[  442.633842][ T5224]  #2: b8ff009693eee398 (kn->count#45){++++}, at:
kernfs_seq_start+0x44/0xf0
[  442.642477][ T5224] 
[  442.642477][ T5224] stack backtrace:
[  442.648221][ T5224] CPU: 117 PID: 5224 Comm: cat Tainted:
G             L    5.3.0-rc7-mm1+ #6
[  442.656826][ T5224] Hardware name: HPE Apollo
70             /C01_APACHE_MB         , BIOS L50_5.13_1.11 06/18/2019
[  442.667253][ T5224] Call trace:
[  442.670391][ T5224]  dump_backtrace+0x0/0x248
[  442.674743][ T5224]  show_stack+0x20/0x2c
[  442.678750][ T5224]  dump_stack+0xd0/0x140
[  442.682841][ T5224]  print_circular_bug+0x368/0x380
[  442.687715][ T5224]  check_noncircular+0x248/0x250
[  442.692501][ T5224]  validate_chain+0xd10/0x2bcc
[  442.697115][ T5224]  __lock_acquire+0x7f4/0xb8c
[  442.701641][ T5224]  lock_acquire+0x31c/0x360
[  442.705993][ T5224]  get_online_mems+0x54/0x150
[  442.710519][ T5224]  show_slab_objects+0x94/0x3a8
[  442.715219][ T5224]  total_objects_show+0x28/0x34
[  442.719918][ T5224]  slab_attr_show+0x38/0x54
[  442.724271][ T5224]  sysfs_kf_seq_show+0x198/0x2d4
[  442.729056][ T5224]  kernfs_seq_show+0xa4/0xcc
[  442.733494][ T5224]  seq_read+0x30c/0x8a8
[  442.737498][ T5224]  kernfs_fop_read+0xa8/0x314
[  442.742025][ T5224]  __vfs_read+0x88/0x20c
[  442.746118][ T5224]  vfs_read+0xd8/0x10c
[  442.750036][ T5224]  ksys_read+0xb0/0x120
[  442.754042][ T5224]  __arm64_sys_read+0x54/0x88
[  442.758569][ T5224]  el0_svc_handler+0x170/0x240
[  442.763180][ T5224]  el0_svc+0x8/0xc

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock
  2019-09-24 15:03 ` Qian Cai
@ 2019-09-24 15:11   ` Michal Hocko
  2019-09-24 18:54     ` Qian Cai
  2019-09-24 15:23   ` David Hildenbrand
  1 sibling, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2019-09-24 15:11 UTC (permalink / raw)
  To: Qian Cai
  Cc: David Hildenbrand, linux-kernel, linux-mm, Andrew Morton,
	Oscar Salvador, Pavel Tatashin, Dan Williams, Thomas Gleixner

On Tue 24-09-19 11:03:21, Qian Cai wrote:
[...]
> While at it, it might be a good time to rethink the whole locking over there, as
> it right now read files under /sys/kernel/slab/ could trigger a possible
> deadlock anyway.
> 
[...]
> [  442.452090][ T5224] -> #0 (mem_hotplug_lock.rw_sem){++++}:
> [  442.459748][ T5224]        validate_chain+0xd10/0x2bcc
> [  442.464883][ T5224]        __lock_acquire+0x7f4/0xb8c
> [  442.469930][ T5224]        lock_acquire+0x31c/0x360
> [  442.474803][ T5224]        get_online_mems+0x54/0x150
> [  442.479850][ T5224]        show_slab_objects+0x94/0x3a8
> [  442.485072][ T5224]        total_objects_show+0x28/0x34
> [  442.490292][ T5224]        slab_attr_show+0x38/0x54
> [  442.495166][ T5224]        sysfs_kf_seq_show+0x198/0x2d4
> [  442.500473][ T5224]        kernfs_seq_show+0xa4/0xcc
> [  442.505433][ T5224]        seq_read+0x30c/0x8a8
> [  442.509958][ T5224]        kernfs_fop_read+0xa8/0x314
> [  442.515007][ T5224]        __vfs_read+0x88/0x20c
> [  442.519620][ T5224]        vfs_read+0xd8/0x10c
> [  442.524060][ T5224]        ksys_read+0xb0/0x120
> [  442.528586][ T5224]        __arm64_sys_read+0x54/0x88
> [  442.533634][ T5224]        el0_svc_handler+0x170/0x240
> [  442.538768][ T5224]        el0_svc+0x8/0xc

I believe the lock is not really needed here. We do not deallocated
pgdat of a hotremoved node nor destroy the slab state because an
existing slabs would prevent hotremove to continue in the first place.

There are likely details to be checked of course but the lock just seems
bogus.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock
  2019-09-24 15:03 ` Qian Cai
  2019-09-24 15:11   ` Michal Hocko
@ 2019-09-24 15:23   ` David Hildenbrand
  1 sibling, 0 replies; 23+ messages in thread
From: David Hildenbrand @ 2019-09-24 15:23 UTC (permalink / raw)
  To: Qian Cai, linux-kernel
  Cc: linux-mm, Andrew Morton, Oscar Salvador, Michal Hocko,
	Pavel Tatashin, Dan Williams, Thomas Gleixner

On 24.09.19 17:03, Qian Cai wrote:
> On Tue, 2019-09-24 at 16:36 +0200, David Hildenbrand wrote:
>> Since commit 3f906ba23689 ("mm/memory-hotplug: switch locking to a percpu
>> rwsem") we do a cpus_read_lock() in mem_hotplug_begin(). This was
>> introduced to fix a potential deadlock between get_online_mems() and
>> get_online_cpus() - the memory and cpu hotplug lock. The root issue was
>> that build_all_zonelists() -> stop_machine() required the cpu hotplug lock:
>>     The reason is that memory hotplug takes the memory hotplug lock and
>>     then calls stop_machine() which calls get_online_cpus().  That's the
>>     reverse lock order to get_online_cpus(); get_online_mems(); in
>>     mm/slub_common.c
>>
>> So memory hotplug never really required any cpu lock itself, only
>> stop_machine() and lru_add_drain_all() required it. Back then,
>> stop_machine_cpuslocked() and lru_add_drain_all_cpuslocked() were used
>> as the cpu hotplug lock was now obtained in the caller.
>>
>> Since commit 11cd8638c37f ("mm, page_alloc: remove stop_machine from build
>> all_zonelists"), the stop_machine_cpuslocked() call is gone.
>> build_all_zonelists() does no longer require the cpu lock and does no
>> longer make use of stop_machine().
>>
>> Since commit 9852a7212324 ("mm: drop hotplug lock from
>> lru_add_drain_all()"), lru_add_drain_all() "Doesn't need any cpu hotplug
>> locking because we do rely on per-cpu kworkers being shut down before our
>> page_alloc_cpu_dead callback is executed on the offlined cpu.". The
>> lru_add_drain_all_cpuslocked() variant was removed.
>>
>> So there is nothing left that requires the cpu hotplug lock. The memory
>> hotplug lock and the device hotplug lock are sufficient.
>>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>
>> RFC -> v1:
>> - Reword and add more details why the cpu hotplug lock was needed here
>>   in the first place, and why we no longer require it.
>>
>> ---
>>  mm/memory_hotplug.c | 2 --
>>  1 file changed, 2 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index c3e9aed6023f..5fa30f3010e1 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -88,14 +88,12 @@ __setup("memhp_default_state=", setup_memhp_default_state);
>>  
>>  void mem_hotplug_begin(void)
>>  {
>> -	cpus_read_lock();
>>  	percpu_down_write(&mem_hotplug_lock);
>>  }
>>  
>>  void mem_hotplug_done(void)
>>  {
>>  	percpu_up_write(&mem_hotplug_lock);
>> -	cpus_read_unlock();
>>  }
>>  
>>  u64 max_mem_size = U64_MAX;
> 
> While at it, it might be a good time to rethink the whole locking over there, as
> it right now read files under /sys/kernel/slab/ could trigger a possible
> deadlock anyway.

We still have another (I think incorrect) splat when onlining/offlining
memory and then removing it e.g., via ACPI. It's also a (problematic ?)
"kn->count" lock in these cases.

E.g., when offlining memory (via sysfs) we first take the kn->count and
then the memory hotplug lock. When removing memory devices we first take
the memory hotplug lock and then the kn->count (to remove the sysfs
files). But at least there we have the device hotplug lock involved to
effectively prohibit any deadlocks. But it's somewhat similar to what
you pasted below (but there, it could as well be that we can just drop
the lock as Michal said).

I wonder if one general solution could be to replace the
device_hotplug_lock by a per-subsystem (memory, cpu) lock. Especially,
in drivers/base/core.c:online_store and when creating/deleting devices
we would take that subsystem lock directly, and not fairly down in the
call paths.

There are only a handful of places (e.g., node onlining/offlining) where
cpu and memory hotplug can race. But we can handle these cases differently.

Eventually we could get rid of the device_hotplug_lock() and use locks
per subsystem instead. That would at least be better compared to what we
have just now (we always need the device hotplug lock and the memory
hotplug lock in memory hotplug paths).

> 
> [  442.258806][ T5224] WARNING: possible circular locking dependency detected
> [  442.265678][ T5224] 5.3.0-rc7-mm1+ #6 Tainted: G             L   
> [  442.271766][ T5224] ------------------------------------------------------
> [  442.278635][ T5224] cat/5224 is trying to acquire lock:
> [  442.283857][ T5224] ffff900012ac3120 (mem_hotplug_lock.rw_sem){++++}, at:
> show_slab_objects+0x94/0x3a8
> [  442.293189][ T5224] 
> [  442.293189][ T5224] but task is already holding lock:
> [  442.300404][ T5224] b8ff009693eee398 (kn->count#45){++++}, at:
> kernfs_seq_start+0x44/0xf0
> [  442.308587][ T5224] 
> [  442.308587][ T5224] which lock already depends on the new lock.
> [  442.308587][ T5224] 
> [  442.318841][ T5224] 
> [  442.318841][ T5224] the existing dependency chain (in reverse order) is:
> [  442.327705][ T5224] 
> [  442.327705][ T5224] -> #2 (kn->count#45){++++}:
> [  442.334413][ T5224]        lock_acquire+0x31c/0x360
> [  442.339286][ T5224]        __kernfs_remove+0x290/0x490
> [  442.344428][ T5224]        kernfs_remove+0x30/0x44
> [  442.349224][ T5224]        sysfs_remove_dir+0x70/0x88
> [  442.354276][ T5224]        kobject_del+0x50/0xb0
> [  442.358890][ T5224]        sysfs_slab_unlink+0x2c/0x38
> [  442.364025][ T5224]        shutdown_cache+0xa0/0xf0
> [  442.368898][ T5224]        kmemcg_cache_shutdown_fn+0x1c/0x34
> [  442.374640][ T5224]        kmemcg_workfn+0x44/0x64
> [  442.379428][ T5224]        process_one_work+0x4f4/0x950
> [  442.384649][ T5224]        worker_thread+0x390/0x4bc
> [  442.389610][ T5224]        kthread+0x1cc/0x1e8
> [  442.394052][ T5224]        ret_from_fork+0x10/0x18
> [  442.398835][ T5224] 
> [  442.398835][ T5224] -> #1 (slab_mutex){+.+.}:
> [  442.405365][ T5224]        lock_acquire+0x31c/0x360
> [  442.410240][ T5224]        __mutex_lock_common+0x16c/0xf78
> [  442.415722][ T5224]        mutex_lock_nested+0x40/0x50
> [  442.420855][ T5224]        memcg_create_kmem_cache+0x38/0x16c
> [  442.426598][ T5224]        memcg_kmem_cache_create_func+0x3c/0x70
> [  442.432687][ T5224]        process_one_work+0x4f4/0x950
> [  442.437908][ T5224]        worker_thread+0x390/0x4bc
> [  442.442868][ T5224]        kthread+0x1cc/0x1e8
> [  442.447307][ T5224]        ret_from_fork+0x10/0x18
> [  442.452090][ T5224] 
> [  442.452090][ T5224] -> #0 (mem_hotplug_lock.rw_sem){++++}:
> [  442.459748][ T5224]        validate_chain+0xd10/0x2bcc
> [  442.464883][ T5224]        __lock_acquire+0x7f4/0xb8c
> [  442.469930][ T5224]        lock_acquire+0x31c/0x360
> [  442.474803][ T5224]        get_online_mems+0x54/0x150
> [  442.479850][ T5224]        show_slab_objects+0x94/0x3a8
> [  442.485072][ T5224]        total_objects_show+0x28/0x34
> [  442.490292][ T5224]        slab_attr_show+0x38/0x54
> [  442.495166][ T5224]        sysfs_kf_seq_show+0x198/0x2d4
> [  442.500473][ T5224]        kernfs_seq_show+0xa4/0xcc
> [  442.505433][ T5224]        seq_read+0x30c/0x8a8
> [  442.509958][ T5224]        kernfs_fop_read+0xa8/0x314
> [  442.515007][ T5224]        __vfs_read+0x88/0x20c
> [  442.519620][ T5224]        vfs_read+0xd8/0x10c
> [  442.524060][ T5224]        ksys_read+0xb0/0x120
> [  442.528586][ T5224]        __arm64_sys_read+0x54/0x88
> [  442.533634][ T5224]        el0_svc_handler+0x170/0x240
> [  442.538768][ T5224]        el0_svc+0x8/0xc
> [  442.542858][ T5224] 
> [  442.542858][ T5224] other info that might help us debug this:
> [  442.542858][ T5224] 
> [  442.552936][ T5224] Chain exists of:
> [  442.552936][ T5224]   mem_hotplug_lock.rw_sem --> slab_mutex --> kn->count#45
> [  442.552936][ T5224] 
> [  442.565803][ T5224]  Possible unsafe locking scenario:
> [  442.565803][ T5224] 
> [  442.573105][ T5224]        CPU0                    CPU1
> [  442.578322][ T5224]        ----                    ----
> [  442.583539][ T5224]   lock(kn->count#45);
> [  442.587545][ T5224]                                lock(slab_mutex);
> [  442.593898][ T5224]                                lock(kn->count#45);
> [  442.600433][ T5224]   lock(mem_hotplug_lock.rw_sem);
> [  442.605393][ T5224] 
> [  442.605393][ T5224]  *** DEADLOCK ***
> [  442.605393][ T5224] 
> [  442.613390][ T5224] 3 locks held by cat/5224:
> [  442.617740][ T5224]  #0: 9eff00095b14b2a0 (&p->lock){+.+.}, at:
> seq_read+0x4c/0x8a8
> [  442.625399][ T5224]  #1: 0eff008997041480 (&of->mutex){+.+.}, at:
> kernfs_seq_start+0x34/0xf0
> [  442.633842][ T5224]  #2: b8ff009693eee398 (kn->count#45){++++}, at:
> kernfs_seq_start+0x44/0xf0
> [  442.642477][ T5224] 
> [  442.642477][ T5224] stack backtrace:
> [  442.648221][ T5224] CPU: 117 PID: 5224 Comm: cat Tainted:
> G             L    5.3.0-rc7-mm1+ #6
> [  442.656826][ T5224] Hardware name: HPE Apollo
> 70             /C01_APACHE_MB         , BIOS L50_5.13_1.11 06/18/2019
> [  442.667253][ T5224] Call trace:
> [  442.670391][ T5224]  dump_backtrace+0x0/0x248
> [  442.674743][ T5224]  show_stack+0x20/0x2c
> [  442.678750][ T5224]  dump_stack+0xd0/0x140
> [  442.682841][ T5224]  print_circular_bug+0x368/0x380
> [  442.687715][ T5224]  check_noncircular+0x248/0x250
> [  442.692501][ T5224]  validate_chain+0xd10/0x2bcc
> [  442.697115][ T5224]  __lock_acquire+0x7f4/0xb8c
> [  442.701641][ T5224]  lock_acquire+0x31c/0x360
> [  442.705993][ T5224]  get_online_mems+0x54/0x150
> [  442.710519][ T5224]  show_slab_objects+0x94/0x3a8
> [  442.715219][ T5224]  total_objects_show+0x28/0x34
> [  442.719918][ T5224]  slab_attr_show+0x38/0x54
> [  442.724271][ T5224]  sysfs_kf_seq_show+0x198/0x2d4
> [  442.729056][ T5224]  kernfs_seq_show+0xa4/0xcc
> [  442.733494][ T5224]  seq_read+0x30c/0x8a8
> [  442.737498][ T5224]  kernfs_fop_read+0xa8/0x314
> [  442.742025][ T5224]  __vfs_read+0x88/0x20c
> [  442.746118][ T5224]  vfs_read+0xd8/0x10c
> [  442.750036][ T5224]  ksys_read+0xb0/0x120
> [  442.754042][ T5224]  __arm64_sys_read+0x54/0x88
> [  442.758569][ T5224]  el0_svc_handler+0x170/0x240
> [  442.763180][ T5224]  el0_svc+0x8/0xc
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock
  2019-09-24 15:11   ` Michal Hocko
@ 2019-09-24 18:54     ` Qian Cai
  2019-09-25  7:02       ` David Hildenbrand
  2019-09-25 10:03       ` Michal Hocko
  0 siblings, 2 replies; 23+ messages in thread
From: Qian Cai @ 2019-09-24 18:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, linux-kernel, linux-mm, Andrew Morton,
	Oscar Salvador, Pavel Tatashin, Dan Williams, Thomas Gleixner

On Tue, 2019-09-24 at 17:11 +0200, Michal Hocko wrote:
> On Tue 24-09-19 11:03:21, Qian Cai wrote:
> [...]
> > While at it, it might be a good time to rethink the whole locking over there, as
> > it right now read files under /sys/kernel/slab/ could trigger a possible
> > deadlock anyway.
> > 
> 
> [...]
> > [  442.452090][ T5224] -> #0 (mem_hotplug_lock.rw_sem){++++}:
> > [  442.459748][ T5224]        validate_chain+0xd10/0x2bcc
> > [  442.464883][ T5224]        __lock_acquire+0x7f4/0xb8c
> > [  442.469930][ T5224]        lock_acquire+0x31c/0x360
> > [  442.474803][ T5224]        get_online_mems+0x54/0x150
> > [  442.479850][ T5224]        show_slab_objects+0x94/0x3a8
> > [  442.485072][ T5224]        total_objects_show+0x28/0x34
> > [  442.490292][ T5224]        slab_attr_show+0x38/0x54
> > [  442.495166][ T5224]        sysfs_kf_seq_show+0x198/0x2d4
> > [  442.500473][ T5224]        kernfs_seq_show+0xa4/0xcc
> > [  442.505433][ T5224]        seq_read+0x30c/0x8a8
> > [  442.509958][ T5224]        kernfs_fop_read+0xa8/0x314
> > [  442.515007][ T5224]        __vfs_read+0x88/0x20c
> > [  442.519620][ T5224]        vfs_read+0xd8/0x10c
> > [  442.524060][ T5224]        ksys_read+0xb0/0x120
> > [  442.528586][ T5224]        __arm64_sys_read+0x54/0x88
> > [  442.533634][ T5224]        el0_svc_handler+0x170/0x240
> > [  442.538768][ T5224]        el0_svc+0x8/0xc
> 
> I believe the lock is not really needed here. We do not deallocated
> pgdat of a hotremoved node nor destroy the slab state because an
> existing slabs would prevent hotremove to continue in the first place.
> 
> There are likely details to be checked of course but the lock just seems
> bogus.

Check 03afc0e25f7f ("slab: get_online_mems for
kmem_cache_{create,destroy,shrink}"). It actually talk about the races during
memory as well cpu hotplug, so it might even that cpu_hotplug_lock removal is
problematic?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock
  2019-09-24 18:54     ` Qian Cai
@ 2019-09-25  7:02       ` David Hildenbrand
  2019-09-25 16:01         ` Qian Cai
  2019-09-25 10:03       ` Michal Hocko
  1 sibling, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2019-09-25  7:02 UTC (permalink / raw)
  To: Qian Cai, Michal Hocko
  Cc: linux-kernel, linux-mm, Andrew Morton, Oscar Salvador,
	Pavel Tatashin, Dan Williams, Thomas Gleixner

On 24.09.19 20:54, Qian Cai wrote:
> On Tue, 2019-09-24 at 17:11 +0200, Michal Hocko wrote:
>> On Tue 24-09-19 11:03:21, Qian Cai wrote:
>> [...]
>>> While at it, it might be a good time to rethink the whole locking over there, as
>>> it right now read files under /sys/kernel/slab/ could trigger a possible
>>> deadlock anyway.
>>>
>>
>> [...]
>>> [  442.452090][ T5224] -> #0 (mem_hotplug_lock.rw_sem){++++}:
>>> [  442.459748][ T5224]        validate_chain+0xd10/0x2bcc
>>> [  442.464883][ T5224]        __lock_acquire+0x7f4/0xb8c
>>> [  442.469930][ T5224]        lock_acquire+0x31c/0x360
>>> [  442.474803][ T5224]        get_online_mems+0x54/0x150
>>> [  442.479850][ T5224]        show_slab_objects+0x94/0x3a8
>>> [  442.485072][ T5224]        total_objects_show+0x28/0x34
>>> [  442.490292][ T5224]        slab_attr_show+0x38/0x54
>>> [  442.495166][ T5224]        sysfs_kf_seq_show+0x198/0x2d4
>>> [  442.500473][ T5224]        kernfs_seq_show+0xa4/0xcc
>>> [  442.505433][ T5224]        seq_read+0x30c/0x8a8
>>> [  442.509958][ T5224]        kernfs_fop_read+0xa8/0x314
>>> [  442.515007][ T5224]        __vfs_read+0x88/0x20c
>>> [  442.519620][ T5224]        vfs_read+0xd8/0x10c
>>> [  442.524060][ T5224]        ksys_read+0xb0/0x120
>>> [  442.528586][ T5224]        __arm64_sys_read+0x54/0x88
>>> [  442.533634][ T5224]        el0_svc_handler+0x170/0x240
>>> [  442.538768][ T5224]        el0_svc+0x8/0xc
>>
>> I believe the lock is not really needed here. We do not deallocated
>> pgdat of a hotremoved node nor destroy the slab state because an
>> existing slabs would prevent hotremove to continue in the first place.
>>
>> There are likely details to be checked of course but the lock just seems
>> bogus.
> 
> Check 03afc0e25f7f ("slab: get_online_mems for
> kmem_cache_{create,destroy,shrink}"). It actually talk about the races during
> memory as well cpu hotplug, so it might even that cpu_hotplug_lock removal is
> problematic?
> 

Which removal are you referring to? get_online_mems() does not mess with
the cpu hotplug lock (and therefore this patch).

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock
  2019-09-24 18:54     ` Qian Cai
  2019-09-25  7:02       ` David Hildenbrand
@ 2019-09-25 10:03       ` Michal Hocko
  1 sibling, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2019-09-25 10:03 UTC (permalink / raw)
  To: Qian Cai
  Cc: David Hildenbrand, linux-kernel, linux-mm, Andrew Morton,
	Oscar Salvador, Pavel Tatashin, Dan Williams, Thomas Gleixner

On Tue 24-09-19 14:54:04, Qian Cai wrote:
> On Tue, 2019-09-24 at 17:11 +0200, Michal Hocko wrote:
> > On Tue 24-09-19 11:03:21, Qian Cai wrote:
> > [...]
> > > While at it, it might be a good time to rethink the whole locking over there, as
> > > it right now read files under /sys/kernel/slab/ could trigger a possible
> > > deadlock anyway.
> > > 
> > 
> > [...]
> > > [  442.452090][ T5224] -> #0 (mem_hotplug_lock.rw_sem){++++}:
> > > [  442.459748][ T5224]        validate_chain+0xd10/0x2bcc
> > > [  442.464883][ T5224]        __lock_acquire+0x7f4/0xb8c
> > > [  442.469930][ T5224]        lock_acquire+0x31c/0x360
> > > [  442.474803][ T5224]        get_online_mems+0x54/0x150
> > > [  442.479850][ T5224]        show_slab_objects+0x94/0x3a8
> > > [  442.485072][ T5224]        total_objects_show+0x28/0x34
> > > [  442.490292][ T5224]        slab_attr_show+0x38/0x54
> > > [  442.495166][ T5224]        sysfs_kf_seq_show+0x198/0x2d4
> > > [  442.500473][ T5224]        kernfs_seq_show+0xa4/0xcc
> > > [  442.505433][ T5224]        seq_read+0x30c/0x8a8
> > > [  442.509958][ T5224]        kernfs_fop_read+0xa8/0x314
> > > [  442.515007][ T5224]        __vfs_read+0x88/0x20c
> > > [  442.519620][ T5224]        vfs_read+0xd8/0x10c
> > > [  442.524060][ T5224]        ksys_read+0xb0/0x120
> > > [  442.528586][ T5224]        __arm64_sys_read+0x54/0x88
> > > [  442.533634][ T5224]        el0_svc_handler+0x170/0x240
> > > [  442.538768][ T5224]        el0_svc+0x8/0xc
> > 
> > I believe the lock is not really needed here. We do not deallocated
> > pgdat of a hotremoved node nor destroy the slab state because an
> > existing slabs would prevent hotremove to continue in the first place.
> > 
> > There are likely details to be checked of course but the lock just seems
> > bogus.
> 
> Check 03afc0e25f7f ("slab: get_online_mems for
> kmem_cache_{create,destroy,shrink}"). It actually talk about the races during
> memory as well cpu hotplug, so it might even that cpu_hotplug_lock removal is
> problematic?

I have to refresh my memory there but the changlog claims:
"To avoid issues like that we should hold get/put_online_mems() during
the whole kmem cache creation/destruction/shrink paths" and
show_slab_objects doesn't fall into any of those categories.

Anyway this seems unrelated to the original thread so I would recommend
discussing in its own thread for clarity.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock
  2019-09-25  7:02       ` David Hildenbrand
@ 2019-09-25 16:01         ` Qian Cai
  2019-09-25 17:48           ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Qian Cai @ 2019-09-25 16:01 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko
  Cc: linux-kernel, linux-mm, Andrew Morton, Oscar Salvador,
	Pavel Tatashin, Dan Williams, Thomas Gleixner

On Wed, 2019-09-25 at 09:02 +0200, David Hildenbrand wrote:
> On 24.09.19 20:54, Qian Cai wrote:
> > On Tue, 2019-09-24 at 17:11 +0200, Michal Hocko wrote:
> > > On Tue 24-09-19 11:03:21, Qian Cai wrote:
> > > [...]
> > > > While at it, it might be a good time to rethink the whole locking over there, as
> > > > it right now read files under /sys/kernel/slab/ could trigger a possible
> > > > deadlock anyway.
> > > > 
> > > 
> > > [...]
> > > > [  442.452090][ T5224] -> #0 (mem_hotplug_lock.rw_sem){++++}:
> > > > [  442.459748][ T5224]        validate_chain+0xd10/0x2bcc
> > > > [  442.464883][ T5224]        __lock_acquire+0x7f4/0xb8c
> > > > [  442.469930][ T5224]        lock_acquire+0x31c/0x360
> > > > [  442.474803][ T5224]        get_online_mems+0x54/0x150
> > > > [  442.479850][ T5224]        show_slab_objects+0x94/0x3a8
> > > > [  442.485072][ T5224]        total_objects_show+0x28/0x34
> > > > [  442.490292][ T5224]        slab_attr_show+0x38/0x54
> > > > [  442.495166][ T5224]        sysfs_kf_seq_show+0x198/0x2d4
> > > > [  442.500473][ T5224]        kernfs_seq_show+0xa4/0xcc
> > > > [  442.505433][ T5224]        seq_read+0x30c/0x8a8
> > > > [  442.509958][ T5224]        kernfs_fop_read+0xa8/0x314
> > > > [  442.515007][ T5224]        __vfs_read+0x88/0x20c
> > > > [  442.519620][ T5224]        vfs_read+0xd8/0x10c
> > > > [  442.524060][ T5224]        ksys_read+0xb0/0x120
> > > > [  442.528586][ T5224]        __arm64_sys_read+0x54/0x88
> > > > [  442.533634][ T5224]        el0_svc_handler+0x170/0x240
> > > > [  442.538768][ T5224]        el0_svc+0x8/0xc
> > > 
> > > I believe the lock is not really needed here. We do not deallocated
> > > pgdat of a hotremoved node nor destroy the slab state because an
> > > existing slabs would prevent hotremove to continue in the first place.
> > > 
> > > There are likely details to be checked of course but the lock just seems
> > > bogus.
> > 
> > Check 03afc0e25f7f ("slab: get_online_mems for
> > kmem_cache_{create,destroy,shrink}"). It actually talk about the races during
> > memory as well cpu hotplug, so it might even that cpu_hotplug_lock removal is
> > problematic?
> > 
> 
> Which removal are you referring to? get_online_mems() does not mess with
> the cpu hotplug lock (and therefore this patch).

The one in your patch. I suspect there might be races among the whole NUMA node
hotplug, kmem_cache_create, and show_slab_objects(). See bfc8c90139eb ("mem-
hotplug: implement get/put_online_mems")

"kmem_cache_{create,destroy,shrink} need to get a stable value of cpu/node
online mask, because they init/destroy/access per-cpu/node kmem_cache parts,
which can be allocated or destroyed on cpu/mem hotplug."

Both online_pages() and show_slab_objects() need to get a stable value of
cpu/node online mask.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock
  2019-09-25 16:01         ` Qian Cai
@ 2019-09-25 17:48           ` Michal Hocko
  2019-09-25 18:20             ` Qian Cai
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2019-09-25 17:48 UTC (permalink / raw)
  To: Qian Cai
  Cc: David Hildenbrand, linux-kernel, linux-mm, Andrew Morton,
	Oscar Salvador, Pavel Tatashin, Dan Williams, Thomas Gleixner

On Wed 25-09-19 12:01:02, Qian Cai wrote:
> On Wed, 2019-09-25 at 09:02 +0200, David Hildenbrand wrote:
> > On 24.09.19 20:54, Qian Cai wrote:
> > > On Tue, 2019-09-24 at 17:11 +0200, Michal Hocko wrote:
> > > > On Tue 24-09-19 11:03:21, Qian Cai wrote:
> > > > [...]
> > > > > While at it, it might be a good time to rethink the whole locking over there, as
> > > > > it right now read files under /sys/kernel/slab/ could trigger a possible
> > > > > deadlock anyway.
> > > > > 
> > > > 
> > > > [...]
> > > > > [  442.452090][ T5224] -> #0 (mem_hotplug_lock.rw_sem){++++}:
> > > > > [  442.459748][ T5224]        validate_chain+0xd10/0x2bcc
> > > > > [  442.464883][ T5224]        __lock_acquire+0x7f4/0xb8c
> > > > > [  442.469930][ T5224]        lock_acquire+0x31c/0x360
> > > > > [  442.474803][ T5224]        get_online_mems+0x54/0x150
> > > > > [  442.479850][ T5224]        show_slab_objects+0x94/0x3a8
> > > > > [  442.485072][ T5224]        total_objects_show+0x28/0x34
> > > > > [  442.490292][ T5224]        slab_attr_show+0x38/0x54
> > > > > [  442.495166][ T5224]        sysfs_kf_seq_show+0x198/0x2d4
> > > > > [  442.500473][ T5224]        kernfs_seq_show+0xa4/0xcc
> > > > > [  442.505433][ T5224]        seq_read+0x30c/0x8a8
> > > > > [  442.509958][ T5224]        kernfs_fop_read+0xa8/0x314
> > > > > [  442.515007][ T5224]        __vfs_read+0x88/0x20c
> > > > > [  442.519620][ T5224]        vfs_read+0xd8/0x10c
> > > > > [  442.524060][ T5224]        ksys_read+0xb0/0x120
> > > > > [  442.528586][ T5224]        __arm64_sys_read+0x54/0x88
> > > > > [  442.533634][ T5224]        el0_svc_handler+0x170/0x240
> > > > > [  442.538768][ T5224]        el0_svc+0x8/0xc
> > > > 
> > > > I believe the lock is not really needed here. We do not deallocated
> > > > pgdat of a hotremoved node nor destroy the slab state because an
> > > > existing slabs would prevent hotremove to continue in the first place.
> > > > 
> > > > There are likely details to be checked of course but the lock just seems
> > > > bogus.
> > > 
> > > Check 03afc0e25f7f ("slab: get_online_mems for
> > > kmem_cache_{create,destroy,shrink}"). It actually talk about the races during
> > > memory as well cpu hotplug, so it might even that cpu_hotplug_lock removal is
> > > problematic?
> > > 
> > 
> > Which removal are you referring to? get_online_mems() does not mess with
> > the cpu hotplug lock (and therefore this patch).
> 
> The one in your patch. I suspect there might be races among the whole NUMA node
> hotplug, kmem_cache_create, and show_slab_objects(). See bfc8c90139eb ("mem-
> hotplug: implement get/put_online_mems")
> 
> "kmem_cache_{create,destroy,shrink} need to get a stable value of cpu/node
> online mask, because they init/destroy/access per-cpu/node kmem_cache parts,
> which can be allocated or destroyed on cpu/mem hotplug."

I still have to grasp that code but if the slub allocator really needs
a stable cpu mask then it should be using the explicit cpu hotplug
locking rather than rely on side effect of memory hotplug locking.

> Both online_pages() and show_slab_objects() need to get a stable value of
> cpu/node online mask.

Could tou be more specific why online_pages need a stable cpu online
mask? I do not think that show_slab_objects is a real problem because a
potential race shouldn't be critical.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock
  2019-09-25 17:48           ` Michal Hocko
@ 2019-09-25 18:20             ` Qian Cai
  2019-09-25 19:48               ` David Hildenbrand
  2019-09-26  7:26               ` Michal Hocko
  0 siblings, 2 replies; 23+ messages in thread
From: Qian Cai @ 2019-09-25 18:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, linux-kernel, linux-mm, Andrew Morton,
	Oscar Salvador, Pavel Tatashin, Dan Williams, Thomas Gleixner

On Wed, 2019-09-25 at 19:48 +0200, Michal Hocko wrote:
> On Wed 25-09-19 12:01:02, Qian Cai wrote:
> > On Wed, 2019-09-25 at 09:02 +0200, David Hildenbrand wrote:
> > > On 24.09.19 20:54, Qian Cai wrote:
> > > > On Tue, 2019-09-24 at 17:11 +0200, Michal Hocko wrote:
> > > > > On Tue 24-09-19 11:03:21, Qian Cai wrote:
> > > > > [...]
> > > > > > While at it, it might be a good time to rethink the whole locking over there, as
> > > > > > it right now read files under /sys/kernel/slab/ could trigger a possible
> > > > > > deadlock anyway.
> > > > > > 
> > > > > 
> > > > > [...]
> > > > > > [  442.452090][ T5224] -> #0 (mem_hotplug_lock.rw_sem){++++}:
> > > > > > [  442.459748][ T5224]        validate_chain+0xd10/0x2bcc
> > > > > > [  442.464883][ T5224]        __lock_acquire+0x7f4/0xb8c
> > > > > > [  442.469930][ T5224]        lock_acquire+0x31c/0x360
> > > > > > [  442.474803][ T5224]        get_online_mems+0x54/0x150
> > > > > > [  442.479850][ T5224]        show_slab_objects+0x94/0x3a8
> > > > > > [  442.485072][ T5224]        total_objects_show+0x28/0x34
> > > > > > [  442.490292][ T5224]        slab_attr_show+0x38/0x54
> > > > > > [  442.495166][ T5224]        sysfs_kf_seq_show+0x198/0x2d4
> > > > > > [  442.500473][ T5224]        kernfs_seq_show+0xa4/0xcc
> > > > > > [  442.505433][ T5224]        seq_read+0x30c/0x8a8
> > > > > > [  442.509958][ T5224]        kernfs_fop_read+0xa8/0x314
> > > > > > [  442.515007][ T5224]        __vfs_read+0x88/0x20c
> > > > > > [  442.519620][ T5224]        vfs_read+0xd8/0x10c
> > > > > > [  442.524060][ T5224]        ksys_read+0xb0/0x120
> > > > > > [  442.528586][ T5224]        __arm64_sys_read+0x54/0x88
> > > > > > [  442.533634][ T5224]        el0_svc_handler+0x170/0x240
> > > > > > [  442.538768][ T5224]        el0_svc+0x8/0xc
> > > > > 
> > > > > I believe the lock is not really needed here. We do not deallocated
> > > > > pgdat of a hotremoved node nor destroy the slab state because an
> > > > > existing slabs would prevent hotremove to continue in the first place.
> > > > > 
> > > > > There are likely details to be checked of course but the lock just seems
> > > > > bogus.
> > > > 
> > > > Check 03afc0e25f7f ("slab: get_online_mems for
> > > > kmem_cache_{create,destroy,shrink}"). It actually talk about the races during
> > > > memory as well cpu hotplug, so it might even that cpu_hotplug_lock removal is
> > > > problematic?
> > > > 
> > > 
> > > Which removal are you referring to? get_online_mems() does not mess with
> > > the cpu hotplug lock (and therefore this patch).
> > 
> > The one in your patch. I suspect there might be races among the whole NUMA node
> > hotplug, kmem_cache_create, and show_slab_objects(). See bfc8c90139eb ("mem-
> > hotplug: implement get/put_online_mems")
> > 
> > "kmem_cache_{create,destroy,shrink} need to get a stable value of cpu/node
> > online mask, because they init/destroy/access per-cpu/node kmem_cache parts,
> > which can be allocated or destroyed on cpu/mem hotplug."
> 
> I still have to grasp that code but if the slub allocator really needs
> a stable cpu mask then it should be using the explicit cpu hotplug
> locking rather than rely on side effect of memory hotplug locking.
> 
> > Both online_pages() and show_slab_objects() need to get a stable value of
> > cpu/node online mask.
> 
> Could tou be more specific why online_pages need a stable cpu online
> mask? I do not think that show_slab_objects is a real problem because a
> potential race shouldn't be critical.

build_all_zonelists()
  __build_all_zonelists()
    for_each_online_cpu(cpu)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock
  2019-09-25 18:20             ` Qian Cai
@ 2019-09-25 19:48               ` David Hildenbrand
  2019-09-25 20:32                 ` Qian Cai
  2019-09-26  7:26               ` Michal Hocko
  1 sibling, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2019-09-25 19:48 UTC (permalink / raw)
  To: Qian Cai, Michal Hocko
  Cc: linux-kernel, linux-mm, Andrew Morton, Oscar Salvador,
	Pavel Tatashin, Dan Williams, Thomas Gleixner

On 25.09.19 20:20, Qian Cai wrote:
> On Wed, 2019-09-25 at 19:48 +0200, Michal Hocko wrote:
>> On Wed 25-09-19 12:01:02, Qian Cai wrote:
>>> On Wed, 2019-09-25 at 09:02 +0200, David Hildenbrand wrote:
>>>> On 24.09.19 20:54, Qian Cai wrote:
>>>>> On Tue, 2019-09-24 at 17:11 +0200, Michal Hocko wrote:
>>>>>> On Tue 24-09-19 11:03:21, Qian Cai wrote:
>>>>>> [...]
>>>>>>> While at it, it might be a good time to rethink the whole locking over there, as
>>>>>>> it right now read files under /sys/kernel/slab/ could trigger a possible
>>>>>>> deadlock anyway.
>>>>>>>
>>>>>>
>>>>>> [...]
>>>>>>> [  442.452090][ T5224] -> #0 (mem_hotplug_lock.rw_sem){++++}:
>>>>>>> [  442.459748][ T5224]        validate_chain+0xd10/0x2bcc
>>>>>>> [  442.464883][ T5224]        __lock_acquire+0x7f4/0xb8c
>>>>>>> [  442.469930][ T5224]        lock_acquire+0x31c/0x360
>>>>>>> [  442.474803][ T5224]        get_online_mems+0x54/0x150
>>>>>>> [  442.479850][ T5224]        show_slab_objects+0x94/0x3a8
>>>>>>> [  442.485072][ T5224]        total_objects_show+0x28/0x34
>>>>>>> [  442.490292][ T5224]        slab_attr_show+0x38/0x54
>>>>>>> [  442.495166][ T5224]        sysfs_kf_seq_show+0x198/0x2d4
>>>>>>> [  442.500473][ T5224]        kernfs_seq_show+0xa4/0xcc
>>>>>>> [  442.505433][ T5224]        seq_read+0x30c/0x8a8
>>>>>>> [  442.509958][ T5224]        kernfs_fop_read+0xa8/0x314
>>>>>>> [  442.515007][ T5224]        __vfs_read+0x88/0x20c
>>>>>>> [  442.519620][ T5224]        vfs_read+0xd8/0x10c
>>>>>>> [  442.524060][ T5224]        ksys_read+0xb0/0x120
>>>>>>> [  442.528586][ T5224]        __arm64_sys_read+0x54/0x88
>>>>>>> [  442.533634][ T5224]        el0_svc_handler+0x170/0x240
>>>>>>> [  442.538768][ T5224]        el0_svc+0x8/0xc
>>>>>>
>>>>>> I believe the lock is not really needed here. We do not deallocated
>>>>>> pgdat of a hotremoved node nor destroy the slab state because an
>>>>>> existing slabs would prevent hotremove to continue in the first place.
>>>>>>
>>>>>> There are likely details to be checked of course but the lock just seems
>>>>>> bogus.
>>>>>
>>>>> Check 03afc0e25f7f ("slab: get_online_mems for
>>>>> kmem_cache_{create,destroy,shrink}"). It actually talk about the races during
>>>>> memory as well cpu hotplug, so it might even that cpu_hotplug_lock removal is
>>>>> problematic?
>>>>>
>>>>
>>>> Which removal are you referring to? get_online_mems() does not mess with
>>>> the cpu hotplug lock (and therefore this patch).
>>>
>>> The one in your patch. I suspect there might be races among the whole NUMA node
>>> hotplug, kmem_cache_create, and show_slab_objects(). See bfc8c90139eb ("mem-
>>> hotplug: implement get/put_online_mems")
>>>
>>> "kmem_cache_{create,destroy,shrink} need to get a stable value of cpu/node
>>> online mask, because they init/destroy/access per-cpu/node kmem_cache parts,
>>> which can be allocated or destroyed on cpu/mem hotplug."
>>
>> I still have to grasp that code but if the slub allocator really needs
>> a stable cpu mask then it should be using the explicit cpu hotplug
>> locking rather than rely on side effect of memory hotplug locking.
>>
>>> Both online_pages() and show_slab_objects() need to get a stable value of
>>> cpu/node online mask.
>>
>> Could tou be more specific why online_pages need a stable cpu online
>> mask? I do not think that show_slab_objects is a real problem because a
>> potential race shouldn't be critical.
> 
> build_all_zonelists()
>   __build_all_zonelists()
>     for_each_online_cpu(cpu)
> 

Two things:

a) We currently always hold the device hotplug lock when onlining memory
and when onlining cpus (for CPUs at least via user space - we would have
to double check other call paths). So theoretically, that should guard
us from something like that already.

b)

commit 11cd8638c37f6c400cc472cc52b6eccb505aba6e
Author: Michal Hocko <mhocko@suse.com>
Date:   Wed Sep 6 16:20:34 2017 -0700

    mm, page_alloc: remove stop_machine from build_all_zonelists

Tells me:

"Updates of the zonelists happen very seldom, basically only when a zone
 becomes populated during memory online or when it loses all the memory
 during offline.  A racing iteration over zonelists could either miss a
 zone or try to work on one zone twice.  Both of these are something we
 can live with occasionally because there will always be at least one
 zone visible so we are not likely to fail allocation too easily for
 example."

Sounds like if there would be a race, we could live with it if I am not
getting that totally wrong.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock
  2019-09-25 19:48               ` David Hildenbrand
@ 2019-09-25 20:32                 ` Qian Cai
  2019-09-26  7:26                   ` David Hildenbrand
  0 siblings, 1 reply; 23+ messages in thread
From: Qian Cai @ 2019-09-25 20:32 UTC (permalink / raw)
  To: David Hildenbrand, Michal Hocko
  Cc: linux-kernel, linux-mm, Andrew Morton, Oscar Salvador,
	Pavel Tatashin, Dan Williams, Thomas Gleixner

On Wed, 2019-09-25 at 21:48 +0200, David Hildenbrand wrote:
> On 25.09.19 20:20, Qian Cai wrote:
> > On Wed, 2019-09-25 at 19:48 +0200, Michal Hocko wrote:
> > > On Wed 25-09-19 12:01:02, Qian Cai wrote:
> > > > On Wed, 2019-09-25 at 09:02 +0200, David Hildenbrand wrote:
> > > > > On 24.09.19 20:54, Qian Cai wrote:
> > > > > > On Tue, 2019-09-24 at 17:11 +0200, Michal Hocko wrote:
> > > > > > > On Tue 24-09-19 11:03:21, Qian Cai wrote:
> > > > > > > [...]
> > > > > > > > While at it, it might be a good time to rethink the whole locking over there, as
> > > > > > > > it right now read files under /sys/kernel/slab/ could trigger a possible
> > > > > > > > deadlock anyway.
> > > > > > > > 
> > > > > > > 
> > > > > > > [...]
> > > > > > > > [  442.452090][ T5224] -> #0 (mem_hotplug_lock.rw_sem){++++}:
> > > > > > > > [  442.459748][ T5224]        validate_chain+0xd10/0x2bcc
> > > > > > > > [  442.464883][ T5224]        __lock_acquire+0x7f4/0xb8c
> > > > > > > > [  442.469930][ T5224]        lock_acquire+0x31c/0x360
> > > > > > > > [  442.474803][ T5224]        get_online_mems+0x54/0x150
> > > > > > > > [  442.479850][ T5224]        show_slab_objects+0x94/0x3a8
> > > > > > > > [  442.485072][ T5224]        total_objects_show+0x28/0x34
> > > > > > > > [  442.490292][ T5224]        slab_attr_show+0x38/0x54
> > > > > > > > [  442.495166][ T5224]        sysfs_kf_seq_show+0x198/0x2d4
> > > > > > > > [  442.500473][ T5224]        kernfs_seq_show+0xa4/0xcc
> > > > > > > > [  442.505433][ T5224]        seq_read+0x30c/0x8a8
> > > > > > > > [  442.509958][ T5224]        kernfs_fop_read+0xa8/0x314
> > > > > > > > [  442.515007][ T5224]        __vfs_read+0x88/0x20c
> > > > > > > > [  442.519620][ T5224]        vfs_read+0xd8/0x10c
> > > > > > > > [  442.524060][ T5224]        ksys_read+0xb0/0x120
> > > > > > > > [  442.528586][ T5224]        __arm64_sys_read+0x54/0x88
> > > > > > > > [  442.533634][ T5224]        el0_svc_handler+0x170/0x240
> > > > > > > > [  442.538768][ T5224]        el0_svc+0x8/0xc
> > > > > > > 
> > > > > > > I believe the lock is not really needed here. We do not deallocated
> > > > > > > pgdat of a hotremoved node nor destroy the slab state because an
> > > > > > > existing slabs would prevent hotremove to continue in the first place.
> > > > > > > 
> > > > > > > There are likely details to be checked of course but the lock just seems
> > > > > > > bogus.
> > > > > > 
> > > > > > Check 03afc0e25f7f ("slab: get_online_mems for
> > > > > > kmem_cache_{create,destroy,shrink}"). It actually talk about the races during
> > > > > > memory as well cpu hotplug, so it might even that cpu_hotplug_lock removal is
> > > > > > problematic?
> > > > > > 
> > > > > 
> > > > > Which removal are you referring to? get_online_mems() does not mess with
> > > > > the cpu hotplug lock (and therefore this patch).
> > > > 
> > > > The one in your patch. I suspect there might be races among the whole NUMA node
> > > > hotplug, kmem_cache_create, and show_slab_objects(). See bfc8c90139eb ("mem-
> > > > hotplug: implement get/put_online_mems")
> > > > 
> > > > "kmem_cache_{create,destroy,shrink} need to get a stable value of cpu/node
> > > > online mask, because they init/destroy/access per-cpu/node kmem_cache parts,
> > > > which can be allocated or destroyed on cpu/mem hotplug."
> > > 
> > > I still have to grasp that code but if the slub allocator really needs
> > > a stable cpu mask then it should be using the explicit cpu hotplug
> > > locking rather than rely on side effect of memory hotplug locking.
> > > 
> > > > Both online_pages() and show_slab_objects() need to get a stable value of
> > > > cpu/node online mask.
> > > 
> > > Could tou be more specific why online_pages need a stable cpu online
> > > mask? I do not think that show_slab_objects is a real problem because a
> > > potential race shouldn't be critical.
> > 
> > build_all_zonelists()
> >   __build_all_zonelists()
> >     for_each_online_cpu(cpu)
> > 
> 
> Two things:
> 
> a) We currently always hold the device hotplug lock when onlining memory
> and when onlining cpus (for CPUs at least via user space - we would have
> to double check other call paths). So theoretically, that should guard
> us from something like that already.
> 
> b)
> 
> commit 11cd8638c37f6c400cc472cc52b6eccb505aba6e
> Author: Michal Hocko <mhocko@suse.com>
> Date:   Wed Sep 6 16:20:34 2017 -0700
> 
>     mm, page_alloc: remove stop_machine from build_all_zonelists
> 
> Tells me:
> 
> "Updates of the zonelists happen very seldom, basically only when a zone
>  becomes populated during memory online or when it loses all the memory
>  during offline.  A racing iteration over zonelists could either miss a
>  zone or try to work on one zone twice.  Both of these are something we
>  can live with occasionally because there will always be at least one
>  zone visible so we are not likely to fail allocation too easily for
>  example."
> 
> Sounds like if there would be a race, we could live with it if I am not
> getting that totally wrong.
> 

What's the problem you are trying to solve? Why it is more important to live
with races than to keep the correct code?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock
  2019-09-25 20:32                 ` Qian Cai
@ 2019-09-26  7:26                   ` David Hildenbrand
  2019-09-26  7:38                     ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2019-09-26  7:26 UTC (permalink / raw)
  To: Qian Cai, Michal Hocko
  Cc: linux-kernel, linux-mm, Andrew Morton, Oscar Salvador,
	Pavel Tatashin, Dan Williams, Thomas Gleixner

On 25.09.19 22:32, Qian Cai wrote:
> On Wed, 2019-09-25 at 21:48 +0200, David Hildenbrand wrote:
>> On 25.09.19 20:20, Qian Cai wrote:
>>> On Wed, 2019-09-25 at 19:48 +0200, Michal Hocko wrote:
>>>> On Wed 25-09-19 12:01:02, Qian Cai wrote:
>>>>> On Wed, 2019-09-25 at 09:02 +0200, David Hildenbrand wrote:
>>>>>> On 24.09.19 20:54, Qian Cai wrote:
>>>>>>> On Tue, 2019-09-24 at 17:11 +0200, Michal Hocko wrote:
>>>>>>>> On Tue 24-09-19 11:03:21, Qian Cai wrote:
>>>>>>>> [...]
>>>>>>>>> While at it, it might be a good time to rethink the whole locking over there, as
>>>>>>>>> it right now read files under /sys/kernel/slab/ could trigger a possible
>>>>>>>>> deadlock anyway.
>>>>>>>>>
>>>>>>>>
>>>>>>>> [...]
>>>>>>>>> [  442.452090][ T5224] -> #0 (mem_hotplug_lock.rw_sem){++++}:
>>>>>>>>> [  442.459748][ T5224]        validate_chain+0xd10/0x2bcc
>>>>>>>>> [  442.464883][ T5224]        __lock_acquire+0x7f4/0xb8c
>>>>>>>>> [  442.469930][ T5224]        lock_acquire+0x31c/0x360
>>>>>>>>> [  442.474803][ T5224]        get_online_mems+0x54/0x150
>>>>>>>>> [  442.479850][ T5224]        show_slab_objects+0x94/0x3a8
>>>>>>>>> [  442.485072][ T5224]        total_objects_show+0x28/0x34
>>>>>>>>> [  442.490292][ T5224]        slab_attr_show+0x38/0x54
>>>>>>>>> [  442.495166][ T5224]        sysfs_kf_seq_show+0x198/0x2d4
>>>>>>>>> [  442.500473][ T5224]        kernfs_seq_show+0xa4/0xcc
>>>>>>>>> [  442.505433][ T5224]        seq_read+0x30c/0x8a8
>>>>>>>>> [  442.509958][ T5224]        kernfs_fop_read+0xa8/0x314
>>>>>>>>> [  442.515007][ T5224]        __vfs_read+0x88/0x20c
>>>>>>>>> [  442.519620][ T5224]        vfs_read+0xd8/0x10c
>>>>>>>>> [  442.524060][ T5224]        ksys_read+0xb0/0x120
>>>>>>>>> [  442.528586][ T5224]        __arm64_sys_read+0x54/0x88
>>>>>>>>> [  442.533634][ T5224]        el0_svc_handler+0x170/0x240
>>>>>>>>> [  442.538768][ T5224]        el0_svc+0x8/0xc
>>>>>>>>
>>>>>>>> I believe the lock is not really needed here. We do not deallocated
>>>>>>>> pgdat of a hotremoved node nor destroy the slab state because an
>>>>>>>> existing slabs would prevent hotremove to continue in the first place.
>>>>>>>>
>>>>>>>> There are likely details to be checked of course but the lock just seems
>>>>>>>> bogus.
>>>>>>>
>>>>>>> Check 03afc0e25f7f ("slab: get_online_mems for
>>>>>>> kmem_cache_{create,destroy,shrink}"). It actually talk about the races during
>>>>>>> memory as well cpu hotplug, so it might even that cpu_hotplug_lock removal is
>>>>>>> problematic?
>>>>>>>
>>>>>>
>>>>>> Which removal are you referring to? get_online_mems() does not mess with
>>>>>> the cpu hotplug lock (and therefore this patch).
>>>>>
>>>>> The one in your patch. I suspect there might be races among the whole NUMA node
>>>>> hotplug, kmem_cache_create, and show_slab_objects(). See bfc8c90139eb ("mem-
>>>>> hotplug: implement get/put_online_mems")
>>>>>
>>>>> "kmem_cache_{create,destroy,shrink} need to get a stable value of cpu/node
>>>>> online mask, because they init/destroy/access per-cpu/node kmem_cache parts,
>>>>> which can be allocated or destroyed on cpu/mem hotplug."
>>>>
>>>> I still have to grasp that code but if the slub allocator really needs
>>>> a stable cpu mask then it should be using the explicit cpu hotplug
>>>> locking rather than rely on side effect of memory hotplug locking.
>>>>
>>>>> Both online_pages() and show_slab_objects() need to get a stable value of
>>>>> cpu/node online mask.
>>>>
>>>> Could tou be more specific why online_pages need a stable cpu online
>>>> mask? I do not think that show_slab_objects is a real problem because a
>>>> potential race shouldn't be critical.
>>>
>>> build_all_zonelists()
>>>   __build_all_zonelists()
>>>     for_each_online_cpu(cpu)
>>>
>>
>> Two things:
>>
>> a) We currently always hold the device hotplug lock when onlining memory
>> and when onlining cpus (for CPUs at least via user space - we would have
>> to double check other call paths). So theoretically, that should guard
>> us from something like that already.
>>
>> b)
>>
>> commit 11cd8638c37f6c400cc472cc52b6eccb505aba6e
>> Author: Michal Hocko <mhocko@suse.com>
>> Date:   Wed Sep 6 16:20:34 2017 -0700
>>
>>     mm, page_alloc: remove stop_machine from build_all_zonelists
>>
>> Tells me:
>>
>> "Updates of the zonelists happen very seldom, basically only when a zone
>>  becomes populated during memory online or when it loses all the memory
>>  during offline.  A racing iteration over zonelists could either miss a
>>  zone or try to work on one zone twice.  Both of these are something we
>>  can live with occasionally because there will always be at least one
>>  zone visible so we are not likely to fail allocation too easily for
>>  example."
>>
>> Sounds like if there would be a race, we could live with it if I am not
>> getting that totally wrong.
>>
> 
> What's the problem you are trying to solve? Why it is more important to live
> with races than to keep the correct code?

I am trying to understand, fix, cleanup and document the locking mess we
have in the memory hotplug code.

The cpu hotplug lock is one of these things nobody really has a clue why
it is still needed. It imposes locking orders (e.g., has to be taken
before the memory hotplug lock) and we are taking the cpu hotplug lock
even if we do add_memory()/remove_memory(), not only when onlining pages)

So if we agree that we need it here, I'll add documentation - especially
to build_all_zonelists(). If we agree it can go, I'll add documentation
why we don't need it in build_all_zonelists().

I am not yet convinced that we need the lock here. As I said, we do hold
the device_hotplug_lock which all sysfs
/sys/devices/system/whatever/online modifications take, and Michal even
documented why the we can live with very very rare races (again, if they
are possible at all).

I'd like to hear what Michal thinks. If we do want the cpu hotplug lock,
we can at least restrict it to the call paths (e.g., online_pages())
where the lock is really needed and document that.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock
  2019-09-25 18:20             ` Qian Cai
  2019-09-25 19:48               ` David Hildenbrand
@ 2019-09-26  7:26               ` Michal Hocko
  2019-09-26 11:19                 ` Qian Cai
  1 sibling, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2019-09-26  7:26 UTC (permalink / raw)
  To: Qian Cai
  Cc: David Hildenbrand, linux-kernel, linux-mm, Andrew Morton,
	Oscar Salvador, Pavel Tatashin, Dan Williams, Thomas Gleixner

On Wed 25-09-19 14:20:59, Qian Cai wrote:
> On Wed, 2019-09-25 at 19:48 +0200, Michal Hocko wrote:
> > On Wed 25-09-19 12:01:02, Qian Cai wrote:
> > > On Wed, 2019-09-25 at 09:02 +0200, David Hildenbrand wrote:
> > > > On 24.09.19 20:54, Qian Cai wrote:
> > > > > On Tue, 2019-09-24 at 17:11 +0200, Michal Hocko wrote:
> > > > > > On Tue 24-09-19 11:03:21, Qian Cai wrote:
> > > > > > [...]
> > > > > > > While at it, it might be a good time to rethink the whole locking over there, as
> > > > > > > it right now read files under /sys/kernel/slab/ could trigger a possible
> > > > > > > deadlock anyway.
> > > > > > > 
> > > > > > 
> > > > > > [...]
> > > > > > > [  442.452090][ T5224] -> #0 (mem_hotplug_lock.rw_sem){++++}:
> > > > > > > [  442.459748][ T5224]        validate_chain+0xd10/0x2bcc
> > > > > > > [  442.464883][ T5224]        __lock_acquire+0x7f4/0xb8c
> > > > > > > [  442.469930][ T5224]        lock_acquire+0x31c/0x360
> > > > > > > [  442.474803][ T5224]        get_online_mems+0x54/0x150
> > > > > > > [  442.479850][ T5224]        show_slab_objects+0x94/0x3a8
> > > > > > > [  442.485072][ T5224]        total_objects_show+0x28/0x34
> > > > > > > [  442.490292][ T5224]        slab_attr_show+0x38/0x54
> > > > > > > [  442.495166][ T5224]        sysfs_kf_seq_show+0x198/0x2d4
> > > > > > > [  442.500473][ T5224]        kernfs_seq_show+0xa4/0xcc
> > > > > > > [  442.505433][ T5224]        seq_read+0x30c/0x8a8
> > > > > > > [  442.509958][ T5224]        kernfs_fop_read+0xa8/0x314
> > > > > > > [  442.515007][ T5224]        __vfs_read+0x88/0x20c
> > > > > > > [  442.519620][ T5224]        vfs_read+0xd8/0x10c
> > > > > > > [  442.524060][ T5224]        ksys_read+0xb0/0x120
> > > > > > > [  442.528586][ T5224]        __arm64_sys_read+0x54/0x88
> > > > > > > [  442.533634][ T5224]        el0_svc_handler+0x170/0x240
> > > > > > > [  442.538768][ T5224]        el0_svc+0x8/0xc
> > > > > > 
> > > > > > I believe the lock is not really needed here. We do not deallocated
> > > > > > pgdat of a hotremoved node nor destroy the slab state because an
> > > > > > existing slabs would prevent hotremove to continue in the first place.
> > > > > > 
> > > > > > There are likely details to be checked of course but the lock just seems
> > > > > > bogus.
> > > > > 
> > > > > Check 03afc0e25f7f ("slab: get_online_mems for
> > > > > kmem_cache_{create,destroy,shrink}"). It actually talk about the races during
> > > > > memory as well cpu hotplug, so it might even that cpu_hotplug_lock removal is
> > > > > problematic?
> > > > > 
> > > > 
> > > > Which removal are you referring to? get_online_mems() does not mess with
> > > > the cpu hotplug lock (and therefore this patch).
> > > 
> > > The one in your patch. I suspect there might be races among the whole NUMA node
> > > hotplug, kmem_cache_create, and show_slab_objects(). See bfc8c90139eb ("mem-
> > > hotplug: implement get/put_online_mems")
> > > 
> > > "kmem_cache_{create,destroy,shrink} need to get a stable value of cpu/node
> > > online mask, because they init/destroy/access per-cpu/node kmem_cache parts,
> > > which can be allocated or destroyed on cpu/mem hotplug."
> > 
> > I still have to grasp that code but if the slub allocator really needs
> > a stable cpu mask then it should be using the explicit cpu hotplug
> > locking rather than rely on side effect of memory hotplug locking.
> > 
> > > Both online_pages() and show_slab_objects() need to get a stable value of
> > > cpu/node online mask.
> > 
> > Could tou be more specific why online_pages need a stable cpu online
> > mask? I do not think that show_slab_objects is a real problem because a
> > potential race shouldn't be critical.
> 
> build_all_zonelists()
>   __build_all_zonelists()
>     for_each_online_cpu(cpu)

OK, this is using for_each_online_cpu but why is this a problem? Have
you checked what the code actually does? Let's say that online_pages is
racing with cpu hotplug. A new CPU appears/disappears from the online
mask while we are iterating it, right? Let's start with cpu offlining
case. We have two choices, either the cpu is still visible and we update
its local node configuration even though it will disappear shortly which
is ok because we are not touching any data that disappears (it's all
per-cpu). Case when the cpu is no longer there is not really
interesting. For the online case we might miss a cpu but that should be
tolerateable because that is not any different from triggering the
online independently of the memory hotplug. So there has to be a hook
from that code path as well. If there is none then this is buggy
irrespective of the locking.

Makes sense?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock
  2019-09-26  7:26                   ` David Hildenbrand
@ 2019-09-26  7:38                     ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2019-09-26  7:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Qian Cai, linux-kernel, linux-mm, Andrew Morton, Oscar Salvador,
	Pavel Tatashin, Dan Williams, Thomas Gleixner

On Thu 26-09-19 09:26:13, David Hildenbrand wrote:
[...]
> I'd like to hear what Michal thinks. If we do want the cpu hotplug lock,
> we can at least restrict it to the call paths (e.g., online_pages())
> where the lock is really needed and document that.

Completely agreed. Conflating cpu and memory hotplug locks was a bad
decision. If there are places which need both they should better use
both lock explicitly.

Now, the reality might turn out more complicated due to locks nesting
but hiding the cpu lock into the mem hotplug is just not fixing that.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock
  2019-09-26  7:26               ` Michal Hocko
@ 2019-09-26 11:19                 ` Qian Cai
  2019-09-26 11:52                   ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: Qian Cai @ 2019-09-26 11:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, linux-kernel, linux-mm, Andrew Morton,
	Oscar Salvador, Pavel Tatashin, Dan Williams, Thomas Gleixner



> On Sep 26, 2019, at 3:26 AM, Michal Hocko <mhocko@kernel.org> wrote:
> 
> OK, this is using for_each_online_cpu but why is this a problem? Have
> you checked what the code actually does? Let's say that online_pages is
> racing with cpu hotplug. A new CPU appears/disappears from the online
> mask while we are iterating it, right? Let's start with cpu offlining
> case. We have two choices, either the cpu is still visible and we update
> its local node configuration even though it will disappear shortly which
> is ok because we are not touching any data that disappears (it's all
> per-cpu). Case when the cpu is no longer there is not really
> interesting. For the online case we might miss a cpu but that should be
> tolerateable because that is not any different from triggering the
> online independently of the memory hotplug. So there has to be a hook
> from that code path as well. If there is none then this is buggy
> irrespective of the locking.
> 
> Makes sense?

This sounds to me requires lots of audits and testing. Also, someone who is more
familiar with CPU hotplug should review this patch. Personally, I am no fun of
operating on an incorrect CPU mask to begin with, things could go wrong really
quickly...

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock
  2019-09-26 11:19                 ` Qian Cai
@ 2019-09-26 11:52                   ` Michal Hocko
  2019-09-26 13:02                     ` Qian Cai
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2019-09-26 11:52 UTC (permalink / raw)
  To: Qian Cai
  Cc: David Hildenbrand, linux-kernel, linux-mm, Andrew Morton,
	Oscar Salvador, Pavel Tatashin, Dan Williams, Thomas Gleixner

On Thu 26-09-19 07:19:27, Qian Cai wrote:
> 
> 
> > On Sep 26, 2019, at 3:26 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > OK, this is using for_each_online_cpu but why is this a problem? Have
> > you checked what the code actually does? Let's say that online_pages is
> > racing with cpu hotplug. A new CPU appears/disappears from the online
> > mask while we are iterating it, right? Let's start with cpu offlining
> > case. We have two choices, either the cpu is still visible and we update
> > its local node configuration even though it will disappear shortly which
> > is ok because we are not touching any data that disappears (it's all
> > per-cpu). Case when the cpu is no longer there is not really
> > interesting. For the online case we might miss a cpu but that should be
> > tolerateable because that is not any different from triggering the
> > online independently of the memory hotplug. So there has to be a hook
> > from that code path as well. If there is none then this is buggy
> > irrespective of the locking.
> > 
> > Makes sense?
> 
> This sounds to me requires lots of audits and testing. Also, someone who is more
> familiar with CPU hotplug should review this patch.

Thomas is on the CC list.

> Personally, I am no fun of
> operating on an incorrect CPU mask to begin with, things could go wrong really
> quickly...

Do you have any specific arguments? Just think of cpu and memory
hotplugs being independent operations. There is nothing really
inherently binding them together. If the cpu_online_mask really needs a
special treatment here then I would like to hear about that. Handwaving 
doesn't really helps us.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock
  2019-09-26 11:52                   ` Michal Hocko
@ 2019-09-26 13:02                     ` Qian Cai
  2019-09-26 13:14                       ` David Hildenbrand
  0 siblings, 1 reply; 23+ messages in thread
From: Qian Cai @ 2019-09-26 13:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, linux-kernel, linux-mm, Andrew Morton,
	Oscar Salvador, Pavel Tatashin, Dan Williams, Thomas Gleixner

On Thu, 2019-09-26 at 13:52 +0200, Michal Hocko wrote:
> On Thu 26-09-19 07:19:27, Qian Cai wrote:
> > 
> > 
> > > On Sep 26, 2019, at 3:26 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > 
> > > OK, this is using for_each_online_cpu but why is this a problem? Have
> > > you checked what the code actually does? Let's say that online_pages is
> > > racing with cpu hotplug. A new CPU appears/disappears from the online
> > > mask while we are iterating it, right? Let's start with cpu offlining
> > > case. We have two choices, either the cpu is still visible and we update
> > > its local node configuration even though it will disappear shortly which
> > > is ok because we are not touching any data that disappears (it's all
> > > per-cpu). Case when the cpu is no longer there is not really
> > > interesting. For the online case we might miss a cpu but that should be
> > > tolerateable because that is not any different from triggering the
> > > online independently of the memory hotplug. So there has to be a hook
> > > from that code path as well. If there is none then this is buggy
> > > irrespective of the locking.
> > > 
> > > Makes sense?
> > 
> > This sounds to me requires lots of audits and testing. Also, someone who is more
> > familiar with CPU hotplug should review this patch.
> 
> Thomas is on the CC list.
> 
> > Personally, I am no fun of
> > operating on an incorrect CPU mask to begin with, things could go wrong really
> > quickly...
> 
> Do you have any specific arguments? Just think of cpu and memory
> hotplugs being independent operations. There is nothing really
> inherently binding them together. If the cpu_online_mask really needs a
> special treatment here then I would like to hear about that. Handwaving 
> doesn't really helps us.

That is why I said it needs CPU hotplug experts to confirm that things including
if CPU masks are tolerate to this kind of "abuse", or in-depth analysis of each 
calls sites that access CPU masks in both online_pages() and offline_pages() as
well as ideally, more testing data in those areas.

However, many kernel commits were merged with the expectations that people are
going to deal with the aftermath, so I am not going to insist.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock
  2019-09-26 13:02                     ` Qian Cai
@ 2019-09-26 13:14                       ` David Hildenbrand
  0 siblings, 0 replies; 23+ messages in thread
From: David Hildenbrand @ 2019-09-26 13:14 UTC (permalink / raw)
  To: Qian Cai, Michal Hocko
  Cc: linux-kernel, linux-mm, Andrew Morton, Oscar Salvador,
	Pavel Tatashin, Dan Williams, Thomas Gleixner

On 26.09.19 15:02, Qian Cai wrote:
> On Thu, 2019-09-26 at 13:52 +0200, Michal Hocko wrote:
>> On Thu 26-09-19 07:19:27, Qian Cai wrote:
>>>
>>>
>>>> On Sep 26, 2019, at 3:26 AM, Michal Hocko <mhocko@kernel.org> wrote:
>>>>
>>>> OK, this is using for_each_online_cpu but why is this a problem? Have
>>>> you checked what the code actually does? Let's say that online_pages is
>>>> racing with cpu hotplug. A new CPU appears/disappears from the online
>>>> mask while we are iterating it, right? Let's start with cpu offlining
>>>> case. We have two choices, either the cpu is still visible and we update
>>>> its local node configuration even though it will disappear shortly which
>>>> is ok because we are not touching any data that disappears (it's all
>>>> per-cpu). Case when the cpu is no longer there is not really
>>>> interesting. For the online case we might miss a cpu but that should be
>>>> tolerateable because that is not any different from triggering the
>>>> online independently of the memory hotplug. So there has to be a hook
>>>> from that code path as well. If there is none then this is buggy
>>>> irrespective of the locking.
>>>>
>>>> Makes sense?
>>>
>>> This sounds to me requires lots of audits and testing. Also, someone who is more
>>> familiar with CPU hotplug should review this patch.
>>
>> Thomas is on the CC list.
>>
>>> Personally, I am no fun of
>>> operating on an incorrect CPU mask to begin with, things could go wrong really
>>> quickly...
>>
>> Do you have any specific arguments? Just think of cpu and memory
>> hotplugs being independent operations. There is nothing really
>> inherently binding them together. If the cpu_online_mask really needs a
>> special treatment here then I would like to hear about that. Handwaving 
>> doesn't really helps us.
> 
> That is why I said it needs CPU hotplug experts to confirm that things including
> if CPU masks are tolerate to this kind of "abuse", or in-depth analysis of each 
> calls sites that access CPU masks in both online_pages() and offline_pages() as
> well as ideally, more testing data in those areas.
> 
> However, many kernel commits were merged with the expectations that people are
> going to deal with the aftermath, so I am not going to insist.
> 

I am going to add documentation to build_all_zonelists() regarding
locking and the details we discussed.

Of course, I'll do more testing, and as Michal suggested, we should let
this "mature" in linux-next for some time.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock
  2019-09-24 14:36 [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock David Hildenbrand
  2019-09-24 14:48 ` Michal Hocko
  2019-09-24 15:03 ` Qian Cai
@ 2019-10-02 21:37 ` Qian Cai
  2019-10-04  7:42   ` David Hildenbrand
  2 siblings, 1 reply; 23+ messages in thread
From: Qian Cai @ 2019-10-02 21:37 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Oscar Salvador, Michal Hocko,
	Pavel Tatashin, Dan Williams, Thomas Gleixner

On Tue, 2019-09-24 at 16:36 +0200, David Hildenbrand wrote:
> Since commit 3f906ba23689 ("mm/memory-hotplug: switch locking to a percpu
> rwsem") we do a cpus_read_lock() in mem_hotplug_begin(). This was
> introduced to fix a potential deadlock between get_online_mems() and
> get_online_cpus() - the memory and cpu hotplug lock. The root issue was
> that build_all_zonelists() -> stop_machine() required the cpu hotplug lock:
>     The reason is that memory hotplug takes the memory hotplug lock and
>     then calls stop_machine() which calls get_online_cpus().  That's the
>     reverse lock order to get_online_cpus(); get_online_mems(); in
>     mm/slub_common.c
> 
> So memory hotplug never really required any cpu lock itself, only
> stop_machine() and lru_add_drain_all() required it. Back then,
> stop_machine_cpuslocked() and lru_add_drain_all_cpuslocked() were used
> as the cpu hotplug lock was now obtained in the caller.
> 
> Since commit 11cd8638c37f ("mm, page_alloc: remove stop_machine from build
> all_zonelists"), the stop_machine_cpuslocked() call is gone.
> build_all_zonelists() does no longer require the cpu lock and does no
> longer make use of stop_machine().
> 
> Since commit 9852a7212324 ("mm: drop hotplug lock from
> lru_add_drain_all()"), lru_add_drain_all() "Doesn't need any cpu hotplug
> locking because we do rely on per-cpu kworkers being shut down before our
> page_alloc_cpu_dead callback is executed on the offlined cpu.". The
> lru_add_drain_all_cpuslocked() variant was removed.
> 
> So there is nothing left that requires the cpu hotplug lock. The memory
> hotplug lock and the device hotplug lock are sufficient.

Actually, powerpc does,

arch_add_memory()
  resize_hpt_for_hotplug()
    pseries_lpar_resize_hpt()
      stop_machine_cpuslocked()

> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
> 
> RFC -> v1:
> - Reword and add more details why the cpu hotplug lock was needed here
>   in the first place, and why we no longer require it.
> 
> ---
>  mm/memory_hotplug.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c3e9aed6023f..5fa30f3010e1 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -88,14 +88,12 @@ __setup("memhp_default_state=", setup_memhp_default_state);
>  
>  void mem_hotplug_begin(void)
>  {
> -	cpus_read_lock();
>  	percpu_down_write(&mem_hotplug_lock);
>  }
>  
>  void mem_hotplug_done(void)
>  {
>  	percpu_up_write(&mem_hotplug_lock);
> -	cpus_read_unlock();
>  }
>  
>  u64 max_mem_size = U64_MAX;

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock
  2019-10-02 21:37 ` Qian Cai
@ 2019-10-04  7:42   ` David Hildenbrand
  0 siblings, 0 replies; 23+ messages in thread
From: David Hildenbrand @ 2019-10-04  7:42 UTC (permalink / raw)
  To: Qian Cai, linux-kernel
  Cc: linux-mm, Andrew Morton, Oscar Salvador, Michal Hocko,
	Pavel Tatashin, Dan Williams, Thomas Gleixner

On 02.10.19 23:37, Qian Cai wrote:
> On Tue, 2019-09-24 at 16:36 +0200, David Hildenbrand wrote:
>> Since commit 3f906ba23689 ("mm/memory-hotplug: switch locking to a percpu
>> rwsem") we do a cpus_read_lock() in mem_hotplug_begin(). This was
>> introduced to fix a potential deadlock between get_online_mems() and
>> get_online_cpus() - the memory and cpu hotplug lock. The root issue was
>> that build_all_zonelists() -> stop_machine() required the cpu hotplug lock:
>>     The reason is that memory hotplug takes the memory hotplug lock and
>>     then calls stop_machine() which calls get_online_cpus().  That's the
>>     reverse lock order to get_online_cpus(); get_online_mems(); in
>>     mm/slub_common.c
>>
>> So memory hotplug never really required any cpu lock itself, only
>> stop_machine() and lru_add_drain_all() required it. Back then,
>> stop_machine_cpuslocked() and lru_add_drain_all_cpuslocked() were used
>> as the cpu hotplug lock was now obtained in the caller.
>>
>> Since commit 11cd8638c37f ("mm, page_alloc: remove stop_machine from build
>> all_zonelists"), the stop_machine_cpuslocked() call is gone.
>> build_all_zonelists() does no longer require the cpu lock and does no
>> longer make use of stop_machine().
>>
>> Since commit 9852a7212324 ("mm: drop hotplug lock from
>> lru_add_drain_all()"), lru_add_drain_all() "Doesn't need any cpu hotplug
>> locking because we do rely on per-cpu kworkers being shut down before our
>> page_alloc_cpu_dead callback is executed on the offlined cpu.". The
>> lru_add_drain_all_cpuslocked() variant was removed.
>>
>> So there is nothing left that requires the cpu hotplug lock. The memory
>> hotplug lock and the device hotplug lock are sufficient.
> 
> Actually, powerpc does,
> 
> arch_add_memory()
>   resize_hpt_for_hotplug()
>     pseries_lpar_resize_hpt()
>       stop_machine_cpuslocked()
> 

Thanks for that observation. This will need some further thought.

Another proof that locking is messed up :)

Maybe we should start decoupling locking of the memory
onlining/offlining path (e.g., get_online_mems()) from the memory
adding/removing path (e.g., later something like get_present_mems()).
Then we can push down the cpu hotplug lock to the PPC path and use
stop_machine() directly.

Time to document properly which lock protects what :)

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-10-04  7:42 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-24 14:36 [PATCH v1] mm/memory_hotplug: Don't take the cpu_hotplug_lock David Hildenbrand
2019-09-24 14:48 ` Michal Hocko
2019-09-24 15:01   ` David Hildenbrand
2019-09-24 15:03 ` Qian Cai
2019-09-24 15:11   ` Michal Hocko
2019-09-24 18:54     ` Qian Cai
2019-09-25  7:02       ` David Hildenbrand
2019-09-25 16:01         ` Qian Cai
2019-09-25 17:48           ` Michal Hocko
2019-09-25 18:20             ` Qian Cai
2019-09-25 19:48               ` David Hildenbrand
2019-09-25 20:32                 ` Qian Cai
2019-09-26  7:26                   ` David Hildenbrand
2019-09-26  7:38                     ` Michal Hocko
2019-09-26  7:26               ` Michal Hocko
2019-09-26 11:19                 ` Qian Cai
2019-09-26 11:52                   ` Michal Hocko
2019-09-26 13:02                     ` Qian Cai
2019-09-26 13:14                       ` David Hildenbrand
2019-09-25 10:03       ` Michal Hocko
2019-09-24 15:23   ` David Hildenbrand
2019-10-02 21:37 ` Qian Cai
2019-10-04  7:42   ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).