Re: [PATCH 0/6] Drain remote per-cpu directly v3

From: Qian Cai <quic_qiancai@quicinc.com>
To: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>,
	Andrew Morton <akpm@linux-foundation.org>,
	Nicolas Saenz Julienne <nsaenzju@redhat.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Michal Hocko <mhocko@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>, <kafai@fb.com>,
	<kpsingh@kernel.org>
Subject: Re: [PATCH 0/6] Drain remote per-cpu directly v3
Date: Thu, 19 May 2022 09:29:45 -0400	[thread overview]
Message-ID: <YoZGSd6yQL3EP8tk@qian> (raw)
In-Reply-To: <20220518171503.GQ1790663@paulmck-ThinkPad-P17-Gen-1>

On Wed, May 18, 2022 at 10:15:03AM -0700, Paul E. McKenney wrote:
> So does this python script somehow change the tracing state?  (It does
> not look to me like it does, but I could easily be missing something.)

No, I don't think so either. It pretty much just offline memory sections
one at a time.

> Either way, is there something else waiting for these RCU flavors?
> (There should not be.)  Nevertheless, if so, there should be
> a synchronize_rcu_tasks(), synchronize_rcu_tasks_rude(), or
> synchronize_rcu_tasks_trace() on some other blocked task's stack
> somewhere.

There are only three blocked tasks when this happens. The kmemleak_scan()
is just the victim waiting for the locks taken by the stucking
offline_pages()->synchronize_rcu() task.

 task:kmemleak        state:D stack:25824 pid: 1033 ppid:     2 flags:0x00000008
 Call trace:
  __switch_to
  __schedule
  schedule
  percpu_rwsem_wait
  __percpu_down_read
  percpu_down_read.constprop.0
  get_online_mems
  kmemleak_scan
  kmemleak_scan_thread
  kthread
  ret_from_fork

 task:cppc_fie        state:D stack:23472 pid: 1848 ppid:     2 flags:0x00000008
 Call trace:
  __switch_to
  __schedule
  lockdep_recursion

 task:tee             state:D stack:24816 pid:16733 ppid: 16732 flags:0x0000020c
 Call trace:
  __switch_to
  __schedule
  schedule
  schedule_timeout
  __wait_for_common
  wait_for_completion
  __wait_rcu_gp
  synchronize_rcu
  lru_cache_disable
  __alloc_contig_migrate_range
  isolate_single_pageblock
  start_isolate_page_range
  offline_pages
  memory_subsys_offline
  device_offline
  online_store
  dev_attr_store
  sysfs_kf_write
  kernfs_fop_write_iter
  new_sync_write
  vfs_write
  ksys_write
  __arm64_sys_write
  invoke_syscall
  el0_svc_common.constprop.0
  do_el0_svc
  el0_svc
  el0t_64_sync_handler
  el0t_64_sync

> Or maybe something sleeps waiting for an RCU Tasks * callback to
> be invoked.  In that case (and in the above case, for that matter),
> at least one of these pointers would be non-NULL on some CPU:
> 
> 1.	rcu_tasks__percpu.cblist.head
> 2.	rcu_tasks_rude__percpu.cblist.head
> 3.	rcu_tasks_trace__percpu.cblist.head
> 
> The ->func field of the pointed-to structure contains a pointer to
> the callback function, which will help work out what is going on.
> (Most likely a wakeup being lost or not provided.)

What would be some of the easy ways to find out those? I can't see anything
interesting from the output of sysrq-t.

> Alternatively, if your system has hundreds of thousands of tasks and
> you have attached BPF programs to short-lived socket structures and you
> don't yet have the workaround, then you can see hangs.  (I am working on a
> longer-term fix.)  In the short term, applying the workaround is the right
> thing to do.  (Adding a couple of the BPF guys on CC for their thoughts.)

The system is pretty much idle after a fresh reboot. The only workload is
to run the script.