On Mon, Aug 16, 2021 at 05:41:57PM -0400, Johannes Weiner wrote:
> On Mon, Aug 16, 2021 at 11:28:55AM +0800, Feng Tang wrote:
> > On Thu, Aug 12, 2021 at 11:19:10AM +0800, Feng Tang wrote:
> > > On Tue, Aug 10, 2021 at 07:59:53PM -1000, Linus Torvalds wrote:
> > [SNIP]
> > 
> > > And seems there is some cache false sharing when accessing mem_cgroup
> > > member: 'struct cgroup_subsys_state', from the offset (0x0 and 0x10 here)
> > > and the calling sites, the cache false sharing could happen between:
> > > 
> > >     cgroup_rstat_updated (read memcg->css.cgroup, offset 0x0)
> > > and 
> > >     get_mem_cgroup_from_mm
> > > 	css_tryget(&memcg->css) (read/write memcg->css.refcnt, offset 0x10)
> > > 
> > > (This could be wrong as many of the functions are inlined, and the
> > > exact calling site isn't shown)
> 
> Thanks for digging more into this.
> 
> The offset 0x0 access is new in the page instantiation path with the
> bisected patch, so that part makes sense. The new sequence is this:
> 
> 	shmem_add_to_page_cache()
> 	  mem_cgroup_charge()
> 	    get_mem_cgroup_from_mm()
> 	      css_tryget()			# touches memcg->css.refcnt
> 	  xas_store()
> 	  __mod_lruvec_page_state()
> 	    __mod_lruvec_state()
> 	      __mod_memcg_lruvec_state()
> 	        __mod_memcg_state()
> 	          __this_cpu_add()
> 	          cgroup_rstat_updated()	# touches memcg->css.cgroup
> 
> whereas before, __mod_memcg_state() would just do stuff inside memcg.
 
Yes, the perf record/report also show these two are hotspots, one takes
about 6% cpu cycles, the other takes 10%.

> However, css.refcnt is a percpu-refcount. We should see a read-only
> lookup of the base pointer inside this cacheline, with the write
> occuring in percpu memory elsewhere. Even if it were in atomic/shared
> mode, which it shouldn't be for the root cgroup, the shared atomic_t
> is also located in an auxiliary allocation and shouldn't overlap with
> the cgroup pointer in any way.
> 
> The css itself is embedded inside struct mem_cgroup, which does see
> modifications. But the closest of those is 3 cachelines down (struct
> page_counter memory), so that doesn't make sense, either.
> 
> Does this theory require writes? Because I don't actually see any (hot
> ones, anyway) inside struct cgroup_subsys_state for this workload.
 
You are right. the access to 'css.refcnt' is a read, and false sharing
is kind of interference between read and write. I presumed it's a global
reference count, and the try_get is a write operation. 

Initially from the perf-c2c data, the in-cacheline hotspots are only
0x0, and 0x10, and if we extends to 2 cachelines, there is one more
offset 0x54 (css.flags), but still I can't figure out which member
inside the 128 bytes range is written frequenty.

/* pah info for cgroup_subsys_state */
struct cgroup_subsys_state {
	struct cgroup *            cgroup;               /*     0     8 */
	struct cgroup_subsys *     ss;                   /*     8     8 */
	struct percpu_ref          refcnt;               /*    16    16 */
	struct list_head           sibling;              /*    32    16 */
	struct list_head           children;             /*    48    16 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct list_head           rstat_css_node;       /*    64    16 */
	int                        id;                   /*    80     4 */
	unsigned int               flags;                /*    84     4 */
	u64                        serial_nr;            /*    88     8 */
	atomic_t                   online_cnt;           /*    96     4 */

	/* XXX 4 bytes hole, try to pack */

	struct work_struct         destroy_work;         /*   104    32 */
	/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */

Since the test run implies this is cacheline related, and I'm not very
familiar with the mem_cgroup code, the original perf-c2c log is attached
which may give more hints.

Thanks,
Feng

> > > And to verify this, we did a test by adding padding between
> > > memcg->css.cgroup and memcg->css.refcnt to push them into 2
> > > different cache lines, and the performance are partly restored:
> > > 
> > > dc26532aed0ab25c 2d146aa3aa842d7f5065802556b 73371bf27a8a8ea68df2fbf456b 
> > > ---------------- --------------------------- --------------------------- 
> > >   65523232 ±  4%     -40.8%   38817332 ±  5%     -19.6%   52701654 ±  3%  vm-scalability.throughput
> > >
> > > We are still checking more, and will update if there is new data. 
> > 
> > Seems this is the second case to hit 'adjacent cacheline prefetch",
> > the first time we saw it is also related with mem_cgroup
> > https://lore.kernel.org/lkml/20201125062445.GA51005@shbuild999.sh.intel.com/
> > 
> > In previous debug patch, the 'css.cgroup' and 'css.refcnt' is
> > separated to 2 cache lines, which are still adjacent (2N and 2N+1)
> > cachelines. And with more padding (add 128 bytes padding in between),
> > the performance is restored, and even better (test run 3 times):
> > 
> > dc26532aed0ab25c 2d146aa3aa842d7f5065802556b 2e34d6daf5fbab0fb286dcdb3bc 
> > ---------------- --------------------------- --------------------------- 
> >   65523232 ±  4%     -40.8%   38817332 ±  5%     +23.4%   80862243 ±  3%  vm-scalability.throughput
> > 
> > The debug patch is:
> > --- a/include/linux/cgroup-defs.h
> > +++ b/include/linux/cgroup-defs.h
> > @@ -142,6 +142,8 @@ struct cgroup_subsys_state {
> >  	/* PI: the cgroup subsystem that this css is attached to */
> >  	struct cgroup_subsys *ss;
> >  
> > +	unsigned long pad[16];
> > +
> >  	/* reference count - access via css_[try]get() and css_put() */
> >  	struct percpu_ref refcnt;
> 
> We aren't particularly space-constrained in this structure, so padding
> should generally be acceptable here.