Re: [PATCH 1/1] rcu/tree: support kfree_bulk() interface in kfree_rcu()

From: Joel Fernandes <joel@joelfernandes.org>
To: Uladzislau Rezki <urezki@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>, RCU <rcu@vger.kernel.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Subject: Re: [PATCH 1/1] rcu/tree: support kfree_bulk() interface in kfree_rcu()
Date: Wed, 15 Jan 2020 17:53:50 -0500	[thread overview]
Message-ID: <20200115225350.GA246464@google.com> (raw)
In-Reply-To: <20200115131446.GA18417@pc636>

On Wed, Jan 15, 2020 at 02:14:46PM +0100, Uladzislau Rezki wrote:
> Hello, Joel, Paul.
> 
> Thank you for comments and testing!
> 
> > > 
> > > Nice improvement!
> > > 
> > > But rcuperf uses a single block size, which turns into kfree_bulk() using
> > > a single slab, which results in good locality of reference.  So I have to
> > 
> > You meant a "single cache" category when you say "single slab"? Just to
> > mention, the number of slabs (in a single cache) when a large number of
> > objects are allocated is more than 1 (not single). With current rcuperf, I
> > see 100s of slabs (each slab being one page) in the kmalloc-32 cache. Each
> > slab contains around 128 objects of type kfree_rcu (24 byte object aligned to
> > 32-byte slab object).
> > 
> I think that is about using different slab caches to break locality. It
> makes sense, IMHO, because usually the system make use of different slabs,
> because of different object sizes. From the other hand i guess there are
> test cases when only one slab gets used.

I was wondering about "locality". A cache can be split into many slabs. Only
the data on a page is local (contiguous). If there are a large number of
objects, then it goes to a new slab (on the same cache). At least on the
kmalloc slabs, there is only 1 slab per page. So for example, if on
kmalloc-32 slab, there are more than 128 objects, then it goes to a different
slab / page. So how is there still locality?

Further the slab (not sure about slub) doesn't seem to do anything at the
moment to take advantage of locality within a slab.

That said, I am fully supportive of your patch and see the same
improvements as well which are for the reasons you mentioned in the changelog.

> > > ask...  Is this performance result representative of production workloads?
> > 
> > I added more variation to allocation sizes to rcuperf (patch below) to distribute
> > allocations across 4 kmalloc slabs (32,64,96 and 128) and I see a signficant
> > improvement with Ulad's patch in SLAB in terms of completion time of the
> > test. Below are the results. With SLUB I see slightly higher memory
> > footprint, I have never used SLUB and not sure who is using it so I am not
> > too concerned since the degradation in memory footprint is only slight with
> > SLAB having the signifcant improvement.
> > 
> Nice patch! I think, it would be useful to have it in "rcuperf" tool with
> extra parameter like "different_obj_sizes".

cool, I posted something like this.

> > 2.25.0.rc1.283.g88dfdc4193-goog
> I also have done some tests with your patch on my Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz, 12xCPUs
> machine to simulate different slab usage:
> 
> dev.2020.01.10a branch
> 
> # Default, CONFIG_SLAB, kfree_loops=200000 kfree_alloc_num=1000 kfree_rcu_test=1, 16, 32, 64, 96 obj sizes
> [   83.762963] Total time taken by all kfree'ers: 53607352517 ns, loops: 200000, batches: 1885, memory footprint: 1248MB
> [   80.108401] Total time taken by all kfree'ers: 53529637912 ns, loops: 200000, batches: 1921, memory footprint: 1193MB
> [   76.622252] Total time taken by all kfree'ers: 53570175705 ns, loops: 200000, batches: 1929, memory footprint: 1250MB
> 
> # With the patch, CONFIG_SLAB, kfree_loops=200000 kfree_alloc_num=1000 kfree_rcu_test=1, 16, 32, 64, 96 obj sizes
> [   48.265008] Total time taken by all kfree'ers: 23981587315 ns, loops: 200000, batches: 810, memory footprint: 1219MB
> [   53.263943] Total time taken by all kfree'ers: 23879375281 ns, loops: 200000, batches: 822, memory footprint: 1190MB
> [   50.366440] Total time taken by all kfree'ers: 24086841707 ns, loops: 200000, batches: 794, memory footprint: 1380MB
> 
> # Default, CONFIG_SLUB, kfree_loops=200000 kfree_alloc_num=1000 kfree_rcu_test=1, 16, 32, 64, 96 obj sizes
> [   81.818576] Total time taken by all kfree'ers: 51291025022 ns, loops: 200000, batches: 1713, memory footprint: 741MB
> [   77.854866] Total time taken by all kfree'ers: 51278911477 ns, loops: 200000, batches: 1671, memory footprint: 719MB
> [   76.329577] Total time taken by all kfree'ers: 51256183045 ns, loops: 200000, batches: 1719, memory footprint: 647MB
> 
> # With the patch, CONFIG_SLUB, kfree_loops=200000 kfree_alloc_num=1000 kfree_rcu_test=1, 16, 32, 64, 96 obj sizes
> [   76.254485] Total time taken by all kfree'ers: 50709919132 ns, loops: 200000, batches: 1618, memory footprint: 456MB
> [   75.891521] Total time taken by all kfree'ers: 50736297452 ns, loops: 200000, batches: 1633, memory footprint: 507MB
> [   76.172573] Total time taken by all kfree'ers: 50660403893 ns, loops: 200000, batches: 1628, memory footprint: 429MB
> 
> in case of CONFIG_SLAB there is double increase in performance but slightly higher memory usage.
> As for CONFIG_SLUB, i still see higher performance figures + lower memory usage with the patch.

Ok, testing today, our results are quite similar.

> 
> Apart of that, I have got the report from the "kernel test robot":
> <snip>
> [   13.957168] ------------[ cut here ]------------
> [   13.958256] ODEBUG: free active (active state 1) object type: rcu_head hint: 0x0
> [   13.962148] WARNING: CPU: 0 PID: 212 at lib/debugobjects.c:484 debug_print_object+0x95/0xd0
> [   13.964298] Modules linked in:
> [   13.964960] CPU: 0 PID: 212 Comm: kworker/0:2 Not tainted 5.5.0-rc1-00136-g883a2cefc0684 #1
> [   13.966712] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> [   13.968528] Workqueue: events kfree_rcu_work
> [   13.969466] RIP: 0010:debug_print_object+0x95/0xd0
> [   13.970480] Code: d2 e8 2f 06 d6 ff 8b 43 10 4d 89 f1 4c 89 e6 8b 4b 14 48 c7 c7 88 73 be 82 4d 8b 45 00 48 8b 14 c5 a0 5f 6d 82 e8 7b 65 c6 ff <0f> 0b b9 01 00 00 00 31 d2 be 01 00 00 00 48 c7 c7 98 b8 0c 83 e8
> [   13.974435] RSP: 0000:ffff888231677bf8 EFLAGS: 00010282
> [   13.975531] RAX: 0000000000000000 RBX: ffff88822d4200e0 RCX: 0000000000000000
> [   13.976730] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff8306e028
> [   13.977568] RBP: ffff888231677c18 R08: 0000000000000000 R09: ffff888231670790
> [   13.978412] R10: ffff888231670000 R11: 0000000000000003 R12: ffffffff82bc5299
> [   13.979250] R13: ffffffff82e77360 R14: 0000000000000000 R15: dead000000000100
> [   13.980089] FS:  0000000000000000(0000) GS:ffffffff82e4f000(0000) knlGS:0000000000000000
> [   13.981069] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   13.981746] CR2: 00007f1e913fc77c CR3: 0000000225ce9000 CR4: 00000000000006f0
> [   13.982587] Call Trace:
> [   13.982911]  __debug_check_no_obj_freed+0x19a/0x200
> [   13.983494]  debug_check_no_obj_freed+0x14/0x20
> [   13.984036]  free_pcp_prepare+0xee/0x1d0
> [   13.984541]  free_unref_page+0x1b/0x80
> [   13.984994]  __free_pages+0x19/0x20
> [   13.985503]  __free_pages+0x13/0x20
> [   13.985924]  slob_free_pages+0x7d/0x90
> [   13.986373]  slob_free+0x34f/0x530
> [   13.986784]  kfree+0x154/0x210
> [   13.987155]  __kmem_cache_free_bulk+0x44/0x60
> [   13.987673]  kmem_cache_free_bulk+0xe/0x10
> [   13.988163]  kfree_rcu_work+0x95/0x310
> [   13.989010]  ? kfree_rcu_work+0x64/0x310
> [   13.989884]  process_one_work+0x378/0x7c0
> [   13.990770]  worker_thread+0x40/0x600
> [   13.991587]  kthread+0x14e/0x170
> [   13.992344]  ? process_one_work+0x7c0/0x7c0
> [   13.993256]  ? kthread_create_on_node+0x70/0x70
> [   13.994246]  ret_from_fork+0x3a/0x50
> [   13.995039] ---[ end trace cdf242638b0e32a0 ]---
> [child0:632] trace_fd was -1
> <snip>
> 
> the trace happens when the kernel is built with CONFIG_DEBUG_OBJECTS_FREE
> and CONFIG_DEBUG_OBJECTS_RCU_HEAD. Basically it is not a problem of the patch
> itself or there is any bug there. It just does not pair with debug_rcu_head_queue(head)
> in the kfree_rcu_work() function, that is why the kernel thinks about freeing
> an active object that is not active in reality.
> 
> I will upload a V2 to fix that.

Oh good point. Thanks for fixing that.

thanks,

 - Joel