From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.4 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 552EFC33CB3 for ; Wed, 15 Jan 2020 22:53:55 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 20DE22187F for ; Wed, 15 Jan 2020 22:53:55 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=joelfernandes.org header.i=@joelfernandes.org header.b="ucxBhF9g" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730338AbgAOWxx (ORCPT ); Wed, 15 Jan 2020 17:53:53 -0500 Received: from mail-pg1-f196.google.com ([209.85.215.196]:45877 "EHLO mail-pg1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730126AbgAOWxx (ORCPT ); Wed, 15 Jan 2020 17:53:53 -0500 Received: by mail-pg1-f196.google.com with SMTP id b9so8870356pgk.12 for ; Wed, 15 Jan 2020 14:53:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=joelfernandes.org; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=GBzXufLzDJf0sHQ0gmQgUghK5kTG0Zw6fGrZMAl/WRE=; b=ucxBhF9gmKhxRRTHRgzyBJg6UBrPac7VbqHUe7y+eA2oWfMo7S1mHzCLfF5G53q+nw mhY1WlZWdY6b7/7VlGZNciC6Dp/hI6ujEroAPpL9n53b5QZTVNtgeVmsi+JrYbc0USMC OVlEuT7fy8WJSdmgJOcMneIzsoGx6Cb7SELBI= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=GBzXufLzDJf0sHQ0gmQgUghK5kTG0Zw6fGrZMAl/WRE=; b=j6pcpD8Oh7ZMPhctLnKg4qDyaEA1AxjSfNgqdaGcI8lal6ZcdPRBYSaPUVe4D9Tjw2 8dTPEP9IMxuuqTawzBYvo2jEromaWkw9v4CcPstX/3wohlhAK1i1/rY4ySdRjo/mtwyh j8K85Ub/Ailz2dAFfHtqwUODBFqXq1aFZkMmbWuWM+sxZyEyiRYRzn0iu1UwxsKsZdZv idKQjjbRc+416AxGOJDaUeOGzw8XB7ci6ExpiPefT6ChFWbMUWhbuJZy4cmSdNGg5uoE K8LmDdyOQvz1meqpVEZULUJ6UF0pHoQWZJhzeLc+T3s1DK2z+gSdqsN3aZ7O/0AOugpr claA== X-Gm-Message-State: APjAAAUX2bOoLajEst/pByocsHL05qck//PbasrXuT0lnUpwSXu4Y4lA KxgVK9ASSXcKogRTGnGd2vQ6+g== X-Google-Smtp-Source: APXvYqwCafHlO3bZd2DN42C9FNLol8CwUcxt9NFfWW2vdlC5Dt685s8SPAss2EZqyC090EuG1aF25w== X-Received: by 2002:a62:1a97:: with SMTP id a145mr35095028pfa.244.1579128832241; Wed, 15 Jan 2020 14:53:52 -0800 (PST) Received: from localhost ([2620:15c:6:12:9c46:e0da:efbf:69cc]) by smtp.gmail.com with ESMTPSA id 11sm23508325pfz.25.2020.01.15.14.53.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 15 Jan 2020 14:53:51 -0800 (PST) Date: Wed, 15 Jan 2020 17:53:50 -0500 From: Joel Fernandes To: Uladzislau Rezki Cc: "Paul E. McKenney" , LKML , RCU , Steven Rostedt , Oleksiy Avramchenko Subject: Re: [PATCH 1/1] rcu/tree: support kfree_bulk() interface in kfree_rcu() Message-ID: <20200115225350.GA246464@google.com> References: <20191231122241.5702-1-urezki@gmail.com> <20200113190315.GA12543@paulmck-ThinkPad-P72> <20200114164937.GA50403@google.com> <20200115131446.GA18417@pc636> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200115131446.GA18417@pc636> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: rcu-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: rcu@vger.kernel.org On Wed, Jan 15, 2020 at 02:14:46PM +0100, Uladzislau Rezki wrote: > Hello, Joel, Paul. > > Thank you for comments and testing! > > > > > > > Nice improvement! > > > > > > But rcuperf uses a single block size, which turns into kfree_bulk() using > > > a single slab, which results in good locality of reference. So I have to > > > > You meant a "single cache" category when you say "single slab"? Just to > > mention, the number of slabs (in a single cache) when a large number of > > objects are allocated is more than 1 (not single). With current rcuperf, I > > see 100s of slabs (each slab being one page) in the kmalloc-32 cache. Each > > slab contains around 128 objects of type kfree_rcu (24 byte object aligned to > > 32-byte slab object). > > > I think that is about using different slab caches to break locality. It > makes sense, IMHO, because usually the system make use of different slabs, > because of different object sizes. From the other hand i guess there are > test cases when only one slab gets used. I was wondering about "locality". A cache can be split into many slabs. Only the data on a page is local (contiguous). If there are a large number of objects, then it goes to a new slab (on the same cache). At least on the kmalloc slabs, there is only 1 slab per page. So for example, if on kmalloc-32 slab, there are more than 128 objects, then it goes to a different slab / page. So how is there still locality? Further the slab (not sure about slub) doesn't seem to do anything at the moment to take advantage of locality within a slab. That said, I am fully supportive of your patch and see the same improvements as well which are for the reasons you mentioned in the changelog. > > > ask... Is this performance result representative of production workloads? > > > > I added more variation to allocation sizes to rcuperf (patch below) to distribute > > allocations across 4 kmalloc slabs (32,64,96 and 128) and I see a signficant > > improvement with Ulad's patch in SLAB in terms of completion time of the > > test. Below are the results. With SLUB I see slightly higher memory > > footprint, I have never used SLUB and not sure who is using it so I am not > > too concerned since the degradation in memory footprint is only slight with > > SLAB having the signifcant improvement. > > > Nice patch! I think, it would be useful to have it in "rcuperf" tool with > extra parameter like "different_obj_sizes". cool, I posted something like this. > > 2.25.0.rc1.283.g88dfdc4193-goog > I also have done some tests with your patch on my Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz, 12xCPUs > machine to simulate different slab usage: > > dev.2020.01.10a branch > > # Default, CONFIG_SLAB, kfree_loops=200000 kfree_alloc_num=1000 kfree_rcu_test=1, 16, 32, 64, 96 obj sizes > [ 83.762963] Total time taken by all kfree'ers: 53607352517 ns, loops: 200000, batches: 1885, memory footprint: 1248MB > [ 80.108401] Total time taken by all kfree'ers: 53529637912 ns, loops: 200000, batches: 1921, memory footprint: 1193MB > [ 76.622252] Total time taken by all kfree'ers: 53570175705 ns, loops: 200000, batches: 1929, memory footprint: 1250MB > > # With the patch, CONFIG_SLAB, kfree_loops=200000 kfree_alloc_num=1000 kfree_rcu_test=1, 16, 32, 64, 96 obj sizes > [ 48.265008] Total time taken by all kfree'ers: 23981587315 ns, loops: 200000, batches: 810, memory footprint: 1219MB > [ 53.263943] Total time taken by all kfree'ers: 23879375281 ns, loops: 200000, batches: 822, memory footprint: 1190MB > [ 50.366440] Total time taken by all kfree'ers: 24086841707 ns, loops: 200000, batches: 794, memory footprint: 1380MB > > # Default, CONFIG_SLUB, kfree_loops=200000 kfree_alloc_num=1000 kfree_rcu_test=1, 16, 32, 64, 96 obj sizes > [ 81.818576] Total time taken by all kfree'ers: 51291025022 ns, loops: 200000, batches: 1713, memory footprint: 741MB > [ 77.854866] Total time taken by all kfree'ers: 51278911477 ns, loops: 200000, batches: 1671, memory footprint: 719MB > [ 76.329577] Total time taken by all kfree'ers: 51256183045 ns, loops: 200000, batches: 1719, memory footprint: 647MB > > # With the patch, CONFIG_SLUB, kfree_loops=200000 kfree_alloc_num=1000 kfree_rcu_test=1, 16, 32, 64, 96 obj sizes > [ 76.254485] Total time taken by all kfree'ers: 50709919132 ns, loops: 200000, batches: 1618, memory footprint: 456MB > [ 75.891521] Total time taken by all kfree'ers: 50736297452 ns, loops: 200000, batches: 1633, memory footprint: 507MB > [ 76.172573] Total time taken by all kfree'ers: 50660403893 ns, loops: 200000, batches: 1628, memory footprint: 429MB > > in case of CONFIG_SLAB there is double increase in performance but slightly higher memory usage. > As for CONFIG_SLUB, i still see higher performance figures + lower memory usage with the patch. Ok, testing today, our results are quite similar. > > Apart of that, I have got the report from the "kernel test robot": > > [ 13.957168] ------------[ cut here ]------------ > [ 13.958256] ODEBUG: free active (active state 1) object type: rcu_head hint: 0x0 > [ 13.962148] WARNING: CPU: 0 PID: 212 at lib/debugobjects.c:484 debug_print_object+0x95/0xd0 > [ 13.964298] Modules linked in: > [ 13.964960] CPU: 0 PID: 212 Comm: kworker/0:2 Not tainted 5.5.0-rc1-00136-g883a2cefc0684 #1 > [ 13.966712] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 > [ 13.968528] Workqueue: events kfree_rcu_work > [ 13.969466] RIP: 0010:debug_print_object+0x95/0xd0 > [ 13.970480] Code: d2 e8 2f 06 d6 ff 8b 43 10 4d 89 f1 4c 89 e6 8b 4b 14 48 c7 c7 88 73 be 82 4d 8b 45 00 48 8b 14 c5 a0 5f 6d 82 e8 7b 65 c6 ff <0f> 0b b9 01 00 00 00 31 d2 be 01 00 00 00 48 c7 c7 98 b8 0c 83 e8 > [ 13.974435] RSP: 0000:ffff888231677bf8 EFLAGS: 00010282 > [ 13.975531] RAX: 0000000000000000 RBX: ffff88822d4200e0 RCX: 0000000000000000 > [ 13.976730] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff8306e028 > [ 13.977568] RBP: ffff888231677c18 R08: 0000000000000000 R09: ffff888231670790 > [ 13.978412] R10: ffff888231670000 R11: 0000000000000003 R12: ffffffff82bc5299 > [ 13.979250] R13: ffffffff82e77360 R14: 0000000000000000 R15: dead000000000100 > [ 13.980089] FS: 0000000000000000(0000) GS:ffffffff82e4f000(0000) knlGS:0000000000000000 > [ 13.981069] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 13.981746] CR2: 00007f1e913fc77c CR3: 0000000225ce9000 CR4: 00000000000006f0 > [ 13.982587] Call Trace: > [ 13.982911] __debug_check_no_obj_freed+0x19a/0x200 > [ 13.983494] debug_check_no_obj_freed+0x14/0x20 > [ 13.984036] free_pcp_prepare+0xee/0x1d0 > [ 13.984541] free_unref_page+0x1b/0x80 > [ 13.984994] __free_pages+0x19/0x20 > [ 13.985503] __free_pages+0x13/0x20 > [ 13.985924] slob_free_pages+0x7d/0x90 > [ 13.986373] slob_free+0x34f/0x530 > [ 13.986784] kfree+0x154/0x210 > [ 13.987155] __kmem_cache_free_bulk+0x44/0x60 > [ 13.987673] kmem_cache_free_bulk+0xe/0x10 > [ 13.988163] kfree_rcu_work+0x95/0x310 > [ 13.989010] ? kfree_rcu_work+0x64/0x310 > [ 13.989884] process_one_work+0x378/0x7c0 > [ 13.990770] worker_thread+0x40/0x600 > [ 13.991587] kthread+0x14e/0x170 > [ 13.992344] ? process_one_work+0x7c0/0x7c0 > [ 13.993256] ? kthread_create_on_node+0x70/0x70 > [ 13.994246] ret_from_fork+0x3a/0x50 > [ 13.995039] ---[ end trace cdf242638b0e32a0 ]--- > [child0:632] trace_fd was -1 > > > the trace happens when the kernel is built with CONFIG_DEBUG_OBJECTS_FREE > and CONFIG_DEBUG_OBJECTS_RCU_HEAD. Basically it is not a problem of the patch > itself or there is any bug there. It just does not pair with debug_rcu_head_queue(head) > in the kfree_rcu_work() function, that is why the kernel thinks about freeing > an active object that is not active in reality. > > I will upload a V2 to fix that. Oh good point. Thanks for fixing that. thanks, - Joel