From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Google-Smtp-Source: AB8JxZrIvBq8HQUojGFhBlCKDQcBsHVQh9AQxgu5+7V7GeIKdW0U7+tHFqzsKTnqPRfEn1L2129P ARC-Seal: i=1; a=rsa-sha256; t=1524088606; cv=none; d=google.com; s=arc-20160816; b=T2jdO93FhlU2PyqPOqIucn6RwWVuTJiaDCr17HRcTuU/3OFrs4686hmvsmrQxugqX2 CPxSNUU10/nDu0nkZ5JGYWp1X4GJziluSRDiMsGx4vhzuQGxSEO0AUrL7nh50DE1SoPE hXacwuqT82+s6HRbV06jfAEJUbTcqKQYOY3WRJKGV3UHm5y6mbg0UTiFyt73aGlyXg5b C2ndNrtaoDz7/WBPO/2p9W3ahbn8kuRNEpm/V5AP4kVd+P4JCYW9y0FxA5ChczHlWPk4 H0WsHNWLStJ4cQsV1oeDj9oskXpPaSuAxMltqaNgu8THaQfVwVMYVquTa5xabV9vezZu ggqg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=mime-version:content-transfer-encoding:content-language :accept-language:in-reply-to:references:message-id:date:thread-index :thread-topic:subject:cc:to:from:dkim-signature :arc-authentication-results; bh=5BDULPzd89HpKOqZNs/KBLQCv6+zPIdCYhMA8AJQo9M=; b=EEw28yt0ry+GdMYf6VwFpxckDC2bxR+fULrDKOn6ZXWNBBdWomUAJV2C4P1+252J6e abh/1WZEcQjkVyLPsDnLrtH3lYTag51XspPet7I8amVebOGGADIz9s2FruxdoJwnpUgG R5h6ZeEnYZJD9pCkLPXpCnV87EKMptJ0R+DlDQYjTPGPmqnyU+c1rGlsuXMoRwDTQQ15 y8v5FnC8TK6r5nZ3kAk20EbmCOJQ7vYnclIKyTiPt9oCHqmdH26chjhTZ91DH74bp5ju NEeDgorxgu6uZyK0Cd+NVLqEdEsA9tXjGY6rTGNE4+s9GPbQE9L1glgcV9mCJxmjFPWu hEhA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ornl.gov header.s=p20151116 header.b=vCD8agiU; spf=pass (google.com: domain of simmonsja@ornl.gov designates 128.219.177.136 as permitted sender) smtp.mailfrom=simmonsja@ornl.gov; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=ornl.gov Authentication-Results: mx.google.com; dkim=pass header.i=@ornl.gov header.s=p20151116 header.b=vCD8agiU; spf=pass (google.com: domain of simmonsja@ornl.gov designates 128.219.177.136 as permitted sender) smtp.mailfrom=simmonsja@ornl.gov; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=ornl.gov X-SG: RELAYLIST X-IronPort-AV: E=Sophos;i="5.48,466,1517893200"; d="scan'208";a="36440478" From: "Simmons, James A." To: NeilBrown , James Simmons CC: Oleg Drokin , Greg Kroah-Hartman , Linux Kernel Mailing List , Lustre Development List Subject: Re: [lustre-devel] [PATCH 00/20] staging: lustre: convert to rhashtable Thread-Topic: [lustre-devel] [PATCH 00/20] staging: lustre: convert to rhashtable Thread-Index: AQHT11jvN8Ej3qUFH0eovqd0vexg6A== Date: Wed, 18 Apr 2018 21:56:44 +0000 Message-ID: <1524088604572.29145@ornl.gov> References: <152348312863.12394.11915752362061083241.stgit@noble> ,<87y3hlqk3w.fsf@notabene.neil.brown.name> In-Reply-To: <87y3hlqk3w.fsf@notabene.neil.brown.name> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-exchange-transport-fromentityheader: Hosted x-originating-ip: [128.219.168.185] Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: =?utf-8?q?1597488463998050383?= X-GMAIL-MSGID: =?utf-8?q?1598122734027668822?= X-Mailing-List: linux-kernel@vger.kernel.org List-ID: =0A= >>> libcfs in lustre has a resizeable hashtable.=0A= >>> Linux already has a resizeable hashtable, rhashtable, which is better= =0A= >>> is most metrics. See https://lwn.net/Articles/751374/ in a few days=0A= >>> for an introduction to rhashtable.=0A= >>=0A= >> Thansk for starting this work. I was think about cleaning the libcfs=0A= >> hash but your port to rhashtables is way better. How did you gather=0A= >> metrics to see that rhashtable was better than libcfs hash?=0A= >=0A= >Code inspection and reputation. It is hard to beat inlined lockless=0A= >code for lookups. And rhashtable is heavily used in the network=0A= >subsystem and they are very focused on latency there. I'm not sure that= =0A= >insertion is as fast as it can be (I have some thoughts on that) but I'm= =0A= >sure lookup will be better.=0A= >I haven't done any performance testing myself, only correctness.=0A= =0A= Sorry this email never reached my infradead account so I'm replying=0A= from my work account. I was just wondering if numbers were gathered.=0A= I'm curious how well this would scale on some HPC cluster. In any case=0A= I can do a comparison with and without the patches on one of my test=0A= clusters and share the numbers with you using real world work loads.=0A= =0A= >>> This series converts lustre to use rhashtable. This affects several=0A= >>> different tables, and each is different is various ways.=0A= >>>=0A= >>> There are two outstanding issues. One is that a bug in rhashtable=0A= >>> means that we cannot enable auto-shrinking in one of the tables. That= =0A= >>> is documented as appropriate and should be fixed soon.=0A= >>>=0A= >>> The other is that rhashtable has an atomic_t which counts the elements= =0A= >>> in a hash table. At least one table in lustre went to some trouble to= =0A= >>> avoid any table-wide atomics, so that could lead to a regression.=0A= >>> I'm hoping that rhashtable can be enhanced with the option of a=0A= >>> per-cpu counter, or similar.=0A= >>>=0A= >>=0A= >> This doesn't sound quite ready to land just yet. This will have to do so= me=0A= >> soak testing and a larger scope of test to make sure no new regressions= =0A= >> happen. Believe me I did work to make lustre work better on tickless=0A= >> systems, which I'm preparing for the linux client, and small changes cou= ld=0A= >> break things in interesting ways. I will port the rhashtable change to t= he=0A= >> Intel developement branch and get people more familar with the hash code= =0A= >> to look at it.=0A= >=0A= >Whether it is "ready" or not probably depends on perspective and=0A= >priorities. As I see it, getting lustre cleaned up and out of staging=0A= >is a fairly high priority, and it will require a lot of code change.=0A= >It is inevitable that regressions will slip in (some already have) and=0A= >it is important to keep testing (the test suite is of great benefit, but= =0A= >is only part of the story of course). But to test the code, it needs to= =0A= >land. Testing the code in Intel's devel branch and then porting it=0A= >across doesn't really prove much. For testing to be meaningful, it=0A= >needs to be tested in a branch that up-to-date with mainline and on=0A= >track to be merged into mainline.=0A= >=0A= >I have no particular desire to rush this in, but I don't see any=0A= >particular benefit in delaying it either.=0A= >=0A= >I guess I see staging as implicitly a 'devel' branch. You seem to be=0A= >treating it a bit like a 'stable' branch - is that right?=0A= =0A= So two years ago no one would touch the linux Lustre client due to it=0A= being so broken. Then after about a year of work the client got into =0A= sort of working state. Even to the point actual sites are using it. =0A= It still is broken and guess who gets notified of the brokenness :-) =0A= The good news is that people do actually test it but if it regress to much= =0A= we will lose our testing audience. Sadly its a chicken and egg problem. Yes= I=0A= want to see it leave staging but I like the Lustre client to be in good wor= king=0A= order. If it leaves staging still broken and no one uses it then their is n= ot=0A= much point. =0A= =0A= So to understand why I work with the Intel development branch and linux =0A= kernel version I need to explain my test setup. So I test at different lev= els.=0A= =0A= At level one I have my generic x86 nodes and x86 server back ends. Pretty= =0A= vanilla and the scale is pretty small. 3 servers nodes and a couple of clie= nts.=0A= In the environment I can easily test the upstream client. This is not VMs b= ut=0A= real hardware and real storage back end.=0A= =0A= Besides x86 client nodes for level one testing I also have Power9 ppc nodes= .=0A= Also its possible for me to test the upstream client directly. In case you = want to=0A= know nope lustre doesn't work out of the box for Power9. The IB stack needs= =0A= fixing and we need to handle 64K pages at the LNet layer. I have work that= =0A= resolves those issues. The fixes need to be pushed upstream.=0A= =0A= Next I have an ARM client cluster to test with which runs a newer kernel th= at =0A= is "offical". When we first got the cluster I stomped all over it but disc= overed=0A= I had to share it and the other parties involved were not to happy with my = =0A= experimenting. The ARM system had the same issues as the Power9 so=0A= now it works. Once the upstream client is in better shape I think I could= =0A= subject the other users of the system to try it out. I have been told the A= RM=0A= vendor has a strong interest in the linux lustre client as well.=0A= =0A= Once I'm done with that testing I moved to my personal Cray test cluster.= =0A= Sadly Cray has unique hardware so if I use the latest vanilla upstream kern= el=0A= that unique hardware no longer works which makes the test bed pretty much= =0A= useless. Since I have to work with an older kernel I back port the linux ke= rnel=0A= client work and run test to make sure to works. On that cluster I can run r= eal=0A= HPC work loads. =0A= =0A= Lastly if the work shows itself stable at this point I see if I can put it = on a Cray=0A= development system that users at my workplace test on. If the user scream= =0A= things are broken then I find out how they break it. General users are very= =0A= creative at find ways to break things that we wouldn't think of. =0A= =0A= >I think we have a better chance of being heard if we have "skin in the=0A= >game" and have upstream code that would use this.=0A= =0A= I agree. I just want a working base line that users can work with :-) For = me =0A= landing the rhashtable work is only the beginning. Its behavior at very lar= ge scales=0A= needs to be examined to find any potential bottlenecks. =0A= =0A= Which bring me to the next point since this is cross posting to the lustre-= devel list.=0A= For me the largest test cluster I can get my hands on is 80 nodes. Would an= yone =0A= be willing to donate test time on their test bed clusters to help out in th= is work. It would=0A= be really awesome to test on a many thousand node system. I will be at the = Lustre=0A= User Group Conference next week so if you want to meet up with me to discus= s details=0A= I would love to help you set up your test cluster so we really can exercise= these changes.=