All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Simmons, James A." <simmonsja@ornl.gov>
To: NeilBrown <neilb@suse.com>, James Simmons <jsimmons@infradead.org>
Cc: Oleg Drokin <oleg.drokin@intel.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Lustre Development List <lustre-devel@lists.lustre.org>
Subject: Re: [lustre-devel] [PATCH 00/20] staging: lustre: convert to rhashtable
Date: Wed, 18 Apr 2018 21:56:44 +0000	[thread overview]
Message-ID: <1524088604572.29145@ornl.gov> (raw)
In-Reply-To: <87y3hlqk3w.fsf@notabene.neil.brown.name>


>>> libcfs in lustre has a resizeable hashtable.
>>> Linux already has a resizeable hashtable, rhashtable, which is better
>>> is most metrics. See https://lwn.net/Articles/751374/ in a few days
>>> for an introduction to rhashtable.
>>
>> Thansk for starting this work. I was think about cleaning the libcfs
>> hash but your port to rhashtables is way better. How did you gather
>> metrics to see that rhashtable was better than libcfs hash?
>
>Code inspection and reputation.  It is hard to beat inlined lockless
>code for lookups.  And rhashtable is heavily used in the network
>subsystem and they are very focused on latency there.  I'm not sure that
>insertion is as fast as it can be (I have some thoughts on that) but I'm
>sure lookup will be better.
>I haven't done any performance testing myself, only correctness.

Sorry this email never reached my infradead account so I'm replying
from my work account.  I was just wondering if numbers were gathered.
I'm curious how well this would scale on some HPC cluster. In any case
I can do a comparison with and without the patches on one of my test
clusters and share the numbers with you using real world work loads.

>>> This series converts lustre to use rhashtable.  This affects several
>>> different tables, and each is different is various ways.
>>>
>>> There are two outstanding issues.  One is that a bug in rhashtable
>>> means that we cannot enable auto-shrinking in one of the tables.  That
>>> is documented as appropriate and should be fixed soon.
>>>
>>> The other is that rhashtable has an atomic_t which counts the elements
>>> in a hash table.  At least one table in lustre went to some trouble to
>>> avoid any table-wide atomics, so that could lead to a regression.
>>> I'm hoping that rhashtable can be enhanced with the option of a
>>> per-cpu counter, or similar.
>>>
>>
>> This doesn't sound quite ready to land just yet. This will have to do some
>> soak testing and a larger scope of test to make sure no new regressions
>> happen. Believe me I did work to make lustre work better on tickless
>> systems, which I'm preparing for the linux client, and small changes could
>> break things in interesting ways. I will port the rhashtable change to the
>> Intel developement branch and get people more familar with the hash code
>> to look at it.
>
>Whether it is "ready" or not probably depends on perspective and
>priorities.  As I see it, getting lustre cleaned up and out of staging
>is a fairly high priority, and it will require a lot of code change.
>It is inevitable that regressions will slip in (some already have) and
>it is important to keep testing (the test suite is of great benefit, but
>is only part of the story of course).  But to test the code, it needs to
>land.  Testing the code in Intel's devel branch and then porting it
>across doesn't really prove much.  For testing to be meaningful, it
>needs to be tested in a branch that up-to-date with mainline and on
>track to be merged into mainline.
>
>I have no particular desire to rush this in, but I don't see any
>particular benefit in delaying it either.
>
>I guess I see staging as implicitly a 'devel' branch.  You seem to be
>treating it a bit like a 'stable' branch - is that right?

So two years ago no one would touch the linux Lustre client due to it
being so broken. Then after about a year of work the client got into 
sort of working state. Even to the point actual sites are using it. 
It still is  broken and guess who gets notified of the brokenness :-)  
The good news is that people do actually test it but if it regress to much
we will lose our testing audience. Sadly its a chicken and egg problem. Yes I
want to see it leave staging but I like the Lustre client to be in good working
order. If it leaves staging still broken and no one uses it then their is not
much point. 

So to understand why I work with the Intel development branch and linux 
kernel version I need to explain my test setup.  So I test at different levels.

At level one I have my generic x86 nodes and x86 server back ends. Pretty
vanilla and the scale is pretty small. 3 servers nodes and a couple of clients.
In the environment I can easily test the upstream client. This is not VMs but
real hardware and real storage back end.

Besides x86 client nodes for level one testing I also have Power9 ppc nodes.
Also its possible for me to test the upstream client directly. In case you want to
know nope lustre doesn't work out of the box for Power9. The IB stack needs
fixing and we need to handle 64K pages at the LNet layer. I have work that
resolves those issues. The fixes need to be pushed upstream.

Next I have an ARM client cluster to test with which runs a newer kernel that 
is "offical".  When we first got the cluster I stomped all over it but discovered
I had to share it and the other parties involved were not to happy with my 
experimenting. The ARM system had the same issues as the Power9 so
now it works. Once the upstream client is in better shape I think I could
subject the other users of the system to try it out. I have been told the ARM
vendor has a strong interest in the linux lustre client as well.

Once I'm done with that testing I moved to my personal Cray test cluster.
Sadly Cray has unique hardware so if I use the latest vanilla upstream kernel
that unique hardware no longer works which makes the test bed pretty much
useless. Since I have to work with an older kernel I back port the linux kernel
client work and run test to make sure to works. On that cluster I can run real
HPC work loads.  

Lastly if the work shows itself stable at this point I see if I can put it on a Cray
development system that users at my workplace test on. If the user scream
things are broken then I find out how they break it. General users are very
creative at find ways to break things that we wouldn't think of. 

>I think we have a better chance of being heard if we have "skin in the
>game" and have upstream code that would use this.

I agree. I just want a working base line that users can  work with :-) For me 
landing the rhashtable work is only the beginning. Its behavior at very large scales
needs to be examined to find any potential bottlenecks. 

Which bring me to the next point since this is cross posting to the lustre-devel list.
For me the largest test cluster I can get my hands on is 80 nodes. Would anyone 
be willing to donate test time on their test bed clusters to help out in this work. It would
be really awesome to test on a many thousand node system. I will be at the Lustre
User Group Conference next week so if you want to meet up with me to discuss details
I would love to help you set up your test cluster so we really can exercise these changes.

WARNING: multiple messages have this Message-ID (diff)
From: Simmons, James A. <simmonsja@ornl.gov>
To: NeilBrown <neilb@suse.com>, James Simmons <jsimmons@infradead.org>
Cc: Oleg Drokin <oleg.drokin@intel.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Lustre Development List <lustre-devel@lists.lustre.org>
Subject: [lustre-devel] [PATCH 00/20] staging: lustre: convert to rhashtable
Date: Wed, 18 Apr 2018 21:56:44 +0000	[thread overview]
Message-ID: <1524088604572.29145@ornl.gov> (raw)
In-Reply-To: <87y3hlqk3w.fsf@notabene.neil.brown.name>


>>> libcfs in lustre has a resizeable hashtable.
>>> Linux already has a resizeable hashtable, rhashtable, which is better
>>> is most metrics. See https://lwn.net/Articles/751374/ in a few days
>>> for an introduction to rhashtable.
>>
>> Thansk for starting this work. I was think about cleaning the libcfs
>> hash but your port to rhashtables is way better. How did you gather
>> metrics to see that rhashtable was better than libcfs hash?
>
>Code inspection and reputation.  It is hard to beat inlined lockless
>code for lookups.  And rhashtable is heavily used in the network
>subsystem and they are very focused on latency there.  I'm not sure that
>insertion is as fast as it can be (I have some thoughts on that) but I'm
>sure lookup will be better.
>I haven't done any performance testing myself, only correctness.

Sorry this email never reached my infradead account so I'm replying
from my work account.  I was just wondering if numbers were gathered.
I'm curious how well this would scale on some HPC cluster. In any case
I can do a comparison with and without the patches on one of my test
clusters and share the numbers with you using real world work loads.

>>> This series converts lustre to use rhashtable.  This affects several
>>> different tables, and each is different is various ways.
>>>
>>> There are two outstanding issues.  One is that a bug in rhashtable
>>> means that we cannot enable auto-shrinking in one of the tables.  That
>>> is documented as appropriate and should be fixed soon.
>>>
>>> The other is that rhashtable has an atomic_t which counts the elements
>>> in a hash table.  At least one table in lustre went to some trouble to
>>> avoid any table-wide atomics, so that could lead to a regression.
>>> I'm hoping that rhashtable can be enhanced with the option of a
>>> per-cpu counter, or similar.
>>>
>>
>> This doesn't sound quite ready to land just yet. This will have to do some
>> soak testing and a larger scope of test to make sure no new regressions
>> happen. Believe me I did work to make lustre work better on tickless
>> systems, which I'm preparing for the linux client, and small changes could
>> break things in interesting ways. I will port the rhashtable change to the
>> Intel developement branch and get people more familar with the hash code
>> to look at it.
>
>Whether it is "ready" or not probably depends on perspective and
>priorities.  As I see it, getting lustre cleaned up and out of staging
>is a fairly high priority, and it will require a lot of code change.
>It is inevitable that regressions will slip in (some already have) and
>it is important to keep testing (the test suite is of great benefit, but
>is only part of the story of course).  But to test the code, it needs to
>land.  Testing the code in Intel's devel branch and then porting it
>across doesn't really prove much.  For testing to be meaningful, it
>needs to be tested in a branch that up-to-date with mainline and on
>track to be merged into mainline.
>
>I have no particular desire to rush this in, but I don't see any
>particular benefit in delaying it either.
>
>I guess I see staging as implicitly a 'devel' branch.  You seem to be
>treating it a bit like a 'stable' branch - is that right?

So two years ago no one would touch the linux Lustre client due to it
being so broken. Then after about a year of work the client got into 
sort of working state. Even to the point actual sites are using it. 
It still is  broken and guess who gets notified of the brokenness :-)  
The good news is that people do actually test it but if it regress to much
we will lose our testing audience. Sadly its a chicken and egg problem. Yes I
want to see it leave staging but I like the Lustre client to be in good working
order. If it leaves staging still broken and no one uses it then their is not
much point. 

So to understand why I work with the Intel development branch and linux 
kernel version I need to explain my test setup.  So I test at different levels.

At level one I have my generic x86 nodes and x86 server back ends. Pretty
vanilla and the scale is pretty small. 3 servers nodes and a couple of clients.
In the environment I can easily test the upstream client. This is not VMs but
real hardware and real storage back end.

Besides x86 client nodes for level one testing I also have Power9 ppc nodes.
Also its possible for me to test the upstream client directly. In case you want to
know nope lustre doesn't work out of the box for Power9. The IB stack needs
fixing and we need to handle 64K pages at the LNet layer. I have work that
resolves those issues. The fixes need to be pushed upstream.

Next I have an ARM client cluster to test with which runs a newer kernel that 
is "offical".  When we first got the cluster I stomped all over it but discovered
I had to share it and the other parties involved were not to happy with my 
experimenting. The ARM system had the same issues as the Power9 so
now it works. Once the upstream client is in better shape I think I could
subject the other users of the system to try it out. I have been told the ARM
vendor has a strong interest in the linux lustre client as well.

Once I'm done with that testing I moved to my personal Cray test cluster.
Sadly Cray has unique hardware so if I use the latest vanilla upstream kernel
that unique hardware no longer works which makes the test bed pretty much
useless. Since I have to work with an older kernel I back port the linux kernel
client work and run test to make sure to works. On that cluster I can run real
HPC work loads.  

Lastly if the work shows itself stable at this point I see if I can put it on a Cray
development system that users at my workplace test on. If the user scream
things are broken then I find out how they break it. General users are very
creative at find ways to break things that we wouldn't think of. 

>I think we have a better chance of being heard if we have "skin in the
>game" and have upstream code that would use this.

I agree. I just want a working base line that users can  work with :-) For me 
landing the rhashtable work is only the beginning. Its behavior at very large scales
needs to be examined to find any potential bottlenecks. 

Which bring me to the next point since this is cross posting to the lustre-devel list.
For me the largest test cluster I can get my hands on is 80 nodes. Would anyone 
be willing to donate test time on their test bed clusters to help out in this work. It would
be really awesome to test on a many thousand node system. I will be at the Lustre
User Group Conference next week so if you want to meet up with me to discuss details
I would love to help you set up your test cluster so we really can exercise these changes.

  reply	other threads:[~2018-04-18 21:56 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-11 21:54 [PATCH 00/20] staging: lustre: convert to rhashtable NeilBrown
2018-04-11 21:54 ` [lustre-devel] " NeilBrown
2018-04-11 21:54 ` [PATCH 03/20] staging: lustre: convert obd uuid hash " NeilBrown
2018-04-11 21:54   ` [lustre-devel] " NeilBrown
2018-04-11 21:54 ` [PATCH 04/20] staging: lustre: convert osc_quota " NeilBrown
2018-04-11 21:54   ` [lustre-devel] " NeilBrown
2018-04-11 21:54 ` [PATCH 05/20] staging: lustre: separate buckets from ldlm hash table NeilBrown
2018-04-11 21:54   ` [lustre-devel] " NeilBrown
2018-04-11 21:54 ` [PATCH 09/20] staging: lustre: convert ldlm_resource hash to rhashtable NeilBrown
2018-04-11 21:54   ` [lustre-devel] " NeilBrown
2018-04-11 21:54 ` [PATCH 07/20] staging: lustre: ldlm: store name directly in namespace NeilBrown
2018-04-11 21:54   ` [lustre-devel] " NeilBrown
2018-04-11 21:54 ` [PATCH 10/20] staging: lustre: make struct lu_site_bkt_data private NeilBrown
2018-04-11 21:54   ` [lustre-devel] " NeilBrown
2018-04-11 21:54 ` [PATCH 02/20] staging: lustre: convert lov_pool to use rhashtable NeilBrown
2018-04-11 21:54   ` [lustre-devel] " NeilBrown
2018-04-11 21:54 ` [PATCH 12/20] staging: lustre: lu_object: factor out extra per-bucket data NeilBrown
2018-04-11 21:54   ` [lustre-devel] " NeilBrown
2018-04-11 21:54 ` [PATCH 08/20] staging: lustre: simplify ldlm_ns_hash_defs[] NeilBrown
2018-04-11 21:54   ` [lustre-devel] " NeilBrown
2018-04-11 21:54 ` [PATCH 01/20] staging: lustre: ptlrpc: convert conn_hash to rhashtable NeilBrown
2018-04-11 21:54   ` [lustre-devel] " NeilBrown
2018-04-11 21:54 ` [PATCH 06/20] staging: lustre: ldlm: add a counter to the per-namespace data NeilBrown
2018-04-11 21:54   ` [lustre-devel] " NeilBrown
2018-04-11 21:54 ` [PATCH 11/20] staging: lustre: lu_object: discard extra lru count NeilBrown
2018-04-11 21:54   ` [lustre-devel] " NeilBrown
2018-04-11 21:54 ` [PATCH 17/20] staging: lustre: use call_rcu() to free lu_object_headers NeilBrown
2018-04-11 21:54   ` [lustre-devel] " NeilBrown
2018-04-11 21:54 ` [PATCH 15/20] staging: lustre: llite: use more private data in dump_pgcache NeilBrown
2018-04-11 21:54   ` [lustre-devel] " NeilBrown
2018-04-11 21:54 ` [PATCH 16/20] staging: lustre: llite: remove redundant lookup " NeilBrown
2018-04-11 21:54   ` [lustre-devel] " NeilBrown
2018-04-11 21:54 ` [PATCH 14/20] staging: lustre: fold lu_object_new() into lu_object_find_at() NeilBrown
2018-04-11 21:54   ` [lustre-devel] " NeilBrown
2018-04-11 21:54 ` [PATCH 13/20] staging: lustre: lu_object: move retry logic inside htable_lookup NeilBrown
2018-04-11 21:54   ` [lustre-devel] " NeilBrown
2018-04-11 21:54 ` [PATCH 18/20] staging: lustre: change how "dump_page_cache" walks a hash table NeilBrown
2018-04-11 21:54   ` [lustre-devel] " NeilBrown
2018-04-11 21:54 ` [PATCH 20/20] staging: lustre: remove cfs_hash resizeable hashtable implementation NeilBrown
2018-04-11 21:54   ` [lustre-devel] " NeilBrown
2018-04-11 21:54 ` [PATCH 19/20] staging: lustre: convert lu_object cache to rhashtable NeilBrown
2018-04-11 21:54   ` [lustre-devel] " NeilBrown
2018-04-17  3:35 ` [PATCH 00/20] staging: lustre: convert " James Simmons
2018-04-17  3:35   ` [lustre-devel] " James Simmons
2018-04-18  3:17   ` NeilBrown
2018-04-18  3:17     ` [lustre-devel] " NeilBrown
2018-04-18 21:56     ` Simmons, James A. [this message]
2018-04-18 21:56       ` Simmons, James A.
2018-04-23 13:08 ` Greg Kroah-Hartman
2018-04-23 13:08   ` [lustre-devel] " Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1524088604572.29145@ornl.gov \
    --to=simmonsja@ornl.gov \
    --cc=gregkh@linuxfoundation.org \
    --cc=jsimmons@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lustre-devel@lists.lustre.org \
    --cc=neilb@suse.com \
    --cc=oleg.drokin@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.