All of lore.kernel.org
 help / color / mirror / Atom feed
* [Lustre-devel] replacing Lustre pings with LNet Peer Health
@ 2011-05-12 14:57 Nic Henke
  2011-05-12 17:27 ` Andreas Dilger
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Nic Henke @ 2011-05-12 14:57 UTC (permalink / raw)
  To: lustre-devel

Just floating an idea... I'd much appreciate any feedback

Given bug 12471 where the ptlrpc pinger traffic on a large system can 
approach the ridiculous (2.6M pings every 75s for 160 OSTs and 16K 
clients), I'd like to consider getting rid of the pings entirely.

The idea would be to extend the idea in the attached patch where we add 
an upper layer callback for lnet_notify() signaling a peer going down or 
up. The ptlrpc pinger code would be then changed to record the 'down' 
event for an import/export which would then start an eviction timer that 
started when the LNet peer was last_alive. If the nodes comes 'up' 
before the timer expires, no eviction. The eviction code would then only 
operate on nodes with 'down' events and trusting that the rest are all 
ok and functional.

Eric - I know this doesn't get us that far down the road toward your new 
health network, but does solve a near term issue with pinger rates on 
large systems.

Issues...

- lacks "proof" that peer nodes ptlrpc queues are moving forward, but 
not really sure that is all that important in terms of pinger evictions.

- LNet peer health is a bit "weird" in that it requires an upper layer 
sending a packet to trigger a node moving back to 'up'. We would need to 
address this for proper LNet peer health as it is.

- Might need some beefing up of the standard LNDs to ensure we have good 
peer health data.

Thoughts ?

Nic
-------------- next part --------------
A non-text attachment was scrubbed...
Name: register_notify.diff
Type: text/x-patch
Size: 6030 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20110512/286e205d/attachment.bin>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Lustre-devel] replacing Lustre pings with LNet Peer Health
  2011-05-12 14:57 [Lustre-devel] replacing Lustre pings with LNet Peer Health Nic Henke
@ 2011-05-12 17:27 ` Andreas Dilger
  2011-05-17 14:27   ` Nic Henke
  2011-05-12 17:37 ` Christopher J. Morrone
  2011-05-17 22:53 ` Isaac Huang
  2 siblings, 1 reply; 7+ messages in thread
From: Andreas Dilger @ 2011-05-12 17:27 UTC (permalink / raw)
  To: lustre-devel

On May 12, 2011, at 08:57, Nic Henke wrote:
> Just floating an idea... I'd much appreciate any feedback
> 
> Given bug 12471 where the ptlrpc pinger traffic on a large system can approach the ridiculous (2.6M pings every 75s for 160 OSTs and 16K clients), I'd like to consider getting rid of the pings entirely.
> 
> The idea would be to extend the idea in the attached patch where we add an upper layer callback for lnet_notify() signaling a peer going down or up. The ptlrpc pinger code would be then changed to record the 'down' event for an import/export which would then start an eviction timer that started when the LNet peer was last_alive. If the nodes comes 'up' before the timer expires, no eviction. The eviction code would then only operate on nodes with 'down' events and trusting that the rest are all ok and functional.

One issue is that the Lustre OBD_PING RPC is not just detecting peer death.  It is also reporting the last_committed value to the RPC stack, so that clients can discard RPCs that were committed on the server.  It is also signalling to the server that this client is still alive, so that it doesn't get evicted.  If there are LNET routers in a system, the LNET peer health will only report the health of the routers, and not of the clients or servers behind the routers, so this isn't going to result in a working Lustre filesystem...

> Eric - I know this doesn't get us that far down the road toward your new health network, but does solve a near term issue with pinger rates on large systems.

There would need to be at least some of the health network implemented in order to "pass through" the peer health on the routers, and also to broadcast some of the data, like last_rcvd.

> Issues...
> 
> - lacks "proof" that peer nodes ptlrpc queues are moving forward, but not really sure that is all that important in terms of pinger evictions.
> 
> - LNet peer health is a bit "weird" in that it requires an upper layer sending a packet to trigger a node moving back to 'up'. We would need to address this for proper LNet peer health as it is.
> 
> - Might need some beefing up of the standard LNDs to ensure we have good peer health data.
> 
> Thoughts ?
> 
> Nic
> <register_notify.diff>_______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel


Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Lustre-devel] replacing Lustre pings with LNet Peer Health
  2011-05-12 14:57 [Lustre-devel] replacing Lustre pings with LNet Peer Health Nic Henke
  2011-05-12 17:27 ` Andreas Dilger
@ 2011-05-12 17:37 ` Christopher J. Morrone
  2011-05-15  7:44   ` Alexey Lyashkov
  2011-05-17 14:30   ` Nic Henke
  2011-05-17 22:53 ` Isaac Huang
  2 siblings, 2 replies; 7+ messages in thread
From: Christopher J. Morrone @ 2011-05-12 17:37 UTC (permalink / raw)
  To: lustre-devel

I think Eric's approach is the only sane way I've heard to reduce pings.

Here are some issues that I see with this:

1)  For your solution to work, you require that the lnet layer take on 
pinging duties.  Usually the network, be it IB, TCP, whatever, will not 
provide any active notification of a peer failure.  To notice that a 
peer has died, the lnet LND must, you guessed it, ping.

Usually the LNDs try to be smart.  They only generate their own pings if 
no traffic has been sent to the peer in a certain period of time.  So 
once you eliminate the higher-level pings, they will partly be replaced 
by lower-level pings.

2)  Doesn't work in a routed environment.  Would need a health network 
for clients behind routers to learn that a server has died, and vice versa.

On 05/12/2011 07:57 AM, Nic Henke wrote:
> Just floating an idea... I'd much appreciate any feedback
>
> Given bug 12471 where the ptlrpc pinger traffic on a large system can
> approach the ridiculous (2.6M pings every 75s for 160 OSTs and 16K
> clients), I'd like to consider getting rid of the pings entirely.
>
> The idea would be to extend the idea in the attached patch where we add
> an upper layer callback for lnet_notify() signaling a peer going down or
> up. The ptlrpc pinger code would be then changed to record the 'down'
> event for an import/export which would then start an eviction timer that
> started when the LNet peer was last_alive. If the nodes comes 'up'
> before the timer expires, no eviction. The eviction code would then only
> operate on nodes with 'down' events and trusting that the rest are all
> ok and functional.
>
> Eric - I know this doesn't get us that far down the road toward your new
> health network, but does solve a near term issue with pinger rates on
> large systems.
>
> Issues...
>
> - lacks "proof" that peer nodes ptlrpc queues are moving forward, but
> not really sure that is all that important in terms of pinger evictions.
>
> - LNet peer health is a bit "weird" in that it requires an upper layer
> sending a packet to trigger a node moving back to 'up'. We would need to
> address this for proper LNet peer health as it is.
>
> - Might need some beefing up of the standard LNDs to ensure we have good
> peer health data.
>
> Thoughts ?
>
> Nic

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Lustre-devel] replacing Lustre pings with LNet Peer Health
  2011-05-12 17:37 ` Christopher J. Morrone
@ 2011-05-15  7:44   ` Alexey Lyashkov
  2011-05-17 14:30   ` Nic Henke
  1 sibling, 0 replies; 7+ messages in thread
From: Alexey Lyashkov @ 2011-05-15  7:44 UTC (permalink / raw)
  To: lustre-devel

One problem.

LNet layer can report - node is live, but one or more ptlrpc services on that node is dead (due a LBUG hit by example).
But yes, generate a LNet event about node is dead is usefull to reduce time of detecting timeout of requests.


On May 12, 2011, at 21:37, Christopher J. Morrone wrote:

> I think Eric's approach is the only sane way I've heard to reduce pings.
> 
> Here are some issues that I see with this:
> 
> 1)  For your solution to work, you require that the lnet layer take on 
> pinging duties.  Usually the network, be it IB, TCP, whatever, will not 
> provide any active notification of a peer failure.  To notice that a 
> peer has died, the lnet LND must, you guessed it, ping.
> 
> Usually the LNDs try to be smart.  They only generate their own pings if 
> no traffic has been sent to the peer in a certain period of time.  So 
> once you eliminate the higher-level pings, they will partly be replaced 
> by lower-level pings.
> 
> 2)  Doesn't work in a routed environment.  Would need a health network 
> for clients behind routers to learn that a server has died, and vice versa.
> 
> On 05/12/2011 07:57 AM, Nic Henke wrote:
>> Just floating an idea... I'd much appreciate any feedback
>> 
>> Given bug 12471 where the ptlrpc pinger traffic on a large system can
>> approach the ridiculous (2.6M pings every 75s for 160 OSTs and 16K
>> clients), I'd like to consider getting rid of the pings entirely.
>> 
>> The idea would be to extend the idea in the attached patch where we add
>> an upper layer callback for lnet_notify() signaling a peer going down or
>> up. The ptlrpc pinger code would be then changed to record the 'down'
>> event for an import/export which would then start an eviction timer that
>> started when the LNet peer was last_alive. If the nodes comes 'up'
>> before the timer expires, no eviction. The eviction code would then only
>> operate on nodes with 'down' events and trusting that the rest are all
>> ok and functional.
>> 
>> Eric - I know this doesn't get us that far down the road toward your new
>> health network, but does solve a near term issue with pinger rates on
>> large systems.
>> 
>> Issues...
>> 
>> - lacks "proof" that peer nodes ptlrpc queues are moving forward, but
>> not really sure that is all that important in terms of pinger evictions.
>> 
>> - LNet peer health is a bit "weird" in that it requires an upper layer
>> sending a packet to trigger a node moving back to 'up'. We would need to
>> address this for proper LNet peer health as it is.
>> 
>> - Might need some beefing up of the standard LNDs to ensure we have good
>> peer health data.
>> 
>> Thoughts ?
>> 
>> Nic
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

--------------------------------------------
Alexey Lyashkov
alexey_lyashkov at xyratex.com




______________________________________________________________________
This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
 
Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
 
Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
 
The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
______________________________________________________________________
 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Lustre-devel] replacing Lustre pings with LNet Peer Health
  2011-05-12 17:27 ` Andreas Dilger
@ 2011-05-17 14:27   ` Nic Henke
  0 siblings, 0 replies; 7+ messages in thread
From: Nic Henke @ 2011-05-17 14:27 UTC (permalink / raw)
  To: lustre-devel

On 05/12/2011 12:27 PM, Andreas Dilger wrote:
> On May 12, 2011, at 08:57, Nic Henke wrote:
>> Just floating an idea... I'd much appreciate any feedback
>>

> One issue is that the Lustre OBD_PING RPC is not just detecting peer
> death.  It is also reporting the last_committed value to the RPC
> stack, so that clients can discard RPCs that were committed on the
> server.  It is also signalling to the server that this client is
> still alive, so that it doesn't get evicted.  If there are LNET
> routers in a system, the LNET peer health will only report the health
> of the routers, and not of the clients or servers behind the routers,
> so this isn't going to result in a working Lustre filesystem...
>

Good point, I had missed this. Pesky "working" filesystems...

>> Eric - I know this doesn't get us that far down the road toward
>> your new health network, but does solve a near term issue with
>> pinger rates on large systems.
>
> There would need to be at least some of the health network
> implemented in order to "pass through" the peer health on the
> routers, and also to broadcast some of the data, like last_rcvd.

Yeah, not sure how I thinko'd the LNet Router case. We'd need to add 
.lnd_notify into the LNDs and have them broadcast the failures at the 
router level. Not exactly ideal, and I think the use of lnd_notify has 
been dropped in favor of the newer LNet Peer Health.

Cheers,
Nic

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Lustre-devel] replacing Lustre pings with LNet Peer Health
  2011-05-12 17:37 ` Christopher J. Morrone
  2011-05-15  7:44   ` Alexey Lyashkov
@ 2011-05-17 14:30   ` Nic Henke
  1 sibling, 0 replies; 7+ messages in thread
From: Nic Henke @ 2011-05-17 14:30 UTC (permalink / raw)
  To: lustre-devel

On 05/12/2011 12:37 PM, Christopher J. Morrone wrote:
> I think Eric's approach is the only sane way I've heard to reduce pings.
>
> Here are some issues that I see with this:
>
> 1)  For your solution to work, you require that the lnet layer take on
> pinging duties.  Usually the network, be it IB, TCP, whatever, will not
> provide any active notification of a peer failure.  To notice that a
> peer has died, the lnet LND must, you guessed it, ping.
>

Correct. I had assumed the LNDs would or could be doing the pinging. At 
worst it'd be done on a per-peer basis and not per-import, reducing the 
traffic somewhat. It'd also reduce the number of layers that need to be 
involved in the message RX, providing some CPU usage benefit.

> Usually the LNDs try to be smart.  They only generate their own pings if
> no traffic has been sent to the peer in a certain period of time.  So
> once you eliminate the higher-level pings, they will partly be replaced
> by lower-level pings.

Correct, and I thought that sufficient to provide reasonable notification.

Given the LNet router case, I think this idea is a bit DOA... unless I 
find some sort of non-gross magic :-)

Cheers,
Nic

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Lustre-devel] replacing Lustre pings with LNet Peer Health
  2011-05-12 14:57 [Lustre-devel] replacing Lustre pings with LNet Peer Health Nic Henke
  2011-05-12 17:27 ` Andreas Dilger
  2011-05-12 17:37 ` Christopher J. Morrone
@ 2011-05-17 22:53 ` Isaac Huang
  2 siblings, 0 replies; 7+ messages in thread
From: Isaac Huang @ 2011-05-17 22:53 UTC (permalink / raw)
  To: lustre-devel

On Thu, May 12, 2011 at 09:57:41AM -0500, Nic Henke wrote:
> ......
> Issues...
> 
> - lacks "proof" that peer nodes ptlrpc queues are moving forward,
> but not really sure that is all that important in terms of pinger
> evictions.
> 
> - LNet peer health is a bit "weird" in that it requires an upper
> layer sending a packet to trigger a node moving back to 'up'. We
> would need to address this for proper LNet peer health as it is.

The idea was that if upper layer has no interest sending him a message
LNet is not bothered whether he's become "up" again. But care must be
taken such that a message from upper layer must not be dropped if it's
destined to a peer that appears "dead" but LNet isn't so sure of it,
i.e. that death news was too old and we haven't tried to get some
update yet. All is so that unnecessary pings could be cut off.

This is also why router pinger can't be replaced by Peer Health -
there'd be no more message to a dead router without router pinger
being active.

As others have pointed out, Peer Health is not end-to-end.

Thanks,
Isaac
______________________________________________________________________
This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
 
Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
 
Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
 
The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
______________________________________________________________________
 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-05-17 22:53 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-12 14:57 [Lustre-devel] replacing Lustre pings with LNet Peer Health Nic Henke
2011-05-12 17:27 ` Andreas Dilger
2011-05-17 14:27   ` Nic Henke
2011-05-12 17:37 ` Christopher J. Morrone
2011-05-15  7:44   ` Alexey Lyashkov
2011-05-17 14:30   ` Nic Henke
2011-05-17 22:53 ` Isaac Huang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.