All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Alternative design for fast register physical memory
@ 2022-05-24 22:28 Bob Pearson
  2022-05-26  6:01 ` lizhijian
  2022-05-27 12:42 ` Jason Gunthorpe
  0 siblings, 2 replies; 4+ messages in thread
From: Bob Pearson @ 2022-05-24 22:28 UTC (permalink / raw)
  To: Jason Gunthorpe, Zhu Yanjun, linux-rdma, Hack,
	Jenny (Ft. Collins),
	Frank Zago

Jason,

There has been a lot of chatter on the FMRs in rxe recently and I am also trying to help out at home with
the folks who are trying to run Lustre on rxe. The last fix for this was to split the state in an FMR into
two with separate rkeys and memory maps so that apps can pipeline the preparation of IO and doing IO.

However, I am convinced that the current design only works by accident when it works. The thing that really
makes a hash of it is retries. Unfortunately the documentation on all this is almost non existent. Lustre
(actually o2iblnd) makes heavy use of FMRs and typically has several different MRs in flight in the send queue
with a mixture of local and remote writes accessing these MRs interleaved with REG_MRs and INVALIDATE_MR local
work requests. When a packet gets dropped from a WQE deep in the send queue the result is nothing works at all.

We have a work around by fencing all the local operations which more or less works but will have bad performance.
The maps used in FMRs have fairly short lifetimes but definitely longer than we we can support today. I am
trying to work out the semantics of everything.

IBA view of FMRs:

verb: ib_alloc_mr(pd, max_num_sg)			- creates empty MR object
	roughly Allocate L_Key

verb: ib_dereg_mr(mr)					- destroys MR object

verb: ib_map_mr_sg(mr, sg, sg_nents, sg_offset)		- builds a map for MR
	roughly (Re)Register Physical Memory Region

verb: ib_update_fast_reg_key(mr, newkey)		- update key portion of l/rkey

send wr: IB_WR_REG_MR(mr, key)				- moves MR from FREE to VALID and updates
	roughly Fast Register Physical Memory Region	  key portion of l/rkey to key

send_wr: IB_WR_LOCAL_INV(invalidate_rkey)		- invalidate local MR moves MR to FREE

send_wr: IB_SEND_WITH_INV(invalidate_rkey)		- invalidate remote MR moves MR to FREE


To make this all recoverable in the face of errors let there be more than one map present for an
FMR indexed by the key portion of the l/rkeys. 

Alternative view of FMRs:

verb: ib_alloc_mr(pd, max_num_sg)			- create an empty MR object with no maps
							  with l/rkey = [index, key] with index
							  fixed and key some initial value.

verb: ib_update_fast_reg_key(mr, newkey)		- update key portion of l/rkey

verb: ib_map_mr_sg(mr, sg, sg_nents, sg_offset)		- create a new map from allocated memory
							  or by re-using an INVALID map. Maps are
							  all the same size (max_num_sg). The
							  key (index) of this map is the current
							  key from l/rkey. The initial state of
							  the map is FREE. (and thus not usable
							  until a REG_MR work request is used.)

verb: ib_dereg_mr(mr)					- free all maps and the MR.

send_wr: IB_WR_REG_MR(mr, key)				- Find mr->map[key] and change its state
							  to VALID. Associate QP with map since
							  it will be hard to manage multiple QPs
							  trying to use the same MR at the same
							  time with differing states. Fail if the
							  map is not FREE. A map with that key must
							  have been created by ib_map_mr_sg with
							  the same key previously. Check the current
							  number of VALID maps and if this exceeds
							  a limit pause the send queue until there
							  is room to reg another MR.

send_wr: IB_WR_LOCAL_INV (execute)			- Lookup a map with the same index and key
							  Change its state to FREE and dissociate
							  from QP. Leave map contents the same.
			 (complete)			- When the local invalidate operation is
							  completed (after all previous send queue WQEs
							  have completed) change its state to INVALID
							  and place map resources on a free list or
							  free memory.

send_wr: IB_SEND_WITH_INV				- same except at remote end.

retry:							- if a retry occurs for a send queue. Back up
							  the requester to the first incomplete PSN.
							  Change the state of all maps which were
							  VALID at that PSN back to VALID. This will
							  require maintaining a list of valid maps
							  at the boundary of completed and un-completed
							  WQEs.

Arrival of RDMA packet					  Lookup MR from index and map from key and if
							  the state is VALID carry out the memory copy.


This is an improvement over the current state. At the moment we have only two maps one for making new
ones and one for doing IO. There is no room to back up but at the moment the retry logic assumes that
you can which is false. This can be fixed easily by forcing all local operations to be fenced
which is what we are doing at the moment at HPE. This can insert long delays between every new FMR instance.
By allowing three maps and then fencing we can back up one broken IO operation without too much of a delay.

Even if you have a clean network the current design of the retransmit timer which is never cleared and which
can fire frequently can make a mess of MB sized IOs used for storage.

Bob

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] Alternative design for fast register physical memory
  2022-05-24 22:28 [RFC] Alternative design for fast register physical memory Bob Pearson
@ 2022-05-26  6:01 ` lizhijian
  2022-05-27 12:42 ` Jason Gunthorpe
  1 sibling, 0 replies; 4+ messages in thread
From: lizhijian @ 2022-05-26  6:01 UTC (permalink / raw)
  To: Bob Pearson, Jason Gunthorpe, Zhu Yanjun, linux-rdma, Hack,
	Jenny (Ft. Collins),
	Frank Zago



On 25/05/2022 06:28, Bob Pearson wrote:
> Jason,
>
> There has been a lot of chatter on the FMRs in rxe recently and I am also trying to help out at home with
> the folks who are trying to run Lustre on rxe. The last fix for this was to split the state in an FMR into
> two with separate rkeys and memory maps so that apps can pipeline the preparation of IO and doing IO.
>
> However, I am convinced that the current design only works by accident when it works. The thing that really
> makes a hash of it is retries. Unfortunately the documentation on all this is almost non existent. Lustre
> (actually o2iblnd) makes heavy use of FMRs and typically has several different MRs in flight in the send queue
> with a mixture of local and remote writes accessing these MRs interleaved with REG_MRs and INVALIDATE_MR local
> work requests. When a packet gets dropped from a WQE deep in the send queue the result is nothing works at all.
>
> We have a work around by fencing all the local operations which more or less works but will have bad performance.
> The maps used in FMRs have fairly short lifetimes but definitely longer than we we can support today. I am
> trying to work out the semantics of everything.
>
> IBA view of FMRs:
>
> verb: ib_alloc_mr(pd, max_num_sg)			- creates empty MR object
> 	roughly Allocate L_Key
>
> verb: ib_dereg_mr(mr)					- destroys MR object
>
> verb: ib_map_mr_sg(mr, sg, sg_nents, sg_offset)		- builds a map for MR
> 	roughly (Re)Register Physical Memory Region
>
> verb: ib_update_fast_reg_key(mr, newkey)		- update key portion of l/rkey
>
> send wr: IB_WR_REG_MR(mr, key)				- moves MR from FREE to VALID and updates
> 	roughly Fast Register Physical Memory Region	  key portion of l/rkey to key
>
> send_wr: IB_WR_LOCAL_INV(invalidate_rkey)		- invalidate local MR moves MR to FREE
>
> send_wr: IB_SEND_WITH_INV(invalidate_rkey)		- invalidate remote MR moves MR to FREE
>
>
> To make this all recoverable in the face of errors let there be more than one map present for an
> FMR indexed by the key portion of the l/rkeys.
>
> Alternative view of FMRs:
>
> verb: ib_alloc_mr(pd, max_num_sg)			- create an empty MR object with no maps
> 							  with l/rkey = [index, key] with index
> 							  fixed and key some initial value.
>
> verb: ib_update_fast_reg_key(mr, newkey)		- update key portion of l/rkey
>
> verb: ib_map_mr_sg(mr, sg, sg_nents, sg_offset)		- create a new map from allocated memory
> 							  or by re-using an INVALID map. Maps are
> 							  all the same size (max_num_sg). The
> 							  key (index) of this map is the current
> 							  key from l/rkey. The initial state of
> 							  the map is FREE. (and thus not usable
> 							  until a REG_MR work request is used.)
>
> verb: ib_dereg_mr(mr)					- free all maps and the MR.
>
> send_wr: IB_WR_REG_MR(mr, key)				- Find mr->map[key] and change its state
> 							  to VALID. Associate QP with map since
> 							  it will be hard to manage multiple QPs
> 							  trying to use the same MR at the same
> 							  time with differing states. Fail if the
> 							  map is not FREE. A map with that key must
> 							  have been created by ib_map_mr_sg with
> 							  the same key previously. Check the current
> 							  number of VALID maps and if this exceeds
> 							  a limit pause the send queue until there
> 							  is room to reg another MR.
>
> send_wr: IB_WR_LOCAL_INV (execute)			- Lookup a map with the same index and key
> 							  Change its state to FREE and dissociate
> 							  from QP. Leave map contents the same.
> 			 (complete)			- When the local invalidate operation is
> 							  completed (after all previous send queue WQEs
> 							  have completed) change its state to INVALID
> 							  and place map resources on a free list or
> 							  free memory.
>
> send_wr: IB_SEND_WITH_INV				- same except at remote end.
>
> retry:							- if a retry occurs for a send queue. Back up
> 							  the requester to the first incomplete PSN.
> 							  Change the state of all maps which were
> 							  VALID at that PSN back to VALID. This will
> 							  require maintaining a list of valid maps
> 							  at the boundary of completed and un-completed
> 							  WQEs.
>
> Arrival of RDMA packet					  Lookup MR from index and map from key and if
> 							  the state is VALID carry out the memory copy.
>
>
> This is an improvement over the current state. At the moment we have only two maps one for making new
> ones and one for doing IO. There is no room to back up but at the moment the retry logic assumes that
> you can which is false. This can be fixed easily by forcing all local operations to be fenced
> which is what we are doing at the moment at HPE. This can insert long delays between every new FMR instance.
> By allowing three maps and then fencing we can back up one broken IO operation without too much of a delay.

Hi Bob

I thought i have almost understood all your approach expect the *retry/back up* part in where i have not have a full imagination.
It sounds good to me.
But i think the *retry* should be a new feature to our existing bug reports about FMRs where they all were trying to fix.
https://lore.kernel.org/all/20220210073655.42281-1-guoqing.jiang@linux.dev/T/
https://lore.kernel.org/all/dfba7eb7-8467-59b5-2c2a-071ed1e4949f@gmail.com/T/
https://lore.kernel.org/lkml/94a5ea93-b8bb-3a01-9497-e2021f29598a@linux.dev/t/

I'm convinced that this approach can help on this bug, shall we focus on fixing the above known FMRs bug first, and then improve the *retry* feature.

Thanks
Zhijian


>
> Even if you have a clean network the current design of the retransmit timer which is never cleared and which
> can fire frequently can make a mess of MB sized IOs used for storage.
>
> Bob

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] Alternative design for fast register physical memory
  2022-05-24 22:28 [RFC] Alternative design for fast register physical memory Bob Pearson
  2022-05-26  6:01 ` lizhijian
@ 2022-05-27 12:42 ` Jason Gunthorpe
  2022-06-08 11:23   ` lizhijian
  1 sibling, 1 reply; 4+ messages in thread
From: Jason Gunthorpe @ 2022-05-27 12:42 UTC (permalink / raw)
  To: Bob Pearson; +Cc: Zhu Yanjun, linux-rdma, Hack, Jenny (Ft. Collins), Frank Zago

On Tue, May 24, 2022 at 05:28:00PM -0500, Bob Pearson wrote:

> We have a work around by fencing all the local operations which more
> or less works but will have bad performance.  The maps used in FMRs
> have fairly short lifetimes but definitely longer than we we can
> support today. I am trying to work out the semantics of everything.

IBTA specifies the fence requirements, I thought we decided RXE or
maybe even lustre wasn't following the spec?

> To make this all recoverable in the face of errors let there be more
> than one map present for an FMR indexed by the key portion of the
> l/rkeys.

Real HW doesn't have more than one map, this seems like the wrong
direction.

As we discussed, there is something wrong with how rxe is processing
its queues, it isn't following IBTA define behaviors in the
exceptional cases.

> 
> Alternative view of FMRs:
> 
> verb: ib_alloc_mr(pd, max_num_sg)			- create an empty MR object with no maps
> 							  with l/rkey = [index, key] with index
> 							  fixed and key some initial value.
> 
> verb: ib_update_fast_reg_key(mr, newkey)		- update key portion of l/rkey
> 
> verb: ib_map_mr_sg(mr, sg, sg_nents, sg_offset)		- create a new map from allocated memory
> 							  or by re-using an INVALID map. Maps are
> 							  all the same size (max_num_sg). The
> 							  key (index) of this map is the current
> 							  key from l/rkey. The initial state of
> 							  the map is FREE. (and thus not usable
> 							  until a REG_MR work request is used.)

More than one map is nonsense, real HW has a single map, a MR object is that
single map.

> This is an improvement over the current state. At the moment we have
> only two maps one for making new ones and one for doing IO. There is
> no room to back up but at the moment the retry logic assumes that
> you can which is false. This can be fixed easily by forcing all
> local operations to be fenced which is what we are doing at the
> moment at HPE. This can insert long delays between every new FMR
> instance.  By allowing three maps and then fencing we can back up
> one broken IO operation without too much of a delay.

IMHO you need to go back to one map and fix the queue processing
logic to be spec compliant.

Jason

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] Alternative design for fast register physical memory
  2022-05-27 12:42 ` Jason Gunthorpe
@ 2022-06-08 11:23   ` lizhijian
  0 siblings, 0 replies; 4+ messages in thread
From: lizhijian @ 2022-06-08 11:23 UTC (permalink / raw)
  To: Bob Pearson
  Cc: Zhu Yanjun, linux-rdma, Hack, Jenny (Ft. Collins),
	Frank Zago, Jason Gunthorpe

Hey Bob

Sorry to disturb you. Are you going to revert it to single map as the suggestion from Jason


Thanks
Zhijian


On 27/05/2022 20:42, Jason Gunthorpe wrote:
> On Tue, May 24, 2022 at 05:28:00PM -0500, Bob Pearson wrote:
>
>> We have a work around by fencing all the local operations which more
>> or less works but will have bad performance.  The maps used in FMRs
>> have fairly short lifetimes but definitely longer than we we can
>> support today. I am trying to work out the semantics of everything.
> IBTA specifies the fence requirements, I thought we decided RXE or
> maybe even lustre wasn't following the spec?
>
>> To make this all recoverable in the face of errors let there be more
>> than one map present for an FMR indexed by the key portion of the
>> l/rkeys.
> Real HW doesn't have more than one map, this seems like the wrong
> direction.
>
> As we discussed, there is something wrong with how rxe is processing
> its queues, it isn't following IBTA define behaviors in the
> exceptional cases.
>
>> Alternative view of FMRs:
>>
>> verb: ib_alloc_mr(pd, max_num_sg)			- create an empty MR object with no maps
>> 							  with l/rkey = [index, key] with index
>> 							  fixed and key some initial value.
>>
>> verb: ib_update_fast_reg_key(mr, newkey)		- update key portion of l/rkey
>>
>> verb: ib_map_mr_sg(mr, sg, sg_nents, sg_offset)		- create a new map from allocated memory
>> 							  or by re-using an INVALID map. Maps are
>> 							  all the same size (max_num_sg). The
>> 							  key (index) of this map is the current
>> 							  key from l/rkey. The initial state of
>> 							  the map is FREE. (and thus not usable
>> 							  until a REG_MR work request is used.)
> More than one map is nonsense, real HW has a single map, a MR object is that
> single map.
>
>> This is an improvement over the current state. At the moment we have
>> only two maps one for making new ones and one for doing IO. There is
>> no room to back up but at the moment the retry logic assumes that
>> you can which is false. This can be fixed easily by forcing all
>> local operations to be fenced which is what we are doing at the
>> moment at HPE. This can insert long delays between every new FMR
>> instance.  By allowing three maps and then fencing we can back up
>> one broken IO operation without too much of a delay.
> IMHO you need to go back to one map and fix the queue processing
> logic to be spec compliant.
>
> Jason

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-06-08 11:23 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-24 22:28 [RFC] Alternative design for fast register physical memory Bob Pearson
2022-05-26  6:01 ` lizhijian
2022-05-27 12:42 ` Jason Gunthorpe
2022-06-08 11:23   ` lizhijian

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.