All of lore.kernel.org
 help / color / mirror / Atom feed
* Is the Scratchpad Implementation Using a LUT Standard?
@ 2017-10-10 12:57 Doug Meyer
  2017-10-10 14:17 ` Allen Hubbe
  0 siblings, 1 reply; 12+ messages in thread
From: Doug Meyer @ 2017-10-10 12:57 UTC (permalink / raw)
  To: linux-ntb


[-- Attachment #1.1: Type: text/plain, Size: 1621 bytes --]

Gents,

As I continue to gain understanding about the NTB code, I am wondering 
about the use and requirements surrounding scratchpads, at least as I see 
things in the Switchtec code.

In there, I see that a LUT (LUT0) is used, the size of the LUTs apparently 
being hard-coded to 64 KiB, for a shared memory window (struct shared_mw) 
which contains an array of 128 u32 for the scratchpad. Also, it appears 
that this is what is used both to determine link status and to pass memory 
address/size information between hosts (ports, peers). I apologize if I 
have that wrong. Please correct me.

The goal here is just to gain understanding... learn about required APIs vs 
philosophical decisions vs convenience, etc.

My questions are whether this is a fixture of the NTB architecture, or if 
this is a convenience to support something else (the latter being a 
requirement)?

In particular,

But most importantly, I'm wondering about struct shared_mw. Could the 
Switchtec message registers have been used?
What do people think about how this technique scales when there are more 
than two peers? Obviously LUTs are a precious resource, and a LUT per peer 
shared_mw is expensive. A LUT broken up into many shared_mw is a 
possibility, though there is always risk of trashing stuff.
Also, if the LUTs need to be much larger (for application use), then a 
large chunk of the BAR space could be used for a relatively small structure.
I'd love to hear anyone's thoughts on this.

As an aside, I'm curious why the LUT size is 64 KiB? Was this just a nice 
number as a starting point?

Thanks again, folks.

Blessings,
Doug

[-- Attachment #1.2: Type: text/html, Size: 1908 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Is the Scratchpad Implementation Using a LUT Standard?
  2017-10-10 12:57 Is the Scratchpad Implementation Using a LUT Standard? Doug Meyer
@ 2017-10-10 14:17 ` Allen Hubbe
  2017-10-10 15:52   ` Doug Meyer
  0 siblings, 1 reply; 12+ messages in thread
From: Allen Hubbe @ 2017-10-10 14:17 UTC (permalink / raw)
  To: 'Doug Meyer', 'linux-ntb'

From: Doug Meyer
> Gents,
> 
> As I continue to gain understanding about the NTB code, I am wondering about the use and requirements
> surrounding scratchpads, at least as I see things in the Switchtec code.
> 
> In there, I see that a LUT (LUT0) is used, the size of the LUTs apparently being hard-coded to 64 KiB,
> for a shared memory window (struct shared_mw) which contains an array of 128 u32 for the scratchpad.
> Also, it appears that this is what is used both to determine link status and to pass memory
> address/size information between hosts (ports, peers). I apologize if I have that wrong. Please
> correct me.

The struct shared_mw is unique to the Switchtec driver, and your interpretation of its mechanism matches my understanding.

Other drivers determine link state from the hardware, and only expose scratchpads if they are implemented in hardware.

> The goal here is just to gain understanding... learn about required APIs vs philosophical decisions vs
> convenience, etc.
> 
> My questions are whether this is a fixture of the NTB architecture, or if this is a convenience to
> support something else (the latter being a requirement)?
> 
> In particular,
> 
> But most importantly, I'm wondering about struct shared_mw. Could the Switchtec message registers have
> been used?

About a year ago when Serge joined the team, we spent a while trying to unify the message and scratchpad api.  At the same time, there is a preference to keep ntb.h very light and expose the hardware functionality as directly as possible.  We decided to split the apis for scratchpads and message registers.  For hardware that supports message registers, it should expose those via the message api.

What this currently implies is that the next layer up driver needs to work with either scratchpads or message registers.  If a driver only works with spads, then it is not portable.  I would like there to be some library code added to the common ntb bus driver to help with that.  Serge is currently making changes to the ntb_transport driver to support multi-port and message registers on IDT.  It may only work for the transport driver at first, but I have some hope that it could be transformed into library code.

> What do people think about how this technique scales when there are more than two peers? Obviously

That limitation was stated upfront with that driver submission.  The Switchtec driver only works with two nodes for now.

> LUTs are a precious resource, and a LUT per peer shared_mw is expensive. A LUT broken up into many
> shared_mw is a possibility, though there is always risk of trashing stuff.
> Also, if the LUTs need to be much larger (for application use), then a large chunk of the BAR space
> could be used for a relatively small structure.
> I'd love to hear anyone's thoughts on this.
> 
> As an aside, I'm curious why the LUT size is 64 KiB? Was this just a nice number as a starting point?
> 
> Thanks again, folks.
> 
> Blessings,
> Doug


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Is the Scratchpad Implementation Using a LUT Standard?
  2017-10-10 14:17 ` Allen Hubbe
@ 2017-10-10 15:52   ` Doug Meyer
  2017-10-11 16:19     ` lsgunthorpe
  0 siblings, 1 reply; 12+ messages in thread
From: Doug Meyer @ 2017-10-10 15:52 UTC (permalink / raw)
  To: linux-ntb


[-- Attachment #1.1: Type: text/plain, Size: 6696 bytes --]

Dear Allen,

Thank you for your reply and additional information.

Two things:
* I've added a verbose comment to your words about messages and 
scratchpads, because it dovetails with a discussion I had over here 
yesterday.

I do need to pursue the crux of my initial post further, which is mostly 
focused on the bottom part (extension of switchtec).

On Tuesday, October 10, 2017 at 7:17:58 AM UTC-7, Allen Hubbe wrote:
>
> From: Doug Meyer 
> > Gents, 
> > 
> [... elided by dmeyer]
> > 
> > But most importantly, I'm wondering about struct shared_mw. Could the 
> Switchtec message registers have 
> > been used? 
>
> About a year ago when Serge joined the team, we spent a while trying to 
> unify the message and scratchpad api.  At the same time, there is a 
> preference to keep ntb.h very light and expose the hardware functionality 
> as directly as possible.  We decided to split the apis for scratchpads and 
> message registers.  For hardware that supports message registers, it should 
> expose those via the message api.  


> What this currently implies is that the next layer up driver needs to work 
> with either scratchpads or message registers.  If a driver only works with 
> spads, then it is not portable.  I would like there to be some library code 
> added to the common ntb bus driver to help with that.  Serge is currently 
> making changes to the ntb_transport driver to support multi-port and 
> message registers on IDT.  It may only work for the transport driver at 
> first, but I have some hope that it could be transformed into library code. 
>

This is a very interesting couple paragraphs. Any my reply here could be 
painful for you and the team because I'm so new to this...

Yesterday I had a discussion about where the switch resources would be 
managed/tracked, versus where the resource requests were coming from. In 
some ways it looks like an MVC architecture (I could really be wrong here). 
To expand on that, ntb.ko exports the abstracted View of the switch as well 
as the Control API. Clients of ntb.ko (e.g. ntb_transport) use that and 
export their own View and Control APIs. In this case, the switch's Model is 
in ntb.ko, with the hardware-specific plug-ins supporting that Model 
directly by any means necessary.

What I *think* that implies is that ntb.c manages the model's resources, 
but the hardware plug-in needs to be sophisticated enough with switch 
hardware resource management strategies (not just message registers and 
scratchpads, but LUTs, direct-mapped windows, doorbells, etc.), a knowledge 
of what can and can't be dynamically configured in the switch, and what has 
been reserved during probe/init for infrastructure, so that it has a decent 
arsenal of ways to try to fulfill whatever is being asked from above.

Given that: whether the configuration is truly static (resources allocated 
and fixed at start-of-day and never changed), or dynamic with hot-plug 
events, application or other fabric management configuring and freeing 
shared memory, perhaps opening and closing windows to large NVMe memory 
space, or who knows what, it is below ntb.ko that has to try to fulfill 
each ask, but it is above ntb,ko to manage the overall PCIe fabric 
presented by the Views on all the connected peers.

The ramifications of this would be that ntb.h (by that I think you mean the 
upward-facing API for all ntb.ko clients) is the View and Control 
abstraction, and it can hopefully stay thin, with the cost being that ntb.c 
and the hardware plug-ins probably get "fat" because they need to not just 
manage the Model, but pick up all the slack between the abstracted API and 
the hardware.

Am I even remotely close?

Also, regarding your portability requirement. On one hand, it sounds like 
messages and scratchpads have distinct properties that make it advantageous 
to expose separately in ntb.h, and yet the portability requirement seems to 
want to abstract them as a single interface. Can you offer up any further 
thoughts on this, please?


 

> > What do people think about how this technique scales when there are more 
> than two peers? Obviously 
>
> That limitation was stated upfront with that driver submission.  The 
> Switchtec driver only works with two nodes for now. 
>

So this section of my email is where I am really hoping to gain 
understanding...

I realize that the Switchtec driver has the two-peer limitation, and I 
apologize that I did not make it clear that the context here is my work to 
remove that limitation (in increments). My current task is to move 
switchtec and ntb-hw-switchtec (by hook or by crook) to support up to four 
peers. 

I can achieve this in a brute-force way, which I may have to do with the 
time allotted for the task, but my hope is to understand what I assume is a 
lot of good thought that went into the current design. By gaining 
understanding, I may be able to choose a near-term approach which is not 
entirely throw-away code, while also having understanding to be careful to 
retain parts of the design which you all have deemed essential (or at least 
preferred). In addition, perhaps some thought had/has already been given to 
how the current design would be extended, and so this design was chosen. I 
don't know, so I'm asking.

This is the only way I know to begin engaging on this. If there is a better 
or preferred approach, please let me know.

All of the following observations were my attempt to have folks give me 
some understanding if this approach is necessary, sufficient, or something 
else, as well as to see what thought has been given to scaling up the 
peers. I should have been explicit (sorry). I'm guessing y'all have thought 
about where you'd like to see the design go and where you see issues.

For now, I'm likely to merely create an array of shared_mw based on 
partition number and restrict the Switchtec configuration to use only the 
first four partitions. But if somebody says "I was thinking that this would 
actually be best done as Y, and it's probably little more work than your 
near-term hack" then I'm sure going to listen!

> LUTs are a precious resource, and a LUT per peer shared_mw is expensive. 
> A LUT broken up into many 
> > shared_mw is a possibility, though there is always risk of trashing 
> stuff. 
> > Also, if the LUTs need to be much larger (for application use), then a 
> large chunk of the BAR space 
> > could be used for a relatively small structure. 
> > I'd love to hear anyone's thoughts on this. 
> > 
> > As an aside, I'm curious why the LUT size is 64 KiB? Was this just a 
> nice number as a starting point? 
> > 
> > Thanks again, folks. 
>

Thanks again (and again).

Blessings,
Doug

[-- Attachment #1.2: Type: text/html, Size: 7826 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Is the Scratchpad Implementation Using a LUT Standard?
  2017-10-10 15:52   ` Doug Meyer
@ 2017-10-11 16:19     ` lsgunthorpe
  2017-10-11 17:41       ` Serge Semin
  0 siblings, 1 reply; 12+ messages in thread
From: lsgunthorpe @ 2017-10-11 16:19 UTC (permalink / raw)
  To: linux-ntb


[-- Attachment #1.1: Type: text/plain, Size: 2165 bytes --]



> Obviously LUTs are a precious resource, and a LUT per peer shared_mw is 
> expensive. A LUT broken up into many shared_mw is a possibility, though 
> there is always risk of trashing stuff.


Well, as far as I know, switchtec is the only hardware with LUTs. But they 
aren't that precious a resource. If I remember correctly, switchtec 
supports up to 128 of them and up to 48 separate partitions so you could 
easily have 2 LUTs per peer and still have plenty left over to do other 
things with. And as far as we know now, there is no other use for LUT 
windows. Thus, I would say, using one per peer is perfectly acceptable.
 
On Tuesday, October 10, 2017 at 9:52:06 AM UTC-6, Doug Meyer wrote:
>
> Also, regarding your portability requirement. On one hand, it sounds like 
> messages and scratchpads have distinct properties that make it advantageous 
> to expose separately in ntb.h, and yet the portability requirement seems to 
> want to abstract them as a single interface. Can you offer up any further 
> thoughts on this, please?
>

Yes, scratchpads and messages are distinct enough that you can't shoe-horn 
messages into the scratchpad api (I tried this a long time ago). In the 
end, drivers will really just need some interface to say "transfer this 
information to peer X" and another interface for the peer to receive the 
data. I agree with Allen in that we need a library to accomplish this based 
on what the hardware provides through the NTB api. As it currently is 
there's a lot of duplication in ntb_transport and ntb_perf for this. 


> For now, I'm likely to merely create an array of shared_mw based on 
> partition number and restrict the Switchtec configuration to use only the 
> first four partitions. But if somebody says "I was thinking that this would 
> actually be best done as Y, and it's probably little more work than your 
> near-term hack" then I'm sure going to listen!
>

This was my long term plan too: create one shared_mw per peer. Though, I 
see no reason to restrict the implementation to 4 partitions. Creating an 
N-way mapping shouldn't be much harder than a 4-way mapping.

Thanks for your work on this,

Logan

[-- Attachment #1.2: Type: text/html, Size: 2811 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Is the Scratchpad Implementation Using a LUT Standard?
  2017-10-11 16:19     ` lsgunthorpe
@ 2017-10-11 17:41       ` Serge Semin
  2017-10-11 18:03         ` Logan Gunthorpe
  2017-10-11 18:23         ` D Meyer
  0 siblings, 2 replies; 12+ messages in thread
From: Serge Semin @ 2017-10-11 17:41 UTC (permalink / raw)
  To: lsgunthorpe; +Cc: linux-ntb

On Wed, Oct 11, 2017 at 09:19:20AM -0700, lsgunthorpe@gmail.com <lsgunthorpe@gmail.com> wrote:
> 
> 
> > Obviously LUTs are a precious resource, and a LUT per peer shared_mw is 
> > expensive. A LUT broken up into many shared_mw is a possibility, though 
> > there is always risk of trashing stuff.
> 
> 
> Well, as far as I know, switchtec is the only hardware with LUTs. But they 
> aren't that precious a resource. If I remember correctly, switchtec 
> supports up to 128 of them and up to 48 separate partitions so you could 
> easily have 2 LUTs per peer and still have plenty left over to do other 
> things with. And as far as we know now, there is no other use for LUT 
> windows. Thus, I would say, using one per peer is perfectly acceptable.
>  

You are mistaken. IDT PCIe switch have LUT. There are 24 of them available at
the current hardware for each NTB. All of them are available to be used over
IDT NTB driver as well as all MWs with direct address translation.
Additionally It is true to say about just one LUT entry usage per peer, only if
you got just two NTB-ports available per device. What if you got eight or even more
of them? In this case you'd need to have as many LUT entries reserved as there are
peers available to have connections to. In this case reservation doesn't seem so
inexpensive. As far as I can see from the switchtec brief documents, there can be
up to 48 NTBs per device. So in this case we'd loose 48 MWs just to have some
stupid scratchpads, which is, according to your words, a third part of all the LUT
MWs available to a port.

> On Tuesday, October 10, 2017 at 9:52:06 AM UTC-6, Doug Meyer wrote:
> >
> > Also, regarding your portability requirement. On one hand, it sounds like 
> > messages and scratchpads have distinct properties that make it advantageous 
> > to expose separately in ntb.h, and yet the portability requirement seems to 
> > want to abstract them as a single interface. Can you offer up any further 
> > thoughts on this, please?
> >
> 
> Yes, scratchpads and messages are distinct enough that you can't shoe-horn 
> messages into the scratchpad api (I tried this a long time ago). In the 
> end, drivers will really just need some interface to say "transfer this 
> information to peer X" and another interface for the peer to receive the 
> data. I agree with Allen in that we need a library to accomplish this based 
> on what the hardware provides through the NTB api. As it currently is 
> there's a lot of duplication in ntb_transport and ntb_perf for this. 
> 

Such library can be partly unpinned from my implementation of the ntb_perf driver,
which is currently at debug stage on Dave's Intel hardware (see the patchset in the
mailing list or here https://github.com/fancer/ntb).
There is a service subsystem in there, which encapsulates the ntb-messages and
ntb-scratchpads usage to setup the memory windows.
Everyone, who promised to implement such a library, can use the new ntb_perf driver
as a reference, obviously when we finally finished debugging it.
And yes, there is no good way to simulate the NTB-scartchpads using the
NTB-messaging. I tried it at my first attempt to develop the IDT NTB driver.
These interfaces are too different. And I'd say we shouldn't do it, since
they are the hardware specifics, which must be reflected by NTB API. 

> 
> > For now, I'm likely to merely create an array of shared_mw based on 
> > partition number and restrict the Switchtec configuration to use only the 
> > first four partitions. But if somebody says "I was thinking that this would 
> > actually be best done as Y, and it's probably little more work than your 
> > near-term hack" then I'm sure going to listen!
> >
> 
> This was my long term plan too: create one shared_mw per peer. Though, I 
> see no reason to restrict the implementation to 4 partitions. Creating an 
> N-way mapping shouldn't be much harder than a 4-way mapping.
> 

As I said before. It is not much harder to create switchtec N-ports hardware
driver then make it just for four ports. You'll need to have the ports descriptors
array in any case. I did the same in the IDT NTB hardware driver. You can check it
out in the code.

Regards,
-Sergey

> Thanks for your work on this,
> 
> Logan
> 
> -- 
> You received this message because you are subscribed to the Google Groups "linux-ntb" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to linux-ntb+unsubscribe@googlegroups.com.
> To post to this group, send email to linux-ntb@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/linux-ntb/38ffd174-b66f-4077-8cdc-679e412242fc%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Is the Scratchpad Implementation Using a LUT Standard?
  2017-10-11 17:41       ` Serge Semin
@ 2017-10-11 18:03         ` Logan Gunthorpe
  2017-10-12 18:31           ` D Meyer
  2017-10-11 18:23         ` D Meyer
  1 sibling, 1 reply; 12+ messages in thread
From: Logan Gunthorpe @ 2017-10-11 18:03 UTC (permalink / raw)
  To: Serge Semin; +Cc: linux-ntb

On 11/10/17 11:41 AM, Serge Semin wrote:

> You are mistaken. IDT PCIe switch have LUT. There are 24 of them available at
> the current hardware for each NTB. All of them are available to be used over
> IDT NTB driver as well as all MWs with direct address translation.
> Additionally It is true to say about just one LUT entry usage per peer, only if
> you got just two NTB-ports available per device. What if you got eight or even more
> of them? In this case you'd need to have as many LUT entries reserved as there are
> peers available to have connections to. In this case reservation doesn't seem so
> inexpensive. As far as I can see from the switchtec brief documents, there can be
> up to 48 NTBs per device. So in this case we'd loose 48 MWs just to have some
> stupid scratchpads, which is, according to your words, a third part of all the LUT
> MWs available to a port.
Ah, fair, I wasn't aware the IDT had LUT support. Using up a third of 
the LUT windows
does not seem like a problem at all considering we have no other use for 
them.
Note: the switchtec code also used the shared_mw for link management, 
not just
providing scratchpads. But once the upper layers are reworked to not 
require
scratchpads I'd be fine with removing them from the shared_mw.

Logan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Is the Scratchpad Implementation Using a LUT Standard?
  2017-10-11 17:41       ` Serge Semin
  2017-10-11 18:03         ` Logan Gunthorpe
@ 2017-10-11 18:23         ` D Meyer
  2017-10-11 18:44           ` Logan Gunthorpe
  1 sibling, 1 reply; 12+ messages in thread
From: D Meyer @ 2017-10-11 18:23 UTC (permalink / raw)
  To: Serge Semin; +Cc: lsgunthorpe, linux-ntb

On Wed, Oct 11, 2017 at 10:41 AM, Serge Semin <fancer.lancer@gmail.com> wrote:
> On Wed, Oct 11, 2017 at 09:19:20AM -0700, lsgunthorpe@gmail.com <lsgunthorpe@gmail.com> wrote:
>>
>>
>> > Obviously LUTs are a precious resource, and a LUT per peer shared_mw is
>> > expensive. A LUT broken up into many shared_mw is a possibility, though
>> > there is always risk of trashing stuff.
>>
>>
>> Well, as far as I know, switchtec is the only hardware with LUTs. But they
>> aren't that precious a resource. If I remember correctly, switchtec
>> supports up to 128 of them and up to 48 separate partitions so you could
>> easily have 2 LUTs per peer and still have plenty left over to do other
>> things with. And as far as we know now, there is no other use for LUT
>> windows. Thus, I would say, using one per peer is perfectly acceptable.
>>
>
> You are mistaken. IDT PCIe switch have LUT. There are 24 of them available at
> the current hardware for each NTB. All of them are available to be used over
> IDT NTB driver as well as all MWs with direct address translation.
> Additionally It is true to say about just one LUT entry usage per peer, only if
> you got just two NTB-ports available per device. What if you got eight or even more
> of them? In this case you'd need to have as many LUT entries reserved as there are
> peers available to have connections to. In this case reservation doesn't seem so
> inexpensive. As far as I can see from the switchtec brief documents, there can be
> up to 48 NTBs per device. So in this case we'd loose 48 MWs just to have some
> stupid scratchpads, which is, according to your words, a third part of all the LUT
> MWs available to a port.

Gents,

First, thanks to Logan and Serge for lots of great information. Y'all
manage to always remind me I'm drinking from a firehose! ;-)

Serge is thinking along the same lines as I am.

Regarding Switchtec, the 96xG3 part has 512 LUTs per Stack and a stack
can have up to eight NTBs. So for that chip, if I'm starting to grasp
this stuff, a machine with 48 NTBs would mean that on a single Stack,
8 (hosts) x 47 (peers) = 376 LUTs would be used up for each of the 8
to have a shared_mw in its own LUT.

Also, the current hard-coded LUT size is 64 KiB, but that can't remain
that way to have flexibility. If the LUTs were, say 16 MiB, then the
shred_mw LUTs use up a massive part of the BAR space.

Oh... A quick side question: For swithtec, I see that the number of
available LUTs for a BAR is read out of the chip and then rounded down
to a power of two... I'm curious why it's rounded down?

Blessings,
Doug

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Is the Scratchpad Implementation Using a LUT Standard?
  2017-10-11 18:23         ` D Meyer
@ 2017-10-11 18:44           ` Logan Gunthorpe
  2017-10-12 18:05             ` D Meyer
  0 siblings, 1 reply; 12+ messages in thread
From: Logan Gunthorpe @ 2017-10-11 18:44 UTC (permalink / raw)
  To: D Meyer, Serge Semin; +Cc: linux-ntb

On 11/10/17 12:23 PM, D Meyer wrote:
> Regarding Switchtec, the 96xG3 part has 512 LUTs per Stack and a stack
> can have up to eight NTBs. So for that chip, if I'm starting to grasp
> this stuff, a machine with 48 NTBs would mean that on a single Stack,
> 8 (hosts) x 47 (peers) = 376 LUTs would be used up for each of the 8
> to have a shared_mw in its own LUT.
Yes, there are a bunch of annoying restrictions like that, but your 
example looks correct.
There are lots of LUTs to play around with. The bigger restriction is 
the direct windows, of
which you only have 2 per port. Creating a network in ntb_transport (et 
al) to communicate across
48 partitions is going to be a very hard problem to solve.
> Also, the current hard-coded LUT size is 64 KiB, but that can't remain
> that way to have flexibility. If the LUTs were, say 16 MiB, then the
> shred_mw LUTs use up a massive part of the BAR space.
Yes, all LUTs must be the same size. Plus the LUT space comes before the 
direct
window space in the BAR. So the alignment (and therefore maximum size) 
of the direct window space
depends on the size and the number of LUTs. I believe I chose 32 64k 
luts so that the
direct window aligns to 2MB. I originally had the size set to 4k, but 
this limited the alignment
of the direct window.

So the tradeoffs are: if you increase it you waste memory for LUTs that 
don't need the extra space,
and if you decrease it you limit the size of the direct window.
> Oh... A quick side question: For swithtec, I see that the number of
> available LUTs for a BAR is read out of the chip and then rounded down
> to a power of two... I'm curious why it's rounded down?
This has to do with the alignment I mentioned above. If the number of 
LUTs is not a power of two,
the direct window won't be nicely aligned and you get some annoying 
restrictions on it's size.

Logan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Is the Scratchpad Implementation Using a LUT Standard?
  2017-10-11 18:44           ` Logan Gunthorpe
@ 2017-10-12 18:05             ` D Meyer
  2017-10-12 18:33               ` Logan Gunthorpe
  0 siblings, 1 reply; 12+ messages in thread
From: D Meyer @ 2017-10-12 18:05 UTC (permalink / raw)
  To: Logan Gunthorpe; +Cc: Serge Semin, linux-ntb

Dear Logan,

Thanks so much for the very, very helpful reply!

On Wed, Oct 11, 2017 at 11:44 AM, Logan Gunthorpe <lsgunthorpe@gmail.com> wrote:
> On 11/10/17 12:23 PM, D Meyer wrote:
>>
>> Regarding Switchtec, the 96xG3 part has 512 LUTs per Stack and a stack
>> can have up to eight NTBs. So for that chip, if I'm starting to grasp
>> this stuff, a machine with 48 NTBs would mean that on a single Stack,
>> 8 (hosts) x 47 (peers) = 376 LUTs would be used up for each of the 8
>> to have a shared_mw in its own LUT.
>
> Yes, there are a bunch of annoying restrictions like that, but your example
> looks correct.
> There are lots of LUTs to play around with. The bigger restriction is the
> direct windows, of
> which you only have 2 per port. Creating a network in ntb_transport (et al)
> to communicate across
> 48 partitions is going to be a very hard problem to solve.

Regarding your comment about using direct windows for ntb_transport, I
don't understand why ntb_transport has to use a direct window.
Couldn't it (or some other client) use one or more LUTs? I don't see a
restriction/difference in the Switchtec specification that would
prevent that from being done.

It looks like the current default max direct memory window size is 2
MiB, which could be a LUT (If that was the LUT size). Having a LUT
size of 2 MiB doesn't seem like any sort of reach at all... in
previous hardware that I worked with, we had BARs that were very, very
large. Having 32 LUTs * 2 MiB isn't big when you consider that GPCPU
configurations are using BARs larger than 4 GiB.

Over a year ago, PCI-SIG proposed changing the BAR maximum to 2^63 to
permit complete access the entire space, they noted:
> Currently, limiting the resizable BARs to 512 GB means resources are either: a) simply not allocated and left out of the system,
> or b) forced to report a smaller aperture in order to be allocated, but that aperture size is not optimal in all uses of the product
> and may cause software to need to “bank” or otherwise move the aperture at runtime.

Imagine the case where many different clients are using different
capabilities in the fabric. NVMe, GPUs, ntb_transport and it's kin,
etc. As far as I can tell, LUTs are a great way to be able to allocate
chunks of BAR space and set up mappings on an as-needed basis.

Also, as far as I understand the Switchtec part, I can reconfigure the
LUT number and size as well as the direct window size (as long as
nothing is currently using them). That's nice too, though it would be
awesome to eventually have a switch that supports richer features that
give greater flexibility as well as dynamic reconfiguration of memory
windows (ya gotta dream!).

Blessings,
Doug

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Is the Scratchpad Implementation Using a LUT Standard?
  2017-10-11 18:03         ` Logan Gunthorpe
@ 2017-10-12 18:31           ` D Meyer
  2017-10-12 18:38             ` Logan Gunthorpe
  0 siblings, 1 reply; 12+ messages in thread
From: D Meyer @ 2017-10-12 18:31 UTC (permalink / raw)
  To: Logan Gunthorpe; +Cc: Serge Semin, linux-ntb

On Wed, Oct 11, 2017 at 11:03 AM, Logan Gunthorpe <lsgunthorpe@gmail.com> wrote:
> On 11/10/17 11:41 AM, Serge Semin wrote:
>
>> You are mistaken. IDT PCIe switch have LUT. There are 24 of them available
>> at
>> the current hardware for each NTB. All of them are available to be used
>> over
>> IDT NTB driver as well as all MWs with direct address translation.
>> Additionally It is true to say about just one LUT entry usage per peer,
>> only if
>> you got just two NTB-ports available per device. What if you got eight or
>> even more
>> of them? In this case you'd need to have as many LUT entries reserved as
>> there are
>> peers available to have connections to. In this case reservation doesn't
>> seem so
>> inexpensive. As far as I can see from the switchtec brief documents, there
>> can be
>> up to 48 NTBs per device. So in this case we'd loose 48 MWs just to have
>> some
>> stupid scratchpads, which is, according to your words, a third part of all
>> the LUT
>> MWs available to a port.
>
> Ah, fair, I wasn't aware the IDT had LUT support. Using up a third of the
> LUT windows
> does not seem like a problem at all considering we have no other use for
> them.
> Note: the switchtec code also used the shared_mw for link management, not
> just
> providing scratchpads. But once the upper layers are reworked to not require
> scratchpads I'd be fine with removing them from the shared_mw.

Dear Logan, everyone,

It seems like there are LOTs of uses for LUTs.

Imagine accessing PiB of storage that uses memory semantics (early
last year PCI-SIG noted a vendor selling an endpoint capable of 2 TB),
a sea of GPGPU cards, etc. in a heterogeneous cloud computing
environment, and wanting to control access to windows of that stuff
being used by multiple applications running in one or more hosts. A
small number of windows are not sufficient. LUTs provide a good method
for being able to establish long-lasting as well as ephemeral
translations and, in conjunction with IOMMU, PCIe's ACS, etc. as well
as various QoS features could provide a good start on data protection
and client performance.

This much bigger than ntb_transport  to support networks with some
NVMe drives and GPU endpoints. Fun!

LUTs aren't an afterthought. I think they're really valuable.

Blessngs,
Doug

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Is the Scratchpad Implementation Using a LUT Standard?
  2017-10-12 18:05             ` D Meyer
@ 2017-10-12 18:33               ` Logan Gunthorpe
  0 siblings, 0 replies; 12+ messages in thread
From: Logan Gunthorpe @ 2017-10-12 18:33 UTC (permalink / raw)
  To: D Meyer; +Cc: Serge Semin, linux-ntb

On 12/10/17 12:05 PM, D Meyer wrote:
> Regarding your comment about using direct windows for ntb_transport, I
> don't understand why ntb_transport has to use a direct window.
> Couldn't it (or some other client) use one or more LUTs? I don't see a
> restriction/difference in the Switchtec specification that would
> prevent that from being done.
Yes, true. Though for performance then you'd need the LUT windows to be 
larger.
> It looks like the current default max direct memory window size is 2
> MiB, which could be a LUT (If that was the LUT size). Having a LUT
> size of 2 MiB doesn't seem like any sort of reach at all... in
> previous hardware that I worked with, we had BARs that were very, very
> large. Having 32 LUTs * 2 MiB isn't big when you consider that GPCPU
> configurations are using BARs larger than 4 GiB.
True. You just get tripped up slightly with all the LUTs having to be 
the same size.

> Imagine the case where many different clients are using different
> capabilities in the fabric. NVMe, GPUs, ntb_transport and it's kin,
> etc. As far as I can tell, LUTs are a great way to be able to allocate
> chunks of BAR space and set up mappings on an as-needed basis.
Yeah, we are a _long_ way off from having multiple clients use the 
resources.
Also, you'll find most kernel developers (especially the big names) will 
argue
against making decisions on an imagined future. Your presumptions will
more than likely be wrong and you'll have wasted everyone's time. Code 
should
be written for today's needs and if someone comes up with some crazy use for
NTB it's their responsibility to figure out how to change the code to 
handle it and
justify the extra complexities to the community.

> Also, as far as I understand the Switchtec part, I can reconfigure the
> LUT number and size as well as the direct window size (as long as
> nothing is currently using them). That's nice too, though it would be
> awesome to eventually have a switch that supports richer features that
> give greater flexibility as well as dynamic reconfiguration of memory
> windows (ya gotta dream!).

Yeah, I also ran up against a bunch of gotchas when dealing with LUT 
configuration.
There are a number of things that could make it nicer on software 
developers but
I wouldn't hold my breath for them happening.

Logan
> Yeah,


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Is the Scratchpad Implementation Using a LUT Standard?
  2017-10-12 18:31           ` D Meyer
@ 2017-10-12 18:38             ` Logan Gunthorpe
  0 siblings, 0 replies; 12+ messages in thread
From: Logan Gunthorpe @ 2017-10-12 18:38 UTC (permalink / raw)
  To: D Meyer; +Cc: Serge Semin, linux-ntb

On 12/10/17 12:31 PM, D Meyer wrote:
> Imagine accessing PiB of storage that uses memory semantics (early
> last year PCI-SIG noted a vendor selling an endpoint capable of 2 TB),
> a sea of GPGPU cards, etc. in a heterogeneous cloud computing
> environment, and wanting to control access to windows of that stuff
> being used by multiple applications running in one or more hosts. A
> small number of windows are not sufficient. LUTs provide a good method
> for being able to establish long-lasting as well as ephemeral
> translations and, in conjunction with IOMMU, PCIe's ACS, etc. as well
> as various QoS features could provide a good start on data protection
> and client performance.
>
> This much bigger than ntb_transport  to support networks with some
> NVMe drives and GPU endpoints. Fun!
>
> LUTs aren't an afterthought. I think they're really valuable.
See my comment in my previous email. Until there's some published code
(preferably upstream) that uses them it's not worth assuming what those
uses might be and giving yourself arbitrary restrictions that may never 
come
to pass. The market may never even want what you are imagining.

Logan



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-10-12 18:38 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-10 12:57 Is the Scratchpad Implementation Using a LUT Standard? Doug Meyer
2017-10-10 14:17 ` Allen Hubbe
2017-10-10 15:52   ` Doug Meyer
2017-10-11 16:19     ` lsgunthorpe
2017-10-11 17:41       ` Serge Semin
2017-10-11 18:03         ` Logan Gunthorpe
2017-10-12 18:31           ` D Meyer
2017-10-12 18:38             ` Logan Gunthorpe
2017-10-11 18:23         ` D Meyer
2017-10-11 18:44           ` Logan Gunthorpe
2017-10-12 18:05             ` D Meyer
2017-10-12 18:33               ` Logan Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.