All of lore.kernel.org
 help / color / mirror / Atom feed
* Memory management facing a 400Gpbs network link
@ 2019-02-12 18:25 Christopher Lameter
  2019-02-15 16:34 ` Jerome Glisse
  2019-02-19 12:26 ` Michal Hocko
  0 siblings, 2 replies; 14+ messages in thread
From: Christopher Lameter @ 2019-02-12 18:25 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-mm

400G Infiniband will become available this year. This means that the data
ingest speeds can be higher than the bandwidth of the processor
interacting with its own memory.

For example a single hardware thread is limited to 20Gbyte/sec whereas the
network interface provides 50Gbytes/sec. These rates can only be obtained
currently with pinned memory.

How can we evolve the memory management subsystem to operate at higher
speeds with more the comforts of paging and system calls that we are used
to?

It is likely that these speeds with increase further and since the lead
processor vendor seems to be caught in a management induced corporate
suicide attempt we will not likely see any process on the processors from
there. The straightforward solution would be to use the high speed tech
for fabrics for the internal busses (doh!). Alternate processors are
likely to show up in 2019 and 2020 but those will take a long time to
mature.

So what does the future hold and how do we scale up our HPC systems given
these problems?







^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Memory management facing a 400Gpbs network link
  2019-02-12 18:25 Memory management facing a 400Gpbs network link Christopher Lameter
@ 2019-02-15 16:34 ` Jerome Glisse
  2019-02-19 12:26 ` Michal Hocko
  1 sibling, 0 replies; 14+ messages in thread
From: Jerome Glisse @ 2019-02-15 16:34 UTC (permalink / raw)
  To: Christopher Lameter; +Cc: lsf-pc, linux-mm

On Tue, Feb 12, 2019 at 06:25:50PM +0000, Christopher Lameter wrote:
> 400G Infiniband will become available this year. This means that the data
> ingest speeds can be higher than the bandwidth of the processor
> interacting with its own memory.
> 
> For example a single hardware thread is limited to 20Gbyte/sec whereas the
> network interface provides 50Gbytes/sec. These rates can only be obtained
> currently with pinned memory.
> 
> How can we evolve the memory management subsystem to operate at higher
> speeds with more the comforts of paging and system calls that we are used
> to?

Couple questions. This is not saturating PCIe ie we are talking 400Gbps so
~40GBytes/s right ? PCIE 4 is ~32GBytes/s with x16 so is this some kind of
weird hardware that have 2 PCIE bridge and can be on 2 different root PCIE
complex ? I heard this idea floating around to get more bandwidth without
having to wait for PCIE 5 ...

Regarding memory management what will be the target memory ? Page cache or
private anonymous ? Or a mix of both ?


More to the point, my feeling is that we want something like page cache for
dma (and for page in page cache we would like to be able to also keep track
of device reference). So when they are no memory pressure we can try to use
as much memory not only for page cache but also for dma cache/pool. This
might mean that we will need to rebalance the page cache and dma cache/pool
depending on knob set by admin.

Obviously when you run out of memory, pressure will degrade the performance.


> 
> It is likely that these speeds with increase further and since the lead
> processor vendor seems to be caught in a management induced corporate
> suicide attempt we will not likely see any process on the processors from
> there. The straightforward solution would be to use the high speed tech
> for fabrics for the internal busses (doh!). Alternate processors are
> likely to show up in 2019 and 2020 but those will take a long time to
> mature.
> 
> So what does the future hold and how do we scale up our HPC systems given
> these problems?

I think peer to peer will also be a big part here, for instance RDMA to/from
GPU memory, which completely bypass the main memory. Some HPC people talks
about even having GPU and CPU run almost unrelated workload and thus trying
to isolate them from one another.

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Memory management facing a 400Gpbs network link
  2019-02-12 18:25 Memory management facing a 400Gpbs network link Christopher Lameter
  2019-02-15 16:34 ` Jerome Glisse
@ 2019-02-19 12:26 ` Michal Hocko
  2019-02-19 14:21   ` Christopher Lameter
  1 sibling, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2019-02-19 12:26 UTC (permalink / raw)
  To: Christopher Lameter; +Cc: lsf-pc, linux-mm

On Tue 12-02-19 18:25:50, Cristopher Lameter wrote:
> 400G Infiniband will become available this year. This means that the data
> ingest speeds can be higher than the bandwidth of the processor
> interacting with its own memory.
> 
> For example a single hardware thread is limited to 20Gbyte/sec whereas the
> network interface provides 50Gbytes/sec. These rates can only be obtained
> currently with pinned memory.
> 
> How can we evolve the memory management subsystem to operate at higher
> speeds with more the comforts of paging and system calls that we are used
> to?

Realistically, is there anything we _can_ do when the HW is the
bottleneck?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Memory management facing a 400Gpbs network link
  2019-02-19 12:26 ` Michal Hocko
@ 2019-02-19 14:21   ` Christopher Lameter
  2019-02-19 17:36     ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Christopher Lameter @ 2019-02-19 14:21 UTC (permalink / raw)
  To: Michal Hocko; +Cc: lsf-pc, linux-mm

On Tue, 19 Feb 2019, Michal Hocko wrote:

> On Tue 12-02-19 18:25:50, Cristopher Lameter wrote:
> > 400G Infiniband will become available this year. This means that the data
> > ingest speeds can be higher than the bandwidth of the processor
> > interacting with its own memory.
> >
> > For example a single hardware thread is limited to 20Gbyte/sec whereas the
> > network interface provides 50Gbytes/sec. These rates can only be obtained
> > currently with pinned memory.
> >
> > How can we evolve the memory management subsystem to operate at higher
> > speeds with more the comforts of paging and system calls that we are used
> > to?
>
> Realistically, is there anything we _can_ do when the HW is the
> bottleneck?

Well the hardware is one problem. The problem that a single core cannot
handle the full memory bandwidth can be solved by spreading the
processing of the data to multiple processors. So I think the memory
subsystem could be aware of that? How do we load balance between cores so
that we can handle the full bandwidth?

The other is that the memory needs to be pinned and all sorts of special
measures and tuning needs to be done to make this actually work. Is there
any way to simplify this?

Also the need for page pinning becomes a problem since the majority of the
memory of a system would need to be pinned. Actually the application seems
to be doing the memory management then?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Memory management facing a 400Gpbs network link
  2019-02-19 14:21   ` Christopher Lameter
@ 2019-02-19 17:36     ` Michal Hocko
  2019-02-19 18:21       ` Christopher Lameter
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2019-02-19 17:36 UTC (permalink / raw)
  To: Christopher Lameter; +Cc: lsf-pc, linux-mm

On Tue 19-02-19 14:21:50, Cristopher Lameter wrote:
> On Tue, 19 Feb 2019, Michal Hocko wrote:
> 
> > On Tue 12-02-19 18:25:50, Cristopher Lameter wrote:
> > > 400G Infiniband will become available this year. This means that the data
> > > ingest speeds can be higher than the bandwidth of the processor
> > > interacting with its own memory.
> > >
> > > For example a single hardware thread is limited to 20Gbyte/sec whereas the
> > > network interface provides 50Gbytes/sec. These rates can only be obtained
> > > currently with pinned memory.
> > >
> > > How can we evolve the memory management subsystem to operate at higher
> > > speeds with more the comforts of paging and system calls that we are used
> > > to?
> >
> > Realistically, is there anything we _can_ do when the HW is the
> > bottleneck?
> 
> Well the hardware is one problem. The problem that a single core cannot
> handle the full memory bandwidth can be solved by spreading the
> processing of the data to multiple processors. So I think the memory
> subsystem could be aware of that? How do we load balance between cores so
> that we can handle the full bandwidth?

Isn't that something that poeple already do from userspace?

> The other is that the memory needs to be pinned and all sorts of special
> measures and tuning needs to be done to make this actually work. Is there
> any way to simplify this?
> 
> Also the need for page pinning becomes a problem since the majority of the
> memory of a system would need to be pinned. Actually the application seems
> to be doing the memory management then?

I am sorry but this still sounds too vague. There are certainly
possibilities to handle part the MM functionality in the userspace.
But why should we discuss that at the MM track. Do you envision any
in kernel changes that would be needed?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Memory management facing a 400Gpbs network link
  2019-02-19 17:36     ` Michal Hocko
@ 2019-02-19 18:21       ` Christopher Lameter
  2019-02-19 18:42         ` Alexander Duyck
  2019-02-19 19:13         ` Michal Hocko
  0 siblings, 2 replies; 14+ messages in thread
From: Christopher Lameter @ 2019-02-19 18:21 UTC (permalink / raw)
  To: Michal Hocko; +Cc: lsf-pc, linux-mm

On Tue, 19 Feb 2019, Michal Hocko wrote:

> > Well the hardware is one problem. The problem that a single core cannot
> > handle the full memory bandwidth can be solved by spreading the
> > processing of the data to multiple processors. So I think the memory
> > subsystem could be aware of that? How do we load balance between cores so
> > that we can handle the full bandwidth?
>
> Isn't that something that poeple already do from userspace?

Yes. We can certainly do a lot from userspace manually but this is hard
and involves working around memory management to some extend. The higher
the I/O bandwidth become the more memory management becomes not that
useful anymore.

Can we improve the situation? A 2M VM was repeatedly discussed f.e.

Or some kind of memory management extension that allows working with large
contiguous blocks of memory? Which are problematic in their own
because large contiguous blocks may not be obtainable due to
fragmentation. Therefore the need to reboot the system if the
load changes.

> > The other is that the memory needs to be pinned and all sorts of special
> > measures and tuning needs to be done to make this actually work. Is there
> > any way to simplify this?
> >
> > Also the need for page pinning becomes a problem since the majority of the
> > memory of a system would need to be pinned. Actually the application seems
> > to be doing the memory management then?
>
> I am sorry but this still sounds too vague. There are certainly
> possibilities to handle part the MM functionality in the userspace.
> But why should we discuss that at the MM track. Do you envision any
> in kernel changes that would be needed?

Without adapting to these trends memory management may become just a
part of the system that is mainly useful for running executables, handling
configuration files etc but not for handling the data going through the
system.

We end up with data fully bypassing the kernel. Its difficult to handle
that way.

Sorry this is fuzzy. I wonder if there are other solutions than those
that I know of for these issues. The solutions mostly mean going directly
to hardware because the performance is just not available if the kernel is
involved. If that is unavoidable then we need clean APIs to be able to
carve out memory for these needs.

I can make this more concrete by listing some of the approaches that I am
seeing?

F.e.

A 400G NIC has the ability to route traffic to certain endpoints on
specific cores. Thus traffic volume can be segmented into multiple
streams that are able to be handled by single cores. However, many
data streams (video, audio) have implicit ordering constraints between
packets.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Memory management facing a 400Gpbs network link
  2019-02-19 18:21       ` Christopher Lameter
@ 2019-02-19 18:42         ` Alexander Duyck
  2019-02-19 19:13         ` Michal Hocko
  1 sibling, 0 replies; 14+ messages in thread
From: Alexander Duyck @ 2019-02-19 18:42 UTC (permalink / raw)
  To: Christopher Lameter; +Cc: Michal Hocko, lsf-pc, linux-mm

On Tue, Feb 19, 2019 at 10:21 AM Christopher Lameter <cl@linux.com> wrote:
>
> On Tue, 19 Feb 2019, Michal Hocko wrote:
>
> > > Well the hardware is one problem. The problem that a single core cannot
> > > handle the full memory bandwidth can be solved by spreading the
> > > processing of the data to multiple processors. So I think the memory
> > > subsystem could be aware of that? How do we load balance between cores so
> > > that we can handle the full bandwidth?
> >
> > Isn't that something that poeple already do from userspace?
>
> Yes. We can certainly do a lot from userspace manually but this is hard
> and involves working around memory management to some extend. The higher
> the I/O bandwidth become the more memory management becomes not that
> useful anymore.
>
> Can we improve the situation? A 2M VM was repeatedly discussed f.e.
>
> Or some kind of memory management extension that allows working with large
> contiguous blocks of memory? Which are problematic in their own
> because large contiguous blocks may not be obtainable due to
> fragmentation. Therefore the need to reboot the system if the
> load changes.
>
> > > The other is that the memory needs to be pinned and all sorts of special
> > > measures and tuning needs to be done to make this actually work. Is there
> > > any way to simplify this?
> > >
> > > Also the need for page pinning becomes a problem since the majority of the
> > > memory of a system would need to be pinned. Actually the application seems
> > > to be doing the memory management then?
> >
> > I am sorry but this still sounds too vague. There are certainly
> > possibilities to handle part the MM functionality in the userspace.
> > But why should we discuss that at the MM track. Do you envision any
> > in kernel changes that would be needed?
>
> Without adapting to these trends memory management may become just a
> part of the system that is mainly useful for running executables, handling
> configuration files etc but not for handling the data going through the
> system.
>
> We end up with data fully bypassing the kernel. Its difficult to handle
> that way.
>
> Sorry this is fuzzy. I wonder if there are other solutions than those
> that I know of for these issues. The solutions mostly mean going directly
> to hardware because the performance is just not available if the kernel is
> involved. If that is unavoidable then we need clean APIs to be able to
> carve out memory for these needs.
>
> I can make this more concrete by listing some of the approaches that I am
> seeing?
>
> F.e.
>
> A 400G NIC has the ability to route traffic to certain endpoints on
> specific cores. Thus traffic volume can be segmented into multiple
> streams that are able to be handled by single cores. However, many
> data streams (video, audio) have implicit ordering constraints between
> packets.

What is the likelihood of a single data stream consuming the full
bandwidth of a 400G NIC though? As far as splitting up the work most
devices have a means of hashing the packet headers and then splitting
up the work based on flows called Receive Side Scaling, aka RSS. That
is the standard for distributing the work for most NICs.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Memory management facing a 400Gpbs network link
  2019-02-19 18:21       ` Christopher Lameter
  2019-02-19 18:42         ` Alexander Duyck
@ 2019-02-19 19:13         ` Michal Hocko
  2019-02-19 20:46           ` Christopher Lameter
  1 sibling, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2019-02-19 19:13 UTC (permalink / raw)
  To: Christopher Lameter; +Cc: lsf-pc, linux-mm

On Tue 19-02-19 18:21:29, Cristopher Lameter wrote:
[...]
> I can make this more concrete by listing some of the approaches that I am
> seeing?

Yes, please. We should have a more specific topic otherwise I am not
sure a very vague discussion would be any useful.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Memory management facing a 400Gpbs network link
  2019-02-19 19:13         ` Michal Hocko
@ 2019-02-19 20:46           ` Christopher Lameter
  2019-02-20  8:31             ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Christopher Lameter @ 2019-02-19 20:46 UTC (permalink / raw)
  To: Michal Hocko; +Cc: lsf-pc, linux-mm

On Tue, 19 Feb 2019, Michal Hocko wrote:

> On Tue 19-02-19 18:21:29, Cristopher Lameter wrote:
> [...]
> > I can make this more concrete by listing some of the approaches that I am
> > seeing?
>
> Yes, please. We should have a more specific topic otherwise I am not
> sure a very vague discussion would be any useful.

I dont like the existing approaches but I can present them?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Memory management facing a 400Gpbs network link
  2019-02-19 20:46           ` Christopher Lameter
@ 2019-02-20  8:31             ` Michal Hocko
  2019-02-21 18:15               ` Christopher Lameter
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2019-02-20  8:31 UTC (permalink / raw)
  To: Christopher Lameter; +Cc: lsf-pc, linux-mm

On Tue 19-02-19 20:46:34, Cristopher Lameter wrote:
> On Tue, 19 Feb 2019, Michal Hocko wrote:
> 
> > On Tue 19-02-19 18:21:29, Cristopher Lameter wrote:
> > [...]
> > > I can make this more concrete by listing some of the approaches that I am
> > > seeing?
> >
> > Yes, please. We should have a more specific topic otherwise I am not
> > sure a very vague discussion would be any useful.
> 
> I dont like the existing approaches but I can present them?

Please give us at least some rough outline so that we can evaluate a
general interest and see how/whether to schedule such a topic.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Memory management facing a 400Gpbs network link
  2019-02-20  8:31             ` Michal Hocko
@ 2019-02-21 18:15               ` Christopher Lameter
  2019-02-21 18:24                 ` [Lsf-pc] " Rik van Riel
  2019-02-21 20:13                 ` Jerome Glisse
  0 siblings, 2 replies; 14+ messages in thread
From: Christopher Lameter @ 2019-02-21 18:15 UTC (permalink / raw)
  To: Michal Hocko; +Cc: lsf-pc, linux-mm

On Wed, 20 Feb 2019, Michal Hocko wrote:

> > I dont like the existing approaches but I can present them?
>
> Please give us at least some rough outline so that we can evaluate a
> general interest and see how/whether to schedule such a topic.

Ok. I am fuzzy on this one too. Lets give this another shot:

In the HPC world we often have to bypass operating system mechanisms for
full speed. Usually this has been through accellerators in the network
card, in sharing memory between multiple systems (with NUMA being a
special case of this) or with devices that provide some specialized memory
access. There is a whole issue here with pinned memory access (I think
that is handled in another session at the MM summit)

The intend was typically to bring the data into the system so that an
application can act on it. However, with the increasing speeds of the
interconnect that may even be faster than the internal busses on
contemporary platforms that may have to change since the processor and the
system as a whole is no longer able to handle the inbound data stream.
This is partially due to the I/O bus speeds no longer increasing.

The solutions to this issue coming from some vendors are falling
mostly into the following categories:

A) Provide preprocessing in the NIC.

   This can compress data, modify it and direct it to certain cores of
   the system. Preprocessing may allow multiple hosts to use one NIC
   (Makes sense since a single host may no longer be able to handle the
   data).

B) Provide fast memory in the NIC

   Since the NIC is at capacity limits when it comes to pushing data
   from the NIC into memory the obvious solution is to not go to main
   memory but provide faster on NIC memory that can then be accessed
   from the host as needed. Now the applications creates I/O bottlenecks
   when accessing their data or they need to implement complicated
   transfer mechanisms to retrieve and store data onto the NIC memory.

C) Direct passthrough to other devices

   The host I/O bus is used or another enhanced bus is provided to reach
   other system components without the constraints imposed by the OS or
   hardware. This means for example that a NIC can directly write to an
   NVME storage device (f.e. NVMEoF). A NIC can directly exchange data with
   another NIC. In an extreme case a hardware addressable global data fabric
   exists that is shared between multiple systems and the devices can
   share memory areas with one another. In the ultra extreme case there
   is a bypass  even using the memory channels since non volatile memory
   (a storage device essentially) is now  supported that way.

All of this leads to the development of numerous specialized accellerators
and special mechamisms to access memory on such devices. We already see a
proliferation of various remote memory schemes (HMM, PCI device memory
etc)

So how does memory work in the systems of the future? It seems that we may
need some new way of tracking memory that is remote on some device in
additional to the classic NUMA nodes? Or can we change the existing NUMA
schemes to cover these use cases?

We need some consistent and hopefully vendor neutral way to work with
memory I think.





----- Old proposal


00G Infiniband will become available this year. This means that the data
ingest speeds can be higher than the bandwidth of the processor
interacting with its own memory.

For example a single hardware thread is limited to 20Gbyte/sec whereas the
network interface provides 50Gbytes/sec. These rates can only be obtained
currently with pinned memory.

How can we evolve the memory management subsystem to operate at higher
speeds with more the comforts of paging and system calls that we are used
to?

It is likely that these speeds with increase further and since the lead
processor vendor seems to be caught in a management induced corporate
suicide attempt we will not likely see any process on the processors from
there. The straightforward solution would be to use the high speed tech
for fabrics for the internal busses (doh!). Alternate processors are
likely to show up in 2019 and 2020 but those will take a long time to
mature.

So what does the future hold and how do we scale up our HPC systems given
these problems?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] Memory management facing a 400Gpbs network link
  2019-02-21 18:15               ` Christopher Lameter
@ 2019-02-21 18:24                 ` Rik van Riel
  2019-02-21 18:47                   ` Christopher Lameter
  2019-02-21 20:13                 ` Jerome Glisse
  1 sibling, 1 reply; 14+ messages in thread
From: Rik van Riel @ 2019-02-21 18:24 UTC (permalink / raw)
  To: Christopher Lameter, Michal Hocko; +Cc: linux-mm, lsf-pc

[-- Attachment #1: Type: text/plain, Size: 734 bytes --]

On Thu, 2019-02-21 at 18:15 +0000, Christopher Lameter wrote:

> B) Provide fast memory in the NIC
> 
>    Since the NIC is at capacity limits when it comes to pushing data
>    from the NIC into memory the obvious solution is to not go to main
>    memory but provide faster on NIC memory that can then be accessed
>    from the host as needed. Now the applications creates I/O
> bottlenecks
>    when accessing their data or they need to implement complicated
>    transfer mechanisms to retrieve and store data onto the NIC
> memory.

Don't Intel and AMD both have High Bandwidth Memory
available?

Is it possible to place your network buffer in HBM,
and process the data from there?

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] Memory management facing a 400Gpbs network link
  2019-02-21 18:24                 ` [Lsf-pc] " Rik van Riel
@ 2019-02-21 18:47                   ` Christopher Lameter
  0 siblings, 0 replies; 14+ messages in thread
From: Christopher Lameter @ 2019-02-21 18:47 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Michal Hocko, linux-mm, lsf-pc

On Thu, 21 Feb 2019, Rik van Riel wrote:

> On Thu, 2019-02-21 at 18:15 +0000, Christopher Lameter wrote:
>
> > B) Provide fast memory in the NIC
> >
> >    Since the NIC is at capacity limits when it comes to pushing data
> >    from the NIC into memory the obvious solution is to not go to main
> >    memory but provide faster on NIC memory that can then be accessed
> >    from the host as needed. Now the applications creates I/O
> > bottlenecks
> >    when accessing their data or they need to implement complicated
> >    transfer mechanisms to retrieve and store data onto the NIC
> > memory.
>
> Don't Intel and AMD both have High Bandwidth Memory
> available?

Well that is another problem that I omitted from the new revision.

Yes but that memory is special with different performance characteristics
and often also represented as another NUMA node.

> Is it possible to place your network buffer in HBM,
> and process the data from there?

Ok but there is still the I/O bottleneck. So you can either have the HBM
on the host processor (Xeon Phi solution) in a special NUMA node. Or you
put the HBM onto the NIC and address it via PCI-E from the host processor
(which means slower access for the host but fast writes from the network)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Memory management facing a 400Gpbs network link
  2019-02-21 18:15               ` Christopher Lameter
  2019-02-21 18:24                 ` [Lsf-pc] " Rik van Riel
@ 2019-02-21 20:13                 ` Jerome Glisse
  1 sibling, 0 replies; 14+ messages in thread
From: Jerome Glisse @ 2019-02-21 20:13 UTC (permalink / raw)
  To: Christopher Lameter; +Cc: Michal Hocko, lsf-pc, linux-mm

On Thu, Feb 21, 2019 at 06:15:14PM +0000, Christopher Lameter wrote:
> On Wed, 20 Feb 2019, Michal Hocko wrote:
> 
> > > I dont like the existing approaches but I can present them?
> >
> > Please give us at least some rough outline so that we can evaluate a
> > general interest and see how/whether to schedule such a topic.
> 
> Ok. I am fuzzy on this one too. Lets give this another shot:
> 
> In the HPC world we often have to bypass operating system mechanisms for
> full speed. Usually this has been through accellerators in the network
> card, in sharing memory between multiple systems (with NUMA being a
> special case of this) or with devices that provide some specialized memory
> access. There is a whole issue here with pinned memory access (I think
> that is handled in another session at the MM summit)
> 
> The intend was typically to bring the data into the system so that an
> application can act on it. However, with the increasing speeds of the
> interconnect that may even be faster than the internal busses on
> contemporary platforms that may have to change since the processor and the
> system as a whole is no longer able to handle the inbound data stream.
> This is partially due to the I/O bus speeds no longer increasing.
> 
> The solutions to this issue coming from some vendors are falling
> mostly into the following categories:
> 
> A) Provide preprocessing in the NIC.
> 
>    This can compress data, modify it and direct it to certain cores of
>    the system. Preprocessing may allow multiple hosts to use one NIC
>    (Makes sense since a single host may no longer be able to handle the
>    data).
> 
> B) Provide fast memory in the NIC
> 
>    Since the NIC is at capacity limits when it comes to pushing data
>    from the NIC into memory the obvious solution is to not go to main
>    memory but provide faster on NIC memory that can then be accessed
>    from the host as needed. Now the applications creates I/O bottlenecks
>    when accessing their data or they need to implement complicated
>    transfer mechanisms to retrieve and store data onto the NIC memory.
> 
> C) Direct passthrough to other devices
> 
>    The host I/O bus is used or another enhanced bus is provided to reach
>    other system components without the constraints imposed by the OS or
>    hardware. This means for example that a NIC can directly write to an
>    NVME storage device (f.e. NVMEoF). A NIC can directly exchange data with
>    another NIC. In an extreme case a hardware addressable global data fabric
>    exists that is shared between multiple systems and the devices can
>    share memory areas with one another. In the ultra extreme case there
>    is a bypass  even using the memory channels since non volatile memory
>    (a storage device essentially) is now  supported that way.
> 
> All of this leads to the development of numerous specialized accellerators
> and special mechamisms to access memory on such devices. We already see a
> proliferation of various remote memory schemes (HMM, PCI device memory
> etc)
> 
> So how does memory work in the systems of the future? It seems that we may
> need some new way of tracking memory that is remote on some device in
> additional to the classic NUMA nodes? Or can we change the existing NUMA
> schemes to cover these use cases?
> 
> We need some consistent and hopefully vendor neutral way to work with
> memory I think.

Note that i proposed a topic about that [1] NUMA is really hard to work
with for device memory and adding memory that might not be cache coherent
or not support atomic operation, is not a good idea to report as regular
NUMA as existing application might start using such memory unaware of all
its peculiarities.

Anyway it is definitly a topic i beliew we need to discuss and i intend
to present the problem from GPU/accelerator point of view (as today this
are the hardware with sizeable fast local memory).

Cheers,
Jérôme

[1] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1904033.html


> 
> 
> 
> 
> 
> ----- Old proposal
> 
> 
> 00G Infiniband will become available this year. This means that the data
> ingest speeds can be higher than the bandwidth of the processor
> interacting with its own memory.
> 
> For example a single hardware thread is limited to 20Gbyte/sec whereas the
> network interface provides 50Gbytes/sec. These rates can only be obtained
> currently with pinned memory.
> 
> How can we evolve the memory management subsystem to operate at higher
> speeds with more the comforts of paging and system calls that we are used
> to?
> 
> It is likely that these speeds with increase further and since the lead
> processor vendor seems to be caught in a management induced corporate
> suicide attempt we will not likely see any process on the processors from
> there. The straightforward solution would be to use the high speed tech
> for fabrics for the internal busses (doh!). Alternate processors are
> likely to show up in 2019 and 2020 but those will take a long time to
> mature.
> 
> So what does the future hold and how do we scale up our HPC systems given
> these problems?
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-02-21 20:13 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-12 18:25 Memory management facing a 400Gpbs network link Christopher Lameter
2019-02-15 16:34 ` Jerome Glisse
2019-02-19 12:26 ` Michal Hocko
2019-02-19 14:21   ` Christopher Lameter
2019-02-19 17:36     ` Michal Hocko
2019-02-19 18:21       ` Christopher Lameter
2019-02-19 18:42         ` Alexander Duyck
2019-02-19 19:13         ` Michal Hocko
2019-02-19 20:46           ` Christopher Lameter
2019-02-20  8:31             ` Michal Hocko
2019-02-21 18:15               ` Christopher Lameter
2019-02-21 18:24                 ` [Lsf-pc] " Rik van Riel
2019-02-21 18:47                   ` Christopher Lameter
2019-02-21 20:13                 ` Jerome Glisse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.