Re: Memory management facing a 400Gpbs network link

From: Jerome Glisse <jglisse@redhat.com>
To: Christopher Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@kernel.org>,
	lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org
Subject: Re: Memory management facing a 400Gpbs network link
Date: Thu, 21 Feb 2019 15:13:09 -0500	[thread overview]
Message-ID: <20190221201309.GA5201@redhat.com> (raw)
In-Reply-To: <010001691144c94b-c935fd1d-9c90-40a5-9763-2c05ef0df7f4-000000@email.amazonses.com>

On Thu, Feb 21, 2019 at 06:15:14PM +0000, Christopher Lameter wrote:
> On Wed, 20 Feb 2019, Michal Hocko wrote:
> 
> > > I dont like the existing approaches but I can present them?
> >
> > Please give us at least some rough outline so that we can evaluate a
> > general interest and see how/whether to schedule such a topic.
> 
> Ok. I am fuzzy on this one too. Lets give this another shot:
> 
> In the HPC world we often have to bypass operating system mechanisms for
> full speed. Usually this has been through accellerators in the network
> card, in sharing memory between multiple systems (with NUMA being a
> special case of this) or with devices that provide some specialized memory
> access. There is a whole issue here with pinned memory access (I think
> that is handled in another session at the MM summit)
> 
> The intend was typically to bring the data into the system so that an
> application can act on it. However, with the increasing speeds of the
> interconnect that may even be faster than the internal busses on
> contemporary platforms that may have to change since the processor and the
> system as a whole is no longer able to handle the inbound data stream.
> This is partially due to the I/O bus speeds no longer increasing.
> 
> The solutions to this issue coming from some vendors are falling
> mostly into the following categories:
> 
> A) Provide preprocessing in the NIC.
> 
>    This can compress data, modify it and direct it to certain cores of
>    the system. Preprocessing may allow multiple hosts to use one NIC
>    (Makes sense since a single host may no longer be able to handle the
>    data).
> 
> B) Provide fast memory in the NIC
> 
>    Since the NIC is at capacity limits when it comes to pushing data
>    from the NIC into memory the obvious solution is to not go to main
>    memory but provide faster on NIC memory that can then be accessed
>    from the host as needed. Now the applications creates I/O bottlenecks
>    when accessing their data or they need to implement complicated
>    transfer mechanisms to retrieve and store data onto the NIC memory.
> 
> C) Direct passthrough to other devices
> 
>    The host I/O bus is used or another enhanced bus is provided to reach
>    other system components without the constraints imposed by the OS or
>    hardware. This means for example that a NIC can directly write to an
>    NVME storage device (f.e. NVMEoF). A NIC can directly exchange data with
>    another NIC. In an extreme case a hardware addressable global data fabric
>    exists that is shared between multiple systems and the devices can
>    share memory areas with one another. In the ultra extreme case there
>    is a bypass  even using the memory channels since non volatile memory
>    (a storage device essentially) is now  supported that way.
> 
> All of this leads to the development of numerous specialized accellerators
> and special mechamisms to access memory on such devices. We already see a
> proliferation of various remote memory schemes (HMM, PCI device memory
> etc)
> 
> So how does memory work in the systems of the future? It seems that we may
> need some new way of tracking memory that is remote on some device in
> additional to the classic NUMA nodes? Or can we change the existing NUMA
> schemes to cover these use cases?
> 
> We need some consistent and hopefully vendor neutral way to work with
> memory I think.

Note that i proposed a topic about that [1] NUMA is really hard to work
with for device memory and adding memory that might not be cache coherent
or not support atomic operation, is not a good idea to report as regular
NUMA as existing application might start using such memory unaware of all
its peculiarities.

Anyway it is definitly a topic i beliew we need to discuss and i intend
to present the problem from GPU/accelerator point of view (as today this
are the hardware with sizeable fast local memory).

Cheers,
Jérôme

[1] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1904033.html

> 
> 
> 
> 
> 
> ----- Old proposal
> 
> 
> 00G Infiniband will become available this year. This means that the data
> ingest speeds can be higher than the bandwidth of the processor
> interacting with its own memory.
> 
> For example a single hardware thread is limited to 20Gbyte/sec whereas the
> network interface provides 50Gbytes/sec. These rates can only be obtained
> currently with pinned memory.
> 
> How can we evolve the memory management subsystem to operate at higher
> speeds with more the comforts of paging and system calls that we are used
> to?
> 
> It is likely that these speeds with increase further and since the lead
> processor vendor seems to be caught in a management induced corporate
> suicide attempt we will not likely see any process on the processors from
> there. The straightforward solution would be to use the high speed tech
> for fabrics for the internal busses (doh!). Alternate processors are
> likely to show up in 2019 and 2020 but those will take a long time to
> mature.
> 
> So what does the future hold and how do we scale up our HPC systems given
> these problems?
>