Re: Memory management facing a 400Gpbs network link

From: Alexander Duyck <alexander.duyck@gmail.com>
To: Christopher Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@kernel.org>,
	lsf-pc@lists.linux-foundation.org,  linux-mm <linux-mm@kvack.org>
Subject: Re: Memory management facing a 400Gpbs network link
Date: Tue, 19 Feb 2019 10:42:25 -0800	[thread overview]
Message-ID: <CAKgT0UevknPT5HoQMrGW9Y8Ohpf=9G7tvMwWxYEhiz2fKHS+aQ@mail.gmail.com> (raw)
In-Reply-To: <0100016906fdc80b-4471de43-3f22-45ec-8f77-f2ff1b76d9fe-000000@email.amazonses.com>

On Tue, Feb 19, 2019 at 10:21 AM Christopher Lameter <cl@linux.com> wrote:
>
> On Tue, 19 Feb 2019, Michal Hocko wrote:
>
> > > Well the hardware is one problem. The problem that a single core cannot
> > > handle the full memory bandwidth can be solved by spreading the
> > > processing of the data to multiple processors. So I think the memory
> > > subsystem could be aware of that? How do we load balance between cores so
> > > that we can handle the full bandwidth?
> >
> > Isn't that something that poeple already do from userspace?
>
> Yes. We can certainly do a lot from userspace manually but this is hard
> and involves working around memory management to some extend. The higher
> the I/O bandwidth become the more memory management becomes not that
> useful anymore.
>
> Can we improve the situation? A 2M VM was repeatedly discussed f.e.
>
> Or some kind of memory management extension that allows working with large
> contiguous blocks of memory? Which are problematic in their own
> because large contiguous blocks may not be obtainable due to
> fragmentation. Therefore the need to reboot the system if the
> load changes.
>
> > > The other is that the memory needs to be pinned and all sorts of special
> > > measures and tuning needs to be done to make this actually work. Is there
> > > any way to simplify this?
> > >
> > > Also the need for page pinning becomes a problem since the majority of the
> > > memory of a system would need to be pinned. Actually the application seems
> > > to be doing the memory management then?
> >
> > I am sorry but this still sounds too vague. There are certainly
> > possibilities to handle part the MM functionality in the userspace.
> > But why should we discuss that at the MM track. Do you envision any
> > in kernel changes that would be needed?
>
> Without adapting to these trends memory management may become just a
> part of the system that is mainly useful for running executables, handling
> configuration files etc but not for handling the data going through the
> system.
>
> We end up with data fully bypassing the kernel. Its difficult to handle
> that way.
>
> Sorry this is fuzzy. I wonder if there are other solutions than those
> that I know of for these issues. The solutions mostly mean going directly
> to hardware because the performance is just not available if the kernel is
> involved. If that is unavoidable then we need clean APIs to be able to
> carve out memory for these needs.
>
> I can make this more concrete by listing some of the approaches that I am
> seeing?
>
> F.e.
>
> A 400G NIC has the ability to route traffic to certain endpoints on
> specific cores. Thus traffic volume can be segmented into multiple
> streams that are able to be handled by single cores. However, many
> data streams (video, audio) have implicit ordering constraints between
> packets.

What is the likelihood of a single data stream consuming the full
bandwidth of a 400G NIC though? As far as splitting up the work most
devices have a means of hashing the packet headers and then splitting
up the work based on flows called Receive Side Scaling, aka RSS. That
is the standard for distributing the work for most NICs.