All of lore.kernel.org
 help / color / mirror / Atom feed
From: Song Bao Hua <song.bao.hua@hisilicon.com>
To: Robin Murphy <robin.murphy@arm.com>,
	"linux@armlinux.org.uk" <linux@armlinux.org.uk>,
	"hch@lst.de" <hch@lst.de>,
	"m.szyprowski@samsung.com" <m.szyprowski@samsung.com>,
	"dagum@barrel.engr.sgi.com" <dagum@barrel.engr.sgi.com>,
	"ralf@oss.sgi.com" <ralf@oss.sgi.com>,
	"grundler@cup.hp.com" <grundler@cup.hp.com>,
	"Jay.Estabrook@compaq.com" <Jay.Estabrook@compaq.com>,
	"sailer@ife.ee.ethz.ch" <sailer@ife.ee.ethz.ch>,
	"andrea@suse.de" <andrea@suse.de>,
	"jens.axboe@oracle.com" <jens.axboe@oracle.com>,
	"davidm@hpl.hp.com" <davidm@hpl.hp.com>
Cc: "iommu@lists.linux-foundation.org"
	<iommu@lists.linux-foundation.org>,
	Linuxarm <linuxarm@huawei.com>,
	"linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>
Subject: RE: Constantly map and unmap of streaming DMA buffers with IOMMU backend might cause serious performance problem
Date: Fri, 15 May 2020 22:45:47 +0000	[thread overview]
Message-ID: <B926444035E5E2439431908E3842AFD249FA4D@DGGEMI525-MBS.china.huawei.com> (raw)
In-Reply-To: <36d67d68-4381-c7a7-dcf1-6383bd9ae0ad@arm.com>

> Subject: Re: Constantly map and unmap of streaming DMA buffers with
> IOMMU backend might cause serious performance problem
> 
> On 2020-05-15 09:19, Song Bao Hua wrote:
> [ snip... nice analysis, but ultimately it's still "doing stuff has more overhead
> than not doing stuff" ]
> 
> > I am thinking several possible ways on decreasing or removing the latency of
> DMA map/unmap for every single DMA transfer. Meanwhile, "non-strict" as an
> existing option with possible safety issues, I won't discuss it in this mail.
> 
> But passthrough and non-strict mode *specifically exist* for the cases where
> performance is the most important concern - streaming DMA with an IOMMU
> in the middle has an unavoidable tradeoff between performance and isolation,
> so dismissing that out of hand is not a good way to start making this
> argument.

I do understand there is a tradeoff between performance and isolation. However, users might ask for performance while supporting isolation. 
In passthrough mode, the whole memory might be accessible by DMA. In non-strict mode, a buffer could be still mapped in IOMMU when users have returned it to buddy and the buffer has even been allocated by another user. 

> 
> > 1. provide bounce coherent buffers for streaming buffers.
> > As the coherent buffers keep the status of mapping, we can remove the
> overhead of map and unmap for each single DMA operations. However, this
> solution requires memory copy between stream buffers and bounce buffers.
> Thus it will work only if copy is faster than map/unmap. Meanwhile, it will
> consume much more memory bandwidth.
> 
> I'm struggling to understand how that would work, can you explain it in more
> detail?

lower-layer drivers maintain some reusable coherent buffers.
For TX path, drivers copy streaming buffer to coherent buffer, then do DMA;
For RX path, drivers do DMA in coherent buffer, then copy to streaming buffer.

> 
> > 2.make upper-layer kernel components aware of the pain of iommu
> > map/unmap upper-layer fs, mm, networks can somehow let the lower-layer
> drivers know the end of the life cycle of sg buffers. In zswap case, I have seen
> zswap always use the same 2 pages as the destination buffers to save
> compressed page, but the compressor driver still has to constantly map and
> unmap those same two pages for every single compression since zswap and zip
> drivers are working in two completely different software layers.
> >
> > I am thinking some way as below, upper-layer kernel code can call:
> > sg_init_table(&sg...);
> > sg_mark_reusable(&sg....);
> > .... /* use the buffer many times */
> > ....
> > sg_mark_stop_reuse(&sg);
> >
> > After that, if low level drivers see "reusable" flag, it will realize the buffer can
> be used multiple times and will not do map/unmap every time. it means
> upper-layer components will further use the buffers and the same buffers will
> probably be given to lower-layer drivers for new DMA transfer later. When
> upper-layer code sets " stop_reuse", lower-layer driver will unmap the sg
> buffers, possibly by providing a unmap-callback to upper-layer components.
> For zswap case, I have seen the same buffers are always re-used and zip driver
> maps and unmaps it again and again. Shortly after the buffer is unmapped, it
> will be mapped in the next transmission, almost without any time gap
> between unmap and map. In case zswap can set the "reusable" flag, zip driver
> will save a lot of time.
> > Meanwhile, for the safety of buffers, lower-layer drivers need to make certain
> the buffers have already been unmapped in iommu before those buffers go
> back to buddy for other users.
> 
> That sounds like it would only have benefit in a very small set of specific
> circumstances, and would be very difficult to generalise to buffers that are
> mapped via dma_map_page() or dma_map_single().

Yes, indeed. Hopefully the small set of specific circumstances will encourage more upper-layer consumers to reuse buffers, then the "reusable" flag can extend to more common cases, such as page and single buffer.

> Furthermore, a high-level API that affects a low-level driver's interpretation of
> mid-layer API calls without the mid-layer's knowledge sounds like a hideous
> abomination of anti-design. If a mid-layer API lends itself to inefficiency at the
> lower level, it would seem a lot cleaner and more robust to extend *that* API
> for stateful buffer reuse.

Absolutely agree. I didn't say the method is elegant. For this moment, maybe "reuse" can get started from a small case like zswap. After some while, it is possible more users are encouraged to do some optimization for buffer reuse, understanding the suffering of lower-layer drivers. Then those performance problems might be solved case by case.
On the other hand, it is always the freedom of upper-layer code to indicate "reuse" or not. If they don't say anything about reuse, lower-layer drivers can simply do map and unmap.

> Failing that, it might possibly be appropriate to approach this at the driver
> level - many of the cleverer network drivers already implement buffer pools to
> recycle mapped SKBs internally, couldn't the "zip driver" simply try doing
> something like that for itself?

are the buffer pools for RX path? For TX path, buffers come from upper-layer so network drivers can't do anything for recycling SKBs?

> 
> Robin.

Thanks
Barry
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

WARNING: multiple messages have this Message-ID
From: Song Bao Hua <song.bao.hua@hisilicon.com>
To: Robin Murphy <robin.murphy@arm.com>,
	"linux@armlinux.org.uk" <linux@armlinux.org.uk>,
	"hch@lst.de" <hch@lst.de>,
	"m.szyprowski@samsung.com" <m.szyprowski@samsung.com>,
	"dagum@barrel.engr.sgi.com" <dagum@barrel.engr.sgi.com>,
	"ralf@oss.sgi.com" <ralf@oss.sgi.com>,
	"grundler@cup.hp.com" <grundler@cup.hp.com>,
	"Jay.Estabrook@compaq.com" <Jay.Estabrook@compaq.com>,
	"sailer@ife.ee.ethz.ch" <sailer@ife.ee.ethz.ch>,
	"andrea@suse.de" <andrea@suse.de>,
	"jens.axboe@oracle.com" <jens.axboe@oracle.com>,
	"davidm@hpl.hp.com" <davidm@hpl.hp.com>
Cc: "iommu@lists.linux-foundation.org"
	<iommu@lists.linux-foundation.org>,
	Linuxarm <linuxarm@huawei.com>,
	"linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>
Subject: RE: Constantly map and unmap of streaming DMA buffers with IOMMU backend might cause serious performance problem
Date: Fri, 15 May 2020 22:45:47 +0000	[thread overview]
Message-ID: <B926444035E5E2439431908E3842AFD249FA4D@DGGEMI525-MBS.china.huawei.com> (raw)
In-Reply-To: <36d67d68-4381-c7a7-dcf1-6383bd9ae0ad@arm.com>

> Subject: Re: Constantly map and unmap of streaming DMA buffers with
> IOMMU backend might cause serious performance problem
> 
> On 2020-05-15 09:19, Song Bao Hua wrote:
> [ snip... nice analysis, but ultimately it's still "doing stuff has more overhead
> than not doing stuff" ]
> 
> > I am thinking several possible ways on decreasing or removing the latency of
> DMA map/unmap for every single DMA transfer. Meanwhile, "non-strict" as an
> existing option with possible safety issues, I won't discuss it in this mail.
> 
> But passthrough and non-strict mode *specifically exist* for the cases where
> performance is the most important concern - streaming DMA with an IOMMU
> in the middle has an unavoidable tradeoff between performance and isolation,
> so dismissing that out of hand is not a good way to start making this
> argument.

I do understand there is a tradeoff between performance and isolation. However, users might ask for performance while supporting isolation. 
In passthrough mode, the whole memory might be accessible by DMA. In non-strict mode, a buffer could be still mapped in IOMMU when users have returned it to buddy and the buffer has even been allocated by another user. 

> 
> > 1. provide bounce coherent buffers for streaming buffers.
> > As the coherent buffers keep the status of mapping, we can remove the
> overhead of map and unmap for each single DMA operations. However, this
> solution requires memory copy between stream buffers and bounce buffers.
> Thus it will work only if copy is faster than map/unmap. Meanwhile, it will
> consume much more memory bandwidth.
> 
> I'm struggling to understand how that would work, can you explain it in more
> detail?

lower-layer drivers maintain some reusable coherent buffers.
For TX path, drivers copy streaming buffer to coherent buffer, then do DMA;
For RX path, drivers do DMA in coherent buffer, then copy to streaming buffer.

> 
> > 2.make upper-layer kernel components aware of the pain of iommu
> > map/unmap upper-layer fs, mm, networks can somehow let the lower-layer
> drivers know the end of the life cycle of sg buffers. In zswap case, I have seen
> zswap always use the same 2 pages as the destination buffers to save
> compressed page, but the compressor driver still has to constantly map and
> unmap those same two pages for every single compression since zswap and zip
> drivers are working in two completely different software layers.
> >
> > I am thinking some way as below, upper-layer kernel code can call:
> > sg_init_table(&sg...);
> > sg_mark_reusable(&sg....);
> > .... /* use the buffer many times */
> > ....
> > sg_mark_stop_reuse(&sg);
> >
> > After that, if low level drivers see "reusable" flag, it will realize the buffer can
> be used multiple times and will not do map/unmap every time. it means
> upper-layer components will further use the buffers and the same buffers will
> probably be given to lower-layer drivers for new DMA transfer later. When
> upper-layer code sets " stop_reuse", lower-layer driver will unmap the sg
> buffers, possibly by providing a unmap-callback to upper-layer components.
> For zswap case, I have seen the same buffers are always re-used and zip driver
> maps and unmaps it again and again. Shortly after the buffer is unmapped, it
> will be mapped in the next transmission, almost without any time gap
> between unmap and map. In case zswap can set the "reusable" flag, zip driver
> will save a lot of time.
> > Meanwhile, for the safety of buffers, lower-layer drivers need to make certain
> the buffers have already been unmapped in iommu before those buffers go
> back to buddy for other users.
> 
> That sounds like it would only have benefit in a very small set of specific
> circumstances, and would be very difficult to generalise to buffers that are
> mapped via dma_map_page() or dma_map_single().

Yes, indeed. Hopefully the small set of specific circumstances will encourage more upper-layer consumers to reuse buffers, then the "reusable" flag can extend to more common cases, such as page and single buffer.

> Furthermore, a high-level API that affects a low-level driver's interpretation of
> mid-layer API calls without the mid-layer's knowledge sounds like a hideous
> abomination of anti-design. If a mid-layer API lends itself to inefficiency at the
> lower level, it would seem a lot cleaner and more robust to extend *that* API
> for stateful buffer reuse.

Absolutely agree. I didn't say the method is elegant. For this moment, maybe "reuse" can get started from a small case like zswap. After some while, it is possible more users are encouraged to do some optimization for buffer reuse, understanding the suffering of lower-layer drivers. Then those performance problems might be solved case by case.
On the other hand, it is always the freedom of upper-layer code to indicate "reuse" or not. If they don't say anything about reuse, lower-layer drivers can simply do map and unmap.

> Failing that, it might possibly be appropriate to approach this at the driver
> level - many of the cleverer network drivers already implement buffer pools to
> recycle mapped SKBs internally, couldn't the "zip driver" simply try doing
> something like that for itself?

are the buffer pools for RX path? For TX path, buffers come from upper-layer so network drivers can't do anything for recycling SKBs?

> 
> Robin.

Thanks
Barry
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  parent reply	other threads:[~2020-05-15 22:46 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-15  8:19 Constantly map and unmap of streaming DMA buffers with IOMMU backend might cause serious performance problem Song Bao Hua
2020-05-15  8:19 ` Song Bao Hua
2020-05-15 12:10 ` Robin Murphy
2020-05-15 12:10   ` Robin Murphy
2020-05-15 14:45   ` hch
2020-05-15 14:45     ` hch
2020-05-15 21:33     ` Song Bao Hua
2020-05-15 21:33       ` Song Bao Hua
2020-05-15 22:12       ` Robin Murphy
2020-05-15 22:12         ` Robin Murphy
2020-05-15 22:45   ` Song Bao Hua [this message]
2020-05-15 22:45     ` Song Bao Hua

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=B926444035E5E2439431908E3842AFD249FA4D@DGGEMI525-MBS.china.huawei.com \
    --to=song.bao.hua@hisilicon.com \
    --cc=Jay.Estabrook@compaq.com \
    --cc=andrea@suse.de \
    --cc=dagum@barrel.engr.sgi.com \
    --cc=davidm@hpl.hp.com \
    --cc=grundler@cup.hp.com \
    --cc=hch@lst.de \
    --cc=iommu@lists.linux-foundation.org \
    --cc=jens.axboe@oracle.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux@armlinux.org.uk \
    --cc=linuxarm@huawei.com \
    --cc=m.szyprowski@samsung.com \
    --cc=ralf@oss.sgi.com \
    --cc=robin.murphy@arm.com \
    --cc=sailer@ife.ee.ethz.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.