From: Jonathan Cameron <jic23@kernel.org> To: Paul Cercueil <paul@crapouillou.net> Cc: "Alexandru Ardelean" <ardeleanalex@gmail.com>, "Lars-Peter Clausen" <lars@metafoo.de>, "Michael Hennerich" <Michael.Hennerich@analog.com>, "Sumit Semwal" <sumit.semwal@linaro.org>, "Christian König" <christian.koenig@amd.com>, linux-iio@vger.kernel.org, linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org Subject: Re: [PATCH 11/15] iio: buffer-dma: Boost performance using write-combine cache setting Date: Sat, 27 Nov 2021 16:05:33 +0000 [thread overview] Message-ID: <20211127160533.5259f486@jic23-huawei> (raw) In-Reply-To: <YX153R.0PENWW3ING7F1@crapouillou.net> On Thu, 25 Nov 2021 17:29:58 +0000 Paul Cercueil <paul@crapouillou.net> wrote: > Hi Jonathan, > > Le dim., nov. 21 2021 at 17:43:20 +0000, Paul Cercueil > <paul@crapouillou.net> a écrit : > > Hi Jonathan, > > > > Le dim., nov. 21 2021 at 15:00:37 +0000, Jonathan Cameron > > <jic23@kernel.org> a écrit : > >> On Mon, 15 Nov 2021 14:19:21 +0000 > >> Paul Cercueil <paul@crapouillou.net> wrote: > >> > >>> We can be certain that the input buffers will only be accessed by > >>> userspace for reading, and output buffers will mostly be accessed > >>> by > >>> userspace for writing. > >> > >> Mostly? Perhaps a little more info on why that's not 'only'. > > > > Just like with a framebuffer, it really depends on what the > > application does. Most of the cases it will just read sequentially an > > input buffer, or write sequentially an output buffer. But then you > > get the exotic application that will try to do something like alpha > > blending, which means read+write. Hence "mostly". > > > >>> > >>> Therefore, it makes more sense to use only fully cached input > >>> \x7f\x7fbuffers, > >>> and to use the write-combine cache coherency setting for output > >>> \x7f\x7fbuffers. > >>> > >>> This boosts performance, as the data written to the output buffers > >>> \x7f\x7fdoes > >>> not have to be sync'd for coherency. It will halve performance if > >>> \x7f\x7fthe > >>> userspace application tries to read from the output buffer, but > >>> this > >>> should never happen. > >>> > >>> Since we don't need to sync the cache when disabling CPU access > >>> \x7f\x7feither > >>> for input buffers or output buffers, the .end_cpu_access() > >>> callback \x7f\x7fcan > >>> be dropped completely. > >> > >> We have an odd mix of coherent and non coherent DMA in here as you > >> \x7fnoted, > >> but are you sure this is safe on all platforms? > > > > The mix isn't safe, but using only coherent or only non-coherent > > should be safe, yes. > > > >> > >>> > >>> Signed-off-by: Paul Cercueil <paul@crapouillou.net> > >> > >> Any numbers to support this patch? The mapping types are performance > >> optimisations so nice to know how much of a difference they make. > > > > Output buffers are definitely faster in write-combine mode. On a > > ZedBoard with a AD9361 transceiver set to 66 MSPS, and buffer/size > > set to 8192, I would get about 185 MiB/s before, 197 MiB/s after. > > > > Input buffers... early results are mixed. On ARM32 it does look like > > it is slightly faster to read from *uncached* memory than reading > > from cached memory. The cache sync does take a long time. > > > > Other architectures might have a different result, for instance on > > MIPS invalidating the cache is a very fast operation, so using cached > > buffers would be a huge win in performance. > > > > Setups where the DMA operations are coherent also wouldn't require > > any cache sync and this patch would give a huge win in performance. > > > > I'll run some more tests next week to have some fresh numbers. > > I think I mixed things up before, because I get different results now. > > Here are some fresh benchmarks, triple-checked, using libiio's > iio_readdev and iio_writedev tools, with 64K samples buffers at 61.44 > MSPS (max. theorical throughput: 234 MiB/s): > iio_readdev -b 65536 cf-ad9361-lpc voltage0 voltage1 | pv > /dev/null > pv /dev/zero | iio_writedev -b 65536 cf-ad9361-dds-core-lpc voltage0 > voltage1 There is a bit of a terminology confusion going on here. I think for the mappings you mean cacheable vs non-cacheable but maybe I'm misunderstanding. That doesn't necessarily correspond to coherency. Non cached memory is always coherent because all caches miss. Non-cacheable can be related to coherency of course. Also beware that given hardware might not implement non-cacheable if it knows all possible accesses are IO-coherent. Affect is the same and if implemented correctly it will not hurt performance significantly. firmware should be letting the OS know if the device does coherent DMA or not... dma-coherent in dt. It might be optional for a given piece of DMA engine but I've not seen that.. I'm not sure I see how you can do a mixture of cacheable for reads and write combine (which means uncacheable) for writes... > > Coherent mapping: > - fileio: > read: 125 MiB/s > write: 141 MiB/s > - dmabuf: > read: 171 MiB/s > write: 210 MiB/s > > Coherent reads + Write-combine writes: > - fileio: > read: 125 MiB/s > write: 141 MiB/s > - dmabuf: > read: 171 MiB/s > write: 210 MiB/s > > Non-coherent mapping: > - fileio: > read: 119 MiB/s > write: 124 MiB/s > - dmabuf: > read: 159 MiB/s > write: 124 MiB/s > > Non-coherent reads + write-combine writes: > - fileio: > read: 119 MiB/s > write: 140 MiB/s > - dmabuf: > read: 159 MiB/s > write: 210 MiB/s > > Non-coherent mapping with no cache sync: > - fileio: > read: 156 MiB/s > write: 123 MiB/s > - dmabuf: > read: 234 MiB/s (capped by sample rate) > write: 182 MiB/s > > Non-coherent reads with no cache sync + write-combine writes: > - fileio: > read: 156 MiB/s > write: 140 MiB/s > - dmabuf: > read: 234 MiB/s (capped by sample rate) > write: 210 MiB/s > > > A few things we can deduce from this: > > * Write-combine is not available on Zynq/ARM? If it was working, it > should give a better performance than the coherent mapping, but it > doesn't seem to do anything at all. At least it doesn't harm > performance. I'm not sure it's very relevant to this sort of streaming write. If you write a sequence of addresses then nothing stops them getting combined into a single write whether or not it is write-combining. You may be right that the particular path to memory doesn't support it anyway. Also some cache architectures will rapidly detect streaming writes and elect not to cache them whether coherent or not. > > * Non-coherent + cache invalidation is definitely a good deal slower > than using coherent mapping, at least on ARM32. However, when the cache > sync is disabled (e.g. if the DMA operations are coherent) the reads > are much faster. If you are running with cache sync then it better not be cached as such it's coherent in the sense of there being no entries in the cache in either direction. > > * The new dma-buf based API is a great deal faster than the fileio API. :) > > So in the future we could use coherent reads + write-combine writes, > unless we know the DMA operations are coherent, and in this case use > non-coherent reads + write-combine writes. Not following this argument at all, but anyway we can revisit when it mattrs. > > Regarding this patch, unfortunately I cannot prove that write-combine > is faster, so I'll just drop this patch for now. Sure, thanks for checking. It's worth noting that WC usage in kernel is vanishingly rare and I suspect that's mostly because it doesn't do anything on many implementations. Jonathan > > Cheers, > -Paul > >
WARNING: multiple messages have this Message-ID (diff)
From: Jonathan Cameron <jic23@kernel.org> To: Paul Cercueil <paul@crapouillou.net> Cc: "Michael Hennerich" <Michael.Hennerich@analog.com>, linux-iio@vger.kernel.org, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, "Christian König" <christian.koenig@amd.com>, linaro-mm-sig@lists.linaro.org, "Alexandru Ardelean" <ardeleanalex@gmail.com>, linux-media@vger.kernel.org Subject: Re: [PATCH 11/15] iio: buffer-dma: Boost performance using write-combine cache setting Date: Sat, 27 Nov 2021 16:05:33 +0000 [thread overview] Message-ID: <20211127160533.5259f486@jic23-huawei> (raw) In-Reply-To: <YX153R.0PENWW3ING7F1@crapouillou.net> On Thu, 25 Nov 2021 17:29:58 +0000 Paul Cercueil <paul@crapouillou.net> wrote: > Hi Jonathan, > > Le dim., nov. 21 2021 at 17:43:20 +0000, Paul Cercueil > <paul@crapouillou.net> a écrit : > > Hi Jonathan, > > > > Le dim., nov. 21 2021 at 15:00:37 +0000, Jonathan Cameron > > <jic23@kernel.org> a écrit : > >> On Mon, 15 Nov 2021 14:19:21 +0000 > >> Paul Cercueil <paul@crapouillou.net> wrote: > >> > >>> We can be certain that the input buffers will only be accessed by > >>> userspace for reading, and output buffers will mostly be accessed > >>> by > >>> userspace for writing. > >> > >> Mostly? Perhaps a little more info on why that's not 'only'. > > > > Just like with a framebuffer, it really depends on what the > > application does. Most of the cases it will just read sequentially an > > input buffer, or write sequentially an output buffer. But then you > > get the exotic application that will try to do something like alpha > > blending, which means read+write. Hence "mostly". > > > >>> > >>> Therefore, it makes more sense to use only fully cached input > >>> \x7f\x7fbuffers, > >>> and to use the write-combine cache coherency setting for output > >>> \x7f\x7fbuffers. > >>> > >>> This boosts performance, as the data written to the output buffers > >>> \x7f\x7fdoes > >>> not have to be sync'd for coherency. It will halve performance if > >>> \x7f\x7fthe > >>> userspace application tries to read from the output buffer, but > >>> this > >>> should never happen. > >>> > >>> Since we don't need to sync the cache when disabling CPU access > >>> \x7f\x7feither > >>> for input buffers or output buffers, the .end_cpu_access() > >>> callback \x7f\x7fcan > >>> be dropped completely. > >> > >> We have an odd mix of coherent and non coherent DMA in here as you > >> \x7fnoted, > >> but are you sure this is safe on all platforms? > > > > The mix isn't safe, but using only coherent or only non-coherent > > should be safe, yes. > > > >> > >>> > >>> Signed-off-by: Paul Cercueil <paul@crapouillou.net> > >> > >> Any numbers to support this patch? The mapping types are performance > >> optimisations so nice to know how much of a difference they make. > > > > Output buffers are definitely faster in write-combine mode. On a > > ZedBoard with a AD9361 transceiver set to 66 MSPS, and buffer/size > > set to 8192, I would get about 185 MiB/s before, 197 MiB/s after. > > > > Input buffers... early results are mixed. On ARM32 it does look like > > it is slightly faster to read from *uncached* memory than reading > > from cached memory. The cache sync does take a long time. > > > > Other architectures might have a different result, for instance on > > MIPS invalidating the cache is a very fast operation, so using cached > > buffers would be a huge win in performance. > > > > Setups where the DMA operations are coherent also wouldn't require > > any cache sync and this patch would give a huge win in performance. > > > > I'll run some more tests next week to have some fresh numbers. > > I think I mixed things up before, because I get different results now. > > Here are some fresh benchmarks, triple-checked, using libiio's > iio_readdev and iio_writedev tools, with 64K samples buffers at 61.44 > MSPS (max. theorical throughput: 234 MiB/s): > iio_readdev -b 65536 cf-ad9361-lpc voltage0 voltage1 | pv > /dev/null > pv /dev/zero | iio_writedev -b 65536 cf-ad9361-dds-core-lpc voltage0 > voltage1 There is a bit of a terminology confusion going on here. I think for the mappings you mean cacheable vs non-cacheable but maybe I'm misunderstanding. That doesn't necessarily correspond to coherency. Non cached memory is always coherent because all caches miss. Non-cacheable can be related to coherency of course. Also beware that given hardware might not implement non-cacheable if it knows all possible accesses are IO-coherent. Affect is the same and if implemented correctly it will not hurt performance significantly. firmware should be letting the OS know if the device does coherent DMA or not... dma-coherent in dt. It might be optional for a given piece of DMA engine but I've not seen that.. I'm not sure I see how you can do a mixture of cacheable for reads and write combine (which means uncacheable) for writes... > > Coherent mapping: > - fileio: > read: 125 MiB/s > write: 141 MiB/s > - dmabuf: > read: 171 MiB/s > write: 210 MiB/s > > Coherent reads + Write-combine writes: > - fileio: > read: 125 MiB/s > write: 141 MiB/s > - dmabuf: > read: 171 MiB/s > write: 210 MiB/s > > Non-coherent mapping: > - fileio: > read: 119 MiB/s > write: 124 MiB/s > - dmabuf: > read: 159 MiB/s > write: 124 MiB/s > > Non-coherent reads + write-combine writes: > - fileio: > read: 119 MiB/s > write: 140 MiB/s > - dmabuf: > read: 159 MiB/s > write: 210 MiB/s > > Non-coherent mapping with no cache sync: > - fileio: > read: 156 MiB/s > write: 123 MiB/s > - dmabuf: > read: 234 MiB/s (capped by sample rate) > write: 182 MiB/s > > Non-coherent reads with no cache sync + write-combine writes: > - fileio: > read: 156 MiB/s > write: 140 MiB/s > - dmabuf: > read: 234 MiB/s (capped by sample rate) > write: 210 MiB/s > > > A few things we can deduce from this: > > * Write-combine is not available on Zynq/ARM? If it was working, it > should give a better performance than the coherent mapping, but it > doesn't seem to do anything at all. At least it doesn't harm > performance. I'm not sure it's very relevant to this sort of streaming write. If you write a sequence of addresses then nothing stops them getting combined into a single write whether or not it is write-combining. You may be right that the particular path to memory doesn't support it anyway. Also some cache architectures will rapidly detect streaming writes and elect not to cache them whether coherent or not. > > * Non-coherent + cache invalidation is definitely a good deal slower > than using coherent mapping, at least on ARM32. However, when the cache > sync is disabled (e.g. if the DMA operations are coherent) the reads > are much faster. If you are running with cache sync then it better not be cached as such it's coherent in the sense of there being no entries in the cache in either direction. > > * The new dma-buf based API is a great deal faster than the fileio API. :) > > So in the future we could use coherent reads + write-combine writes, > unless we know the DMA operations are coherent, and in this case use > non-coherent reads + write-combine writes. Not following this argument at all, but anyway we can revisit when it mattrs. > > Regarding this patch, unfortunately I cannot prove that write-combine > is faster, so I'll just drop this patch for now. Sure, thanks for checking. It's worth noting that WC usage in kernel is vanishingly rare and I suspect that's mostly because it doesn't do anything on many implementations. Jonathan > > Cheers, > -Paul > >
next prev parent reply other threads:[~2021-11-27 16:02 UTC|newest] Thread overview: 118+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-11-15 14:19 [PATCH 00/15] iio: buffer-dma: write() and new DMABUF based API Paul Cercueil 2021-11-15 14:19 ` Paul Cercueil 2021-11-15 14:19 ` [PATCH 01/15] iio: buffer-dma: Get rid of incoming/outgoing queues Paul Cercueil 2021-11-15 14:19 ` Paul Cercueil 2021-11-16 8:23 ` Alexandru Ardelean 2021-11-16 8:23 ` Alexandru Ardelean 2021-11-21 14:05 ` Jonathan Cameron 2021-11-21 14:05 ` Jonathan Cameron 2021-11-21 16:23 ` Lars-Peter Clausen 2021-11-21 16:23 ` Lars-Peter Clausen 2021-11-21 17:52 ` Paul Cercueil 2021-11-21 17:52 ` Paul Cercueil 2021-11-21 18:49 ` Lars-Peter Clausen 2021-11-21 18:49 ` Lars-Peter Clausen 2021-11-21 20:08 ` Paul Cercueil 2021-11-21 20:08 ` Paul Cercueil 2021-11-22 15:08 ` Lars-Peter Clausen 2021-11-22 15:08 ` Lars-Peter Clausen 2021-11-22 15:16 ` Paul Cercueil 2021-11-22 15:16 ` Paul Cercueil 2021-11-22 15:17 ` Lars-Peter Clausen 2021-11-22 15:17 ` Lars-Peter Clausen 2021-11-22 15:27 ` Paul Cercueil 2021-11-22 15:27 ` Paul Cercueil 2021-11-15 14:19 ` [PATCH 02/15] iio: buffer-dma: Remove unused iio_buffer_block struct Paul Cercueil 2021-11-15 14:19 ` Paul Cercueil 2021-11-16 8:22 ` Alexandru Ardelean 2021-11-16 8:22 ` Alexandru Ardelean 2021-11-15 14:19 ` [PATCH 03/15] iio: buffer-dma: Use round_down() instead of rounddown() Paul Cercueil 2021-11-15 14:19 ` Paul Cercueil 2021-11-16 8:26 ` Alexandru Ardelean 2021-11-16 8:26 ` Alexandru Ardelean 2021-11-21 14:08 ` Jonathan Cameron 2021-11-21 14:08 ` Jonathan Cameron 2021-11-22 10:00 ` Paul Cercueil 2021-11-22 10:00 ` Paul Cercueil 2021-11-27 15:15 ` Jonathan Cameron 2021-11-27 15:15 ` Jonathan Cameron 2021-11-15 14:19 ` [PATCH 04/15] iio: buffer-dma: Enable buffer write support Paul Cercueil 2021-11-15 14:19 ` Paul Cercueil 2021-11-16 8:52 ` Alexandru Ardelean 2021-11-16 8:52 ` Alexandru Ardelean 2021-11-21 14:20 ` Jonathan Cameron 2021-11-21 14:20 ` Jonathan Cameron 2021-11-21 17:19 ` Paul Cercueil 2021-11-21 17:19 ` Paul Cercueil 2021-11-27 15:17 ` Jonathan Cameron 2021-11-27 15:17 ` Jonathan Cameron 2021-11-15 14:19 ` [PATCH 05/15] iio: buffer-dmaengine: Support specifying buffer direction Paul Cercueil 2021-11-15 14:19 ` Paul Cercueil 2021-11-16 8:53 ` Alexandru Ardelean 2021-11-16 8:53 ` Alexandru Ardelean 2021-11-15 14:19 ` [PATCH 06/15] iio: buffer-dmaengine: Enable write support Paul Cercueil 2021-11-15 14:19 ` Paul Cercueil 2021-11-16 8:55 ` Alexandru Ardelean 2021-11-16 8:55 ` Alexandru Ardelean 2021-11-15 14:19 ` [PATCH 07/15] iio: core: Add new DMABUF interface infrastructure Paul Cercueil 2021-11-15 14:19 ` Paul Cercueil 2021-11-21 14:31 ` Jonathan Cameron 2021-11-21 14:31 ` Jonathan Cameron 2021-11-15 14:19 ` [PATCH 08/15] iio: buffer-dma: split iio_dma_buffer_fileio_free() function Paul Cercueil 2021-11-15 14:19 ` Paul Cercueil 2021-11-16 10:59 ` Alexandru Ardelean 2021-11-16 10:59 ` Alexandru Ardelean 2021-11-21 13:49 ` Jonathan Cameron 2021-11-21 13:49 ` Jonathan Cameron 2021-11-15 14:19 ` [PATCH 09/15] iio: buffer-dma: Use DMABUFs instead of custom solution Paul Cercueil 2021-11-15 14:19 ` Paul Cercueil 2021-11-15 14:19 ` [PATCH 10/15] iio: buffer-dma: Implement new DMABUF based userspace API Paul Cercueil 2021-11-15 14:19 ` Paul Cercueil 2021-11-15 14:19 ` [PATCH 11/15] iio: buffer-dma: Boost performance using write-combine cache setting Paul Cercueil 2021-11-15 14:19 ` Paul Cercueil 2021-11-18 11:45 ` Paul Cercueil 2021-11-18 11:45 ` Paul Cercueil 2021-11-21 15:00 ` Jonathan Cameron 2021-11-21 15:00 ` Jonathan Cameron 2021-11-21 17:43 ` Paul Cercueil 2021-11-21 17:43 ` Paul Cercueil 2021-11-25 17:29 ` Paul Cercueil 2021-11-25 17:29 ` Paul Cercueil 2021-11-27 16:05 ` Jonathan Cameron [this message] 2021-11-27 16:05 ` Jonathan Cameron 2021-11-28 13:25 ` Lars-Peter Clausen 2021-11-28 13:25 ` Lars-Peter Clausen 2021-11-27 15:20 ` Jonathan Cameron 2021-11-27 15:20 ` Jonathan Cameron 2021-11-15 14:22 ` [PATCH 12/15] iio: buffer-dmaengine: Support new DMABUF based userspace API Paul Cercueil 2021-11-15 14:22 ` Paul Cercueil 2021-11-15 14:22 ` [PATCH 13/15] iio: core: Add support for cyclic buffers Paul Cercueil 2021-11-15 14:22 ` Paul Cercueil 2021-11-16 9:50 ` Alexandru Ardelean 2021-11-16 9:50 ` Alexandru Ardelean 2021-11-15 14:22 ` [PATCH 14/15] iio: buffer-dmaengine: " Paul Cercueil 2021-11-15 14:22 ` Paul Cercueil 2021-11-16 9:50 ` Alexandru Ardelean 2021-11-16 9:50 ` Alexandru Ardelean 2021-11-15 14:22 ` [PATCH 15/15] Documentation: iio: Document high-speed DMABUF based API Paul Cercueil 2021-11-15 14:22 ` Paul Cercueil 2021-11-21 15:10 ` Jonathan Cameron 2021-11-21 15:10 ` Jonathan Cameron 2021-11-21 17:46 ` Paul Cercueil 2021-11-21 17:46 ` Paul Cercueil 2021-11-15 14:37 ` [PATCH 00/15] iio: buffer-dma: write() and new " Daniel Vetter 2021-11-15 14:37 ` Daniel Vetter 2021-11-15 14:57 ` Paul Cercueil 2021-11-15 14:57 ` Paul Cercueil 2021-11-16 16:02 ` Daniel Vetter 2021-11-16 16:02 ` Daniel Vetter 2021-11-16 16:31 ` Laurent Pinchart 2021-11-16 16:31 ` Laurent Pinchart 2021-11-17 8:48 ` Christian König 2021-11-17 8:48 ` Christian König 2021-11-17 12:50 ` Paul Cercueil 2021-11-17 12:50 ` Paul Cercueil 2021-11-17 13:42 ` Hennerich, Michael 2021-11-17 13:42 ` Hennerich, Michael 2021-11-21 13:57 ` Jonathan Cameron 2021-11-21 13:57 ` Jonathan Cameron
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20211127160533.5259f486@jic23-huawei \ --to=jic23@kernel.org \ --cc=Michael.Hennerich@analog.com \ --cc=ardeleanalex@gmail.com \ --cc=christian.koenig@amd.com \ --cc=dri-devel@lists.freedesktop.org \ --cc=lars@metafoo.de \ --cc=linaro-mm-sig@lists.linaro.org \ --cc=linux-iio@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-media@vger.kernel.org \ --cc=paul@crapouillou.net \ --cc=sumit.semwal@linaro.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.