From: Leo Yan <leo.yan@linaro.org> To: Mathieu Poirier <mathieu.poirier@linaro.org>, Suzuki K Poulose <suzuki.poulose@arm.com>, Mike Leach <mike.leach@linaro.org>, Robin Murphy <robin.murphy@arm.com>, Alexander Shishkin <alexander.shishkin@linux.intel.com>, coresight@lists.linaro.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org Cc: Leo Yan <leo.yan@linaro.org> Subject: [PATCH v4] coresight: tmc-etr: Speed up for bounce buffer in flat mode Date: Sun, 5 Sep 2021 11:21:44 +0800 [thread overview] Message-ID: <20210905032144.966766-1-leo.yan@linaro.org> (raw) The AUX bounce buffer is allocated with API dma_alloc_coherent(), in the low level's architecture code, e.g. for Arm64, it maps the memory with the attribution "Normal non-cacheable"; this can be concluded from the definition for pgprot_dmacoherent() in arch/arm64/include/asm/pgtable.h. Later when access the AUX bounce buffer, since the memory mapping is non-cacheable, it's low efficiency due to every load instruction must reach out DRAM. This patch changes to allocate pages with dma_alloc_noncoherent(), the driver can access the memory via cacheable mapping; therefore, load instructions can fetch data from cache lines rather than always read data from DRAM, the driver can boost memory performance. After using the cacheable mapping, the driver uses dma_sync_single_for_cpu() to invalidate cacheline prior to read bounce buffer so can avoid read stale trace data. By measurement the duration for function tmc_update_etr_buffer() with ftrace function_graph tracer, it shows the performance significant improvement for copying 4MiB data from bounce buffer: # echo tmc_etr_get_data_flat_buf > set_graph_notrace // avoid noise # echo tmc_update_etr_buffer > set_graph_function # echo function_graph > current_tracer before: # CPU DURATION FUNCTION CALLS # | | | | | | | 2) | tmc_update_etr_buffer() { ... 2) # 8148.320 us | } after: # CPU DURATION FUNCTION CALLS # | | | | | | | 2) | tmc_update_etr_buffer() { ... 2) # 2525.420 us | } Signed-off-by: Leo Yan <leo.yan@linaro.org> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> --- Changes from v3: Refined change to use dma_alloc_noncoherent()/dma_free_noncoherent() (Robin Murphy); Retested functionality and performance on Juno-r2 board. Changes from v2: Sync the entire buffer in one go when the tracing is wrap around (Suzuki); Add Suzuki's review tage. .../hwtracing/coresight/coresight-tmc-etr.c | 26 ++++++++++++++++--- 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/drivers/hwtracing/coresight/coresight-tmc-etr.c b/drivers/hwtracing/coresight/coresight-tmc-etr.c index acdb59e0e661..a049b525a274 100644 --- a/drivers/hwtracing/coresight/coresight-tmc-etr.c +++ b/drivers/hwtracing/coresight/coresight-tmc-etr.c @@ -609,8 +609,9 @@ static int tmc_etr_alloc_flat_buf(struct tmc_drvdata *drvdata, if (!flat_buf) return -ENOMEM; - flat_buf->vaddr = dma_alloc_coherent(real_dev, etr_buf->size, - &flat_buf->daddr, GFP_KERNEL); + flat_buf->vaddr = dma_alloc_noncoherent(real_dev, etr_buf->size, + &flat_buf->daddr, + DMA_FROM_DEVICE, GFP_KERNEL); if (!flat_buf->vaddr) { kfree(flat_buf); return -ENOMEM; @@ -631,14 +632,18 @@ static void tmc_etr_free_flat_buf(struct etr_buf *etr_buf) if (flat_buf && flat_buf->daddr) { struct device *real_dev = flat_buf->dev->parent; - dma_free_coherent(real_dev, flat_buf->size, - flat_buf->vaddr, flat_buf->daddr); + dma_free_noncoherent(real_dev, etr_buf->size, + flat_buf->vaddr, flat_buf->daddr, + DMA_FROM_DEVICE); } kfree(flat_buf); } static void tmc_etr_sync_flat_buf(struct etr_buf *etr_buf, u64 rrp, u64 rwp) { + struct etr_flat_buf *flat_buf = etr_buf->private; + struct device *real_dev = flat_buf->dev->parent; + /* * Adjust the buffer to point to the beginning of the trace data * and update the available trace data. @@ -648,6 +653,19 @@ static void tmc_etr_sync_flat_buf(struct etr_buf *etr_buf, u64 rrp, u64 rwp) etr_buf->len = etr_buf->size; else etr_buf->len = rwp - rrp; + + /* + * The driver always starts tracing at the beginning of the buffer, + * the only reason why we would get a wrap around is when the buffer + * is full. Sync the entire buffer in one go for this case. + */ + if (etr_buf->offset + etr_buf->len > etr_buf->size) + dma_sync_single_for_cpu(real_dev, flat_buf->daddr, + etr_buf->size, DMA_FROM_DEVICE); + else + dma_sync_single_for_cpu(real_dev, + flat_buf->daddr + etr_buf->offset, + etr_buf->len, DMA_FROM_DEVICE); } static ssize_t tmc_etr_get_data_flat_buf(struct etr_buf *etr_buf, -- 2.25.1
WARNING: multiple messages have this Message-ID (diff)
From: Leo Yan <leo.yan@linaro.org> To: Mathieu Poirier <mathieu.poirier@linaro.org>, Suzuki K Poulose <suzuki.poulose@arm.com>, Mike Leach <mike.leach@linaro.org>, Robin Murphy <robin.murphy@arm.com>, Alexander Shishkin <alexander.shishkin@linux.intel.com>, coresight@lists.linaro.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org Cc: Leo Yan <leo.yan@linaro.org> Subject: [PATCH v4] coresight: tmc-etr: Speed up for bounce buffer in flat mode Date: Sun, 5 Sep 2021 11:21:44 +0800 [thread overview] Message-ID: <20210905032144.966766-1-leo.yan@linaro.org> (raw) The AUX bounce buffer is allocated with API dma_alloc_coherent(), in the low level's architecture code, e.g. for Arm64, it maps the memory with the attribution "Normal non-cacheable"; this can be concluded from the definition for pgprot_dmacoherent() in arch/arm64/include/asm/pgtable.h. Later when access the AUX bounce buffer, since the memory mapping is non-cacheable, it's low efficiency due to every load instruction must reach out DRAM. This patch changes to allocate pages with dma_alloc_noncoherent(), the driver can access the memory via cacheable mapping; therefore, load instructions can fetch data from cache lines rather than always read data from DRAM, the driver can boost memory performance. After using the cacheable mapping, the driver uses dma_sync_single_for_cpu() to invalidate cacheline prior to read bounce buffer so can avoid read stale trace data. By measurement the duration for function tmc_update_etr_buffer() with ftrace function_graph tracer, it shows the performance significant improvement for copying 4MiB data from bounce buffer: # echo tmc_etr_get_data_flat_buf > set_graph_notrace // avoid noise # echo tmc_update_etr_buffer > set_graph_function # echo function_graph > current_tracer before: # CPU DURATION FUNCTION CALLS # | | | | | | | 2) | tmc_update_etr_buffer() { ... 2) # 8148.320 us | } after: # CPU DURATION FUNCTION CALLS # | | | | | | | 2) | tmc_update_etr_buffer() { ... 2) # 2525.420 us | } Signed-off-by: Leo Yan <leo.yan@linaro.org> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> --- Changes from v3: Refined change to use dma_alloc_noncoherent()/dma_free_noncoherent() (Robin Murphy); Retested functionality and performance on Juno-r2 board. Changes from v2: Sync the entire buffer in one go when the tracing is wrap around (Suzuki); Add Suzuki's review tage. .../hwtracing/coresight/coresight-tmc-etr.c | 26 ++++++++++++++++--- 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/drivers/hwtracing/coresight/coresight-tmc-etr.c b/drivers/hwtracing/coresight/coresight-tmc-etr.c index acdb59e0e661..a049b525a274 100644 --- a/drivers/hwtracing/coresight/coresight-tmc-etr.c +++ b/drivers/hwtracing/coresight/coresight-tmc-etr.c @@ -609,8 +609,9 @@ static int tmc_etr_alloc_flat_buf(struct tmc_drvdata *drvdata, if (!flat_buf) return -ENOMEM; - flat_buf->vaddr = dma_alloc_coherent(real_dev, etr_buf->size, - &flat_buf->daddr, GFP_KERNEL); + flat_buf->vaddr = dma_alloc_noncoherent(real_dev, etr_buf->size, + &flat_buf->daddr, + DMA_FROM_DEVICE, GFP_KERNEL); if (!flat_buf->vaddr) { kfree(flat_buf); return -ENOMEM; @@ -631,14 +632,18 @@ static void tmc_etr_free_flat_buf(struct etr_buf *etr_buf) if (flat_buf && flat_buf->daddr) { struct device *real_dev = flat_buf->dev->parent; - dma_free_coherent(real_dev, flat_buf->size, - flat_buf->vaddr, flat_buf->daddr); + dma_free_noncoherent(real_dev, etr_buf->size, + flat_buf->vaddr, flat_buf->daddr, + DMA_FROM_DEVICE); } kfree(flat_buf); } static void tmc_etr_sync_flat_buf(struct etr_buf *etr_buf, u64 rrp, u64 rwp) { + struct etr_flat_buf *flat_buf = etr_buf->private; + struct device *real_dev = flat_buf->dev->parent; + /* * Adjust the buffer to point to the beginning of the trace data * and update the available trace data. @@ -648,6 +653,19 @@ static void tmc_etr_sync_flat_buf(struct etr_buf *etr_buf, u64 rrp, u64 rwp) etr_buf->len = etr_buf->size; else etr_buf->len = rwp - rrp; + + /* + * The driver always starts tracing at the beginning of the buffer, + * the only reason why we would get a wrap around is when the buffer + * is full. Sync the entire buffer in one go for this case. + */ + if (etr_buf->offset + etr_buf->len > etr_buf->size) + dma_sync_single_for_cpu(real_dev, flat_buf->daddr, + etr_buf->size, DMA_FROM_DEVICE); + else + dma_sync_single_for_cpu(real_dev, + flat_buf->daddr + etr_buf->offset, + etr_buf->len, DMA_FROM_DEVICE); } static ssize_t tmc_etr_get_data_flat_buf(struct etr_buf *etr_buf, -- 2.25.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next reply other threads:[~2021-09-05 3:21 UTC|newest] Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-09-05 3:21 Leo Yan [this message] 2021-09-05 3:21 ` [PATCH v4] coresight: tmc-etr: Speed up for bounce buffer in flat mode Leo Yan 2021-09-13 17:56 ` Mathieu Poirier 2021-09-13 17:56 ` Mathieu Poirier 2021-09-14 6:00 ` Suzuki K Poulose 2021-09-14 6:00 ` Suzuki K Poulose 2021-09-14 15:22 ` Mathieu Poirier 2021-09-14 15:22 ` Mathieu Poirier
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20210905032144.966766-1-leo.yan@linaro.org \ --to=leo.yan@linaro.org \ --cc=alexander.shishkin@linux.intel.com \ --cc=coresight@lists.linaro.org \ --cc=linux-arm-kernel@lists.infradead.org \ --cc=linux-kernel@vger.kernel.org \ --cc=mathieu.poirier@linaro.org \ --cc=mike.leach@linaro.org \ --cc=robin.murphy@arm.com \ --cc=suzuki.poulose@arm.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.