All of lore.kernel.org
 help / color / mirror / Atom feed
* Constantly map and unmap of streaming DMA buffers with IOMMU backend might cause serious performance problem
@ 2020-05-15  8:19 ` Song Bao Hua
  0 siblings, 0 replies; 12+ messages in thread
From: Song Bao Hua @ 2020-05-15  8:19 UTC (permalink / raw)
  To: linux, hch, m.szyprowski, robin.murphy, dagum, ralf, grundler,
	Jay.Estabrook, sailer, andrea, jens.axboe, davidm
  Cc: iommu, Linuxarm, linux-arm-kernel

Hi Russell & All,

In many DMA streaming map/unmap use cases, lower-layer device drivers completely have no idea how and when single/sg buffers are allocated and freed by upper-layer filesystem, network protocol, mm management system etc. So the only thing device drivers can do is constantly mapping the buffer before DMA begins and unmapping the buffer when DMA is done.

This will dramatically increase the latency of dma_map_single/sg and dma_unmap_single/sg when these APIs are bound with the IOMMU backend. As for each map, iommu driver needs to allocate iova and do the map in iommu. And for each unmap, it needs to free iova and unmap the buffer in iommu hardware. When devices performing DMA are super-fast, for example, on 100GbE networks, the DMA streaming map/unmap latency might become a critical system bottleneck.

In comparison to DMA streaming APIs, DMA consistent APIs using IOMMU backend may show much better performance as the map is done when the buffer is allocated and unmap is done when the buffer is freed. DMA can be done multiple times before the buffers are freed by dma_free_coherent(). There is no such map and unmap overhead for each separate DMA transfer as streaming APIs. The typical work flow is like
dma_alloc_coherent-> 
doing DMA -> 
doing DMA ->
doing DMA ->
.... /* DMA many times */
dma_free_coherent

However, the typical work flow for streaming DMA is like
dma_map_sg -> doing DMA -> dma_unmap_sg -> 
dma_map_sg -> doing DMA -> dma_unmap_sg ->  
dma_map_sg -> doing DMA -> dma_unmap_sg ->  
.... /* map, DMA transfer, unmap many times */

Even though upper-layer software might use the same buffers multiple times, for each single DMA transmission, map and unmap still need to be done by lower-level drivers as lower-layer drivers don't know this fact.

A possible routine to improve the performance of stream APIs is like:
dma_map_sg -> 
dma_sync_sg_for_device -> doing DMA -> 
dma_sync_sg_for_device -> doing DMA -> 
dma_sync_sg_for_device -> doing DMA -> 
... ->    /* sync between DMA and CPU many times */
dma_unmap_sg

For every single DMA, software only needs to do sync operations which are much lighter that map and unmap. But this case is often not applicable to device drivers as the buffers usually come from the upper-layer filesystem, network protocol, mm management system etc. Device drivers have to work with the assumption that the buffer will be freed immediately after DMA is done. However, for those device drivers which are able to allocate and free the DMA stream buffers by themselves, they will get benefits of reusing the same buffers for doing DMA multiple times without map/unmap overhead.

I collected some latency data for iommu_dma_map_sg and iommu_dma_unmap_sg. In the test case, zswap is calling acomp APIs to compress/decompress pages, and comp/decomp is done by lower-level hardware ZIP driver.
root@ubuntu:/usr/share/bcc/tools# ./funclatency iommu_dma_map_sg
Tracing 1 functions for "iommu_dma_map_sg"... Hit Ctrl-C to end.
^C
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 2274570  |***********************                 |
      2048 -> 4095       : 3896310  |****************************************|
      4096 -> 8191       : 74499    |                                        |
      8192 -> 16383      : 4475     |                                        |
     16384 -> 32767      : 1519     |                                        |
     32768 -> 65535      : 480      |                                        |
     65536 -> 131071     : 286      |                                        |
    131072 -> 262143     : 18       |                                        |
    262144 -> 524287     : 2        |                                        |

root@ubuntu:/usr/share/bcc/tools# ./funclatency iommu_dma_unmap_sg
Tracing 1 functions for "iommu_dma_unmap_sg"... Hit Ctrl-C to end.
^C
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 56083    |                                        |
      4096 -> 8191       : 5232036  |****************************************|
      8192 -> 16383      : 7723     |                                        |
     16384 -> 32767      : 1277     |                                        |
     32768 -> 65535      : 32       |                                        |
     65536 -> 131071     : 12       |                                        |
    131072 -> 262143     : 41       |                                        |

In contrast, if we set iommu passthrough, the latency will be much better:

root@ubuntu:/usr/share/bcc/tools# ./funclatency dma_direct_map_sg
Tracing 1 functions for "dma_direct_map_sg"... Hit Ctrl-C to end.
^C
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 10798    |                                        |
      1024 -> 2047       : 1435035  |****************************************|
      2048 -> 4095       : 13879    |                                        |
      4096 -> 8191       : 485      |                                        |
      8192 -> 16383      : 791      |                                        |
     16384 -> 32767      : 418      |                                        |
     32768 -> 65535      : 55       |                                        |
     65536 -> 131071     : 67       |                                        |
    131072 -> 262143     : 8        |                                        |

root@ubuntu:/usr/share/bcc/tools# ./funclatency dma_direct_unmap_sg
Tracing 1 functions for "dma_direct_unmap_sg"... Hit Ctrl-C to end.
^C
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 216      |                                        |
      1024 -> 2047       : 250849   |****************************************|
      2048 -> 4095       : 54341    |********                                |
      4096 -> 8191       : 80       |                                        |
      8192 -> 16383      : 191      |                                        |
     16384 -> 32767      : 65       |                                        |

In summary, the comparison is as below:
(1)map
iommu passthrough         mainly 1-2us
iommu non-passthrough     mainly 2-4us

(2)unmap
iommu passthrough         mainly 1-2us
iommu non-passthrough     mainly 4-8us

The below is the long function trace for each dma_map/unmap_sg while iommu is enabled:

  507.520069 |   53)               |  iommu_dma_map_sg() {
  507.520070 |   53)   0.670 us    |    iommu_get_dma_domain();
  507.520071 |   53)   0.610 us    |    iommu_dma_deferred_attach();
  507.520072 |   53)               |    iommu_dma_alloc_iova.isra.26() {
  507.520073 |   53)               |      alloc_iova_fast() {
  507.520074 |   53)               |        _raw_spin_lock_irqsave() {
  507.520074 |   53)   0.570 us    |          preempt_count_add();
  507.520076 |   53)   2.060 us    |        }
  507.520077 |   53)               |        _raw_spin_unlock_irqrestore() {
  507.520077 |   53)   0.790 us    |          preempt_count_sub();
  507.520079 |   53)   2.090 us    |        }
  507.520079 |   53)   6.260 us    |      }
  507.520080 |   53)   7.470 us    |    }
  507.520081 |   53)               |    iommu_map_sg_atomic() {
  507.520081 |   53)               |      __iommu_map_sg() {
  507.520082 |   53)               |        __iommu_map() {
  507.520082 |   53)   0.630 us    |          iommu_pgsize.isra.14();
  507.520084 |   53)               |          arm_smmu_map() {
  507.520084 |   53)               |            arm_lpae_map() {
  507.520085 |   53)               |              __arm_lpae_map() {
  507.520086 |   53)               |                __arm_lpae_map() {
  507.520086 |   53)               |                  __arm_lpae_map() {
  507.520087 |   53)   0.930 us    |                    __arm_lpae_map();
  507.520089 |   53)   2.170 us    |                  }
  507.520089 |   53)   3.490 us    |                }
  507.520090 |   53)   4.730 us    |              }
  507.520090 |   53)   5.980 us    |            }
  507.520091 |   53)   7.250 us    |          }
  507.520092 |   53)   0.650 us    |          iommu_pgsize.isra.14();
  507.520093 |   53)               |          arm_smmu_map() {
  507.520093 |   53)               |            arm_lpae_map() {
  507.520094 |   53)               |              __arm_lpae_map() {
  507.520095 |   53)               |                __arm_lpae_map() {
  507.520096 |   53)               |                  __arm_lpae_map() {
  507.520096 |   53)   0.630 us    |                    __arm_lpae_map();
  507.520098 |   53)   1.860 us    |                  }
  507.520098 |   53)   3.210 us    |                }
  507.520099 |   53)   4.610 us    |              }
  507.520099 |   53)   5.860 us    |            }
  507.520100 |   53)   7.110 us    |          }
  507.520101 |   53) + 18.740 us   |        }
  507.520101 |   53) + 20.080 us   |      }
  507.520102 |   53) + 21.320 us   |    }
  507.520102 |   53) + 33.200 us   |  }

  783.039976 |   48)               |  iommu_dma_unmap_sg() {
  783.039977 |   48)               |    __iommu_dma_unmap() {
  783.039978 |   48)   0.720 us    |      iommu_get_dma_domain();
  783.039979 |   48)               |      iommu_unmap_fast() {
  783.039980 |   48)               |        __iommu_unmap() {
  783.039981 |   48)   0.740 us    |          iommu_pgsize.isra.14();
  783.039982 |   48)               |          arm_smmu_unmap() {
  783.039983 |   48)               |            arm_lpae_unmap() {
  783.039984 |   48)               |              __arm_lpae_unmap() {
  783.039985 |   48)               |                __arm_lpae_unmap() {
  783.039985 |   48)               |                  __arm_lpae_unmap() {
  783.039986 |   48)               |                    __arm_lpae_unmap() {
  783.039988 |   48)   0.730 us    |                      arm_smmu_tlb_inv_page_nosync();
  783.039989 |   48)   3.010 us    |                    }
  783.039990 |   48)   4.490 us    |                  }
  783.039991 |   48)   5.950 us    |                }
  783.039991 |   48)   7.460 us    |              }
  783.039992 |   48)   8.920 us    |            }
  783.039993 |   48) + 10.380 us   |          }
  783.039993 |   48) + 13.350 us   |        }
  783.039994 |   48) + 14.820 us   |      }
  783.039995 |   48)               |      arm_smmu_iotlb_sync() {
  783.039996 |   48)               |        arm_smmu_tlb_inv_range() {
  783.039996 |   48)               |          arm_smmu_cmdq_batch_add() {
  783.039997 |   48)   0.760 us    |            arm_smmu_cmdq_build_cmd();
  783.039999 |   48)   2.220 us    |          }
  783.039999 |   48)               |          arm_smmu_cmdq_issue_cmdlist() {
  783.040000 |   48)   0.530 us    |            arm_smmu_cmdq_build_cmd();
  783.040001 |   48)   0.530 us    |            __arm_smmu_cmdq_poll_set_valid_map.isra.40();
  783.040002 |   48)   0.540 us    |            __arm_smmu_cmdq_poll_set_valid_map.isra.40();
  783.040004 |   48)               |            ktime_get() {
  783.040004 |   48)   0.540 us    |              arch_counter_read();
  783.040005 |   48)   1.570 us    |            }
  783.040006 |   48)   6.880 us    |          }
  783.040007 |   48)   0.830 us    |          arm_smmu_atc_inv_domain.constprop.48();
  783.040008 |   48) + 12.910 us   |        }
  783.040009 |   48) + 14.370 us   |      }
  783.040010 |   48)               |      iommu_dma_free_iova() {
  783.040011 |   48)               |        free_iova_fast() {
  783.040011 |   48)               |          _raw_spin_lock_irqsave() {
  783.040012 |   48)   0.600 us    |            preempt_count_add();
  783.040013 |   48)   2.000 us    |          }
  783.040014 |   48)               |          _raw_spin_unlock_irqrestore() {
  783.040015 |   48)   0.820 us    |            preempt_count_sub();
  783.040016 |   48)   2.220 us    |          }
  783.040018 |   48)   6.200 us    |        }
  783.040019 |   48)   8.880 us    |      }
  783.040020 |   48) + 42.540 us   |    }
  783.040020 |   48) + 44.030 us   |  }

I am thinking several possible ways on decreasing or removing the latency of DMA map/unmap for every single DMA transfer. Meanwhile, "non-strict" as an existing option with possible safety issues, I won't discuss it in this mail.

1. provide bounce coherent buffers for streaming buffers. 
As the coherent buffers keep the status of mapping, we can remove the overhead of map and unmap for each single DMA operations. However, this solution requires memory copy between stream buffers and bounce buffers. Thus it will work only if copy is faster than map/unmap. Meanwhile, it will consume much more memory bandwidth.

2.make upper-layer kernel components aware of the pain of iommu map/unmap
upper-layer fs, mm, networks can somehow let the lower-layer drivers know the end of the life cycle of sg buffers. In zswap case, I have seen zswap always use the same 2 pages as the destination buffers to save compressed page, but the compressor driver still has to constantly map and unmap those same two pages for every single compression since zswap and zip drivers are working in two completely different software layers.

I am thinking some way as below, upper-layer kernel code can call:
sg_init_table(&sg...);
sg_mark_reusable(&sg....);
.... /* use the buffer many times */
....
sg_mark_stop_reuse(&sg);

After that, if low level drivers see "reusable" flag, it will realize the buffer can be used multiple times and will not do map/unmap every time. it means upper-layer components will further use the buffers and the same buffers will probably be given to lower-layer drivers for new DMA transfer later. When upper-layer code sets " stop_reuse", lower-layer driver will unmap the sg buffers, possibly by providing a unmap-callback to upper-layer components. For zswap case, I have seen the same buffers are always re-used and zip driver maps and unmaps it again and again. Shortly after the buffer is unmapped, it will be mapped in the next transmission, almost without any time gap between unmap and map. In case zswap can set the "reusable" flag, zip driver will save a lot of time.
Meanwhile, for the safety of buffers, lower-layer drivers need to make certain the buffers have already been unmapped in iommu before those buffers go back to buddy for other users.

I don't think letting upper-layer components aware of the overhead of map and unmap is elegant. But it might be something which deserves to be done for performance reason. Upper-layer software which is friendly to lower-layer driver might call sg_mark_reusable(&sg....). But it is not enforced, if upper-layer components don't call the API, the current lower-level driver won't be affected.

Please kindly give your comments on this proposal and provide your suggestions on any possible way to improve the performance of DMA stream APIs with iommu backend. I am glad to send a draft patch for "reusable" buffers if you think it is not bad.

Best Regards
Barry

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Constantly map and unmap of streaming DMA buffers with IOMMU backend might cause serious performance problem
@ 2020-05-15  8:19 ` Song Bao Hua
  0 siblings, 0 replies; 12+ messages in thread
From: Song Bao Hua @ 2020-05-15  8:19 UTC (permalink / raw)
  To: linux, hch, m.szyprowski, robin.murphy, dagum, ralf, grundler,
	Jay.Estabrook, sailer, andrea, jens.axboe, davidm
  Cc: iommu, Linuxarm, linux-arm-kernel

Hi Russell & All,

In many DMA streaming map/unmap use cases, lower-layer device drivers completely have no idea how and when single/sg buffers are allocated and freed by upper-layer filesystem, network protocol, mm management system etc. So the only thing device drivers can do is constantly mapping the buffer before DMA begins and unmapping the buffer when DMA is done.

This will dramatically increase the latency of dma_map_single/sg and dma_unmap_single/sg when these APIs are bound with the IOMMU backend. As for each map, iommu driver needs to allocate iova and do the map in iommu. And for each unmap, it needs to free iova and unmap the buffer in iommu hardware. When devices performing DMA are super-fast, for example, on 100GbE networks, the DMA streaming map/unmap latency might become a critical system bottleneck.

In comparison to DMA streaming APIs, DMA consistent APIs using IOMMU backend may show much better performance as the map is done when the buffer is allocated and unmap is done when the buffer is freed. DMA can be done multiple times before the buffers are freed by dma_free_coherent(). There is no such map and unmap overhead for each separate DMA transfer as streaming APIs. The typical work flow is like
dma_alloc_coherent-> 
doing DMA -> 
doing DMA ->
doing DMA ->
.... /* DMA many times */
dma_free_coherent

However, the typical work flow for streaming DMA is like
dma_map_sg -> doing DMA -> dma_unmap_sg -> 
dma_map_sg -> doing DMA -> dma_unmap_sg ->  
dma_map_sg -> doing DMA -> dma_unmap_sg ->  
.... /* map, DMA transfer, unmap many times */

Even though upper-layer software might use the same buffers multiple times, for each single DMA transmission, map and unmap still need to be done by lower-level drivers as lower-layer drivers don't know this fact.

A possible routine to improve the performance of stream APIs is like:
dma_map_sg -> 
dma_sync_sg_for_device -> doing DMA -> 
dma_sync_sg_for_device -> doing DMA -> 
dma_sync_sg_for_device -> doing DMA -> 
... ->    /* sync between DMA and CPU many times */
dma_unmap_sg

For every single DMA, software only needs to do sync operations which are much lighter that map and unmap. But this case is often not applicable to device drivers as the buffers usually come from the upper-layer filesystem, network protocol, mm management system etc. Device drivers have to work with the assumption that the buffer will be freed immediately after DMA is done. However, for those device drivers which are able to allocate and free the DMA stream buffers by themselves, they will get benefits of reusing the same buffers for doing DMA multiple times without map/unmap overhead.

I collected some latency data for iommu_dma_map_sg and iommu_dma_unmap_sg. In the test case, zswap is calling acomp APIs to compress/decompress pages, and comp/decomp is done by lower-level hardware ZIP driver.
root@ubuntu:/usr/share/bcc/tools# ./funclatency iommu_dma_map_sg
Tracing 1 functions for "iommu_dma_map_sg"... Hit Ctrl-C to end.
^C
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 2274570  |***********************                 |
      2048 -> 4095       : 3896310  |****************************************|
      4096 -> 8191       : 74499    |                                        |
      8192 -> 16383      : 4475     |                                        |
     16384 -> 32767      : 1519     |                                        |
     32768 -> 65535      : 480      |                                        |
     65536 -> 131071     : 286      |                                        |
    131072 -> 262143     : 18       |                                        |
    262144 -> 524287     : 2        |                                        |

root@ubuntu:/usr/share/bcc/tools# ./funclatency iommu_dma_unmap_sg
Tracing 1 functions for "iommu_dma_unmap_sg"... Hit Ctrl-C to end.
^C
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 56083    |                                        |
      4096 -> 8191       : 5232036  |****************************************|
      8192 -> 16383      : 7723     |                                        |
     16384 -> 32767      : 1277     |                                        |
     32768 -> 65535      : 32       |                                        |
     65536 -> 131071     : 12       |                                        |
    131072 -> 262143     : 41       |                                        |

In contrast, if we set iommu passthrough, the latency will be much better:

root@ubuntu:/usr/share/bcc/tools# ./funclatency dma_direct_map_sg
Tracing 1 functions for "dma_direct_map_sg"... Hit Ctrl-C to end.
^C
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 10798    |                                        |
      1024 -> 2047       : 1435035  |****************************************|
      2048 -> 4095       : 13879    |                                        |
      4096 -> 8191       : 485      |                                        |
      8192 -> 16383      : 791      |                                        |
     16384 -> 32767      : 418      |                                        |
     32768 -> 65535      : 55       |                                        |
     65536 -> 131071     : 67       |                                        |
    131072 -> 262143     : 8        |                                        |

root@ubuntu:/usr/share/bcc/tools# ./funclatency dma_direct_unmap_sg
Tracing 1 functions for "dma_direct_unmap_sg"... Hit Ctrl-C to end.
^C
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 216      |                                        |
      1024 -> 2047       : 250849   |****************************************|
      2048 -> 4095       : 54341    |********                                |
      4096 -> 8191       : 80       |                                        |
      8192 -> 16383      : 191      |                                        |
     16384 -> 32767      : 65       |                                        |

In summary, the comparison is as below:
(1)map
iommu passthrough         mainly 1-2us
iommu non-passthrough     mainly 2-4us

(2)unmap
iommu passthrough         mainly 1-2us
iommu non-passthrough     mainly 4-8us

The below is the long function trace for each dma_map/unmap_sg while iommu is enabled:

  507.520069 |   53)               |  iommu_dma_map_sg() {
  507.520070 |   53)   0.670 us    |    iommu_get_dma_domain();
  507.520071 |   53)   0.610 us    |    iommu_dma_deferred_attach();
  507.520072 |   53)               |    iommu_dma_alloc_iova.isra.26() {
  507.520073 |   53)               |      alloc_iova_fast() {
  507.520074 |   53)               |        _raw_spin_lock_irqsave() {
  507.520074 |   53)   0.570 us    |          preempt_count_add();
  507.520076 |   53)   2.060 us    |        }
  507.520077 |   53)               |        _raw_spin_unlock_irqrestore() {
  507.520077 |   53)   0.790 us    |          preempt_count_sub();
  507.520079 |   53)   2.090 us    |        }
  507.520079 |   53)   6.260 us    |      }
  507.520080 |   53)   7.470 us    |    }
  507.520081 |   53)               |    iommu_map_sg_atomic() {
  507.520081 |   53)               |      __iommu_map_sg() {
  507.520082 |   53)               |        __iommu_map() {
  507.520082 |   53)   0.630 us    |          iommu_pgsize.isra.14();
  507.520084 |   53)               |          arm_smmu_map() {
  507.520084 |   53)               |            arm_lpae_map() {
  507.520085 |   53)               |              __arm_lpae_map() {
  507.520086 |   53)               |                __arm_lpae_map() {
  507.520086 |   53)               |                  __arm_lpae_map() {
  507.520087 |   53)   0.930 us    |                    __arm_lpae_map();
  507.520089 |   53)   2.170 us    |                  }
  507.520089 |   53)   3.490 us    |                }
  507.520090 |   53)   4.730 us    |              }
  507.520090 |   53)   5.980 us    |            }
  507.520091 |   53)   7.250 us    |          }
  507.520092 |   53)   0.650 us    |          iommu_pgsize.isra.14();
  507.520093 |   53)               |          arm_smmu_map() {
  507.520093 |   53)               |            arm_lpae_map() {
  507.520094 |   53)               |              __arm_lpae_map() {
  507.520095 |   53)               |                __arm_lpae_map() {
  507.520096 |   53)               |                  __arm_lpae_map() {
  507.520096 |   53)   0.630 us    |                    __arm_lpae_map();
  507.520098 |   53)   1.860 us    |                  }
  507.520098 |   53)   3.210 us    |                }
  507.520099 |   53)   4.610 us    |              }
  507.520099 |   53)   5.860 us    |            }
  507.520100 |   53)   7.110 us    |          }
  507.520101 |   53) + 18.740 us   |        }
  507.520101 |   53) + 20.080 us   |      }
  507.520102 |   53) + 21.320 us   |    }
  507.520102 |   53) + 33.200 us   |  }

  783.039976 |   48)               |  iommu_dma_unmap_sg() {
  783.039977 |   48)               |    __iommu_dma_unmap() {
  783.039978 |   48)   0.720 us    |      iommu_get_dma_domain();
  783.039979 |   48)               |      iommu_unmap_fast() {
  783.039980 |   48)               |        __iommu_unmap() {
  783.039981 |   48)   0.740 us    |          iommu_pgsize.isra.14();
  783.039982 |   48)               |          arm_smmu_unmap() {
  783.039983 |   48)               |            arm_lpae_unmap() {
  783.039984 |   48)               |              __arm_lpae_unmap() {
  783.039985 |   48)               |                __arm_lpae_unmap() {
  783.039985 |   48)               |                  __arm_lpae_unmap() {
  783.039986 |   48)               |                    __arm_lpae_unmap() {
  783.039988 |   48)   0.730 us    |                      arm_smmu_tlb_inv_page_nosync();
  783.039989 |   48)   3.010 us    |                    }
  783.039990 |   48)   4.490 us    |                  }
  783.039991 |   48)   5.950 us    |                }
  783.039991 |   48)   7.460 us    |              }
  783.039992 |   48)   8.920 us    |            }
  783.039993 |   48) + 10.380 us   |          }
  783.039993 |   48) + 13.350 us   |        }
  783.039994 |   48) + 14.820 us   |      }
  783.039995 |   48)               |      arm_smmu_iotlb_sync() {
  783.039996 |   48)               |        arm_smmu_tlb_inv_range() {
  783.039996 |   48)               |          arm_smmu_cmdq_batch_add() {
  783.039997 |   48)   0.760 us    |            arm_smmu_cmdq_build_cmd();
  783.039999 |   48)   2.220 us    |          }
  783.039999 |   48)               |          arm_smmu_cmdq_issue_cmdlist() {
  783.040000 |   48)   0.530 us    |            arm_smmu_cmdq_build_cmd();
  783.040001 |   48)   0.530 us    |            __arm_smmu_cmdq_poll_set_valid_map.isra.40();
  783.040002 |   48)   0.540 us    |            __arm_smmu_cmdq_poll_set_valid_map.isra.40();
  783.040004 |   48)               |            ktime_get() {
  783.040004 |   48)   0.540 us    |              arch_counter_read();
  783.040005 |   48)   1.570 us    |            }
  783.040006 |   48)   6.880 us    |          }
  783.040007 |   48)   0.830 us    |          arm_smmu_atc_inv_domain.constprop.48();
  783.040008 |   48) + 12.910 us   |        }
  783.040009 |   48) + 14.370 us   |      }
  783.040010 |   48)               |      iommu_dma_free_iova() {
  783.040011 |   48)               |        free_iova_fast() {
  783.040011 |   48)               |          _raw_spin_lock_irqsave() {
  783.040012 |   48)   0.600 us    |            preempt_count_add();
  783.040013 |   48)   2.000 us    |          }
  783.040014 |   48)               |          _raw_spin_unlock_irqrestore() {
  783.040015 |   48)   0.820 us    |            preempt_count_sub();
  783.040016 |   48)   2.220 us    |          }
  783.040018 |   48)   6.200 us    |        }
  783.040019 |   48)   8.880 us    |      }
  783.040020 |   48) + 42.540 us   |    }
  783.040020 |   48) + 44.030 us   |  }

I am thinking several possible ways on decreasing or removing the latency of DMA map/unmap for every single DMA transfer. Meanwhile, "non-strict" as an existing option with possible safety issues, I won't discuss it in this mail.

1. provide bounce coherent buffers for streaming buffers. 
As the coherent buffers keep the status of mapping, we can remove the overhead of map and unmap for each single DMA operations. However, this solution requires memory copy between stream buffers and bounce buffers. Thus it will work only if copy is faster than map/unmap. Meanwhile, it will consume much more memory bandwidth.

2.make upper-layer kernel components aware of the pain of iommu map/unmap
upper-layer fs, mm, networks can somehow let the lower-layer drivers know the end of the life cycle of sg buffers. In zswap case, I have seen zswap always use the same 2 pages as the destination buffers to save compressed page, but the compressor driver still has to constantly map and unmap those same two pages for every single compression since zswap and zip drivers are working in two completely different software layers.

I am thinking some way as below, upper-layer kernel code can call:
sg_init_table(&sg...);
sg_mark_reusable(&sg....);
.... /* use the buffer many times */
....
sg_mark_stop_reuse(&sg);

After that, if low level drivers see "reusable" flag, it will realize the buffer can be used multiple times and will not do map/unmap every time. it means upper-layer components will further use the buffers and the same buffers will probably be given to lower-layer drivers for new DMA transfer later. When upper-layer code sets " stop_reuse", lower-layer driver will unmap the sg buffers, possibly by providing a unmap-callback to upper-layer components. For zswap case, I have seen the same buffers are always re-used and zip driver maps and unmaps it again and again. Shortly after the buffer is unmapped, it will be mapped in the next transmission, almost without any time gap between unmap and map. In case zswap can set the "reusable" flag, zip driver will save a lot of time.
Meanwhile, for the safety of buffers, lower-layer drivers need to make certain the buffers have already been unmapped in iommu before those buffers go back to buddy for other users.

I don't think letting upper-layer components aware of the overhead of map and unmap is elegant. But it might be something which deserves to be done for performance reason. Upper-layer software which is friendly to lower-layer driver might call sg_mark_reusable(&sg....). But it is not enforced, if upper-layer components don't call the API, the current lower-level driver won't be affected.

Please kindly give your comments on this proposal and provide your suggestions on any possible way to improve the performance of DMA stream APIs with iommu backend. I am glad to send a draft patch for "reusable" buffers if you think it is not bad.

Best Regards
Barry


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Constantly map and unmap of streaming DMA buffers with IOMMU backend might cause serious performance problem
  2020-05-15  8:19 ` Song Bao Hua
@ 2020-05-15 12:10   ` Robin Murphy
  -1 siblings, 0 replies; 12+ messages in thread
From: Robin Murphy @ 2020-05-15 12:10 UTC (permalink / raw)
  To: Song Bao Hua, linux, hch, m.szyprowski, dagum, ralf, grundler,
	Jay.Estabrook, sailer, andrea, jens.axboe, davidm
  Cc: iommu, Linuxarm, linux-arm-kernel

On 2020-05-15 09:19, Song Bao Hua wrote:
[ snip... nice analysis, but ultimately it's still "doing stuff has more 
overhead than not doing stuff" ]

> I am thinking several possible ways on decreasing or removing the latency of DMA map/unmap for every single DMA transfer. Meanwhile, "non-strict" as an existing option with possible safety issues, I won't discuss it in this mail.

But passthrough and non-strict mode *specifically exist* for the cases 
where performance is the most important concern - streaming DMA with an 
IOMMU in the middle has an unavoidable tradeoff between performance and 
isolation, so dismissing that out of hand is not a good way to start 
making this argument.

> 1. provide bounce coherent buffers for streaming buffers.
> As the coherent buffers keep the status of mapping, we can remove the overhead of map and unmap for each single DMA operations. However, this solution requires memory copy between stream buffers and bounce buffers. Thus it will work only if copy is faster than map/unmap. Meanwhile, it will consume much more memory bandwidth.

I'm struggling to understand how that would work, can you explain it in 
more detail?

> 2.make upper-layer kernel components aware of the pain of iommu map/unmap
> upper-layer fs, mm, networks can somehow let the lower-layer drivers know the end of the life cycle of sg buffers. In zswap case, I have seen zswap always use the same 2 pages as the destination buffers to save compressed page, but the compressor driver still has to constantly map and unmap those same two pages for every single compression since zswap and zip drivers are working in two completely different software layers.
> 
> I am thinking some way as below, upper-layer kernel code can call:
> sg_init_table(&sg...);
> sg_mark_reusable(&sg....);
> .... /* use the buffer many times */
> ....
> sg_mark_stop_reuse(&sg);
> 
> After that, if low level drivers see "reusable" flag, it will realize the buffer can be used multiple times and will not do map/unmap every time. it means upper-layer components will further use the buffers and the same buffers will probably be given to lower-layer drivers for new DMA transfer later. When upper-layer code sets " stop_reuse", lower-layer driver will unmap the sg buffers, possibly by providing a unmap-callback to upper-layer components. For zswap case, I have seen the same buffers are always re-used and zip driver maps and unmaps it again and again. Shortly after the buffer is unmapped, it will be mapped in the next transmission, almost without any time gap between unmap and map. In case zswap can set the "reusable" flag, zip driver will save a lot of time.
> Meanwhile, for the safety of buffers, lower-layer drivers need to make certain the buffers have already been unmapped in iommu before those buffers go back to buddy for other users.

That sounds like it would only have benefit in a very small set of 
specific circumstances, and would be very difficult to generalise to 
buffers that are mapped via dma_map_page() or dma_map_single(). 
Furthermore, a high-level API that affects a low-level driver's 
interpretation of mid-layer API calls without the mid-layer's knowledge 
sounds like a hideous abomination of anti-design. If a mid-layer API 
lends itself to inefficiency at the lower level, it would seem a lot 
cleaner and more robust to extend *that* API for stateful buffer reuse. 
Failing that, it might possibly be appropriate to approach this at the 
driver level - many of the cleverer network drivers already implement 
buffer pools to recycle mapped SKBs internally, couldn't the "zip 
driver" simply try doing something like that for itself?

Robin.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Constantly map and unmap of streaming DMA buffers with IOMMU backend might cause serious performance problem
@ 2020-05-15 12:10   ` Robin Murphy
  0 siblings, 0 replies; 12+ messages in thread
From: Robin Murphy @ 2020-05-15 12:10 UTC (permalink / raw)
  To: Song Bao Hua, linux, hch, m.szyprowski, dagum, ralf, grundler,
	Jay.Estabrook, sailer, andrea, jens.axboe, davidm
  Cc: iommu, Linuxarm, linux-arm-kernel

On 2020-05-15 09:19, Song Bao Hua wrote:
[ snip... nice analysis, but ultimately it's still "doing stuff has more 
overhead than not doing stuff" ]

> I am thinking several possible ways on decreasing or removing the latency of DMA map/unmap for every single DMA transfer. Meanwhile, "non-strict" as an existing option with possible safety issues, I won't discuss it in this mail.

But passthrough and non-strict mode *specifically exist* for the cases 
where performance is the most important concern - streaming DMA with an 
IOMMU in the middle has an unavoidable tradeoff between performance and 
isolation, so dismissing that out of hand is not a good way to start 
making this argument.

> 1. provide bounce coherent buffers for streaming buffers.
> As the coherent buffers keep the status of mapping, we can remove the overhead of map and unmap for each single DMA operations. However, this solution requires memory copy between stream buffers and bounce buffers. Thus it will work only if copy is faster than map/unmap. Meanwhile, it will consume much more memory bandwidth.

I'm struggling to understand how that would work, can you explain it in 
more detail?

> 2.make upper-layer kernel components aware of the pain of iommu map/unmap
> upper-layer fs, mm, networks can somehow let the lower-layer drivers know the end of the life cycle of sg buffers. In zswap case, I have seen zswap always use the same 2 pages as the destination buffers to save compressed page, but the compressor driver still has to constantly map and unmap those same two pages for every single compression since zswap and zip drivers are working in two completely different software layers.
> 
> I am thinking some way as below, upper-layer kernel code can call:
> sg_init_table(&sg...);
> sg_mark_reusable(&sg....);
> .... /* use the buffer many times */
> ....
> sg_mark_stop_reuse(&sg);
> 
> After that, if low level drivers see "reusable" flag, it will realize the buffer can be used multiple times and will not do map/unmap every time. it means upper-layer components will further use the buffers and the same buffers will probably be given to lower-layer drivers for new DMA transfer later. When upper-layer code sets " stop_reuse", lower-layer driver will unmap the sg buffers, possibly by providing a unmap-callback to upper-layer components. For zswap case, I have seen the same buffers are always re-used and zip driver maps and unmaps it again and again. Shortly after the buffer is unmapped, it will be mapped in the next transmission, almost without any time gap between unmap and map. In case zswap can set the "reusable" flag, zip driver will save a lot of time.
> Meanwhile, for the safety of buffers, lower-layer drivers need to make certain the buffers have already been unmapped in iommu before those buffers go back to buddy for other users.

That sounds like it would only have benefit in a very small set of 
specific circumstances, and would be very difficult to generalise to 
buffers that are mapped via dma_map_page() or dma_map_single(). 
Furthermore, a high-level API that affects a low-level driver's 
interpretation of mid-layer API calls without the mid-layer's knowledge 
sounds like a hideous abomination of anti-design. If a mid-layer API 
lends itself to inefficiency at the lower level, it would seem a lot 
cleaner and more robust to extend *that* API for stateful buffer reuse. 
Failing that, it might possibly be appropriate to approach this at the 
driver level - many of the cleverer network drivers already implement 
buffer pools to recycle mapped SKBs internally, couldn't the "zip 
driver" simply try doing something like that for itself?

Robin.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Constantly map and unmap of streaming DMA buffers with IOMMU backend might cause serious performance problem
  2020-05-15 12:10   ` Robin Murphy
@ 2020-05-15 14:45     ` hch
  -1 siblings, 0 replies; 12+ messages in thread
From: hch @ 2020-05-15 14:45 UTC (permalink / raw)
  To: Robin Murphy
  Cc: davidm, ralf, Linuxarm, linux, iommu, sailer, Jay.Estabrook,
	dagum, andrea, grundler, jens.axboe, hch, linux-arm-kernel

On Fri, May 15, 2020 at 01:10:21PM +0100, Robin Murphy wrote:
>> Meanwhile, for the safety of buffers, lower-layer drivers need to make certain the buffers have already been unmapped in iommu before those buffers go back to buddy for other users.
>
> That sounds like it would only have benefit in a very small set of specific 
> circumstances, and would be very difficult to generalise to buffers that 
> are mapped via dma_map_page() or dma_map_single(). Furthermore, a 
> high-level API that affects a low-level driver's interpretation of 
> mid-layer API calls without the mid-layer's knowledge sounds like a hideous 
> abomination of anti-design. If a mid-layer API lends itself to inefficiency 
> at the lower level, it would seem a lot cleaner and more robust to extend 
> *that* API for stateful buffer reuse. Failing that, it might possibly be 
> appropriate to approach this at the driver level - many of the cleverer 
> network drivers already implement buffer pools to recycle mapped SKBs 
> internally, couldn't the "zip driver" simply try doing something like that 
> for itself?

Exactly.  If you upper consumer of the DMA API keeps reusing the same
pages just map them once and use dma_sync_* to transfer ownership as
needed.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Constantly map and unmap of streaming DMA buffers with IOMMU backend might cause serious performance problem
@ 2020-05-15 14:45     ` hch
  0 siblings, 0 replies; 12+ messages in thread
From: hch @ 2020-05-15 14:45 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Song Bao Hua, davidm, ralf, Linuxarm, linux, iommu, sailer,
	Jay.Estabrook, dagum, andrea, grundler, jens.axboe, hch,
	linux-arm-kernel, m.szyprowski

On Fri, May 15, 2020 at 01:10:21PM +0100, Robin Murphy wrote:
>> Meanwhile, for the safety of buffers, lower-layer drivers need to make certain the buffers have already been unmapped in iommu before those buffers go back to buddy for other users.
>
> That sounds like it would only have benefit in a very small set of specific 
> circumstances, and would be very difficult to generalise to buffers that 
> are mapped via dma_map_page() or dma_map_single(). Furthermore, a 
> high-level API that affects a low-level driver's interpretation of 
> mid-layer API calls without the mid-layer's knowledge sounds like a hideous 
> abomination of anti-design. If a mid-layer API lends itself to inefficiency 
> at the lower level, it would seem a lot cleaner and more robust to extend 
> *that* API for stateful buffer reuse. Failing that, it might possibly be 
> appropriate to approach this at the driver level - many of the cleverer 
> network drivers already implement buffer pools to recycle mapped SKBs 
> internally, couldn't the "zip driver" simply try doing something like that 
> for itself?

Exactly.  If you upper consumer of the DMA API keeps reusing the same
pages just map them once and use dma_sync_* to transfer ownership as
needed.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Constantly map and unmap of streaming DMA buffers with IOMMU backend might cause serious performance problem
  2020-05-15 14:45     ` hch
@ 2020-05-15 21:33       ` Song Bao Hua
  -1 siblings, 0 replies; 12+ messages in thread
From: Song Bao Hua @ 2020-05-15 21:33 UTC (permalink / raw)
  To: hch, Robin Murphy
  Cc: davidm, ralf, Linuxarm, linux, iommu, sailer, Jay.Estabrook,
	dagum, andrea, grundler, jens.axboe, linux-arm-kernel

> Subject: Re: Constantly map and unmap of streaming DMA buffers with
> IOMMU backend might cause serious performance problem
> 
> On Fri, May 15, 2020 at 01:10:21PM +0100, Robin Murphy wrote:
> >> Meanwhile, for the safety of buffers, lower-layer drivers need to make
> certain the buffers have already been unmapped in iommu before those
> buffers go back to buddy for other users.
> >
> > That sounds like it would only have benefit in a very small set of specific
> > circumstances, and would be very difficult to generalise to buffers that
> > are mapped via dma_map_page() or dma_map_single(). Furthermore, a
> > high-level API that affects a low-level driver's interpretation of
> > mid-layer API calls without the mid-layer's knowledge sounds like a hideous
> > abomination of anti-design. If a mid-layer API lends itself to inefficiency
> > at the lower level, it would seem a lot cleaner and more robust to extend
> > *that* API for stateful buffer reuse. Failing that, it might possibly be
> > appropriate to approach this at the driver level - many of the cleverer
> > network drivers already implement buffer pools to recycle mapped SKBs
> > internally, couldn't the "zip driver" simply try doing something like that
> > for itself?
> 
> Exactly.  If you upper consumer of the DMA API keeps reusing the same
> pages just map them once and use dma_sync_* to transfer ownership as
> needed.

The problem is that the lower-layer drivers don't know if upper consumer keeps reusing the same pages. They are running in different software layers.
For example, Consumer is here in mm/zswap.c
static int zswap_frontswap_store(unsigned type, pgoff_t offset,
				struct page *page)
{
	...
	/* compress */
	dst = get_cpu_var(zswap_dstmem);
	...
	ret = crypto_comp_compress(tfm, src, PAGE_SIZE, dst, &dlen);
	...
}

But the lower-layer driver is in drivers/crypto/...

Meanwhile, the lower-layer driver couldn't cache the pointers of buffer address coming from consumers to detect if the upper-layer is using the same page.
Because the same page might come from different users or come from the different stages of the same user with different permissions.
 
For example, consumer A uses the buffer as destination, then returns it to buddy, but consumer B gets the same buffer and uses it as source.

Another possibility is
Consumer A uses the buffer, returns it to buddy, after some time, it allocates a buffer again, but gets the same buffer from buddy like before.

For the safety of the buffer, lower-layer driver must guarantee the buffer is unmapped when the buffer returns to buddy.

I think only the upper-layer consumer knows if it is reusing the buffer. 

Thanks
Barry


_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Constantly map and unmap of streaming DMA buffers with IOMMU backend might cause serious performance problem
@ 2020-05-15 21:33       ` Song Bao Hua
  0 siblings, 0 replies; 12+ messages in thread
From: Song Bao Hua @ 2020-05-15 21:33 UTC (permalink / raw)
  To: hch, Robin Murphy
  Cc: davidm, ralf, Linuxarm, linux, iommu, sailer, Jay.Estabrook,
	dagum, andrea, grundler, jens.axboe, linux-arm-kernel,
	m.szyprowski

> Subject: Re: Constantly map and unmap of streaming DMA buffers with
> IOMMU backend might cause serious performance problem
> 
> On Fri, May 15, 2020 at 01:10:21PM +0100, Robin Murphy wrote:
> >> Meanwhile, for the safety of buffers, lower-layer drivers need to make
> certain the buffers have already been unmapped in iommu before those
> buffers go back to buddy for other users.
> >
> > That sounds like it would only have benefit in a very small set of specific
> > circumstances, and would be very difficult to generalise to buffers that
> > are mapped via dma_map_page() or dma_map_single(). Furthermore, a
> > high-level API that affects a low-level driver's interpretation of
> > mid-layer API calls without the mid-layer's knowledge sounds like a hideous
> > abomination of anti-design. If a mid-layer API lends itself to inefficiency
> > at the lower level, it would seem a lot cleaner and more robust to extend
> > *that* API for stateful buffer reuse. Failing that, it might possibly be
> > appropriate to approach this at the driver level - many of the cleverer
> > network drivers already implement buffer pools to recycle mapped SKBs
> > internally, couldn't the "zip driver" simply try doing something like that
> > for itself?
> 
> Exactly.  If you upper consumer of the DMA API keeps reusing the same
> pages just map them once and use dma_sync_* to transfer ownership as
> needed.

The problem is that the lower-layer drivers don't know if upper consumer keeps reusing the same pages. They are running in different software layers.
For example, Consumer is here in mm/zswap.c
static int zswap_frontswap_store(unsigned type, pgoff_t offset,
				struct page *page)
{
	...
	/* compress */
	dst = get_cpu_var(zswap_dstmem);
	...
	ret = crypto_comp_compress(tfm, src, PAGE_SIZE, dst, &dlen);
	...
}

But the lower-layer driver is in drivers/crypto/...

Meanwhile, the lower-layer driver couldn't cache the pointers of buffer address coming from consumers to detect if the upper-layer is using the same page.
Because the same page might come from different users or come from the different stages of the same user with different permissions.
 
For example, consumer A uses the buffer as destination, then returns it to buddy, but consumer B gets the same buffer and uses it as source.

Another possibility is
Consumer A uses the buffer, returns it to buddy, after some time, it allocates a buffer again, but gets the same buffer from buddy like before.

For the safety of the buffer, lower-layer driver must guarantee the buffer is unmapped when the buffer returns to buddy.

I think only the upper-layer consumer knows if it is reusing the buffer. 

Thanks
Barry



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Constantly map and unmap of streaming DMA buffers with IOMMU backend might cause serious performance problem
  2020-05-15 21:33       ` Song Bao Hua
@ 2020-05-15 22:12         ` Robin Murphy
  -1 siblings, 0 replies; 12+ messages in thread
From: Robin Murphy @ 2020-05-15 22:12 UTC (permalink / raw)
  To: Song Bao Hua, hch
  Cc: davidm, ralf, Linuxarm, linux, iommu, sailer, Jay.Estabrook,
	dagum, andrea, grundler, jens.axboe, linux-arm-kernel

On 2020-05-15 22:33, Song Bao Hua wrote:
>> Subject: Re: Constantly map and unmap of streaming DMA buffers with
>> IOMMU backend might cause serious performance problem
>>
>> On Fri, May 15, 2020 at 01:10:21PM +0100, Robin Murphy wrote:
>>>> Meanwhile, for the safety of buffers, lower-layer drivers need to make
>> certain the buffers have already been unmapped in iommu before those
>> buffers go back to buddy for other users.
>>>
>>> That sounds like it would only have benefit in a very small set of specific
>>> circumstances, and would be very difficult to generalise to buffers that
>>> are mapped via dma_map_page() or dma_map_single(). Furthermore, a
>>> high-level API that affects a low-level driver's interpretation of
>>> mid-layer API calls without the mid-layer's knowledge sounds like a hideous
>>> abomination of anti-design. If a mid-layer API lends itself to inefficiency
>>> at the lower level, it would seem a lot cleaner and more robust to extend
>>> *that* API for stateful buffer reuse. Failing that, it might possibly be
>>> appropriate to approach this at the driver level - many of the cleverer
>>> network drivers already implement buffer pools to recycle mapped SKBs
>>> internally, couldn't the "zip driver" simply try doing something like that
>>> for itself?
>>
>> Exactly.  If you upper consumer of the DMA API keeps reusing the same
>> pages just map them once and use dma_sync_* to transfer ownership as
>> needed.
> 
> The problem is that the lower-layer drivers don't know if upper consumer keeps reusing the same pages. They are running in different software layers.
> For example, Consumer is here in mm/zswap.c
> static int zswap_frontswap_store(unsigned type, pgoff_t offset,
> 				struct page *page)
> {
> 	...
> 	/* compress */
> 	dst = get_cpu_var(zswap_dstmem);
> 	...
> 	ret = crypto_comp_compress(tfm, src, PAGE_SIZE, dst, &dlen);
> 	...
> }
> 
> But the lower-layer driver is in drivers/crypto/...
> 
> Meanwhile, the lower-layer driver couldn't cache the pointers of buffer address coming from consumers to detect if the upper-layer is using the same page.
> Because the same page might come from different users or come from the different stages of the same user with different permissions.

Indeed the driver can't cache arbitrary pointers, but if typical buffers 
are small enough it can copy the data into its own already-mapped page, 
dma_sync it, and perform the DMA operation from there. That might even 
be more or less what your first suggestion was, but I'm still not quite 
sure.

> For example, consumer A uses the buffer as destination, then returns it to buddy, but consumer B gets the same buffer and uses it as source.
> 
> Another possibility is
> Consumer A uses the buffer, returns it to buddy, after some time, it allocates a buffer again, but gets the same buffer from buddy like before.
> 
> For the safety of the buffer, lower-layer driver must guarantee the buffer is unmapped when the buffer returns to buddy.
> 
> I think only the upper-layer consumer knows if it is reusing the buffer.

Right, and if reusing buffers is common in crypto callers, then there's 
an argument for "set up reusable buffer", "process updated buffer" and 
"clean up buffer" operations to be added to the crypto API itself, such 
that the underlying drivers can then optimise for DMA usage in a robust 
and obvious way if they want to (or just implement the setup and 
teardown as no-ops and still do a full map/unmap in each "process" call 
if they don't).

Robin.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Constantly map and unmap of streaming DMA buffers with IOMMU backend might cause serious performance problem
@ 2020-05-15 22:12         ` Robin Murphy
  0 siblings, 0 replies; 12+ messages in thread
From: Robin Murphy @ 2020-05-15 22:12 UTC (permalink / raw)
  To: Song Bao Hua, hch
  Cc: davidm, ralf, Linuxarm, linux, iommu, sailer, Jay.Estabrook,
	dagum, andrea, grundler, jens.axboe, linux-arm-kernel,
	m.szyprowski

On 2020-05-15 22:33, Song Bao Hua wrote:
>> Subject: Re: Constantly map and unmap of streaming DMA buffers with
>> IOMMU backend might cause serious performance problem
>>
>> On Fri, May 15, 2020 at 01:10:21PM +0100, Robin Murphy wrote:
>>>> Meanwhile, for the safety of buffers, lower-layer drivers need to make
>> certain the buffers have already been unmapped in iommu before those
>> buffers go back to buddy for other users.
>>>
>>> That sounds like it would only have benefit in a very small set of specific
>>> circumstances, and would be very difficult to generalise to buffers that
>>> are mapped via dma_map_page() or dma_map_single(). Furthermore, a
>>> high-level API that affects a low-level driver's interpretation of
>>> mid-layer API calls without the mid-layer's knowledge sounds like a hideous
>>> abomination of anti-design. If a mid-layer API lends itself to inefficiency
>>> at the lower level, it would seem a lot cleaner and more robust to extend
>>> *that* API for stateful buffer reuse. Failing that, it might possibly be
>>> appropriate to approach this at the driver level - many of the cleverer
>>> network drivers already implement buffer pools to recycle mapped SKBs
>>> internally, couldn't the "zip driver" simply try doing something like that
>>> for itself?
>>
>> Exactly.  If you upper consumer of the DMA API keeps reusing the same
>> pages just map them once and use dma_sync_* to transfer ownership as
>> needed.
> 
> The problem is that the lower-layer drivers don't know if upper consumer keeps reusing the same pages. They are running in different software layers.
> For example, Consumer is here in mm/zswap.c
> static int zswap_frontswap_store(unsigned type, pgoff_t offset,
> 				struct page *page)
> {
> 	...
> 	/* compress */
> 	dst = get_cpu_var(zswap_dstmem);
> 	...
> 	ret = crypto_comp_compress(tfm, src, PAGE_SIZE, dst, &dlen);
> 	...
> }
> 
> But the lower-layer driver is in drivers/crypto/...
> 
> Meanwhile, the lower-layer driver couldn't cache the pointers of buffer address coming from consumers to detect if the upper-layer is using the same page.
> Because the same page might come from different users or come from the different stages of the same user with different permissions.

Indeed the driver can't cache arbitrary pointers, but if typical buffers 
are small enough it can copy the data into its own already-mapped page, 
dma_sync it, and perform the DMA operation from there. That might even 
be more or less what your first suggestion was, but I'm still not quite 
sure.

> For example, consumer A uses the buffer as destination, then returns it to buddy, but consumer B gets the same buffer and uses it as source.
> 
> Another possibility is
> Consumer A uses the buffer, returns it to buddy, after some time, it allocates a buffer again, but gets the same buffer from buddy like before.
> 
> For the safety of the buffer, lower-layer driver must guarantee the buffer is unmapped when the buffer returns to buddy.
> 
> I think only the upper-layer consumer knows if it is reusing the buffer.

Right, and if reusing buffers is common in crypto callers, then there's 
an argument for "set up reusable buffer", "process updated buffer" and 
"clean up buffer" operations to be added to the crypto API itself, such 
that the underlying drivers can then optimise for DMA usage in a robust 
and obvious way if they want to (or just implement the setup and 
teardown as no-ops and still do a full map/unmap in each "process" call 
if they don't).

Robin.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Constantly map and unmap of streaming DMA buffers with IOMMU backend might cause serious performance problem
  2020-05-15 12:10   ` Robin Murphy
@ 2020-05-15 22:45     ` Song Bao Hua
  -1 siblings, 0 replies; 12+ messages in thread
From: Song Bao Hua @ 2020-05-15 22:45 UTC (permalink / raw)
  To: Robin Murphy, linux, hch, m.szyprowski, dagum, ralf, grundler,
	Jay.Estabrook, sailer, andrea, jens.axboe, davidm
  Cc: iommu, Linuxarm, linux-arm-kernel

> Subject: Re: Constantly map and unmap of streaming DMA buffers with
> IOMMU backend might cause serious performance problem
> 
> On 2020-05-15 09:19, Song Bao Hua wrote:
> [ snip... nice analysis, but ultimately it's still "doing stuff has more overhead
> than not doing stuff" ]
> 
> > I am thinking several possible ways on decreasing or removing the latency of
> DMA map/unmap for every single DMA transfer. Meanwhile, "non-strict" as an
> existing option with possible safety issues, I won't discuss it in this mail.
> 
> But passthrough and non-strict mode *specifically exist* for the cases where
> performance is the most important concern - streaming DMA with an IOMMU
> in the middle has an unavoidable tradeoff between performance and isolation,
> so dismissing that out of hand is not a good way to start making this
> argument.

I do understand there is a tradeoff between performance and isolation. However, users might ask for performance while supporting isolation. 
In passthrough mode, the whole memory might be accessible by DMA. In non-strict mode, a buffer could be still mapped in IOMMU when users have returned it to buddy and the buffer has even been allocated by another user. 

> 
> > 1. provide bounce coherent buffers for streaming buffers.
> > As the coherent buffers keep the status of mapping, we can remove the
> overhead of map and unmap for each single DMA operations. However, this
> solution requires memory copy between stream buffers and bounce buffers.
> Thus it will work only if copy is faster than map/unmap. Meanwhile, it will
> consume much more memory bandwidth.
> 
> I'm struggling to understand how that would work, can you explain it in more
> detail?

lower-layer drivers maintain some reusable coherent buffers.
For TX path, drivers copy streaming buffer to coherent buffer, then do DMA;
For RX path, drivers do DMA in coherent buffer, then copy to streaming buffer.

> 
> > 2.make upper-layer kernel components aware of the pain of iommu
> > map/unmap upper-layer fs, mm, networks can somehow let the lower-layer
> drivers know the end of the life cycle of sg buffers. In zswap case, I have seen
> zswap always use the same 2 pages as the destination buffers to save
> compressed page, but the compressor driver still has to constantly map and
> unmap those same two pages for every single compression since zswap and zip
> drivers are working in two completely different software layers.
> >
> > I am thinking some way as below, upper-layer kernel code can call:
> > sg_init_table(&sg...);
> > sg_mark_reusable(&sg....);
> > .... /* use the buffer many times */
> > ....
> > sg_mark_stop_reuse(&sg);
> >
> > After that, if low level drivers see "reusable" flag, it will realize the buffer can
> be used multiple times and will not do map/unmap every time. it means
> upper-layer components will further use the buffers and the same buffers will
> probably be given to lower-layer drivers for new DMA transfer later. When
> upper-layer code sets " stop_reuse", lower-layer driver will unmap the sg
> buffers, possibly by providing a unmap-callback to upper-layer components.
> For zswap case, I have seen the same buffers are always re-used and zip driver
> maps and unmaps it again and again. Shortly after the buffer is unmapped, it
> will be mapped in the next transmission, almost without any time gap
> between unmap and map. In case zswap can set the "reusable" flag, zip driver
> will save a lot of time.
> > Meanwhile, for the safety of buffers, lower-layer drivers need to make certain
> the buffers have already been unmapped in iommu before those buffers go
> back to buddy for other users.
> 
> That sounds like it would only have benefit in a very small set of specific
> circumstances, and would be very difficult to generalise to buffers that are
> mapped via dma_map_page() or dma_map_single().

Yes, indeed. Hopefully the small set of specific circumstances will encourage more upper-layer consumers to reuse buffers, then the "reusable" flag can extend to more common cases, such as page and single buffer.

> Furthermore, a high-level API that affects a low-level driver's interpretation of
> mid-layer API calls without the mid-layer's knowledge sounds like a hideous
> abomination of anti-design. If a mid-layer API lends itself to inefficiency at the
> lower level, it would seem a lot cleaner and more robust to extend *that* API
> for stateful buffer reuse.

Absolutely agree. I didn't say the method is elegant. For this moment, maybe "reuse" can get started from a small case like zswap. After some while, it is possible more users are encouraged to do some optimization for buffer reuse, understanding the suffering of lower-layer drivers. Then those performance problems might be solved case by case.
On the other hand, it is always the freedom of upper-layer code to indicate "reuse" or not. If they don't say anything about reuse, lower-layer drivers can simply do map and unmap.

> Failing that, it might possibly be appropriate to approach this at the driver
> level - many of the cleverer network drivers already implement buffer pools to
> recycle mapped SKBs internally, couldn't the "zip driver" simply try doing
> something like that for itself?

are the buffer pools for RX path? For TX path, buffers come from upper-layer so network drivers can't do anything for recycling SKBs?

> 
> Robin.

Thanks
Barry
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Constantly map and unmap of streaming DMA buffers with IOMMU backend might cause serious performance problem
@ 2020-05-15 22:45     ` Song Bao Hua
  0 siblings, 0 replies; 12+ messages in thread
From: Song Bao Hua @ 2020-05-15 22:45 UTC (permalink / raw)
  To: Robin Murphy, linux, hch, m.szyprowski, dagum, ralf, grundler,
	Jay.Estabrook, sailer, andrea, jens.axboe, davidm
  Cc: iommu, Linuxarm, linux-arm-kernel

> Subject: Re: Constantly map and unmap of streaming DMA buffers with
> IOMMU backend might cause serious performance problem
> 
> On 2020-05-15 09:19, Song Bao Hua wrote:
> [ snip... nice analysis, but ultimately it's still "doing stuff has more overhead
> than not doing stuff" ]
> 
> > I am thinking several possible ways on decreasing or removing the latency of
> DMA map/unmap for every single DMA transfer. Meanwhile, "non-strict" as an
> existing option with possible safety issues, I won't discuss it in this mail.
> 
> But passthrough and non-strict mode *specifically exist* for the cases where
> performance is the most important concern - streaming DMA with an IOMMU
> in the middle has an unavoidable tradeoff between performance and isolation,
> so dismissing that out of hand is not a good way to start making this
> argument.

I do understand there is a tradeoff between performance and isolation. However, users might ask for performance while supporting isolation. 
In passthrough mode, the whole memory might be accessible by DMA. In non-strict mode, a buffer could be still mapped in IOMMU when users have returned it to buddy and the buffer has even been allocated by another user. 

> 
> > 1. provide bounce coherent buffers for streaming buffers.
> > As the coherent buffers keep the status of mapping, we can remove the
> overhead of map and unmap for each single DMA operations. However, this
> solution requires memory copy between stream buffers and bounce buffers.
> Thus it will work only if copy is faster than map/unmap. Meanwhile, it will
> consume much more memory bandwidth.
> 
> I'm struggling to understand how that would work, can you explain it in more
> detail?

lower-layer drivers maintain some reusable coherent buffers.
For TX path, drivers copy streaming buffer to coherent buffer, then do DMA;
For RX path, drivers do DMA in coherent buffer, then copy to streaming buffer.

> 
> > 2.make upper-layer kernel components aware of the pain of iommu
> > map/unmap upper-layer fs, mm, networks can somehow let the lower-layer
> drivers know the end of the life cycle of sg buffers. In zswap case, I have seen
> zswap always use the same 2 pages as the destination buffers to save
> compressed page, but the compressor driver still has to constantly map and
> unmap those same two pages for every single compression since zswap and zip
> drivers are working in two completely different software layers.
> >
> > I am thinking some way as below, upper-layer kernel code can call:
> > sg_init_table(&sg...);
> > sg_mark_reusable(&sg....);
> > .... /* use the buffer many times */
> > ....
> > sg_mark_stop_reuse(&sg);
> >
> > After that, if low level drivers see "reusable" flag, it will realize the buffer can
> be used multiple times and will not do map/unmap every time. it means
> upper-layer components will further use the buffers and the same buffers will
> probably be given to lower-layer drivers for new DMA transfer later. When
> upper-layer code sets " stop_reuse", lower-layer driver will unmap the sg
> buffers, possibly by providing a unmap-callback to upper-layer components.
> For zswap case, I have seen the same buffers are always re-used and zip driver
> maps and unmaps it again and again. Shortly after the buffer is unmapped, it
> will be mapped in the next transmission, almost without any time gap
> between unmap and map. In case zswap can set the "reusable" flag, zip driver
> will save a lot of time.
> > Meanwhile, for the safety of buffers, lower-layer drivers need to make certain
> the buffers have already been unmapped in iommu before those buffers go
> back to buddy for other users.
> 
> That sounds like it would only have benefit in a very small set of specific
> circumstances, and would be very difficult to generalise to buffers that are
> mapped via dma_map_page() or dma_map_single().

Yes, indeed. Hopefully the small set of specific circumstances will encourage more upper-layer consumers to reuse buffers, then the "reusable" flag can extend to more common cases, such as page and single buffer.

> Furthermore, a high-level API that affects a low-level driver's interpretation of
> mid-layer API calls without the mid-layer's knowledge sounds like a hideous
> abomination of anti-design. If a mid-layer API lends itself to inefficiency at the
> lower level, it would seem a lot cleaner and more robust to extend *that* API
> for stateful buffer reuse.

Absolutely agree. I didn't say the method is elegant. For this moment, maybe "reuse" can get started from a small case like zswap. After some while, it is possible more users are encouraged to do some optimization for buffer reuse, understanding the suffering of lower-layer drivers. Then those performance problems might be solved case by case.
On the other hand, it is always the freedom of upper-layer code to indicate "reuse" or not. If they don't say anything about reuse, lower-layer drivers can simply do map and unmap.

> Failing that, it might possibly be appropriate to approach this at the driver
> level - many of the cleverer network drivers already implement buffer pools to
> recycle mapped SKBs internally, couldn't the "zip driver" simply try doing
> something like that for itself?

are the buffer pools for RX path? For TX path, buffers come from upper-layer so network drivers can't do anything for recycling SKBs?

> 
> Robin.

Thanks
Barry
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-05-15 22:46 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-15  8:19 Constantly map and unmap of streaming DMA buffers with IOMMU backend might cause serious performance problem Song Bao Hua
2020-05-15  8:19 ` Song Bao Hua
2020-05-15 12:10 ` Robin Murphy
2020-05-15 12:10   ` Robin Murphy
2020-05-15 14:45   ` hch
2020-05-15 14:45     ` hch
2020-05-15 21:33     ` Song Bao Hua
2020-05-15 21:33       ` Song Bao Hua
2020-05-15 22:12       ` Robin Murphy
2020-05-15 22:12         ` Robin Murphy
2020-05-15 22:45   ` Song Bao Hua
2020-05-15 22:45     ` Song Bao Hua

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.