[LSF/MM/BPF TOPIC] [LSF/MM/BPF ATTEND] : Two stage IOMMU DMA mapping operations

From: Chaitanya Kulkarni <chaitanyak@nvidia.com>
To: "lsf-pc@lists.linux-foundation.org"
	<lsf-pc@lists.linux-foundation.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	"iommu@lists.linux.dev" <iommu@lists.linux.dev>,
	linux-rdma <linux-rdma@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Cc: Jens Axboe <axboe@kernel.dk>,
	Bart Van Assche <bvanassche@acm.org>,
	"kbusch@kernel.org" <kbusch@kernel.org>,
	Damien Le Moal <damien.lemoal@opensource.wdc.com>,
	Amir Goldstein <amir73il@gmail.com>,
	"josef@toxicpanda.com" <josef@toxicpanda.com>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	"daniel@iogearbox.net" <daniel@iogearbox.net>,
	Christoph Hellwig <hch@lst.de>,
	Dan Williams <dan.j.williams@intel.com>,
	"jack@suse.com" <jack@suse.com>,
	Leon Romanovsky <leonro@nvidia.com>,
	Jason Gunthorpe <jgg@nvidia.com>
Subject: [LSF/MM/BPF TOPIC] [LSF/MM/BPF ATTEND] : Two stage IOMMU DMA mapping operations
Date: Tue, 27 Feb 2024 08:17:27 +0000	[thread overview]
Message-ID: <97f385db-42c9-4c04-8fba-9b1ba8ffc525@nvidia.com> (raw)

Hi,

* Problem Statement :-
-------------------------------------------------------------------------
The existing IOMMU DMA mapping operation is performed in two steps at the
same time (one-shot):
1. Allocates IOVA space.
2. Actually maps DMA pages to that space.
For example, map scatter-gather list:
dma_map_sg_attrs()
   __dma_map_sg_attrs
     ops->map_sg()
       iommu_dma_map_sg()
         Calculate length of IOVA space that  is needed

         /* ####### step one allocate IOVA space ####### */
         iommu_dma_alloc_iova()

         /* ####### step two actually map DMA Pages ####### */
         iommu_map_sg()
           for each entry in sg list()
             __iommu_map()
               iommu_domain_ops->map_pages()

This one-shot operation works perfectly for non-complex scenarios where
callers use the existing DMA API in the control path when they setup
hardware.

However, in more complex scenarios, when DMA mapping is needed in the
data path and especially when some sort of specific intermediary
datatype is involved (sg list), this one-shot approach:

1. Forces developers to introduce new DMA APIs for specific datatype,
    e.g., Existing scatter-gather mapping functions in dma mapping
    existing subsystems :-

    dma_map_sgtable()
      __dma_map_sg_attrs()
    dma_unmap_sg_attrs()
    blk_rq_map_sg()
      __blk_rq_map_sg()
      __blk_bvec_map_sg()
      __blk_bios_map_sg()
    blk_bvec_map_sg()

    OR

    Latest Chuck's RFC series [1] aims to incorporate biovec-related
    DMA mapping (which expands bio_vec with DMA addresses). Probably,
    struct folio will also require it.

2. Creates dependencies on a data type, forcing certain intermediary
    data type allocation/de-allocation and page-to-data-type mapping
    and unmapping in the fast path (submission or completion).

* Proposed approach and discussion points :-
-------------------------------------------------------------------------

Instead of teaching DMA APIs to know about specific datatypes & creating
a dependency on it, that may add performance overhead with mapping and
allocation, we propose to separate the existing DMA mapping routine into
two steps where:

Step 1 : Provide an option to API users (subsystems) to perform all
          calculations internally in-advance.
Step 2 : Map pages when they are needed.

These advanced DMA mapping APIs are needed to calculate the IOVA size to
allocate as one chunk and a combination of offset calculations to know
which part of IOVA to be mapped to which page.

The new API will also allow us to remove the dependency on the sg list as
discussed previously in [2].

The main advantages of this approach as it is seen in upcoming RFC
series are:

1. Simplified & increased performance in page fault handling for
    On-Demand-Paging (ODP) mode for RDMA.
2. Reduced memory footprint for VFIO PCI live migration code.
3. Reduced overhead of intermediary SG table manipulation in the fast
    path for storage drivers where block layer requests are mapped onto
    sg table and then sg table is mapped onto DMA :-
    xxx_queue_rq()
     allocate sg table
     blk_rq_map_sg()
       merge and maps bvecs to sg
     dma_map_sgtable()
      maps pages in sg to DMA.

In order to create a good platform for a concrete and meaningful
discussion at LSFMM 24, we plan to post an RFC within the next two weeks.

Required Attendees list :-

Christoph Hellwig
Jason Gunthorpe
Jens Axboe
Chuck Lever
David Howells
Keith Busch
Bart Van Assche
Damien Le Moal
Martin Petersen

-ck

[1] 
https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@klimt.1015granger.net
[2] https://lore.kernel.org/linux-iommu/20200708065014.GA5694@lst.de/