From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ferruh Yigit Subject: Re: [PATCH v2] kni: add IOVA va support for kni Date: Wed, 3 Apr 2019 17:29:25 +0100 Message-ID: References: <20180927104846.16356-1-kkokkilagadda@caviumnetworks.com> <20190401095118.4176-1-kirankumark@marvell.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Cc: "dev@dpdk.org" , Jerin Jacob To: Kiran Kumar Kokkilagadda Return-path: Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by dpdk.org (Postfix) with ESMTP id 530671B4C8 for ; Wed, 3 Apr 2019 18:29:28 +0200 (CEST) In-Reply-To: <20190401095118.4176-1-kirankumark@marvell.com> Content-Language: en-US List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 4/1/2019 10:51 AM, Kiran Kumar Kokkilagadda wrote: > From: Kiran Kumar K > > With current KNI implementation kernel module will work only in > IOVA=PA mode. This patch will add support for kernel module to work > with IOVA=VA mode. Thanks Kiran for removing the limitation, I have a few questions, can you please help me understand. And when this patch is ready, the restriction in 'linux/eal/eal.c', in 'rte_eal_init' should be removed, perhaps with this patch. I assume you already doing it to be able to test this patch. > > The idea is to maintain a mapping in KNI module between user pages and > kernel pages and in fast path perform a lookup in this table and get > the kernel virtual address for corresponding user virtual address. > > In IOVA=VA mode, the memory allocated to the pool is physically > and virtually contiguous. We will take advantage of this and create a > mapping in the kernel.In kernel we need mapping for queues > (tx_q, rx_q,... slow path) and mbuf memory (fast path). Is it? As far as I know mempool can have multiple chunks and they can be both virtually and physically separated. And even for a single chunk, that will be virtually continuous, but will it be physically continuous? > > At the KNI init time, in slow path we will create a mapping for the > queues and mbuf using get_user_pages similar to af_xdp. Using pool > memory base address, we will create a page map table for the mbuf, > which we will use in the fast path for kernel page translation. > > At KNI init time, we will pass the base address of the pool and size of > the pool to kernel. In kernel, using get_user_pages API, we will get > the pages with size PAGE_SIZE and store the mapping and start address > of user space in a table. > > In fast path for any user address perform PAGE_SHIFT > (user_addr >> PAGE_SHIFT) and subtract the start address from this value, > we will get the index of the kernel page with in the page map table. > Adding offset to this kernel page address, we will get the kernel address > for this user virtual address. > > For example user pool base address is X, and size is S that we passed to > kernel. In kernel we will create a mapping for this using get_user_pages. > Our page map table will look like [Y, Y+PAGE_SIZE, Y+(PAGE_SIZE*2) ....] > and user start page will be U (we will get it from X >> PAGE_SHIFT). > > For any user address Z we will get the index of the page map table using > ((Z >> PAGE_SHIFT) - U). Adding offset (Z & (PAGE_SIZE - 1)) to this > address will give kernel virtual address. > > Signed-off-by: Kiran Kumar K <...> > +int > +kni_pin_pages(void *address, size_t size, struct page_info *mem) > +{ > + unsigned int gup_flags = FOLL_WRITE; > + long npgs; > + int err; > + > + /* Get at least one page */ > + if (size < PAGE_SIZE) > + size = PAGE_SIZE; > + > + /* Compute number of user pages based on page size */ > + mem->npgs = (size + PAGE_SIZE - 1) / PAGE_SIZE; > + > + /* Allocate memory for the pages */ > + mem->pgs = kcalloc(mem->npgs, sizeof(*mem->pgs), > + GFP_KERNEL | __GFP_NOWARN); > + if (!mem->pgs) { > + pr_err("%s: -ENOMEM\n", __func__); > + return -ENOMEM; > + } > + > + down_write(¤t->mm->mmap_sem); > + > + /* Get the user pages from the user address*/ > +#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,9,0) > + npgs = get_user_pages((u64)address, mem->npgs, > + gup_flags, &mem->pgs[0], NULL); > +#else > + npgs = get_user_pages(current, current->mm, (u64)address, mem->npgs, > + gup_flags, 0, &mem->pgs[0], NULL); > +#endif > + up_write(¤t->mm->mmap_sem); This should work even memory is physically not continuous, right? Where exactly physically continuous requirement is coming from? <...> > + > +/* Get the kernel address from the user address using > + * page map table. Will be used only in IOVA=VA mode > + */ > +static inline void* > +get_kva(uint64_t usr_addr, struct kni_dev *kni) > +{ > + uint32_t index; > + /* User page - start user page will give the index > + * with in the page map table > + */ > + index = (usr_addr >> PAGE_SHIFT) - kni->va_info.start_page; > + > + /* Add the offset to the page address */ > + return (kni->va_info.page_map[index].addr + > + (usr_addr & kni->va_info.page_mask)); > + > +} > + > /* physical address to kernel virtual address */ > static void * > pa2kva(void *pa) > @@ -186,7 +205,10 @@ kni_fifo_trans_pa2va(struct kni_dev *kni, > return; > > for (i = 0; i < num_rx; i++) { > - kva = pa2kva(kni->pa[i]); > + if (likely(kni->iova_mode == 1)) > + kva = get_kva((u64)(kni->pa[i]), kni); kni->pa[] now has iova addresses, for 'get_kva()' to work shouldn't 'va_info.start_page' calculated from 'mempool_memhdr->iova' instead of 'mempool_memhdr->addr' If this is working I must be missing something but not able to find what it is. <...> > @@ -304,6 +304,27 @@ rte_kni_alloc(struct rte_mempool *pktmbuf_pool, > kni->group_id = conf->group_id; > kni->mbuf_size = conf->mbuf_size; > > + dev_info.iova_mode = (rte_eal_iova_mode() == RTE_IOVA_VA) ? 1 : 0; > + if (dev_info.iova_mode) { > + struct rte_mempool_memhdr *hdr; > + uint64_t pool_size = 0; > + > + /* In each pool header chunk, we will maintain the > + * base address of the pool. This chunk is physically and > + * virtually contiguous. > + * This approach will work, only if the allocated pool > + * memory is contiguous, else it won't work > + */ > + hdr = STAILQ_FIRST(&pktmbuf_pool->mem_list); > + dev_info.mbuf_va = (void *)(hdr->addr); > + > + /* Traverse the list and get the total size of the pool */ > + STAILQ_FOREACH(hdr, &pktmbuf_pool->mem_list, next) { > + pool_size += hdr->len; > + } This code is aware that there may be multiple chunks, but assumes they are all continuous, I don't know if this assumption is correct. Also I guess there is another assumption that there will be single pktmbuf_pool in the application which passed into kni? What if there are multiple pktmbuf_pool, like one for each PMD, will this work? Now some mbufs in kni Rx fifo will come from different pktmbuf_pool which we don't know their pages, so won't able to get their kernel virtual address.