From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CC174C433F5 for ; Tue, 26 Oct 2021 04:31:11 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 2669A61074 for ; Tue, 26 Oct 2021 04:31:11 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 2669A61074 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=nongnu.org Received: from localhost ([::1]:41626 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1mfE7B-0003Hq-UV for qemu-devel@archiver.kernel.org; Tue, 26 Oct 2021 00:31:09 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:60918) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mfE5x-0002b0-N9 for qemu-devel@nongnu.org; Tue, 26 Oct 2021 00:29:53 -0400 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:43564) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mfE5r-0003OS-Ne for qemu-devel@nongnu.org; Tue, 26 Oct 2021 00:29:52 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1635222584; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=UOorz8Wyu98joomTnFv4zcB7ToafMS5tBC0Aj4wZLNU=; b=WOaUYDWL34RB235n0MbZZDR10ObXSPkr1faW+BGr5FeEhTA3d2wLn0dCs7Nu7BfOf1cSAw VMyMV1LiKBDJR1JvnATQNyQb4Vz60iZAjO7HHZv95sAw1K4ZGKUeceEGi7zebsb20XF2qh t81k1HDht7U4OXJ4Tdv50AB1Vc7EJl0= Received: from mail-lf1-f69.google.com (mail-lf1-f69.google.com [209.85.167.69]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-571-BaT1xTspPdeSXi3u3JhZvg-1; Tue, 26 Oct 2021 00:29:43 -0400 X-MC-Unique: BaT1xTspPdeSXi3u3JhZvg-1 Received: by mail-lf1-f69.google.com with SMTP id u17-20020a05651206d100b003fd714d9a38so730984lff.8 for ; Mon, 25 Oct 2021 21:29:43 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=UOorz8Wyu98joomTnFv4zcB7ToafMS5tBC0Aj4wZLNU=; b=7L5zTKqmlO/lC2qSXilqkaBvJ9juZT/cmPPxM8qm7YkyRzLF8TzYE9XYbU1YieAZcl Sq4wXEdNAX77tuOjzN6bhyOwj6LBXc33hpFoXXaUShEfVDUdm556ZUpRB0019sS+uls9 ra6BvW+oD0Hs01CX4Q6hxeC2xSL2wJFZUW7uw8KfEXWhrjXM7UoF3S0egJE9pPyAt0cO lQbcdKd0/YyyUTOQlz9fsrZjt1stugpk/JczJgq0VG/tYpv4xvIif4yj84PrVu3tagXx jAzEKB9YyCBbrDKjvNTKy8avhdpDo2T60p11etQ2QSe+5UtqoW8Iro2kr5Wm2+tX/C3b W3tw== X-Gm-Message-State: AOAM532j66xNbej4fPsh50QaqoPLB6Pqv56voSso0jClWEam7A+qC/IG kV5mP1mjpG6lXtrxjShPdFgewI5suaNqinPH9PXtu4RfYOPs2tIFxWXkQZ/F7DoSeiCK5W4cuqE vqedcAXZ0zxaiOPnawiv2ei+DcM0ZQrI= X-Received: by 2002:a2e:8605:: with SMTP id a5mr4776277lji.107.1635222581684; Mon, 25 Oct 2021 21:29:41 -0700 (PDT) X-Google-Smtp-Source: ABdhPJykvzomp1pJB0ctc4XssRm+jJ2StF7lOaI9zIa5CKpdX2/Q9nwwytbqpe864nmpCnkXcFP3LjSh5erojyV/iww= X-Received: by 2002:a2e:8605:: with SMTP id a5mr4776252lji.107.1635222581309; Mon, 25 Oct 2021 21:29:41 -0700 (PDT) MIME-Version: 1.0 References: <20211001070603.307037-1-eperezma@redhat.com> <20211001070603.307037-19-eperezma@redhat.com> <79905c11-e313-ad60-17dc-1a47d35f12cc@redhat.com> In-Reply-To: From: Jason Wang Date: Tue, 26 Oct 2021 12:29:30 +0800 Message-ID: Subject: Re: [RFC PATCH v4 18/20] vhost: Add VhostIOVATree To: Eugenio Perez Martin Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=jasowang@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=216.205.24.124; envelope-from=jasowang@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -27 X-Spam_score: -2.8 X-Spam_bar: -- X-Spam_report: (-2.8 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: qemu-devel Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" On Thu, Oct 21, 2021 at 10:34 PM Eugenio Perez Martin wrote: > > On Thu, Oct 21, 2021 at 10:12 AM Jason Wang wrote: > > > > On Thu, Oct 21, 2021 at 3:03 PM Eugenio Perez Martin > > wrote: > > > > > > On Thu, Oct 21, 2021 at 4:34 AM Jason Wang wrot= e: > > > > > > > > On Wed, Oct 20, 2021 at 8:07 PM Eugenio Perez Martin > > > > wrote: > > > > > > > > > > On Wed, Oct 20, 2021 at 11:01 AM Jason Wang = wrote: > > > > > > > > > > > > On Wed, Oct 20, 2021 at 3:54 PM Eugenio Perez Martin > > > > > > wrote: > > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 11:23 AM Jason Wang wrote: > > > > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 4:32 PM Jason Wang wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > =E5=9C=A8 2021/10/1 =E4=B8=8B=E5=8D=883:06, Eugenio P=C3= =A9rez =E5=86=99=E9=81=93: > > > > > > > > > > This tree is able to look for a translated address from= an IOVA address. > > > > > > > > > > > > > > > > > > > > At first glance is similar to util/iova-tree. However, = SVQ working on > > > > > > > > > > devices with limited IOVA space need more capabilities,= like allocating > > > > > > > > > > IOVA chunks or perform reverse translations (qemu addre= sses to iova). > > > > > > > > > > > > > > > > > > > > > > > > > > > I don't see any reverse translation is used in the shadow= code. Or > > > > > > > > > anything I missed? > > > > > > > > > > > > > > > > Ok, it looks to me that it is used in the iova allocator. B= ut I think > > > > > > > > it's better to decouple it to an independent allocator inst= ead of > > > > > > > > vhost iova tree. > > > > > > > > > > > > > > > > > > > > > > Reverse translation is used every time a buffer is made avail= able, > > > > > > > since buffers content are not copied, only the descriptors to= SVQ > > > > > > > vring. > > > > > > > > > > > > I may miss something but I didn't see the code? Qemu knows the = VA of > > > > > > virtqueue, and the VA of the VQ is stored in the VirtQueueElem? > > > > > > > > > > > > > > > > It's used in the patch 20/20, could that be the misunderstanding?= The > > > > > function calling it is vhost_svq_translate_addr. > > > > > > > > > > Qemu knows the VA address of the buffer, but it must offer a vali= d SVQ > > > > > iova to the device. That is the translation I mean. > > > > > > > > Ok, I get you. So if I understand correctly, what you did is: > > > > > > > > 1) allocate IOVA during region_add > > > > 2) preform VA->IOVA reverse lookup in handle_kick > > > > > > > > This should be fine, but here're some suggestions: > > > > > > > > 1) remove the assert(map) in vhost_svq_translate_addr() since guest > > > > can add e.g BAR address > > > > > > Wouldn't VirtQueue block them in virtqueue_pop / address_space_read_* > > > functions? I'm fine to remove it but I would say it adds value agains= t > > > coding error. > > > > I think not. Though these addresses were excluded in > > vhost_vdpa_listener_skipped_section(). For Qemu memory core, they are > > valid addresses. Qemu emulate how hardware work (e.g pci p2p), so dma > > to bar is allowed. > > > > Ok I will treat them as errors. > > > > > > > > 2) we probably need a better name vhost_iova_tree_alloc(), maybe > > > > "vhost_iova_tree_map_alloc()" > > > > > > > > > > Ok I will change for the next version. > > > > > > > There's actually another method. > > > > > > > > 1) don't do IOVA/map allocation in region_add() > > > > 2) do the allocation in handle_kick(), then we know the IOVA so no > > > > reverse lookup > > > > > > > > The advantage is that this can work for the case of vIOMMU. And the= y > > > > should perform the same: > > > > > > > > 1) you method avoid the iova allocation per sg > > > > 2) my method avoid the reverse lookup per sg > > > > > > > > > > It's somehow doable, but we are replacing a tree search with a linear > > > insertion at this moment. > > > > > > I would say that guest's IOVA -> qemu vaddr part works with no change > > > for vIOMMU, since VirtQueue's virtqueue_pop already gives us the vadd= r > > > even in the case of vIOMMU. > > > > So in this case: > > > > 1) listener gives us GPA->host IOVA (host IOVA is allocated per GPA) > > Right, that was a miss from my side, I think I get your point way better = now. > > So now vhost-iova-tree translates GPA -> host IOVA in vIOMMU case, and > it is updated at the same frequency than guest physical memory hotplug > / unplug (little during migration, I guess). There are special entries > for SVQ vrings, that the tree does not map with GPA for obvious > reasons, and you cannot locate them when looking by GPA. Yes. > > Let's assume too that only SVQ vrings have been sent as IOMMU / IOTLB > map, with the relation Host iova -> qemu's VA. > > > 2) virtqueue_pop gives us guest IOVA -> VA > > > > We still need extra logic to lookup the vIOMMU to get the guest IOVA > > GPA then we can know the host IOVA. > > > > That's somehow right, but I think this does not need to be *another* > search, insertion, etc. Please see below. > > > If we allocate after virtqueue_pop(), we can follow the same logic as > > without vIOMMU. Just allocate an host IOVA then all is done. > > > > > The only change I would add for that case > > > is the SVQ -> device map/unmapping part, so the device cannot access > > > random addresses but only the exposed ones. I'm assuming that part is > > > O(1). > > > > > > This way, we already have a tree with all the possible guest's > > > addresses, and we only need to look for it's SVQ iova -> vaddr > > > translation. This is a O(log(N)) operation, > > > > Yes, but it's requires traverse the vIOMMU page table which should be > > slower than our own iova tree? > > > > The lookup over vIOMMU is not needed (to perform twice), since > virtqueue_pop already do it. We already have that data here, just need > to extract it. For 'extract' do you mean fetching it from IOMMU's IOTLB via address_space_get_iotlb_entry()? Yes, it would be faster and probably an O(1). > Not saying that is complicated, just saying that I > didn't dedicate a lot of time to figure out how. The calltrace of it > is: > > #0 address_space_translate_iommu > (iommu_mr, xlat, plen_out, page_mask_out, is_write, is_mmio, > target_as, attrs) at ../softmmu/physmem.c:418 > #1 flatview_do_translate > (fv, addr, xlat, plen_out, page_mask_out, is_write, is_mmio, > target_as, attrs) at ../softmmu/physmem.c:505 > #2 flatview_translate > (fv, addr, xlat, plen, is_write, attrs) at ../softmmu/physmem.c:565 > #3 address_space_map (as, addr, plen, is_write, attrs) > at ../softmmu/physmem.c:3183 > #4 dma_memory_map (as, addr, len, dir) > at /home/qemu/svq/include/sysemu/dma.h:202 > #5 virtqueue_map_desc > (vdev, p_num_sg, addr, iov, max_num_sg, is_write, pa, sz) at > ../hw/virtio/virtio.c:1314 > #6 virtqueue_split_pop (vq, sz) at ../hw/virtio/virtio.c:1488 > > So with that GPA we can locate its correspond entry in the > vhost-iova-tree, in a read-only operation, O(log(N)). And element > address in qemu's va is not going to change until we mark it as used. > > This process (all the stack call trace) needs to be serialized somehow > in qemu's memory system internals, I'm just assuming that it will be > faster than the one we can do in SVQ with little effort, and it will > help to reduce duplication. If is not the case, I think it is even > more beneficial to improve it, than to reinvent it in SVQ. I think so. Thanks > > After that, an iommu map needs to be sent to the device, as (qemu's > iommu obtained from the tree, qemu's VA, length, ...). We may even > batch them. Another option is to wait for the miss(), but I think that > would be a waste of resources. > > The reverse is also true with the unmapping: When we see an used > descriptor, IOTLB unmap(s) will be sent before send the descriptor to > guest as used. > > > > and read only, so it's > > > easily parallelizable when we make each SVQ in it's own thread (if > > > needed). > > > > Yes, this is because the host IOVA was allocated before by the memory l= istener. > > > > Right. > > > > The only thing left is to expose that with an iommu miss > > > (O(1)) and unmap it on used buffers processing (also O(1)). The > > > domination operation keeps being VirtQueue's own code lookup for > > > guest's IOVA -> GPA, which I'm assuming is already well optimized and > > > will benefit from future optimizations since qemu's memory system is > > > frequently used. > > > > > > To optimize your use case we would need to add a custom (and smarter > > > than the currently used) allocator to SVQ. I've been looking for ways > > > to reuse glibc or similar in our own arenas but with no luck. It will > > > be code that SVQ needs to maintain by and for itself anyway. > > > > The benefit is to have separate iova allocation from the tree. > > > > > > > > In either case it should not be hard to switch to your method, just a > > > few call changes in the future, if we achieve a faster allocator. > > > > > > Would that make sense? > > > > Yes, feel free to choose any method you wish or feel simpler in the nex= t series. > > > > > > > > > > > > > > > > > > > > > > > > At this point all the limits are copied to vhost iova tree in= the next > > > > > > > revision I will send, defined at its creation at > > > > > > > vhost_iova_tree_new(). They are outside of util/iova-tree, on= ly sent > > > > > > > to the latter at allocation time. > > > > > > > > > > > > > > Since vhost_iova_tree has its own vhost_iova_tree_alloc(), th= at wraps > > > > > > > the iova_tree_alloc() [1], limits could be kept in vhost-vdpa= and make > > > > > > > them an argument of vhost_iova_tree_alloc. But I'm not sure i= f it's > > > > > > > what you are proposing or I'm missing something. > > > > > > > > > > > > If the reverse translation is only used in iova allocation, I m= eant to > > > > > > split the logic of IOVA allocation itself. > > > > > > > > > > > > > > > > Still don't understand it, sorry :). In SVQ setup we allocate an = iova > > > > > address for every guest's GPA address its driver can use. After t= hat > > > > > there should be no allocation unless memory is hotplugged. > > > > > > > > > > So the limits are only needed precisely at allocation time. Not s= ure > > > > > if that is what you mean here, but to first allocate and then che= ck if > > > > > it is within the range could lead to false negatives, since there > > > > > could be a valid range *in* the address but the iova allocator > > > > > returned us another range that fell outside the range. How could = we > > > > > know the cause if it is not using the range itself? > > > > > > > > See my above reply. And we can teach the iova allocator to return t= he > > > > IOVA in the range that vhost-vDPA supports. > > > > > > > > > > Ok, > > > > > > For the next series it will be that way. I'm pretty sure we are > > > aligned in this part, but the lack of code in this series makes it > > > very hard to discuss it :). > > > > Fine. Let's see. > > > > Thanks > > > > > > > > Thanks! > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > Either way, I think it is harder to talk about this specific = case > > > > > > > without code, since this one still does not address the limit= s. Would > > > > > > > you prefer me to send another RFC in WIP quality, with *not* = all > > > > > > > comments addressed? I would say that there is not a lot of pe= nding > > > > > > > work to send the next one, but it might be easier for all of = us. > > > > > > > > > > > > I'd prefer to try to address them all, otherwise it's not easy = to see > > > > > > what is missing. > > > > > > > > > > > > > > > > Got it, I will do it that way then! > > > > > > > > > > Thanks! > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > [1] This util/iova-tree method will be proposed in the next s= eries, > > > > > > > and vhost_iova_tree wraps it since it needs to keep in sync b= oth > > > > > > > trees: iova->qemu vaddr for iova allocation and the reverse o= ne to > > > > > > > translate available buffers. > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >