From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alexander Duyck Subject: Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC Date: Fri, 30 Oct 2015 11:04:24 -0700 Message-ID: <5633B128.8010708@gmail.com> References: <1445445464-5056-1-git-send-email-tianyu.lan@intel.com> <562A7E33.4080800@gmail.com> <562DBBC9.4000104@intel.com> <562E40BB.6040404@gmail.com> <5631B8CA.9040805@intel.com> <5631C3A7.2070900@gmail.com> <5631D9C2.2040206@intel.com> <56324682.5060507@gmail.com> <5632D8C8.3010101@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE To: Lan Tianyu , bhelgaas@google.com, carolyn.wyborny@intel.com, donald.c.skidmore@intel.com, eddie.dong@intel.com, nrupal.jani@intel.com, yang.z.zhang@intel.com, agraf@suse.de, kvm@vger.kernel.org, pbonzini@redhat.com, qemu-devel@nongnu.org, emil.s.tantilov@intel.com, intel-wired-lan@lists.osuosl.org, jeffrey.t.kirsher@intel.com, jesse.brandeburg@intel.com, john.ronciak@intel.com, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, matthew.vick@intel.com, mitch.a.williams@intel.com, netdev@vger.kernel.org, shannon.nelson@intel.com Return-path: In-Reply-To: <5632D8C8.3010101@intel.com> Sender: linux-kernel-owner@vger.kernel.org List-Id: netdev.vger.kernel.org On 10/29/2015 07:41 PM, Lan Tianyu wrote: > On 2015=E5=B9=B410=E6=9C=8830=E6=97=A5 00:17, Alexander Duyck wrote: >> On 10/29/2015 01:33 AM, Lan Tianyu wrote: >>> On 2015=E5=B9=B410=E6=9C=8829=E6=97=A5 14:58, Alexander Duyck wrote= : >>>> Your code was having to do a bunch of shuffling in order to get th= ings >>>> set up so that you could bring the interface back up. I would arg= ue >>>> that it may actually be faster at least on the bring-up to just dr= op the >>>> old rings and start over since it greatly reduced the complexity a= nd the >>>> amount of device related data that has to be moved. >>> If give up the old ring after migration and keep DMA running before >>> stopping VCPU, it seems we don't need to track Tx/Rx descriptor rin= g and >>> just make sure that all Rx buffers delivered to stack has been migr= ated. >>> >>> 1) Dummy write Rx buffer before checking Rx descriptor to ensure pa= cket >>> migrated first. >> Don't dummy write the Rx descriptor. You should only really need to >> dummy write the Rx buffer and you would do so after checking the >> descriptor, not before. Otherwise you risk corrupting the Rx buffer >> because it is possible for you to read the Rx buffer, DMA occurs, an= d >> then you write back the Rx buffer and now you have corrupted the mem= ory. >> >>> 2) Make a copy of Rx descriptor and then use the copied data to che= ck >>> buffer status. Not use the original descriptor because it won't be >>> migrated and migration may happen between two access of the Rx >>> descriptor. >> Do not just blindly copy the Rx descriptor ring. That is a recipe f= or >> disaster. The problem is DMA has to happen in a very specific order= for >> things to function correctly. The Rx buffer has to be written and t= hen >> the Rx descriptor. The problem is you will end up getting a read-ah= ead >> on the Rx descriptor ring regardless of which order you dirty things= in. > > Sorry, I didn't say clearly. > I meant to copy one Rx descriptor when receive rx irq and handle Rx r= ing. No, I understood what you are saying. My explanation was that it will=20 not work. > Current code in the ixgbevf_clean_rx_irq() checks status of the Rx > descriptor whether its Rx buffer has been populated data and then rea= d > the packet length from Rx descriptor to handle the Rx buffer. That part you have correct. However there are very explicit rules abou= t=20 the ordering of the reads. > My idea is to do the following three steps when receive Rx buffer in = the > ixgbevf_clean_rx_irq(). > > (1) dummy write the Rx buffer first, You cannot dummy write the Rx buffer without first being given ownershi= p=20 of it. In the driver this is handled in two phases. First we have to=20 read the DD bit to see if it is set. If it is we can take ownership of= =20 the buffer. Second we have to either do a dma_sync_range_for_cpu or=20 dma_unmap_page call so that we can guarantee the data has been moved to= =20 the buffer by the DMA API and that it knows it should no longer be=20 accessing it. > (2) make a copy of its Rx descriptor This is not advisable. Unless you can guarantee you are going to only=20 read the descriptor after the DD bit is set you cannot guarantee that=20 you won't race with device DMA. The problem is you could have the=20 migration occur right in the middle of (2). If that occurs then you=20 will have valid status bits, but the rest of the descriptor would be=20 invalid data. > (3) Check the buffer status and get length from the copy. I believe this is the assumption that is leading you down the wrong=20 path. You would have to read the status before you could do the copy. = =20 You cannot do it after. > Migration may happen every time. > Happen between (1) and (2). If the Rx buffer has been populated data,= VF > driver will not know that on the new machine because the Rx descripto= r > isn't migrated. But it's still safe. The part I think you are not getting is that DMA can occur between (1)=20 and (2). So if for example you were doing your dummy write while DMA=20 was occurring you pull in your value, DMA occurs, you write your value=20 and now you have corrupted an Rx frame by writing stale data back into = it. > Happen between (2) and (3). The copy will be migrated to new machine > and Rx buffer is migrated firstly. If there is data in the Rx buffer, > VF driver still can handle the buffer without migrating Rx descriptor= =2E > > The next buffers will be ignored since we don't migrate Rx descriptor > for them. Their status will be not completed on the new machine. You have kind of lost me on this part. Why do you believe there=20 statuses will not be completed? How are you going to prevent the Rx=20 descriptor ring from being migrated as it will be a dirty page by the=20 virtue of the fact that it is a bidirectional DMA mapping where the Rx=20 path provides new buffers and writes those addresses in while the devic= e=20 is writing back the status bits and length back. This is kind of what = I=20 was getting at. The Rx descriptor ring will show up as one of the=20 dirtiest spots on the driver since it is constantly being overwritten b= y=20 the CPU in ixgbevf_alloc_rx_buffers. Anyway we are kind of getting side tracked and I really think the=20 solution you have proposed is kind of a dead-end. What we have to do is come up with a solution that can deal with the=20 fact that you are racing against two different entities. You have to=20 avoid racing with the device, while at the same time you have to avoid=20 racing with the dirty page migration code. There are essentially 2=20 problems you have to solve. 1. Rx pages handed off to the stack must be marked as dirty. For now=20 your code seemed to address this via this snippet below from patch 12/1= 2: > @@ -946,15 +949,17 @@ static struct sk_buff *ixgbevf_fetch_rx_buffer(= struct ixgbevf_ring *rx_ring, > { > struct ixgbevf_rx_buffer *rx_buffer; > struct page *page; > + u8 *page_addr; > =20 > rx_buffer =3D &rx_ring->rx_buffer_info[rx_ring->next_to_clean]; > page =3D rx_buffer->page; > prefetchw(page); > =20 > - if (likely(!skb)) { > - void *page_addr =3D page_address(page) + > - rx_buffer->page_offset; > + /* Mark page dirty */ > + page_addr =3D page_address(page) + rx_buffer->page_offset; > + *page_addr =3D *page_addr; > =20 > + if (likely(!skb)) { > /* prefetch first cache line of first page */ > prefetch(page_addr); > #if L1_CACHE_BYTES < 128 It will work for now as a proof of concept, but I really would prefer t= o=20 see a solution that is driver agnostic. Maybe something that could tak= e=20 care of it in the DMA API. For example if you were to use=20 "swiotlb=3Dforce" in the guest this code wouldn't even be necessary sin= ce=20 that forces bounce buffers which would mean your DMA mappings are dirty= =20 pages anyway. 2. How to deal with a device that might be in the middle of an=20 interrupt routine when you decide to migrate. This is the bit I think=20 you might be focusing on a bit too much, and the current solutions you=20 have proposed will result in Rx data corruption in the generic case eve= n=20 without migration. There are essentially 2 possible solutions that you= =20 could explore. 2a. Have a VF device that is aware something is taking place and have=20 it yield via something like a PCI hot-plug pause request. I don't know= =20 if the Linux kernel supports something like that now since pause suppor= t=20 in the OS is optional in the PCI hot-plug specification, but essentiall= y=20 it would be a request to do a PM suspend. You would issue a hot-plug=20 pause and know when it is completed by the fact that the PCI Bus Master= =20 bit is cleared in the VF. Then you complete the migration and in the=20 new guest you could issue a hot-plug event to restart operation. 2b. Come up with some sort of pseudo IOMMU interface the VF has to use= =20 to map DMA, and provide an interface to quiesce the devices attached to= =20 the VM so that DMA can no longer occur. Once you have disabled bus=20 mastering on the VF you could then go through and migrate all DMA mappe= d=20 pages. As far as resuming on the other side you would somehow need to=20 poke the VF to get it to realize the rings are no longer initialized an= d=20 the mailbox is out-of-sync. Once that happens the VF could reset and=20 resume operation.