From mboxrd@z Thu Jan 1 00:00:00 1970 From: narayan.ayalasomayajula@kazan-networks.com (Narayan Ayalasomayajula) Date: Thu, 15 Sep 2016 13:36:21 +0000 Subject: Failure with 8K Write operations In-Reply-To: <1473871944.2781.90.camel@linux.intel.com> References: <1473810683.2781.72.camel@linux.intel.com> <1473871944.2781.90.camel@linux.intel.com> Message-ID: Hi Jay, Thanks for pointing out that I was not running the latest version of the kernel. I updated to 4.8rc6 and my FIO test that had previously failed with the Linux NVMeF target (using null_blk device as the target) is now completing successfully. I am still seeing the same NAK (Remote Access Error) failure when I use our target instead. I will debug this further but updating to 4.8rc6 did improve things. (Sagi, thanks for the print statements to display what the driver is passing down to the RNIC. I will use that change in 4.8rc6 to debug further). Thanks, Narayan -----Original Message----- From: J Freyensee [mailto:james_p_freyensee@linux.intel.com] Sent: Wednesday, September 14, 2016 9:52 AM To: Narayan Ayalasomayajula ; Sagi Grimberg ; linux-nvme at lists.infradead.org Subject: Re: Failure with 8K Write operations On Wed, 2016-09-14@00:03 +0000, Narayan Ayalasomayajula wrote: > Hi Jay, > > Thanks for taking the effort to emulate the behavior. > > I did not mention this in my last email but had indicated it in the > earlier posting. I am using null_blk as the target (so the IOs are not > being serviced by a real nvme target). I am not sure if that could > somehow be the catalyst for the failure. Is it possible for you to > re-run your test with null_blk as the target? As we talked off-line, try the latest mainline kernel from kernel.org and see if you see anything different. > > Thanks, > Narayan > > -----Original Message----- > From: J Freyensee [mailto:james_p_freyensee at linux.intel.com] > Sent: Tuesday, September 13, 2016 4:51 PM > To: Narayan Ayalasomayajula om>; Sagi Grimberg ; linux-nvme at lists.infradead.org > Subject: Re: Failure with 8K Write operations > > On Tue, 2016-09-13@20:04 +0000, Narayan Ayalasomayajula wrote: > > > > Hi Sagi, > > > > Thanks for the print statement to verify that the sgls in the > > command capsule match what the Host programmed. I added this print > > statement and compared the Virtual Address and R_Key information in > > the /var/log to the NVMe Commands in the trace file and found the > > two to match. > > I > > have the trace and Host log files from this failure (trace is ~6M) > > - > > will it be useful for someone who may be looking into this issue? > > > > Regarding the host side log information you mentioned, I had > > attached that in my prior email (attached again). Is this what you > > are requesting? That was collected prior to adding the print > > statement that you suggested. > > > > Just to summarize, the failure is seen in the following > > configuration: > > > > 1. Host is an 8-core Ubuntu server running the 4.8.0 kernel. It has > > a > > ConnectX-4 RNIC (1x100G) and is connected to a Mellanox Switch. > > 2. Target is an 8-core Ubuntu server running the 4.8.0 kernel. It > > has a ConnectX-3 RNIC (1x10G) and is connected to a Mellanox Switch. > > 3. Switch has normal Pause and Jumbo frame support enabled on all > > ports. > > 4. Test fails with Host sending a NAK (Remote Access Error) for the > > following FIO workload: > > > > [global] > > ioengine=libaio > > direct=1 > > runtime=10m > > size=800g > > time_based > > norandommap > > group_reporting > > bs=8k > > numjobs=8 > > iodepth=16 > > > > [rand_write] > > filename=/dev/nvme0n1 > > rw=randwrite > > > > Hi Narayan: > > I have a 2 host, 2 target 1RU server data network using a 32x Arista > switch and using your FIO setup above, I am not seeing any errors. ?I > tried running your script on each Host at the same time targeting the > same NVMe Target (but different SSDs targeted by each Host) as well as > only running the script on 1 Host only and didn't see any errors. > Also > tried 'numjobs=1' and didn't reproduce what you see. > > Both Host and Targets for me are using the 4.8-rc4 kernel. ?Both the > Host and Target are using dual port Mellanox?ConnectX-3 Pro EN 40Gb > (so I'm using a RoCE setup). My Hosts are 32 processor machines and > Targets are 28 Processor machine. ?All filled w/various Intel SSDs. > > Something unique about your setup. > > Jay > > > > > > I have found that the failure happens with numjobs set to 1 as well. > > > > Thanks again for your response, > > Narayan > > > > -----Original Message----- > > From: Sagi Grimberg [mailto:sagi at grimberg.me] > > Sent: Tuesday, September 13, 2016 2:16 AM > > To: Narayan Ayalasomayajula > .c > > om>; linux-nvme at lists.infradead.org > > Subject: Re: Failure with 8K Write operations > > > > > > > > > > > > > Hello All, > > > > Hi Narayan, > > > > > > > > > > > I am running into a failure with the 4.8.0 branch and wanted to > > > see this is a known issue or whether there is something I am not > > > doing right in my setup/configuration. The issue that I am running > > > into is that the Host is indicating a NAK (Remote Access Error) > > > condition when executing an FIO script that is performing 100% 8K > > > Write operations. Trace analysis shows that the target has the > > > expected Virtual Address and R_KEY values in the READ REQUEST but > > > for some reason, the Host flags the request as an access > > > violation. > > > I ran a similar test with iWARP Host and Target systems and the > > > did see a Terminate followed by FIN from the Host. The cause for > > > both failures appears to be the same. > > > > > > > I cannot reproduce what you are seeing on my setup (Steve, can > > you?) > > I'm running 2 VMs connected over SRIOV on the same PC though... > > > > Can you share the log on the host side? > > > > Can you also add this print to verify that the host driver > > programmed the same sgl as it sent the target: > > -- > > diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c > > index c2c2c28e6eb5..248fa2e5cabf 100644 > > --- a/drivers/nvme/host/rdma.c > > +++ b/drivers/nvme/host/rdma.c > > @@ -955,6 +955,9 @@ static int nvme_rdma_map_sg_fr(struct > > nvme_rdma_queue *queue, > > ?????????sg->type = (NVME_KEY_SGL_FMT_DATA_DESC << 4) | > > ?????????????????????????NVME_SGL_FMT_INVALIDATE; > > > > +???????pr_err("%s: rkey=%#x iova=%#llx length=%#x\n", > > +???????????????__func__, req->mr->rkey, req->mr->iova, > > + req->mr->length); > > + > > ?????????return 0; > > ? } > > -- > > _______________________________________________ > > Linux-nvme mailing list > > Linux-nvme at lists.infradead.org > > http://lists.infradead.org/mailman/listinfo/linux-nvme