From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alexandre Depoutovitch Subject: Re: [PATCH RFC v2] Performing direct I/O on sector-aligned requests Date: Tue, 15 May 2012 11:50:05 -0700 (PDT) Message-ID: <1862886070.5701097.1337107805033.JavaMail.root@zimbra-prod-mbox-3.vmware.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Cc: linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org To: Jeff Moyer Return-path: Received: from smtp-outbound-2.vmware.com ([208.91.2.13]:38581 "EHLO smtp-outbound-2.vmware.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932419Ab2EOSuF (ORCPT ); Tue, 15 May 2012 14:50:05 -0400 In-Reply-To: Sender: linux-fsdevel-owner@vger.kernel.org List-ID: ----- Original Message ----- > From: "Jeff Moyer" > To: "Alexandre Depoutovitch" > Cc: linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org > Sent: Monday, April 30, 2012 2:22:32 PM > Subject: Re: [PATCH RFC v2] Performing direct I/O on sector-aligned requests > > "Alexandre Depoutovitch" writes: > > > NFS daemons always perform buffered IO on files. As a result, write > > requests that are not aligned on a file system block boundary take > > about > > 15 times more time to complete compared to the same writes that are > > file > > system block aligned. This patch fixes the problem by analyzing > > alignment > > of the IO request that comes to NFS daemon and using Direct I/O > > mechanism > > when all of the following are true: > > 1. Request is not aligned on a file system block boundary > > 2. Request is aligned on an underlying block device's sector > > boundary. > > 3. Request size is a multiple of the sector size. > > Why not add: > > 4. Request is a write > > ? There is no read-modify-write cycle for a read. Yes, this makes sense. Thanks. > > In order to test the patch, the following have been done: I've > > deployed 2 > > Linux machines with 3.0 kernel and my modifications. One acted as > > an NFS > > server, the other acted as an NFS client. NFS volume was mounted in > > sync > > mode. > > Number of NFS daemons was increased to 64 in order to have higher > > chances > > of catching concurrency issues. Volume was formatted using ext4 > > file > > system. Volume was located on a hardware RAID10 array with 8 10K > > 450GB SAS > > drives. Raid adapter was HP P410i. > > > > 1. During first set of experiments, the client machine created a > > 200 GB > > file by writing to it. Then it performed the following access > > patterns: > > Read, random, (4K) > > Write, random, (4K) > > Read, sequential (4K) > > Write, sequential (4K) > > Read, sequential (4K, first access at 512 offset) > > Write, sequential (4K, first access at 512 offset) > > Read, sequential (32K) > > Write, sequential (32K) > > Read, sequential (32K, first access at 512 offset) > > Write, sequential (32K, first access at 512 offset) > > Read, sequential (256K) > > Write, sequential (256K) > > All accesses where done with keeping 64 outstanding IO requests on > > a > > client. I compared performance of the above patterns on vanilla > > Linux and > > Linux with my changes. All numbers (IOPS, latency) where the same > > for all > > cases except for random writes, where IOPS increase was 14 times. > > So you only tested the case where the file already exists on the file > system, is that right? It would be a good idea to also test > workloads > that create files. In much the same vein, it would be a good idea to > run specsfs or other industry standard benchmark. I will do this. > > In addition, I have done several correctness tests. > > > > 2. Allocated three 200GB files using (a) explicit writes to a file, > > (b) > > fallocate() system call, (c) seeking to the end of the file and > > writing > > one sector there. > > Then, did random and sequential writes to files. After that, I > > verified > > that files were indeed modified and contained the latest data. Test > > for > > each file ran for 2 hours. > > > > 3. Allocated 200GB file and started sequential reads to trigger > > read-ahead > > mechanism. Every 100 read operations, one file system unaligned > > write > > immediately after the current read position was requested in order > > to > > trigger a direct write. After that, read continued. All writes > > contained a > > predefined value, so that read can check for it. I have done this, > > in > > order to be sure that direct write correctly invalidates already > > in-memory > > cache. > > > > > > Current implementation performs synchronous direct I/O and may > > trigger > > higher latencies when NFS volume is mounted in asynchronous mode. > > In > > It may also cause higher latencies for synchronously mounted file > systems. It's never really a good idea to mix buffered and direct > I/O. > In addition to exposing odd race conditions (which we think don't > exist > anymore), each time you decide to perform direct I/O, you're going to > invalidate the page cache for the range of the file under I/O. If there is an unknown bug in Direct I/O code, sure there will be a problem. But I specifically tested for many hours trying to trigger unknown race conditions (Test 2 and 3). As for cache invalidation, this may cause some performance degradation only when workload writes and reads to data cached in memory which has a low possibility for large randomly accessed data sets. As for small data sets, I suggest the direct I/O only as an optimization mode that must be explicitly turned on by the user > > order to avoid it, as per Trond Myklebust's suggestion, iov_iter > > interface with asynchronous reads and writes can be used. This is > > why > > currently, Direct I/O can be enabled or disabled at boot or > > run-time > > without NFS server restart through the /proc/fs/nfsd/direct_io > > node. > > Not sure I understand that last part, but I really think you want to > layer this work on top of Dave Kleikamp's patch set: > Subject: loop: Issue O_DIRECT aio using bio_vec Unless I got it wrong, Dave Kleikamp's work allows asynchronous I/O from the kernel. In the proposed NFS direct I/O patch all I/O is done synchronously (as it is done now). What benefit will we get from Dave's patch? > > Also, instead of just telling us this is better, it would be good to > provide the benchmarks you ran and the raw results. Buffered NFS Direct I/O for unaligned Difference IOPS MB/s Latency(ms) IOPS MB/s Latency (ms) Read, random (4K OIO=64) 1670 6.5 38 1700 6.8 37 4.62% Write, random (4K OIO=64) 150 0.6 500 2200 8.5 29 1300.00% Read, sequential (4K OIO=64) 22,000 87 2.8 22000 86 2.9 -1.15% Write, sequential (4K OIO=64) 10,000 40 6 11000 42 6 5.00% Read, sequential (32K OIO=1) 1900 59 0.5 2000 62 0.5 5.08% Write, sequential (32K OIO=1) 1100 35 0.9 1100 35 0.9 0.00% Read, sequential (32K OIO=64) 5000 156 13 5100 160 12 2.56% Write, sequential (32K OIO=64) 5700 180 11 5600 175 11 -2.78% Read, sequential (256K OIO=64) 560 140 110 550 140 120 0.00% Write, sequential (256K OIO=64)580 150 110 600 150 100 0.00% > I've commented on random sections of the code below (not at all a > full > review, just stuff that jumped out at me). Thanks! > > > diff -uNr linux-orig/fs/direct-io.c > > linux-3.0.7-0.7.2.8796.vmw/fs/direct-io.c > > --- linux-orig/fs/direct-io.c 2011-10-24 14:06:32.000000000 -0400 > > +++ linux-3.0.7-0.7.2.8796.vmw/fs/direct-io.c 2012-04-25 > > 16:34:30.000000000 -0400 > > @@ -152,11 +152,30 @@ > > Please use the '-p' switch to diff. Will do. > > int nr_pages; > > > > nr_pages = min(dio->total_pages - dio->curr_page, DIO_PAGES); > > - ret = get_user_pages_fast( > > - dio->curr_user_address, /* Where from? */ > > - nr_pages, /* How many pages? */ > > - dio->rw == READ, /* Write to memory? */ > > - &dio->pages[0]); /* Put results here */ > > + > > + if (current->mm) { > > + ret = get_user_pages_fast( > > + dio->curr_user_address, /* Where from? */ > > + nr_pages, /* How many pages? */ > > + dio->rw == READ, /* Write to memory? */ > > + &dio->pages[0]); /* Put results here */ > > + } else { > > + /* For kernel threads mm is NULL, so all we need is to > > increment > > + page's reference count and add page to dio->pages array */ > > + int i; > > + struct page* page; > > + unsigned long start_pfn = virt_to_phys((void > > *)dio->curr_user_address) > > Your mailer is line-wrapping (and maybe munging white space in other > ways). Also, is this a patch against 3.0? Please update your > sources > next time. I will. > > +/* > > + Copies data between two iovec arrays. Individual array elements > > might have > > + different sizes, but total size of data described by the two > > arrays must > > + be the same > > +*/ > > +static int nfsd_copy_iovec(const struct iovec* svec, const > > unsigned int > > scount, > > + struct iovec* dvec, const unsigned int dcount, > > size_t size) > > { > > Another data copy? Ouch. I could not find any easy way to avoid it. The copy is done only in the case when direct I/O is involved. > > > +static int can_use_direct_io(const struct file *file, > > + const struct super_block* sb, > > + const loff_t offset, const unsigned long size) { > > + unsigned int blkbits = 0; > > + struct inode *inode; > > + unsigned int fsblkbits = 0; > > + unsigned int alignment = io_alignment(offset, size); > > + > > + if (alignment == 0) > > + return 0; > > + > > + if (file == NULL && sb == NULL) > > + return 0; > > + > > + if (nfsd_directio_mode == DIO_NEVER) > > + return 0; > > This check should be first, so we don't have to do alignment checks > when > this is disabled. > I will change this.