From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:32990 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752827AbdICDcW (ORCPT ); Sat, 2 Sep 2017 23:32:22 -0400 From: Richard Wareing Subject: Re: [PATCH 1/3] xfs: Add rtdefault mount option Date: Sun, 3 Sep 2017 03:31:55 +0000 Message-ID: <48FAC1A6-43D6-4A14-AC68-42CE80F2AE0A@fb.com> References: <25856B28-A65C-4C5B-890D-159F8822393D@fb.com> <20170901043151.GZ10621@dastard> <20170901193237.GF29225@bfoster.bfoster> <20170901225539.GC10621@dastard> <67F62657-D116-4B85-9452-5BAB52EC7041@fb.com> <20170902115545.GA36492@bfoster.bfoster> In-Reply-To: Content-Language: en-US Content-Type: text/plain; charset="iso-8859-1" Content-ID: Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Sender: linux-xfs-owner@vger.kernel.org List-ID: List-Id: xfs To: Brian Foster Cc: Dave Chinner , "linux-xfs@vger.kernel.org" Quick correction, should be >100k hours on the RT code, not > 1M (maths is = hard); we=B9ll get to 1M soon, but not there yet ;) . Richard On 9/2/17, 5:44 PM, "linux-xfs-owner@vger.kernel.org on behalf of Richard W= areing" wrot= e: =20 =20 On 9/2/17, 4:55 AM, "Brian Foster" wrote: =20 On Fri, Sep 01, 2017 at 11:37:37PM +0000, Richard Wareing wrote: >=20 > > On Sep 1, 2017, at 3:55 PM, Dave Chinner = wrote: > >=20 > > [satuday morning here, so just a quick comment] > >=20 > > On Fri, Sep 01, 2017 at 08:36:53PM +0000, Richard Wareing wrote= : > >>> On Sep 1, 2017, at 12:32 PM, Brian Foster wrote: > >>>=20 > >>> On Fri, Sep 01, 2017 at 06:39:09PM +0000, Richard Wareing wro= te: > >>>> Thanks for the quick feedback Dave! My comments are in-line= below. > >>>>=20 > >>>>=20 > >>>>> On Aug 31, 2017, at 9:31 PM, Dave Chinner wrote: > >>>>>=20 > >>>>> Hi Richard, > >>>>>=20 > >>>>> On Thu, Aug 31, 2017 at 06:00:21PM -0700, Richard Wareing w= rote: > >>> ... > >>>>>> add > >>>>>> support for the more sophisticated AG based block allocato= r to RT > >>>>>> (bitmapped version works well for us, but multi-threaded u= se-cases > >>>>>> might not do as well). > >>>>>=20 > >>>>> That's a great big can of worms - not sure we want to open = it. The > >>>>> simplicity of the rt allocator is one of it's major benefit= s to > >>>>> workloads that require deterministic allocation behaviour..= . > >>>>=20 > >>>> Agreed, I took a quick look at what it might take and came t= o a similar conclusion, but I can dream :). > >>>>=20 > >>>=20 > >>> Just a side point based on the discussion so far... I kind of= get the > >>> impression that the primary reason for using realtime support= here is > >>> for the simple fact that it's a separate physical device. Tha= t provides > >>> a basic mechanism to split files across fast and slow physica= l storage > >>> based on some up-front heuristic. The fact that the realtime = feature > >>> uses a separate allocation algorithm is actually irrelevant (= and > >>> possibly a problem in the future). > >>>=20 > >>> Is that an accurate assessment? If so, it makes me wonder whe= ther it's > >>> worth thinking about if there are ways to get the same behavi= or using > >>> traditional functionality. This ignores Dave's question about= how much > >>> of the performance actually comes from simply separating out = the log, > >>> but for example suppose we had a JBOD block device made up of= a > >>> combination of spinning and solid state disks via device-mapp= er with the > >>> requirement that a boundary from fast -> slow and vice versa = was always > >>> at something like a 100GB alignment. Then if you formatted th= at device > >>> with XFS using 100GB AGs (or whatever to make them line up), = and could > >>> somehow tag each AG as "fast" or "slow" based on the known un= derlying > >>> device mapping, > >=20 > > Not a new idea. :) > >=20 =20 Yeah (what ever is? :P).. I know we've discussed having more contro= ls or attributes of AGs for various things in the past. I'm not trying to propose a particular design here, but rather trying to step back fr= om the focus on RT and understand what the general requirements are (multi-device, tiering, etc.). I've not seen the pluggable allocati= on stuff before, but it sounds like that could suit this use case perf= ectly. =20 > > I've got old xfs_spaceman patches sitting around somewhere for > > ioctls to add such information to individual AGs. I think I cal= led > > them "concat groups" to allow multiple AGs to sit inside a sing= le > > concatenation, and they added a policy layer over the top of AG= s > > to control things like metadata placement.... > >=20 =20 Yeah, the alignment thing is just the first thing that popped in my= head for a thought experiment. Programmatic knobs on AGs via ioctl() or = sysfs is certainly a more legitimate solution. =20 > >>> could you potentially get the same results by using the > >>> same heuristics to direct files to particular sets of AGs rat= her than > >>> between two physical devices? > >=20 > > That's pretty much what I was working on back at SGI in 2007. i= .e. > > providing a method for configuring AGs with difference > > characteristics and a userspace policy interface to configure a= nd > > make use of it.... > >=20 > > https://urldefense.proofpoint.com/v2/url?u=3Dhttp-3A__oss.sgi.c= om_archives_xfs_2009-2D02_msg00250.html&d=3DDwIBAg&c=3D5VD0RTtNlTh3ycd41b3M= Uw&r=3DqJ8Lp7ySfpQklq3QZr44Iw&m=3D2aGIpGJVnKOtPKDQQRfM52Rv5NTAwoK15WHcQIodI= G4&s=3DbAOVWOrDuWm92j4tTCxZnZOQxhUP1EVlj-JSHpC1yoA&e=3D=20 > >=20 > >=20 > >>> Obviously there are some differences like > >>> metadata being spread across the fast/slow devices (though I = think we > >>> had such a thing as metadata only AGs), etc. > >=20 > > We have "metadata preferred" AGs, and that is what the inode32 > > policy uses to place all the inodes and directory/atribute meta= data > > in the 32bit inode address space. It doesn't get used for data > > unless the rest of the filesystem is ENOSPC. > >=20 =20 Ah, right. Thanks. =20 > >>> I'm just handwaving here to > >>> try and better understand the goal. > >=20 > > We've been down these paths many times - the problem has always= been > > that the people who want complex, configurable allocation polic= ies > > for their workload have never provided the resources needed to > > implement past "here's a mount option hack that works for us"..= ... > >=20 =20 Yep. To be fair, I think what Richard is doing is an interesting an= d useful experiment. If one wants to determine whether there's value = in directing files across separate devices via file size in a constrai= ned workload, it makes sense to hack up things like RT and fallocate() because they provide the basic mechanisms you'd want to take advant= age of without having to reimplement that stuff just to prove a concept= . =20 The challenge of course is then realizing when you're done that thi= s is not a generic solution. It abuses features/interfaces in ways they = were not designed for, disrupts traditional functionality, makes assumpt= ions that may not be valid for all users (i.e., file size based filterin= g, number of devices, device to device ratios), etc. So we have to ste= p back and try to piece together a more generic, upstream-worthy appr= oach. To your point, it would be nice if those exploring these kind of ha= cks would contribute more to that upstream process rather than settle o= n running the "custom fit" hack until upstream comes around with some= thing better on its own. ;) (Though sending it out is still better than n= ot, so thanks for that. :) =20 > >> Sorry I forgot to clarify the origins of the performance wins > >> here. This is obviously very workload dependent (e.g. > >> write/flush/inode updatey workloads benefit the most) but for = our > >> use case about ~65% of the IOP savings (~1/3 journal + slightl= y > >> less than 1/3 sync of metadata from journal, slightly less as = some > >> journal entries get canceled), the remainder 1/3 of the win co= mes > >> from reading small files from the SSD vs. HDDs (about 25-30% o= f > >> our file population is <=3D256k; depending on the cluster). T= o be > >> clear, we don't split files, we store all data blocks of the f= iles > >> either entirely on the SSD (e.g. small files <=3D256k) and the= rest > >> on the real-time HDD device. The basic principal here being t= hat, > >> larger files MIGHT have small IOPs to them (in our use-case th= is > >> happens to be rare, but not impossible), but small files alway= s > >> do, and when 25-30% of your population is small...that's a big > >> chunk of your IOPs. > >=20 > > So here's a test for you. Make a device with a SSD as the first= 1TB, > > and you HDD as the rest (use dm to do this). Then use the inode= 32 > > allocator (mount option) to split metadata from data. The files= ysetm > > will keep inodes/directories on the SSD and file data on the HD= D > > automatically. > >=20 > > Better yet: have data allocations smaller than stripe units tar= get > > metadata prefferred AGs (i.e. the SSD region) and allocations l= arger > > than stripe unit target the data-preferred AGs. Set the stripe = unit > > to match your SSD/HDD threshold.... > >=20 > > [snip] > >=20 > >> The AG based could work, though it's going to be a very hard s= ell > >> to use dm mapper, this isn't code we have ever used in our sto= rage > >> stack. At our scale, there are important operational reasons = we > >> need to keep the storage stack simple (less bugs to hit), so > >> keeping the solution contained within XFS is a necessary > >> requirement for us. > >=20 =20 I am obviously not at all familiar with your storage stack and the requirements of your environment and whatnoat. It's certainly possi= ble that there's some technical reason you can't use dm, but I find it = very hard to believe that reason is "there might be bugs" if you're inst= ead willing to hack up and deploy a barely tested feature such as XFS R= T. Using dm for basic linear mapping (i.e., partitioning) seems pretty= much ubiquitous in the Linux world these days. =20 Bugs aren=B9t the only reason of course, but we=B9ve been working on th= is for a number of months, we also have thousands of production hours (* >1= 0 FSes per system =3D=3D >1M hours on the real-time code) on this setup, I= =B9m also doing more testing with dm-flaky + dm-log w/ xfs-tests along with= this. In any event, large deviations (or starting over from scratch) on o= ur setup isn=B9t something we=B9d like to do. At this point I trust the RT= allocator a good amount, and its sheer simplicity is something of an asset= for us. =20 To be honest, if an AG allocator solution were available, I=B9d have to= think carefully if it would make sense for us (though I=B9d be willing to = help test/create it). Once you have the small files filtered out to an SSD= , you can dramatically increase the extent sizes on the RT FS (you don=B9t = waste space for small allocations), yielding very dependable/contiguous rea= ds/write IOs (we want multi-MB ave IOs), and the dependable latencies mesh = well with the needs of a distributed FS. I=B9d need to make sure these cha= racteristics were achievable with the more AG allocator (yes there is =B3al= locsize=B2 option but it=B9s more of a suggestion than the hard guarantee o= f the RT extents), it=B9s complexity also makes developers prone to treatin= g it as a =B3black box=B2 and ending up with less than stellar IO efficienc= ies. =20 > > Modifying the filesysetm on-disk format is far more complex tha= n > > adding dm to your stack. Filesystem modifications are difficult= and > > time consuming because if we screw up, users lose all their dat= a. > >=20 > > If you can solve the problem with DM and a little bit of additi= onal > > in-memory kernel code to categorise and select which AG to use = for > > what (i.e. policy stuff that can be held in userspace), then th= at is > > the pretty much the only answer that makes sense from a filesys= tem > > developer's point of view.... > >=20 =20 Yep, agreed. =20 > > Start by thinking about exposing AG behaviour controls through = sysfs > > objects and configuring them at mount time through udev event > > notifications. > >=20 >=20 > Very cool idea. A detail which I left out which might complicate= this, is we only use 17GB of SSD for each ~8-10TB HDD (we share just a sma= ll 256G SSD for about 15 drives), and even then we don't even use 50% of th= e SSD for these partitions. We also want to be very selective about what d= ata we let touch the SSD, we don't want folks who write large files by doin= g small IO to touch the SSD, only IO to small files (which are immutable in= our use-case). >=20 =20 I think Dave's more after the data point of how much basic metadata= /data separation helps your workload. This is an experiment you can run t= o get that behavior without having to write any code (maybe a little for = the stripe unit thing ;). If there's a physical device size limitation, perhaps you can do something crazy like create a sparse 1TB file on= the SSD, map that to a block device over loop or something and proceed = from there. =20 We have a very good idea on this already, we also have data for a 7 day= period when we simply did MD offload to SSD alone. Prior to even doing th= is setup, we used blktrace and examined all the metadata IO requests (e.g. = per the RWBS field). It=B9s about 60-65% of the IO savings, the remaining = ~35% is from the small file IO. For us, it=B9s worth saving. =20 Wrt to performance, we observe average 50%+ drops in latency for nearly= all IO requests, the smaller IO requests should be quite a bit more but we= need to change our threading model to handle a bit to take advantage of th= e fact the small files are on the SSDs (and therefore don=B9t need to wait = behind other requests coming from the HDDs). =20 Though I guess that since this is a performance experiment, a bette= r idea may be to find a bigger SSD or concat 4 of the 256GB devices i= nto 1TB and use that, assuming you're able to procure enough devices to= run an informative test. =20 Brian =20 > On an unrelated note, after talking to Omar Sandoval & Chris Maso= n over here, I'm reworking rtdefault to change it to "rtdisable" which give= s the same operational outcome vs. rtdefault w/o setting inheritance bits (= see prior e-mail). This way folks have a kill switch of sorts, yet otherwi= se maintains the existing "persistent" behavior. >=20 >=20 > > Cheers, > >=20 > > Dave. > > --=20 > > Dave Chinner > > david@fromorbit.com >=20 > -- > To unsubscribe from this list: send the line "unsubscribe linux-x= fs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.htm= l =20 =20