From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx1.redhat.com ([209.132.183.28]:40914 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725975AbfDOCQW (ORCPT ); Sun, 14 Apr 2019 22:16:22 -0400 Subject: Re: [RFC PATCH 2/2] ceph: test basic ceph.quota.max_bytes quota References: <20190402103428.21435-1-lhenriques@suse.com> <20190402103428.21435-3-lhenriques@suse.com> <20190402210931.GV23020@dastard> <87d0m3e81f.fsf@suse.com> <874l7fdy5s.fsf@suse.com> <20190403214708.GA26298@dastard> <87tvfecbv5.fsf@suse.com> <20190412011559.GE1695@dread.disaster.area> <740207e9-b4ef-e4b4-4097-9ece2ac189a7@redhat.com> <20190414221535.GF1695@dread.disaster.area> From: "Yan, Zheng" Message-ID: <0cbc6885-93ae-ca79-184e-cdc56681202c@redhat.com> Date: Mon, 15 Apr 2019 10:16:18 +0800 MIME-Version: 1.0 In-Reply-To: <20190414221535.GF1695@dread.disaster.area> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Sender: fstests-owner@vger.kernel.org Content-Transfer-Encoding: quoted-printable To: Dave Chinner Cc: Luis Henriques , Nikolay Borisov , fstests@vger.kernel.org, ceph-devel@vger.kernel.org List-ID: On 4/15/19 6:15 AM, Dave Chinner wrote: > On Fri, Apr 12, 2019 at 11:37:55AM +0800, Yan, Zheng wrote: >> On 4/12/19 9:15 AM, Dave Chinner wrote: >>> On Thu, Apr 04, 2019 at 11:18:22AM +0100, Luis Henriques wrote: >>>> Dave Chinner writes: >>>> >>>>> On Wed, Apr 03, 2019 at 02:19:11PM +0100, Luis Henriques wrote: >>>>>> Nikolay Borisov writes: >>>>>>> On 3.04.19 =D0=B3. 12:45 =D1=87., Luis Henriques wrote: >>>>>>>> Dave Chinner writes: >>>>>>>>> Makes no sense to me. xfs_io does a write() loop internally wit= h >>>>>>>>> this pwrite command of 4kB writes - the default buffer size. If= you >>>>>>>>> want xfs_io to loop doing 1MB sized pwrite() calls, then all yo= u >>>>>>>>> need is this: >>>>>>>>> >>>>>>>>> $XFS_IO_PROG -f -c "pwrite -w -B 1m 0 ${size}m" $file | _filte= r_xfs_io >>>>>>>>> >>>>>>>> >>>>>>>> Thank you for your review, Dave. I'll make sure the next revisi= on of >>>>>>>> these tests will include all your comments implemented... except= for >>>>>>>> this one. >>>>>>>> >>>>>>>> The reason I'm using a loop for writing a file is due to the nat= ure of >>>>>>>> the (very!) loose definition of quotas in CephFS. Basically, cl= ients >>>>>>>> will likely write some amount of data over the configured limit = because >>>>>>>> the servers they are communicating with to write the data (the O= SDs) >>>>>>>> have no idea about the concept of quotas (or files even); the fi= lesystem >>>>>>>> view in the cluster is managed at a different level, with the he= lp of >>>>>>>> the MDS and the client itself. >>>>>>>> >>>>>>>> So, the loop in this function is simply to allow the metadata as= sociated >>>>>>>> with the file to be updated while we're writing the file. If I = use a >>>>>>> >>>>>>> But the metadata will be modified while writing the file even wit= h a >>>>>>> single invocation of xfs_io. >>>>>> >>>>>> No, that's not true. It would be too expensive to keep the metada= ta >>>>>> server updated while writing to a file. So, making sure there's >>>>>> actually an open/close to the file (plus the fsync in pwrite) help= s >>>>>> making sure the metadata is flushed into the MDS. >>>>> >>>>> /me sighs. >>>>> >>>>> So you want: >>>>> >>>>> loop until ${size}MB written: >>>>> write 1MB >>>>> fsync >>>>> -> flush data to server >>>>> -> flush metadata to server >>>>> >>>>> i.e. this one liner: >>>>> >>>>> xfs_io -f -c "pwrite -D -B 1m 0 ${size}m" /path/to/file >>>> >>>> Unfortunately, that doesn't do what I want either :-/ >>>> (and I guess you meant '-b 1m', not '-B 1m', right?) >>> >>> Yes. But I definitely did mean "-D" so that RWF_DSYNC was used with >>> each 1MB write. >>> >>>> [ Zheng: please feel free to correct me if I'm saying something real= ly >>>> stupid below. ] >>>> >>>> So, one of the key things in my loop is the open/close operations. = When >>>> a file is closed in cephfs the capabilities (that's ceph jargon for = what >>>> sort of operations a client is allowed to perform on an inode) will >>>> likely be released and that's when the metadata server will get the >>>> updated file size. Before that, the client is allowed to modify the >>>> file size if it has acquired the capabilities for doing so. >>> >>> So you are saying that O_DSYNC writes on ceph do not force file >>> size metadata changes to the metadata server to be made stable? >>> >>>> OTOH, a pwrite operation will eventually get the -EDQUOT even with t= he >>>> one-liner above because the client itself will realize it has exceed= ed a >>>> certain threshold set by the MDS and will eventually update the serv= er >>>> with the new file size. >>> >>> Sure, but if the client crashes without having sent the updated file >>> size to the server as part of an extending O_DSYNC write, then how >>> is it recovered when the client reconnects to the server and >>> accesses the file again? >> >> >> For DSYNC write, client has already written data to object store. If c= lient >> crashes, MDS will set file to 'recovering' state and probe file size b= y >> checking object store. Accessing the file is blocked during recovery. >=20 > IOWs, ceph allows data integrity writes to the object store even > though those writes breach limits on that object store? i.e. > ceph quota essentially ignores O_SYNC/O_DSYNC metadata requirements? >=20 Current cephfs quota implementation checks quota (compare i_size and=20 quota setting) at very beginning of ceph_write_iter(). Nothing do with=20 O_SYNC and O_DSYNC. Regards Yan, Zheng > FWIW, quotas normally have soft and hard limits - soft limits can be > breached with a warning and a time limit to return under the soft > limit, but the quota hard limit should /never/ be breached by users. >=20 > I guess that's the way of the world these days - fast and loose > because everyone demands fast before correct.... >=20 > Cheers, >=20 > Dave. >=20