From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751412AbdFBKLB (ORCPT ); Fri, 2 Jun 2017 06:11:01 -0400 Received: from mail-vk0-f68.google.com ([209.85.213.68]:35118 "EHLO mail-vk0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751285AbdFBKKv (ORCPT ); Fri, 2 Jun 2017 06:10:51 -0400 MIME-Version: 1.0 In-Reply-To: References: <1491613030-11599-1-git-send-email-deepa.kernel@gmail.com> <1491613030-11599-5-git-send-email-deepa.kernel@gmail.com> From: "Yan, Zheng" Date: Fri, 2 Jun 2017 18:10:49 +0800 Message-ID: Subject: Re: [PATCH 04/12] fs: ceph: CURRENT_TIME with ktime_get_real_ts() To: Arnd Bergmann Cc: Deepa Dinamani , John Stultz , Linux Kernel Mailing List , Andrew Morton , Thomas Gleixner , Al Viro , gregkh , "Dilger, Andreas" , "J. Bruce Fields" , Chris Mason , David Miller , David Sterba , Evgeniy Dushistov , Eric Paris , Jaegeuk Kim , Josef Bacik , Jeff Layton , James Simmons , Ingo Molnar , "Drokin, Oleg" , Paul Moore , Steven Rostedt , yuchao0@huawei.com, ceph-devel , devel@driverdev.osuosl.org, linux-audit@redhat.com, linux-btrfs , linux-cifs@vger.kernel.org, "Linux F2FS DEV, Mailing List" , Linux FS-devel Mailing List , linux-mtd , LSM List , Lustre Development List , Networking , samba-technical@lists.samba.org, y2038 Mailman List Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jun 2, 2017 at 5:45 PM, Arnd Bergmann wrote: > On Fri, Jun 2, 2017 at 4:09 AM, Yan, Zheng wrote: >> On Fri, Jun 2, 2017 at 8:57 AM, Deepa Dinamani wrote: >>> On Thu, Jun 1, 2017 at 5:36 PM, John Stultz wrote: >>>> On Thu, Jun 1, 2017 at 5:26 PM, Yan, Zheng wrote: >>>>> On Thu, Jun 1, 2017 at 6:22 PM, Arnd Bergmann wrote: >>>>>> On Thu, Jun 1, 2017 at 11:56 AM, Yan, Zheng wrote: >>>>>>> On Sat, Apr 8, 2017 at 8:57 AM, Deepa Dinamani wrote: >>>>>> >>>>>>>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c >>>>>>>> index 517838b..77204da 100644 >>>>>>>> --- a/drivers/block/rbd.c >>>>>>>> +++ b/drivers/block/rbd.c >>>>>>>> @@ -1922,7 +1922,7 @@ static void rbd_osd_req_format_write(struct rbd_obj_request *obj_request) >>>>>>>> { >>>>>>>> struct ceph_osd_request *osd_req = obj_request->osd_req; >>>>>>>> >>>>>>>> - osd_req->r_mtime = CURRENT_TIME; >>>>>>>> + ktime_get_real_ts(&osd_req->r_mtime); >>>>>>>> osd_req->r_data_offset = obj_request->offset; >>>>>>>> } >>>>>>>> >>>>>>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c >>>>>>>> index c681762..1d3fa90 100644 >>>>>>>> --- a/fs/ceph/mds_client.c >>>>>>>> +++ b/fs/ceph/mds_client.c >>>>>>>> @@ -1666,6 +1666,7 @@ struct ceph_mds_request * >>>>>>>> ceph_mdsc_create_request(struct ceph_mds_client *mdsc, int op, int mode) >>>>>>>> { >>>>>>>> struct ceph_mds_request *req = kzalloc(sizeof(*req), GFP_NOFS); >>>>>>>> + struct timespec ts; >>>>>>>> >>>>>>>> if (!req) >>>>>>>> return ERR_PTR(-ENOMEM); >>>>>>>> @@ -1684,7 +1685,8 @@ ceph_mdsc_create_request(struct ceph_mds_client *mdsc, int op, int mode) >>>>>>>> init_completion(&req->r_safe_completion); >>>>>>>> INIT_LIST_HEAD(&req->r_unsafe_item); >>>>>>>> >>>>>>>> - req->r_stamp = current_fs_time(mdsc->fsc->sb); >>>>>>>> + ktime_get_real_ts(&ts); >>>>>>>> + req->r_stamp = timespec_trunc(ts, mdsc->fsc->sb->s_time_gran); >>>>>>> >>>>>>> This change causes our kernel_untar_tar test case to fail (inode's >>>>>>> ctime goes back). The reason is that there is time drift between the >>>>>>> time stamps got by ktime_get_real_ts() and current_time(). We need to >>>>>>> revert this change until current_time() uses ktime_get_real_ts() >>>>>>> internally. >>>>>> >>>>>> Hmm, the change was not supposed to have a user-visible effect, so >>>>>> something has gone wrong, but I don't immediately see how it >>>>>> relates to what you observe. >>>>>> >>>>>> ktime_get_real_ts() and current_time() use the same time base, there >>>>>> is no drift, but there is a difference in resolution, as the latter uses >>>>>> the time stamp of the last jiffies update, which may be up to one jiffy >>>>>> (10ms) behind the exact time we put in the request stamps here. >>>>>> >>>>>> Do you still see problems if you use current_kernel_time() instead of >>>>>> ktime_get_real_ts()? >>>>> >>>>> The problem disappears after using current_kernel_time(). >>>>> >>>>> https://github.com/ceph/ceph-client/commit/2e0f648da23167034a3cf1500bc90ec60aef2417 >>>> >>>> From the commit above: >>>> "It seems there is time drift between ktime_get_real_ts() and >>>> current_kernel_time()" >>>> >>>> Its more of a granularity difference. current_kernel_time() returns >>>> the cached time at the last tick, where as ktime_get_real_ts() reads >>>> the clocksource hardware and returns the immediate time. >>>> >>>> Filesystems usually use the cached time (similar to >>>> CLOCK_REALTIME_COARSE), for performance reasons, as touching the >>>> clocksource takes time. >>> >>> Alternatively, it would be best for this code also to use current_time(). >>> I had suggested this in one of the previous versions of the patch. >>> The implementation of current_time() will change when we switch vfs to >>> use 64 bit time. This will prevent such errors from happening again. >>> But, this also means there is more code reordering for these modules >>> to get a reference to inode. >>> >> >> I took a look. it's quite inconvenience to use current_time(). I >> prefer to temporarily use current_kernel_time(). > > I've looked at the code some more and I think there is another angle to it: > In your test case, 'tar' calls into the utimes syscall (or a member of its > family), which sets the i_ctime field in the inode to the curren time > (using current_time()), then calls __ceph_setattr(), which creates > a mds client request, and ceph_mdsc_create_request() takes another > time stamp and stores it in r_stamp. > > We then store the first timestamp (only) in the in-memory inode, and the > second time stamp in the request. Depending on the state of the inode, > we may also set the ctime to a third timestamp we again take using > current_time(). > > The mtime and atime from user space get passed correctly through > union ceph_mds_request_args->setattr and are kept in sync between > the in-memory inode and the persistent inode data, but the ctime > in the inode never makes it to the lower protocol levels and instead > we use the r_stamp field that got set a little earlier or a little later. > > I believe the bug you see is the result of the two timestamps > currently being almost guaranteed to be different in the latest > kernels. > Changing r_stamp to use current_kernel_time() will make it the > same value most of the time (as it was before Deepa's patch), > but when the timer interrupt happens between the timestamps, > the two are still different, it's just much harder to hit. > > I think the proper solution should be to change __ceph_setattr() > in a way that has req->r_stamp always synchronized with i_ctime. > If we copy i_ctime to r_stamp, that will also take care of the > future issues with the planned changes to current_time(). > I already have a patch https://github.com/ceph/ceph-client/commit/24f54cd18e195a002ee3d2ab50dbc952fd9f82af > The part I don't understand is what else r_stamp (i.e. the time > stamp in ceph_msg_data with type== > CEPH_MSG_CLIENT_REQUEST) is used for, other than setting > ctime in CEPH_MDS_OP_SETATTR. > > Will this be used to update the stored i_ctime for other operations > too? If so, we would need to synchronize it with the in-memory > i_ctime for all operations that do this. > yes, mds uses it to update ctime of modified inodes. For example, when handling mkdir, mds set ctime of both parent inode and new inode to r_stamp. Regards Yan, Zheng > Semi-related side note: I see that the granularity for ceph_timespec > is 1000 nanoseconds, so the value is always a multiple of 1000. > If the full 32-bit data gets stored, we could use this it to store the > epoch number in the future: > > static inline void ceph_decode_timespec(struct timespec64 *ts, > const struct ceph_timespec *tv) > { > u32 ns_epoch = le32_to_cpu(tv->tv_nsec); > u32 epoch = ns_epoch % 1000; > > /* tv_sec is traditionally interpreted as unsigned > * with time ranges 1970-2106, we extend > * it to 1970-138069 */ > ts->tv_sec = (u64)le32_to_cpu(tv->tv_sec) + > (u64)epoch << 32; > ts->tv_nsec = ns_epoch - epoch; > } > > Arnd