From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.6 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6F44FC35247 for ; Tue, 4 Feb 2020 18:06:24 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 3761D2166E for ; Tue, 4 Feb 2020 18:06:24 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="az91qY8K" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727466AbgBDSGX (ORCPT ); Tue, 4 Feb 2020 13:06:23 -0500 Received: from mail-il1-f193.google.com ([209.85.166.193]:35026 "EHLO mail-il1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727355AbgBDSGX (ORCPT ); Tue, 4 Feb 2020 13:06:23 -0500 Received: by mail-il1-f193.google.com with SMTP id g12so16723842ild.2; Tue, 04 Feb 2020 10:06:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=mc62ZSJ2FyEbetH4nweBkjrAs6Ck5WUQW5hhZvQRO50=; b=az91qY8K3YfCW0QGFi1FZ/pBdRYotv9Le+rXXqWUZQNr/OGWLGYifvJwzO97uDlxNF EHB6INZm/VLtEkppIfPqkV+vJxyUYszv1d5Aevlx+zUCD2vY3eGVxArvUTN19rdJUaEc CHL2j/uXpkPq183F50rozETD9JOHr27NNlbnmdb6+ZFnRb+RVDVr2ssZewHcaO8CKJGf lOPCjfrKyUrmT2cb/+lOG5eREU2aILHuKjjM+4VQVeh15UXRJg8xX7r4QmKPYtCE1E/t OhpQOzWbOXYq/fGb+BPpr6lViSL5aKf1Vwoy9NvsEiVXV3YuNlaHyU78vvOIvfnSYjgT iAkA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=mc62ZSJ2FyEbetH4nweBkjrAs6Ck5WUQW5hhZvQRO50=; b=KjDVjws+P3bwn3w+dn9Y75Gk8n+D4RVeTTrheLwEXXh4d3VZzadnl+B0BK+I/IlLzO wMWfPnz7PCiivh4oZXPMb2zAMeom4/mkOjXVEZ3goTeff+UcPfO6M8jLIEvd2GuqIdkl YD1aMPUkFcm+m308GS6kF7qXKrOseGNkNnRbvrpNNAdNyfR4C079lO+OnaTV4Oawufd6 JhkDJHLjKi73yy6MPStjql9y2tVSpzMv5/unhrbv5809yXUxw+7AIHq30oj6v7HJb3p1 Kea/IG2VMfXxBTK81kSeG2a0DFzKSHHOlywsAhLOjjNvwzStwFtTFA8wVBEruAg73ARp edSA== X-Gm-Message-State: APjAAAXDSvcj5Oa4Du6Z3zU15PH5VThd+mfCnnQhNMvEpN94bZR5T4Sd 5CyR27CE9q5SEmVzBmHR2hIF8laU4iC2sFO7JXA= X-Google-Smtp-Source: APXvYqxKyl971a5Xqlv2EwzvWwPfaoP5Bparz3CeO839snzSqpxb29ZVRbO7MW2JMyg+DfmWnqO+CoAF5YT0aGoBJvw= X-Received: by 2002:a92:b749:: with SMTP id c9mr21417293ilm.143.1580839582327; Tue, 04 Feb 2020 10:06:22 -0800 (PST) MIME-Version: 1.0 References: <20200203165117.5701-1-lhenriques@suse.com> <20200203165117.5701-2-lhenriques@suse.com> <20200204151158.GA15992@suse.com> In-Reply-To: <20200204151158.GA15992@suse.com> From: Ilya Dryomov Date: Tue, 4 Feb 2020 19:06:36 +0100 Message-ID: Subject: Re: [PATCH v3 1/1] ceph: parallelize all copy-from requests in copy_file_range To: Luis Henriques Cc: Jeff Layton , Sage Weil , "Yan, Zheng" , Gregory Farnum , Ceph Development , LKML Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 4, 2020 at 4:11 PM Luis Henriques wrote: > > On Tue, Feb 04, 2020 at 11:56:57AM +0100, Ilya Dryomov wrote: > ... > > > @@ -2108,21 +2118,40 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off, > > > CEPH_OSD_OP_FLAG_FADVISE_DONTNEED, > > > dst_ci->i_truncate_seq, dst_ci->i_truncate_size, > > > CEPH_OSD_COPY_FROM_FLAG_TRUNCATE_SEQ); > > > - if (err) { > > > - if (err == -EOPNOTSUPP) { > > > - src_fsc->have_copy_from2 = false; > > > - pr_notice("OSDs don't support 'copy-from2'; " > > > - "disabling copy_file_range\n"); > > > - } > > > + if (IS_ERR(req)) { > > > + err = PTR_ERR(req); > > > dout("ceph_osdc_copy_from returned %d\n", err); > > > + > > > + /* wait for all queued requests */ > > > + ceph_osdc_wait_requests(&osd_reqs, &reqs_complete); > > > + ret += reqs_complete * object_size; /* Update copied bytes */ > > > > Hi Luis, > > > > Looks like ret is still incremented unconditionally? What happens > > if there are three OSD requests on the list and the first fails but > > the second and the third succeed? As is, ceph_osdc_wait_requests() > > will return an error with reqs_complete set to 2... > > > > > if (!ret) > > > ret = err; > > > > ... and we will return 8M instead of an error. > > Right, my assumption was that if a request fails, all subsequent requests > would also fail. This would allow ret to be updated with the number of > successful requests (x object size), even if the OSDs replies were being > delivered in a different order. But from your comment I see that my > assumption is incorrect. > > In that case, when shall ret be updated with the number of bytes already > written? Only after a successful call to ceph_osdc_wait_requests()? I mentioned this in the previous email: you probably want to change ceph_osdc_wait_requests() so that the counter isn't incremented after an error is encountered. > I.e. only after each throttling cycle, when we don't have any requests > pending completion? In this case, I can simply drop the extra > reqs_complete parameter to the ceph_osdc_wait_requests. > > In your example the right thing to do would be to simply return an error, > I guess. But then we're assuming that we're loosing space in the storage, > as we may have created objects that won't be reachable anymore. Well, that is what I'm getting at -- this needs a lot more consideration. How errors are dealt with, how file metadata is updated, when do we fall back to plain copy, etc. Generating stray objects is bad but way better than reporting that e.g. 0..12M were copied when only 0..4M and 8M..12M were actually copied, leaving the user one step away from data loss. One option is to revert to issuing copy-from requests serially when an error is encountered. Another option is to fall back to plain copy on any error. Or perhaps we just don't bother with the complexity of parallel copy-from requests at all... Of course, no matter what we do for parallel copy-from requests, the existing short copy bug needs to be fixed separately. > > > > > > goto out_caps; > > > } > > > + list_add(&req->r_private_item, &osd_reqs); > > > len -= object_size; > > > src_off += object_size; > > > dst_off += object_size; > > > - ret += object_size; > > > + /* > > > + * Wait requests if we've reached the maximum requests allowed > > > + * or if this was the last copy > > > + */ > > > + if ((--copy_count == 0) || (len < object_size)) { > > > + err = ceph_osdc_wait_requests(&osd_reqs, &reqs_complete); > > > + ret += reqs_complete * object_size; /* Update copied bytes */ > > > > Same here. > > > > > + if (err) { > > > + if (err == -EOPNOTSUPP) { > > > + src_fsc->have_copy_from2 = false; > > > > Since EOPNOTSUPP is special in that it triggers the fallback, it > > should be returned even if something was copied. Think about a > > mixed cluster, where some OSDs support copy-from2 and some don't. > > If the range is split between such OSDs, copy_file_range() will > > always return short if the range happens to start on a new OSD. > > IMO, if we managed to copy some objects, we still need to return the > number of bytes copied. Because, since this return value will be less > then the expected amount of bytes, the application will retry the > operation. And at that point, since we've set have_copy_from2 to 'false', > the default VFS implementation will be used. Ah, yeah, given have_copy_from2 guard, this particular corner case is fine. Thanks, Ilya