From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-11.5 required=3.0 tests=AWL,BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI,USER_IN_DEF_DKIM_WL shortcircuit=no autolearn=ham autolearn_force=no version=3.4.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id 8F67F1F404 for ; Tue, 11 Sep 2018 18:32:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728010AbeIKXcl (ORCPT ); Tue, 11 Sep 2018 19:32:41 -0400 Received: from mail-yw1-f67.google.com ([209.85.161.67]:36336 "EHLO mail-yw1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727007AbeIKXcl (ORCPT ); Tue, 11 Sep 2018 19:32:41 -0400 Received: by mail-yw1-f67.google.com with SMTP id i144-v6so3845969ywc.3 for ; Tue, 11 Sep 2018 11:32:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=oT5HFqpQXrGZxQSfUNc4JPy1nb8ktHk0HJoyTxg8Q2Y=; b=UsbD55WauuCNLf18i8TgsnMOCHDOUw21U6AmJwf+w951dgDrAKx3DQmxLzexpeiLAp TVqiGvXjlGNu9jHzH3N92tOO/NKbMswfXcViovw/4Wxs4QCltTvtffVejXFxCDlu9cPx e68/AQkWb0cF6usURTAgFrb0hgwS17DgX4C/SOKJ8/wrsQ085rbwGABiFwtp2esX1UVD dqpNAChiJFpPRbMLqUDBBex6vuZcfP6h/dLL/3Wn1Xj/2/rLxxLxQFvT6da82vnzysm8 0KADJyGYb4gRmcwAXNj4mCb3PLKTPQtglCdA+Hv2LVuj+Mq3UENyALfURkLAxKkGTGcz QcPQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=oT5HFqpQXrGZxQSfUNc4JPy1nb8ktHk0HJoyTxg8Q2Y=; b=a5K9ynnBnDJoNKBVsEaNO4CQ9eOZ7KP7vxcGbjY+H+/UqJ+VqN0SNzyayC6Iu7mV9C gZOV2T6A7OKbW68yl4pxSiMdiSG5xAz4NMToMytUutQymtVQ0/48MgijCftDljjGnlFS ve41/dfrduHht5R4Yfdm3H8kgdbChKqGLOOGM599jPFCxQ/TLlhVMEB9Xi4/LeBU6s1o HRTTpLNkH5DqNbTOTiMmWKuH3YnWIJnfxsvzWSxiZkEyTP1Dy/zQotiB2P7GGMgFzUps u9zwE+VQ7s9It6EUnjyOwYi1RH6kcdUPBHVUtH2hKrbnavxFHmIaUKk+N0Zp+yqTgbl3 7cLg== X-Gm-Message-State: APzg51AzfELRjsN4zV35er9Le6z+/UBmuWexUGIfhh1JsHkRa4DEpbRa g9LYxqISSy57sEsWe8A31WL/6IPo2LJch6KXVB6IRWupeXI= X-Google-Smtp-Source: ANB0VdZpwC6LoLaFgKnJ1pIvwGo4K3PILmYv2cJ1E3CF3cw3A2Qj/wj9/jVXDJqrinitB7e3d+hSiWGZVCRpr0qJU6E= X-Received: by 2002:a81:5855:: with SMTP id m82-v6mr12793958ywb.300.1536690726528; Tue, 11 Sep 2018 11:32:06 -0700 (PDT) MIME-Version: 1.0 References: <20180904230149.180332-1-sbeller@google.com> <20180904230149.180332-7-sbeller@google.com> In-Reply-To: From: Stefan Beller Date: Tue, 11 Sep 2018 11:31:55 -0700 Message-ID: Subject: Re: [PATCH 06/11] submodule.c: sort changed_submodule_names before searching it To: Junio C Hamano Cc: git Content-Type: text/plain; charset="UTF-8" Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Thu, Sep 6, 2018 at 11:03 AM Junio C Hamano wrote: > > Stefan Beller writes: > > > Instead of sorting it after we created an unsorted list, we could insert > > correctly into the list. > > It is unclear what problem you are solving, especially with > subjunctive "could" there. We are creating an unsorted list and > then sorting it and you see it as a problem because it is just as > easy and efficient to do the insertion sort while building up the > list? (don't react and answer without reading all the way to the > end; I think I know what is going on). > > > As the unsorted append is in order of cache entry > > names, this is already sorted if names were equal to paths for submodules. > > That may be a statement of a fact, but it is unclear how that fact > relates to what the code is doing before this patch, or what the > code updated by this patch is doing. > > > As submodule names are often the same as their path, the input is sorted > > pretty well already, so let's just do the sort afterwards. > > It is unclear what (performance?) trade-off this senttence is trying > to make. It sounds as if it is claiming this: > > We can string_list_insert() to maintain sorted-ness of the > list as we find new items, or we can string_list_append() to > build an unsorted list and sort it at the end just once. > > To pick which one is more appropriate, we notice the fact > that we discover new items more or less in the already > sorted order. That makes "append then sort" more > appropriate. > > But is that reasoning sensible? > > I'd imagine that append-then-sort would always be more efficient > than insert-at-the-right-place-as-we-go, and the reason why we > sometimes would need to do the latter is when we need to look up > elements in the middle of building the list (e.g. we may want to > de-dup, which requires us to look up a name from the ones we > collected so far). If we come across a (mostly) sorted list, then the insert-at-the-right-place we'd come across the best case of insertion sort which is O(n), which sounds better than append-then-sort as our sorting is a merge sort, which has O(n log n) even in its best case (and needs to copy stuff into a temp buffer and back). By having the submodules named after its path, I strongly suspect we have a mostly sorted list in nearly all cases except some really interesting corner cases out there. > And in this application, calculate_changed_submodule_paths() > discover paths by calling collect_changed_submodules() which finds a > mapping into a list sorted by > submodule name, and then goes through that list and builds a list of > submodules paths (which could be different from submodule names) by > appending. Only after this list is fully built, get_next_submodule() > gets called, so making the latter use string_list_lookup() that assumes > a sorted list is safe if we built the list by append-then-sort (iow, > sortedness while building the list does not matter). > > Having analysed all that, I find it somewhat iffy that _append() is > used there in the calculate_changed_submodule_paths() function. Note that this is fixed in the later patch "submodule.c: do not copy around submodule list" > It > would cause the resulting changed_submodule_names list to hold the > same name twice (or more), This would be possible if there is a submodule at path A and another submodule (at a different path) named "A", as we'll try hard to collect names, but are also okay with path as we want to keep supporting the historical use case of submodules. > but I do not know if that would pose a > problem to the consumer of the list. Using "accumulate then sort > before calling look-up" would not change it as string_list_sort() > would not dedup, so I do not think this patch would introduce a new > problem, though. Yes, that is true, so we'd want to extend the message above to mention the potential duplicates. Thanks, Stefan