From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D9473C433EF for ; Fri, 29 Oct 2021 18:46:46 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id BA3E460F4B for ; Fri, 29 Oct 2021 18:46:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229968AbhJ2StO (ORCPT ); Fri, 29 Oct 2021 14:49:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47354 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231169AbhJ2StL (ORCPT ); Fri, 29 Oct 2021 14:49:11 -0400 Received: from mail-qt1-x82e.google.com (mail-qt1-x82e.google.com [IPv6:2607:f8b0:4864:20::82e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0776BC061203 for ; Fri, 29 Oct 2021 11:46:23 -0700 (PDT) Received: by mail-qt1-x82e.google.com with SMTP id v17so10018829qtp.1 for ; Fri, 29 Oct 2021 11:46:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=message-id:date:mime-version:user-agent:subject:content-language:to :cc:references:from:in-reply-to:content-transfer-encoding; bh=bo8khXN0R9IOkRFyn/1aN+t3KGO0wsuI+pLsFimBvQk=; b=FXkayP6h4JjyBunKmuFl1xT/Q/XOZJFjBX81KJamnPmQHAkGngMDKPFdW7vAICLGMU stn+dkSckoftA6REGlz1q1Qz31dajL9/xexuprgzzhZlVQtf36/Hp0WeyDE/7LjdCs74 tw0dUYQfnRU5CbIZWLV1ej4zTJMSLrX9AJgyFCKvWbCMKAZlyuDb1m8INxJoJFwTgOo0 0E/cBTWy7FwulA+nloVXTs88uXh22mFoq9dpIjlHddjoSWoqSnL+itMmp2wNM/lXsdZs oFlm6dQwu1i/F/yibJLnmEnZqF8yxdfWCaXOtkPJEEtHZbLhgl24P0oEP1Af8DS59Pr5 yIXw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:to:cc:references:from:in-reply-to :content-transfer-encoding; bh=bo8khXN0R9IOkRFyn/1aN+t3KGO0wsuI+pLsFimBvQk=; b=juiEvhj2eDog2i2RP5Fl0GxotBxsTZx+1yDTiFKfi2rZ6rPXkxy2WW5nnkeghzUN+6 Lw2BvtyUDkF5fFFMu7OJJv/Fp1ljNo8t4xDgnduipCJgl/8P9XVTGRTaE9DkESA2koh2 QgkU4Kbzv6zGGhsBEkfil4dN3LN77cJ1EpIFMzNtLgOxQnGzJv35qlFvu5qmYCqf9XA/ 02tnSQuyk5upK1E1oYQCZ9PUP6HgMz4HJTSOZImME/tmh+QbXrHlIY7LqTLpuGTkljSs 1P1Pyg9/2lPL8c20VfscmopgguTqz+LASFMqmGjkvvPCDl4f51Spz/JdDHhf1NN9iVh/ 6XeA== X-Gm-Message-State: AOAM532XDGhuTIy8NGVmlvA8H1U55dXC6GP+8LLxPZqirAva3/kQ+dU0 kpb5UKG3JW84Rjwq0Xc8cD9ntAYC1B8= X-Google-Smtp-Source: ABdhPJzkjbUqr1HGVEKgc/2FnvcKEVOoho8Fay/YTi+2m+XVryNI+omgqd6/TIYulHOFX9im9jpbFw== X-Received: by 2002:a05:622a:174b:: with SMTP id l11mr13380652qtk.405.1635533182034; Fri, 29 Oct 2021 11:46:22 -0700 (PDT) Received: from ?IPV6:2600:1700:e72:80a0:6463:ee8f:8638:ff1b? ([2600:1700:e72:80a0:6463:ee8f:8638:ff1b]) by smtp.gmail.com with ESMTPSA id y8sm1257041qtx.0.2021.10.29.11.46.21 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 29 Oct 2021 11:46:21 -0700 (PDT) Message-ID: Date: Fri, 29 Oct 2021 14:46:19 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.2.1 Subject: Re: [PATCH 0/3] bundle-uri: "dumb" static CDN offloading, spec & server implementation Content-Language: en-US To: =?UTF-8?B?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= , git@vger.kernel.org Cc: Junio C Hamano , Jeff King , Patrick Steinhardt , Christian Couder , Albert Cui , Jonathan Tan , Jonathan Nieder , "brian m . carlson" , "Robin H . Johnson" References: From: Derrick Stolee In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote: > This implements a new "bundle-uri" protocol v2 extension, which allows > servers to advertise *.bundle files which clients can pre-seed their > full "clone"'s or incremental "fetch"'s from. > > This is both an alternative to, and complimentary to the existing > "packfile-uri" mechanism, i.e. servers and/or clients can pick one or > both, but would generally pick one over the other. > > This "bundle-uri" mechanism has the advantage of being dumber, and > offloads more complexity from the server side to the client > side. Generally, I like that using bundles presents an easier way to serve static content from an alternative source and then let Git's fetch negotiation catch up with the remainder. However, after inspecting your design and talking to some GitHub engineers who know more about CDNs and general internet things than I do, I want to propose an alternative design. I think this new design is simultaneously more flexible as well as promotes further decoupling of the origin Git server and the bundle contents. Your proposed design extends protocol v2 to let the client request a list of bundle URIs from the origin server. However, this still requires the origin server to know about this list. Further, your implementation focuses on the server side without integrating with the client. I propose that we flip this around. The "bundle server" should know which bundles are available at which URIs, and the client should contact the bundle server directly for a "table of contents" that lists these URIs, along with metadata related to each URI. The origin Git server then would only need to store the list of bundle servers and the URIs to their table of contents. The client could then pick from among those bundle servers (probably by ping time, or randomly) to start the bundle downloads. To summarize, there are two pieces here, that can be implemented at different times: 1. Create a specification for a "bundle server" that doesn't need to speak the Git protocol at all. This could be a REST API specification using well-established standards such as JSON for the table of contents. 2. Create a way for the origin Git server to advertise known bundle servers to clients so they can automatically benefit from faster downloads without needing to know about bundle servers. There are a few key benefits to this approach: * Further decoupling. The origin Git server doesn't need to know how the bundle server organizes its bundles. This allows maximum flexibility depending on whether the bundles are stored in something like a CDN (where bundles can't be too big) or some kind of blob storage (where they can have arbitrarily large size). * The bundle servers could be run completely independently from the origin Git server. Organizations could run their own bundle servers to host data in the same building as their build farms. As long as they can configure the bundle location at clone/fetch time, the origin Git server doesn't need to be involved. While I didn't go so far as to create a clear standard or implement a prototype in the Git codebase, I created a very simple prototype [1] using a python script that parses a JSON table of contents and downloads bundles into the Git repository. Then, I made a 'clone.sh' script that initializes a repository using the bundle fetcher and fetching the remainder from the origin Git server. I even computed static bundles for the git.git repository based on where 'master' has been over several days in the past month, to give an example of incremental bundles. You can test the approach all the way to including the fetch to github.com (note how the GitHub servers were not modified in any way for this). [1] https://github.com/derrickstolee/bundles There are a lot of limitations to the prototype, but it hopefully demonstrates the possibility of using something other than the Git protocol to solve these problems. Let me know if you are interested in switching your approach to something more like what I propose here. There are many more questions about what information could/should be located in the table of contents and how it can be extended in the future. I'm interested to explore that space with you. Thanks, -Stolee