git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Junio C Hamano <gitster@pobox.com>,
	Phillip Wood <phillip.wood123@gmail.com>
Cc: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org, peff@peff.net, jrnieder@google.com,
	Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH 01/15] run-job: create barebones builtin
Date: Mon, 6 Apr 2020 10:42:23 -0400	[thread overview]
Message-ID: <208bdbc7-9c8e-5105-0627-7db86135db7b@gmail.com> (raw)
In-Reply-To: <xmqqimidybzu.fsf@gitster.c.googlers.com>

On 4/5/2020 3:21 PM, Junio C Hamano wrote:
> Phillip Wood <phillip.wood123@gmail.com> writes:
> 
>> Hi Stolee
>>
>> On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
>>> From: Derrick Stolee <dstolee@microsoft.com>
>>>
>>> The 'git run-job' command will be used to execute a short-lived set
>>> of maintenance activities by a background job manager. The intention
>>> is to perform small batches of work that reduce the foreground time
>>> taken by repository maintenance such as 'git gc --auto'.
>>>
>>> This change does the absolute minimum to create the builtin and show
>>> the usage output.
>>>
>>> Provide an explicit warning that this command is experimental. The
>>> set of jobs may change, and each job could alter its behavior in
>>> future versions.
>>>
>>> RFC QUESTION: This builtin is based on the background maintenance in
>>> Scalar. Specifically, this builtin is based on the "scalar run <job>"
>>> command [1] [2]. My default thought was to make this a "git run <job>"
>>> command to maximize similarity. However, it seems like "git run" is
>>> too generic. Or, am I being overly verbose for no reason?
>>
>> Having read through this series I wondered if we wanted a single git
>> command such as 'git maintenance' (suggestions of better names
>> welcome) and then 'git run-job' could become 'git maintenance run',
>> 'git job-runner' would become another subcommand (run-jobs or
>> schedule-jobs?) and the 'git please-run-maintenance-on-this-repo' you
>> mentioned in you email to Junio could become 'git maintenance init'
>> (or maybe setup)
> 
> I had a very similar impression.  In addition to what you already
> said, a few more were:
> 
>  - Why the existing "git repack" isn't such "maintenance" command?
>    IOW why do we even need [01/15]?  After all, "repack" may have
>    started its life as a tool to reorganize the PACKFILES, but it is
>    no longer limited to 'git/objects/pack/*.pack' files with its
>    knowledge about the loose object files and the "--prune" option.
>    Consolidating pieces of information spread across multiple .idx
>    files, reachability bitmaps and commit graph files, into a newer
>    and more performant forms can just be part of "packing the pieces
>    of information in a repository for optimum performance", which is
>    a better way to understand why "repack" has a word 'pack' in its
>    name.

To me, "git repack" is a specific kind of maintenance. The end result
is a pack-file. Now, "git gc" is a bit more general, because it will
create a pack-file but also update the commit-graph file. Still, its
name is still very specific: it "collects garbage". The goals of this
series are to replace "git gc --auto" with something less invasive.

I'll include an alternate CLI proposal at the end of this message.

>  - Many of the "maintenance" operations this series proposes do make
>    sense, just like other "maintenance" operations we already have
>    in "repack", "prune", "prune-packed" etc., which are welcome
>    additions. 

Thanks. I'm glad these steps make sense. They are definitely more
"incremental" updates than a full repack or GC.
 
>  - Like the individual steps that appear in e.g. "repack", however,
>    some of the individual steps in this series can be triggered by
>    calling underlying tools directly, allowing scripted maintenance
>    commands that suit individual needs better than the canned
>    invocation of "run-job", but I didn't get the impression that the
>    series strives to make sure that all knobs of these individual
>    steps are available to scripters who want to deviate from what
>    "run-job" prescribes.  If it is not doing so, we probably should.
> 
>  - Again, I do not think we want a reimplementation of cron, at or
>    inetd that is not specific to "git" at all.

I expected the job-runner to get some push-back. The design for it in
the current RFC matched how we do it in Scalar more than anything else.
You're probably right that it would be better to leave the "background"
part to the platform.

Of course, not every platform has "cron" but that just means we need a
cross-platform way to launch Git processes on some schedule. That could
be a command that creates a cron job on platforms that have it, and on
Windows it could create a scheduled task instead.

But what should we launch? It should probably be a Git command that
checks config for a list of repositories, then runs "the maintenance
command" on each of those repos.

I'm inserting a break here to draw the eye to a new proposed design:

---

Create a "git maintenance" builtin. This has a few subcommands:

1. "run" will run the configured maintenance on the current repo. This
   should become the single entry point for users to say "please clean
   up my repo." What _exactly_ it does can be altered with config. I'll
   list some possibilities after listing the subcommands.

2. "run-on-repos" uses command-line arguments or config to launch "git
   -C <dir> maintenance run" for all configured directories. The
   intention is that this is launched on some schedule by a platform-
   specific scheduling mechanism (i.e. cron).
   (This subcommand could use a better name.)

3. "schedule" adds the current repository to the configured list of
   repositories for running with "run-on-repos". It will also initialize
   the platform-specific scheduling mechanism. This may be to start the
   schedule for the first time OR to update how frequent "run-on-repos"
   is run, as appropriate.

4. (OPTIONAL) "mode <mode>" adjusts the config for the current repo to
   change the type of maintenance requested for this repo. For example,
   "simple" could just run "git gc --auto" using a normal range.
   "incremental" could run the maintenance tasks from this series.
   Finally, "server" could run maintenance tasks as if we are serving
   the repo to others, so we repack aggressively with full bitmaps, and
   more frequently.

Here are some possible maintenance tasks. Not all of them would
be appropriate to run on the same repo, or at least not with the
same frequency:

* "fetch" : the background fetch from PATCH 3. Appropriate for all modes,
  but perhaps would want users to opt-in to this in the  basic mode.

* "commit-graph" : the incremental commit-graph writes from PATCH 2.
  Appropriate whenever the "fetch" command is being run, but also
  valuable for the "server" mode.

* "gc" : Run "git gc --auto". This would be enabled by default, but
  should be disabled for the "incremental" and "server" modes.

* "repack" : Run "git repack <options>" with appropriate options based
  on config. The "server" mode would include custom delta and bitmap
  options. (I will leave the specifics to those who maintain servers to
  recommend the best options for "server" mode.)

* "loose-objects" : see PATCH 4. Appropriate for "incremental" mode.

* "multi-pack-index" or "incremental-repack" : Run the "pack-files" job
  from PATCH 5. Appropriate for "incremental" mode.

* "pack-refs" : create a packed-refs file or repack the reftable as
  appropriate for those features. (I have less familiarity with these.)

Notice that with this new set of options we could do something rather
dramatic: replace all calls to "git gc --auto" with "git maintenance
run --auto". By default, these would be equivalent. However, "git
maintenance run --auto" is more clear that the behavior is less specific
than "git gc" and could be configured to do something different.

I used an "--auto" option in the suggestion above to help distinguish
between the command being run as a foreground operation instead of a
background operation. Part of setting up a schedule would include
disabling these "foreground" maintenance tasks and relying entirely on
the background tasks instead. The best situation would be to avoid
launching the subprocess at all.

---

What do people think of this alternative? Does this get us closer to an
appropriate level of work for Git to do?

Thanks,
-Stolee

  reply	other threads:[~2020-04-06 14:42 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 01/15] run-job: create barebones builtin Derrick Stolee via GitGitGadget
2020-04-05 15:10   ` Phillip Wood
2020-04-05 19:21     ` Junio C Hamano
2020-04-06 14:42       ` Derrick Stolee [this message]
2020-04-07  0:58         ` Danh Doan
2020-04-07 10:54           ` Derrick Stolee
2020-04-07 14:16             ` Danh Doan
2020-04-07 14:30               ` Johannes Schindelin
2020-04-03 20:48 ` [PATCH 02/15] run-job: implement commit-graph job Derrick Stolee via GitGitGadget
2020-05-20 19:08   ` Josh Steadmon
2020-04-03 20:48 ` [PATCH 03/15] run-job: implement fetch job Derrick Stolee via GitGitGadget
2020-04-05 15:14   ` Phillip Wood
2020-04-06 12:48     ` Derrick Stolee
2020-04-05 20:28   ` Junio C Hamano
2020-04-06 12:46     ` Derrick Stolee
2020-05-20 19:08   ` Josh Steadmon
2020-04-03 20:48 ` [PATCH 04/15] run-job: implement loose-objects job Derrick Stolee via GitGitGadget
2020-04-05 20:33   ` Junio C Hamano
2020-04-03 20:48 ` [PATCH 05/15] run-job: implement pack-files job Derrick Stolee via GitGitGadget
2020-05-27 22:17   ` Josh Steadmon
2020-04-03 20:48 ` [PATCH 06/15] run-job: auto-size or use custom pack-files batch Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 07/15] config: add job.pack-files.batchSize option Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 08/15] job-runner: create builtin for job loop Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 09/15] job-runner: load repos from config by default Derrick Stolee via GitGitGadget
2020-04-05 15:18   ` Phillip Wood
2020-04-06 12:49     ` Derrick Stolee
2020-04-05 15:41   ` Phillip Wood
2020-04-06 12:57     ` Derrick Stolee
2020-04-03 20:48 ` [PATCH 10/15] job-runner: use config to limit job frequency Derrick Stolee via GitGitGadget
2020-04-05 15:24   ` Phillip Wood
2020-04-03 20:48 ` [PATCH 11/15] job-runner: use config for loop interval Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 12/15] job-runner: add --interval=<span> option Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 13/15] job-runner: skip a job if job.<job-name>.enabled is false Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 14/15] job-runner: add --daemonize option Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 15/15] runjob: customize the loose-objects batch size Derrick Stolee via GitGitGadget
2020-04-03 21:40 ` [PATCH 00/15] [RFC] Maintenance jobs and job runner Junio C Hamano
2020-04-04  0:16   ` Derrick Stolee
2020-04-07  0:50     ` Danh Doan
2020-04-07 10:59       ` Derrick Stolee
2020-04-07 14:26         ` Danh Doan
2020-04-07 14:43           ` Johannes Schindelin
2020-04-07  1:48     ` brian m. carlson
2020-04-07 20:08       ` Junio C Hamano
2020-04-07 22:23       ` Johannes Schindelin
2020-04-08  0:01         ` brian m. carlson
2020-05-27 22:39           ` Josh Steadmon
2020-05-28  0:47             ` Junio C Hamano
2020-05-27 21:52               ` Johannes Schindelin
2020-05-28 14:48                 ` Junio C Hamano
2020-05-28 14:50                 ` Jonathan Nieder
2020-05-28 14:57                   ` Junio C Hamano
2020-05-28 15:03                     ` Jonathan Nieder
2020-05-28 15:30                       ` Derrick Stolee
2020-05-28  4:39                         ` Johannes Schindelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=208bdbc7-9c8e-5105-0627-7db86135db7b@gmail.com \
    --to=stolee@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=jrnieder@google.com \
    --cc=peff@peff.net \
    --cc=phillip.wood123@gmail.com \
    --subject='Re: [PATCH 01/15] run-job: create barebones builtin' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).