[RFC] Bidirectional communication with a long-running Git process

From: Jonathan Tan <jonathantanmy@google.com>
To: git@vger.kernel.org
Cc: jabolopes@google.com, Jonathan Tan <jonathantanmy@google.com>
Subject: [RFC] Bidirectional communication with a long-running Git process
Date: Mon,  7 Feb 2022 11:03:19 -0800	[thread overview]
Message-ID: <20220207190320.2960362-1-jonathantanmy@google.com> (raw)

Here's a proposal of what we (the Git team at Google) are planning to
build, about bidirectional communication between a long-running Git
process and another long-running process. We're planning to build this
to help with a virtual file system layer that we're also building (as
described below), but we envision this to also be useful for anyone that
needs deeper integration with Git repositories beyond running Git
commands one at a time.

Any comments or suggestions are welcome.

-----
# Git API extensions (external)

Authors: Jose Lopes (jabolopes@google.com), Jonathan Tan
(jonathantanmy@google.com)

## Objective

Google is working to improve development workflows for large Git repositories,
by means of a virtual file system layer (vfsd) which downloads contents lazily.
We have an internal prototype vfsd that we're using to experiment with this.

There are a number of correctness and performance challenges in materializing
the filesystem that we cannot address today with the currently available Git
APIs. For this reason, this document proposes several Git API extensions to:

*   List entries from the Git index file that match a certain path prefix
*   Obtain specific fields from the entries stored in the Git index file
*   Batch fetches by sending out a single network request with a variable number
    of Git objects
*   Obtain file sizes for a variable number of Git objects efficiently (via
    object-info)

The API extensions proposed in this document are not the only ones that we will
ever need. As the project development progresses, our developers will need to
design, implement and propose new API extensions to Git to address future needs.

The APIs will be called on a persistent connection between Git and vfsd to avoid
the performance cost of running one-off Git commands during filesystem
operations. This communication mechanism will be used for both the APIs proposed
in this doc and future ones, too.

To establish the connection, vfsd spawns Git and runs a new Git command that we
also propose in this document. Then vfsd and Git can communicate over stdin/out.
vfsd can use this to pipe commands to Git and obtain results. Conversely Git can
also send commands to vfsd and obtain results. vfsd and Git will communicate
using the pkt-line format because it’s already supported by Git.

In summary, this document proposes:

*   a generic RPC protocol based on the pkt-line format, that can be used as the
    basis of communication between Git and vfsd, but it is generic enough so
    that it can be used by any external process to talk to Git.
*   API extensions on top of this pkt-line RPC protocol, which are useful for
    virtual filesystem layers built on top of Git.
*   A new command, called *git-batch*, which implements the pkt-line RPC
    protocol and these API extensions.

## Background

Today vfsd interacts with Git in 2 different ways, namely, via the Git cat-file
daemon and by spawning one-off Git processes.

The Git cat-file daemon (i.e., *git cat-file --batch*) is a long-lived process
that accepts commands from vfsd via stdin and returns output via stdout /
stderr. The commands accepted by git-cat-file are limited (see
[git-rev-parse](https://git-scm.com/docs/git-rev-parse#_specifying_revisions)).
vfsd uses the Git cat-file daemon to obtain file contents for files displayed in
the FUSE filesystem. This results in a full connection setup/teardown with the
Git remote server for every object that is fetched, which results in user
visible latency and impacts performance.

vfsd needs to read tree objects, obtain object sizes, obtain the contents of the
Git index file, among others, which are not supported by the Git cat-file
daemon. In this case, vfsd ends up running other Git commands, such as,
`git-cat-file -s`, `git-fetch`, `git-ls-files`, etc. These commands are run by
spawning one-off Git processes. This incurs a performance penalty every time
vfsd needs to run a Git command since it requires creating a new system process
(via `fork`/`exec`).

For obtaining file sizes, vfsd uses a one-off Git process (`git-cat-file -s`),
but we would like to use the Git cat-file daemon to obtain size information
also. This would allow us to reuse the existing infrastructure for the Git
cat-file daemon and also get better performance by avoiding spawning one-off
processes to obtain size information. Also, vfsd needs a performant API to
obtain size information for a variable number of objects. This API does not
exist yet.

For batch fetching, vfsd uses a one-off Git process (`git fetch`) with several
flags, one of which is the `--filter` which tells Git which types of objects to
leave out of the fetch. Just like when obtaining file sizes, we would like to
move away from one-off Git processes and employ the Git cat-file daemon for
batch fetching.

For parsing the contents of the index file, vfsd also uses a one-off Git process
(`git ls-files -st`). It’s worth mentioning that the `-s` option is used to
obtain the staged contents’ mode bits, object name (aka SHA1 hash), and stage
number. And the `-t` option is used to obtain the file status (`H` - cached,
`S` - skip-worktree, `C` - changed, etc). This `ls-files` command parses the
contents of the Git index file in full, but we would like to have the option to
parse only the entries that match a certain path prefix. This is an important
optimization especially when dealing with large repositories.

## Design

We propose to introduce a new command called `git-batch` which is similar to
`git-cat-file --batch` but uses the pkt-line RPC format (described in this
document) instead of `git-rev-parse` patterns.

Using this new command, we will implement the following new APIs:

*   API to list the contents of the index file
*   API to obtain file sizes
*   API to fetch in batch

### pkt-line RPC

We start by introducing the pkt-line RPC protocol. The external process (i.e.,
vfsd) and Git will communicate via stdin/out using the protocol described in
this section, which is embedded in the
[pkt-line format](https://git-scm.com/docs/protocol-common#_pkt_line_format):

Syntax:

```
PKT_LINE := $PKT_LEN_HEX $FRAME

FRAME := $ID $STREAM_OP [$MSG]

STREAM_OP := b|e|k|be

MSG = $MSG_TYPE [$DATA]

MSG_TYPE := o|E|c
```

The pkt-line RPC protocol is different, e.g., from HTTPv1, in which one process
is the client and the other is the server, and only the client can initiate
requests to the server. In pkt-line RPC, both processes act as the sender and
receiver to exchange frames. So the term sender and receiver is not a role of
the process. Rather for each frame that is sent, one process is the sender and
the other the receiver.

The protocol is full duplex, i.e., the sender and receiver can exchange frames
without having to wait for the full request or full response to be sent /
received.

#### Terminology

Terminology used in the subsequent sections that describe the protocol:

*   Frame: piece data of known length containing both protocol metadata and
    application data.
*   Sender: process that sent a frame.
*   Receiver: process that received a frame.
*   Stream: ordered sequence of 1 or more messages
*   Message: ordered sequence of 1 or more frames
*   Stream operation: protocol metadata that indicates if a stream begins,
    continues, or ends.
*   Request: a stream started and completed by the sender.
*   Response: a stream started and completed by the receiver that pairs with a
    request via `$ID`.
*   Application: either the Git process or the external process talking to Git.
*   Control frame: a frame containing only protocol metadata and no message.
*   RPC: a pair of request and response with the same ID.

#### FRAME

The `$FRAME` contains a protocol frame and it is embedded in a pkt-line. To
avoid head-of-line blocking problems, frames can be interleaved. Frames are
always sent in streams. Streams group 1 or more frames. The sender and receiver
keep track of open streams. A stream is initiated by a request from either side
and terminated by a response with the same `$ID` as the request.

The limit of a pkt-line is 65516 bytes, so large requests / responses / blobs
may not fit in a single frame. To overcome this, messages can span multiple
frames. A continuation mechanism is used to indicate that the message is
incomplete and continues on the next frame of the same stream. The sender and
receiver keep track of whether there is an active continuation on a stream or
not.

#### ID

The `$ID` field is an alphanumeric identifier and it is multi-purpose:

*   Identifies frames belonging to the same stream.
*   Pairs requests with responses.
*   Allows frames to be interleaved.

IDs can be reused provided they are free. The sender and receiver keep track of
busy IDs. An ID is busy (or not free) if there is an open stream with that ID,
which is the same as saying that a request was sent with that ID but the
response was not yet received. An ID is free (or not busy) either because no
request has been sent with that ID or because a response with the same ID has
already been received.

To avoid the sender and receiver accidentally choosing the same ID concurrently,
the external process will use positive IDs and Git will use negative IDs.

An RPC is a request and response with the same `$ID`. If the sender wants to
initiate a request but does not have a message to send, it can send a control
frame (i.e., a frame without `$MSG`) as a request. Conversely, if the receiver
wants to send a response but does not have a message to return, it can send a
control frame. These control frames are essential for senders and receivers to
know when IDs become free.

It is a protocol error to reuse an ID that is still being used. It’s a protocol
error to omit a request or response in an RPC.

#### STREAM\_OP

The `$STREAM_OP` field is used to control stream operations:

*   A value of `b` (aka begin) indicates the beginning of a stream.
*   A value of `k` (aka keep) sends a message on an open stream.
*   A value of `e` (aka end) indicates the end of a stream.
*   A value of `be` is an optimization to avoid sending empty frames. This is
    only used in streams that have a single message
    <span style="text-decoration:underline;">and</span> a single frame. It is
    equivalent to sending a `b` frame and an `e` frame, where the `e` frame has
    no `$MSG`.

It is a protocol error to mishandle stream operations, for example, to begin a
stream that is already started, or end a stream that is not started, or send a
`k` frame on a stream that is not started.

#### MSG

The `$MSG` contains an application message. If this is omitted, then the frame
is a control frame, containing only protocol data. If `$MSG` is specified, then
this frame contains application data meant to be delivered to the receiver
application.

#### MSG\_TYPE

The `$MSG_TYPE` field is the message type.

A value of `o` (aka ok) contains a whole message if there is no active
continuation. Otherwise, it contains the last part of a message and marks the
end of the active continuation.

A value of `E` (aka error) contains a whole error message. If there is an active
continuation, this continuation ends, and messages sent in those continuation
frames are discarded. The error is delivered to the application.

A value of `c` (aka continuation) contains part of a message. If there is no
active continuation, this also starts an active continuation. Otherwise, it
indicates that the active continuation continues onto the next frame.

It is a protocol error to start a continuation and not finish it either with an
`o` frame or an `E` frame. A `be` frame does not automatically finish an active
continuation because it does not indicate whether the message is an error or
not.

The `$MSG_TYPE` is optional. If omitted, then `$DATA` must also be omitted, in
which case this frame is a control frame and no message is delivered to the
application.

#### DATA

The `$DATA` field is optional and can contain an empty or non-empty message. An
empty message is not given special treatment and it’s delivered to the
application like non-empty messages. This allows APIs to return an OK message
without any actual data to indicate that the request was successful, for
example:

```
> 1 be o fetch $SHA1
< 1 be o
```

Sender (i.e., >) sends a request to fetch an object. Receiver (i.e. &lt;)
responds with an OK message with empty `$DATA`.

#### Examples

```
1 be o hello world
```

Send a single message. This uses the `be` optimization because it’s a stream
with a single message and single frame.

```
1 be E no such file or directory
```

Send a single error message. This uses the `be` optimization for the same
reasons as above.

```
1 b c $LONG_DATA1
1 k c $LONG_DATA2
1 e o $LONG_DATA3
```

Send a single long message. The `$LONG_DATA` message is too large, therefore
it’s split into `$LONG_DATA1`, `$LONG_DATA2`, and `$LONG_DATA3`, using `c` to
indicate the continuation frames. The receiver reassembles the continuation
frames and delivers a single message to the application.

```
1 b o $SMALL_BLOB1
1 k o $SMALL_BLOB2
1 e o $SMALL_BLOB3
```

Send 3 individual messages on a stream without errors.

```
1 b o $SMALL_BLOB1
1 k o $SMALL_BLOB2
1 e E $SMALL_ERROR
```

Send 3 individual messages on a stream. The last message is an error message.
The receiver delivers 3 messages to the application.

```
1 b c $A_DATA1
1 k o $A_DATA2
1 k c $B_DATA1
1 e o $B_DATA2
```

Send multiple long messages without errors. The `$A_DATA` message is split into
2 frames ($`A_DATA1` and $`A_DATA2`) using continuation frames. The same is done
for the `$B_DATA` message. The receiver reassembles the continuation frames and
delivers 2 messages to the application.

```
1 b c $A_DATA1
1 k o $A_DATA2
1 k c $B_DATA1
1 e E $ERROR
```

Send multiple long messages with errors. The receiver delivers `$A_DATA`
(`$A_DATA1` + `$A_DATA2`) to the application. The `$B_DATA1` is discarded, and
`$ERROR` is delivered to the application.

#### Nuanced cases

There are some nuanced cases, so we want to make sure that the protocol works as
expected and that there are no ambiguities.

```
1 b m hello
1 e
```

Send a single message to the receiver in 2 frames. The second frame is empty and
does not contain a message, but it contains a stream operation to end the
stream. This is equivalent to sending a single `be` frame but without the
optimization that saves the empty frame.

```
1 be
```

Send a control frame without any messages. No messages are delivered to the
application.

```
1 be o
```

Send an empty message. An empty message is delivered to the application.

Now that we have the protocol basics (framing, interleaving and streaming), we
can now define the APIs. The APIs below fit in the `$DATA` part shown in the
syntax above.

### API to list contents of the index file

vfsd needs an API to parse the Git index file, to extract certain fields from
it, and match only certain paths.

Request syntax:

```
ls-index [path:$PATH\0] [fields:%($FIELD1)%($FIELD2)...]
```

Response syntax:

```
[$FIELD1:$VALUE1] [file:$PATH\0] [$FIELD2:$VALUE2 ...]
```

The command `ls-index` lists the contents of the index file.

The argument `path:` is a path selector to list paths that match the given
prefix. The wildcard `*` matches immediate subentries. The wildcard `**`
recursively matches all subentries.

The argument `fields:` is a field selector to select which fields to return from
the matched entries.

In the response, the filename has a variable length so it’s terminated by NULL
('\0').

Examples:

```
path:dir/*
```

matches all immediate children of `dir/` and returns the Git index entries for
the paths `dir/myfile` and `dir/mydir`, but not for `dir/mydir/file`.

```
path:*
```

matches all immediate children of the root directory, e.g., `myfile` and
`mydir`, but it doesn’t match `mydir/file`.

```
path:dir/**
```

matches all entries from the Git index file that have the prefix `dir/` in their
path, so all of `dir/myfile`, `dir/mydir`, and `dir/mydir/file` match.

```
fields:%(status)%(mode)%(name)%(stage)%(file)
```

returns the file status (H, S, C, etc), file mode, object name (aka SHA1 hash),
stage number (0, 1, 2) and file name (e.g., README). This uses the
[git-log format string](https://git-scm.com/docs/git-log).

For performance reasons, wildcards are employed in path selectors instead of
regular expressions. Wildcards can only appear as the last component of a path
and they cannot be combined with other stanzas, so a pattern like `myfile*` with
the intention of matching `myfile1` and `myfileabc` is not allowed, and a
pattern like `dir/*/myfile` with the intention of matching all intermediate
subdirectories is also not allowed.

### API to get blob size

vfsd needs this API either to obtain file sizes. For performance reasons, it’s
critical that vfsd can batch requests, i.e., send a single network request for a
number of objects in parallel.

Request syntax:

```
size $NAME1 [$NAME2 ...]
```

Response syntax:

```
$SIZE1 [$SIZE2 ...]
```

Example:

```
size 6363ba80dc6f90ac2b016adef8b9186cec3e431e
```

Returns the size of the blob with the given name.

```
size 6adef8b9186cec3e431e6363ba80dc6f90ac2b01 cec3e431e6363ba80dc6f90ac2b016adef8b9186
```

returns the sizes of the blobs with the given names.

### API to fetch in batch

vfsd needs this API to fetch several objects in a batch so that subsequent
commands that interact with those objects do not block on the network.

Request syntax:

```
fetch $NAME1 [$NAME2 ...]
```

Response syntax:

```
No response.
```

Example:

```
fetch 6adef8b9186cec3e431e6363ba80dc6f90ac2b01 cec3e431e6363ba80dc6f90ac2b016adef8b9186
```

fetches the given objects in a single network request from the remote and stores
them locally. It may be necessary to extend this with the name of the remote in
order to accommodate Git repositories with multiple remotes.

## Experimental

The new Git command `git-batch` will be first released as
`git-batch-experimental` because it:

*   Communicates to the users that the API is in development, it can drastically
    change or even be removed
*   Communicates to the users that backwards compatibility for this API is not
    guaranteed
*   Allows the Git maintainers to accept a feature in Git for development
    purposes but without the risk of maintaining backwards compatibility for a
    feature that is not useful
*   Allows our developers to continue developing these APIs incrementally /
    iteratively.

When these APIs are stable, we can then start a discussion to stabilize these
APIs and define a path to remove the experimental bit.