All of lore.kernel.org
 help / color / mirror / Atom feed
* [GSoC] Git Blog 4
@ 2021-06-13 14:17 ZheNing Hu
  2021-06-13 23:28 ` Eric Sunshine
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: ZheNing Hu @ 2021-06-13 14:17 UTC (permalink / raw)
  To: Git List; +Cc: Junio C Hamano, Christian Couder, Hariom verma, Jeff King

My fourth week blog finished:
The web version is here:
https://adlternative.github.io/GSOC-Git-Blog-3/

## Week4: Trouble is a friend

At the beginning of this week , since my previous code
broke some Github CI tests , I tried to solve these bugs
related to the atom `%(raw)` . The most confusing thing
is that some bugs may pass the tests of your local machine,
but fail to pass in the CI of GitHub .

E.g. I need to add the `GPG` prerequisites to the test like this:

```sh
test_expect_success GPG 'basic atom: refs/tags/signed-empty raw' '
git cat-file tag refs/tags/signed-empty >expected &&
git for-each-ref --format="%(raw)" refs/tags/signed-empty >actual &&
sanitize_pgp <expected >expected.clean &&
sanitize_pgp <actual >actual.clean &&
echo "" >>expected.clean &&
test_cmp expected.clean actual.clean
'
```

Otherwise, some operating systems that do not contain GnuPG
may not be able to perform related tests.

In addition, some scripts like `printf "%b" "a\0b\0c" >blob1` will
be truncated at the first NUL on a 32-bit machine, but it performs
well on 64-bit machines, and NUL is normally stored in the file.
This made me think that Git's file decompression had an error on
a 32-bit machine before I used Ubuntu32's docker container to
clone the git repository and In-depth analysis of bugs... In the end,
I used `printf "a\0b\0c"` to make 32-bit machines not truncated
in NUL. Is there a better way to write binary data onto a file than
`printf` and `echo`?

Since I am a newbie to docker, I would like to know if there is any
way to run the Git's Github CI program remotely or locally?

In the second half of this week, I tried to make `cat-file` reuse the
logic of `ref-filter`. I have to say that this is a very difficult process.
"rebase -i" again and again to repair the content of previous commits.
squeeze commits, split commits, modify commit messages... Finally, I
submitted the patches to the Git mailing list in
[[PATCH 0/8] [GSOC][RFC] cat-file: reuse `ref-filter`
logic](https://lore.kernel.org/git/pull.980.git.1623496458.gitgitgadget@gmail.com/).
Now `cat-file` has learned most of the atoms in `ref-filter`. I am very
happy to be able to make git support richer functions through my own code.

Regrettably, `git cat-file --batch --batch-all-objects` seems to take up
a huge amount of memory on a large repo such as git.git, and it will
be killed by Linux's oom. This is mainly because we will make a large
number of copies of the object's raw data. The original `git cat-file`
uses `read_object_file()` or `stream_blob()` to output the object's
raw data, but in `ref-filter`, we have to use `v->s` to copy the object's
data, it is difficult to eliminate `v->s` and print the output directly to the
final output buffer. Because we may have atoms like `%(if)`, `%(else)`
that need to use buffers on the stack to build the final output string
layer by layer, or the `cmp_ref_sorting()` needs to use `v->s` to
compare two refs. In short, it is very difficult for `ref-filter` to reduce
copy overhead. I even thought about using the string pool API
`memintern()` to replace `xmemdupz()`, but it seems that the effect
is not obvious. A large number of objects' data will still reside in memory,
so this may not be a good method.

Anyway, stay confident. I can solve these difficult problems with
the help of mentors and reviewers. `:)`

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [GSoC] Git Blog 4
  2021-06-13 14:17 [GSoC] Git Blog 4 ZheNing Hu
@ 2021-06-13 23:28 ` Eric Sunshine
  2021-06-14  3:41   ` ZheNing Hu
  2021-06-14  8:02 ` Christian Couder
  2021-06-14 13:20 ` Atharva Raykar
  2 siblings, 1 reply; 9+ messages in thread
From: Eric Sunshine @ 2021-06-13 23:28 UTC (permalink / raw)
  To: ZheNing Hu
  Cc: Git List, Junio C Hamano, Christian Couder, Hariom verma, Jeff King

On Sun, Jun 13, 2021 at 10:18 AM ZheNing Hu <adlternative@gmail.com> wrote:
> My fourth week blog finished:
> The web version is here:
> https://adlternative.github.io/GSOC-Git-Blog-3/

I suppose you meant: https://adlternative.github.io/GSOC-Git-Blog-4/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [GSoC] Git Blog 4
  2021-06-13 23:28 ` Eric Sunshine
@ 2021-06-14  3:41   ` ZheNing Hu
  0 siblings, 0 replies; 9+ messages in thread
From: ZheNing Hu @ 2021-06-14  3:41 UTC (permalink / raw)
  To: Eric Sunshine
  Cc: Git List, Junio C Hamano, Christian Couder, Hariom verma, Jeff King

Eric Sunshine <sunshine@sunshineco.com> 于2021年6月14日周一 上午7:28写道:
>
> On Sun, Jun 13, 2021 at 10:18 AM ZheNing Hu <adlternative@gmail.com> wrote:
> > My fourth week blog finished:
> > The web version is here:
> > https://adlternative.github.io/GSOC-Git-Blog-3/
>
> I suppose you meant: https://adlternative.github.io/GSOC-Git-Blog-4/

Yeah, it is 4. :-)

--
ZheNing Hu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [GSoC] Git Blog 4
  2021-06-13 14:17 [GSoC] Git Blog 4 ZheNing Hu
  2021-06-13 23:28 ` Eric Sunshine
@ 2021-06-14  8:02 ` Christian Couder
  2021-06-14 12:02   ` Christian Couder
  2021-06-15  8:59   ` ZheNing Hu
  2021-06-14 13:20 ` Atharva Raykar
  2 siblings, 2 replies; 9+ messages in thread
From: Christian Couder @ 2021-06-14  8:02 UTC (permalink / raw)
  To: ZheNing Hu; +Cc: Git List, Junio C Hamano, Hariom verma, Jeff King

On Sun, Jun 13, 2021 at 4:17 PM ZheNing Hu <adlternative@gmail.com> wrote:

> In addition, some scripts like `printf "%b" "a\0b\0c" >blob1` will
> be truncated at the first NUL on a 32-bit machine, but it performs
> well on 64-bit machines, and NUL is normally stored in the file.
> This made me think that Git's file decompression had an error on
> a 32-bit machine before I used Ubuntu32's docker container to
> clone the git repository and In-depth analysis of bugs... In the end,
> I used `printf "a\0b\0c"` to make 32-bit machines not truncated
> in NUL. Is there a better way to write binary data onto a file than
> `printf` and `echo`?

You might want to take a look at t/t4058-diff-duplicates.sh which has
the following:

# make_tree_entry <mode> <mode> <sha1>
#
# We have to rely on perl here because not all printfs understand
# hex escapes (only octal), and xxd is not portable.
make_tree_entry () {
       printf '%s %s\0' "$1" "$2" &&
       perl -e 'print chr(hex($_)) for ($ARGV[0] =~ /../g)' "$3"
}

> Since I am a newbie to docker, I would like to know if there is any
> way to run the Git's Github CI program remotely or locally?

There are scripts in the ci/ directory, but yeah it could help if
there was a README there.

> In the second half of this week, I tried to make `cat-file` reuse the
> logic of `ref-filter`. I have to say that this is a very difficult process.
> "rebase -i" again and again to repair the content of previous commits.
> squeeze commits, split commits, modify commit messages... Finally, I
> submitted the patches to the Git mailing list in
> [[PATCH 0/8] [GSOC][RFC] cat-file: reuse `ref-filter`
> logic](https://lore.kernel.org/git/pull.980.git.1623496458.gitgitgadget@gmail.com/).
> Now `cat-file` has learned most of the atoms in `ref-filter`. I am very
> happy to be able to make git support richer functions through my own code.
>
> Regrettably, `git cat-file --batch --batch-all-objects` seems to take up
> a huge amount of memory on a large repo such as git.git, and it will
> be killed by Linux's oom.

In the cover letter of your patch series you say:

"There is still an unresolved issue: performance overhead is very large, so
that when we use:

git cat-file --batch --batch-all-objects >/dev/null

on git.git, it may fail."

Is this the same issue? Is it only a memory issue, or is your patch
series also making things slower?

> This is mainly because we will make a large
> number of copies of the object's raw data. The original `git cat-file`
> uses `read_object_file()` or `stream_blob()` to output the object's
> raw data, but in `ref-filter`, we have to use `v->s` to copy the object's
> data, it is difficult to eliminate `v->s` and print the output directly to the
> final output buffer. Because we may have atoms like `%(if)`, `%(else)`
> that need to use buffers on the stack to build the final output string
> layer by layer,

What does layer by layer mean here?

> or the `cmp_ref_sorting()` needs to use `v->s` to
> compare two refs. In short, it is very difficult for `ref-filter` to reduce
> copy overhead. I even thought about using the string pool API
> `memintern()` to replace `xmemdupz()`, but it seems that the effect
> is not obvious. A large number of objects' data will still reside in memory,
> so this may not be a good method.

Would it be possible to keep the data for a limited number of objects,
then print everything related to these objects, free their data and
start again with another limited number of objects?

> Anyway, stay confident. I can solve these difficult problems with
> the help of mentors and reviewers. `:)`

Sure :-)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [GSoC] Git Blog 4
  2021-06-14  8:02 ` Christian Couder
@ 2021-06-14 12:02   ` Christian Couder
  2021-06-15  8:59   ` ZheNing Hu
  1 sibling, 0 replies; 9+ messages in thread
From: Christian Couder @ 2021-06-14 12:02 UTC (permalink / raw)
  To: ZheNing Hu; +Cc: Git List, Junio C Hamano, Hariom verma, Jeff King

On Mon, Jun 14, 2021 at 10:02 AM Christian Couder
<christian.couder@gmail.com> wrote:
>
> On Sun, Jun 13, 2021 at 4:17 PM ZheNing Hu <adlternative@gmail.com> wrote:

> > Since I am a newbie to docker, I would like to know if there is any
> > way to run the Git's Github CI program remotely or locally?
>
> There are scripts in the ci/ directory, but yeah it could help if
> there was a README there.

There is a "GitHub-Travis CI hints" in Documentation/SubmittingPatches though.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [GSoC] Git Blog 4
  2021-06-13 14:17 [GSoC] Git Blog 4 ZheNing Hu
  2021-06-13 23:28 ` Eric Sunshine
  2021-06-14  8:02 ` Christian Couder
@ 2021-06-14 13:20 ` Atharva Raykar
  2021-06-15  9:06   ` ZheNing Hu
  2 siblings, 1 reply; 9+ messages in thread
From: Atharva Raykar @ 2021-06-14 13:20 UTC (permalink / raw)
  To: ZheNing Hu
  Cc: Git List, Junio C Hamano, Christian Couder, Hariom verma, Jeff King

On 13-Jun-2021, at 19:47, ZheNing Hu <adlternative@gmail.com> wrote:
> 
> [...]
> 
> Since I am a newbie to docker, I would like to know if there is any
> way to run the Git's Github CI program remotely or locally?

Whenever I push to my fork on GitHub, a GitHub Actions workflow gets
triggered. I don't think you need to do any special setup for it.

You should be able to just access it from the 'Actions' tab.

Either way, if you are looking for something specific, there is some
work going on to update the documentation to mention this:

https://lore.kernel.org/git/patch-2.3-7add00cc87-20210512T084137Z-avarab@gmail.com/

Maybe that might be of help to you :-)

---
Atharva Raykar
ಅಥರ್ವ ರಾಯ್ಕರ್
अथर्व रायकर


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [GSoC] Git Blog 4
  2021-06-14  8:02 ` Christian Couder
  2021-06-14 12:02   ` Christian Couder
@ 2021-06-15  8:59   ` ZheNing Hu
  2021-06-15 12:30     ` ZheNing Hu
  1 sibling, 1 reply; 9+ messages in thread
From: ZheNing Hu @ 2021-06-15  8:59 UTC (permalink / raw)
  To: Christian Couder; +Cc: Git List, Junio C Hamano, Hariom verma, Jeff King

Christian Couder <christian.couder@gmail.com> 于2021年6月14日周一 下午4:02写道:
>
> On Sun, Jun 13, 2021 at 4:17 PM ZheNing Hu <adlternative@gmail.com> wrote:
>
> > In addition, some scripts like `printf "%b" "a\0b\0c" >blob1` will
> > be truncated at the first NUL on a 32-bit machine, but it performs
> > well on 64-bit machines, and NUL is normally stored in the file.
> > This made me think that Git's file decompression had an error on
> > a 32-bit machine before I used Ubuntu32's docker container to
> > clone the git repository and In-depth analysis of bugs... In the end,
> > I used `printf "a\0b\0c"` to make 32-bit machines not truncated
> > in NUL. Is there a better way to write binary data onto a file than
> > `printf` and `echo`?
>
> You might want to take a look at t/t4058-diff-duplicates.sh which has
> the following:
>
> # make_tree_entry <mode> <mode> <sha1>
> #
> # We have to rely on perl here because not all printfs understand
> # hex escapes (only octal), and xxd is not portable.
> make_tree_entry () {
>        printf '%s %s\0' "$1" "$2" &&
>        perl -e 'print chr(hex($_)) for ($ARGV[0] =~ /../g)' "$3"
> }
>

Yes, perl can indeed do this, and perhaps python can do it too.
However, python may need to consider portability issues.

> > Since I am a newbie to docker, I would like to know if there is any
> > way to run the Git's Github CI program remotely or locally?
>
> There are scripts in the ci/ directory, but yeah it could help if
> there was a README there.
>

Thanks, I probably know how to use it.
As you said in another article, GitHub-Travis CI, this is exactly what I need.

> > In the second half of this week, I tried to make `cat-file` reuse the
> > logic of `ref-filter`. I have to say that this is a very difficult process.
> > "rebase -i" again and again to repair the content of previous commits.
> > squeeze commits, split commits, modify commit messages... Finally, I
> > submitted the patches to the Git mailing list in
> > [[PATCH 0/8] [GSOC][RFC] cat-file: reuse `ref-filter`
> > logic](https://lore.kernel.org/git/pull.980.git.1623496458.gitgitgadget@gmail.com/).
> > Now `cat-file` has learned most of the atoms in `ref-filter`. I am very
> > happy to be able to make git support richer functions through my own code.
> >
> > Regrettably, `git cat-file --batch --batch-all-objects` seems to take up
> > a huge amount of memory on a large repo such as git.git, and it will
> > be killed by Linux's oom.
>
> In the cover letter of your patch series you say:
>
> "There is still an unresolved issue: performance overhead is very large, so
> that when we use:
>
> git cat-file --batch --batch-all-objects >/dev/null
>
> on git.git, it may fail."
>
> Is this the same issue? Is it only a memory issue, or is your patch
> series also making things slower?
>

Yes, they are talking about the same thing, the memory usage is too large.
Of course I should check for memory leaks first. However, this is mainly
caused by changes in the strategy of cat-file printing object data.

The original cat-file needs do fewer (one time) copies in read_object_file()
or stream_blob(), now cat-file needs do four time (or more) copy in
oid_object_info_extended(), grab_sub_body_contents(), append_atom(),
and pop_stack_element().

> > This is mainly because we will make a large
> > number of copies of the object's raw data. The original `git cat-file`
> > uses `read_object_file()` or `stream_blob()` to output the object's
> > raw data, but in `ref-filter`, we have to use `v->s` to copy the object's
> > data, it is difficult to eliminate `v->s` and print the output directly to the
> > final output buffer. Because we may have atoms like `%(if)`, `%(else)`
> > that need to use buffers on the stack to build the final output string
> > layer by layer,
>
> What does layer by layer mean here?
>

In the case of using multiple nested %(if) %(else), the data may be
copied to the
"previous level" buffer of the stack through pop_stack_element().

> > or the `cmp_ref_sorting()` needs to use `v->s` to
> > compare two refs. In short, it is very difficult for `ref-filter` to reduce
> > copy overhead. I even thought about using the string pool API
> > `memintern()` to replace `xmemdupz()`, but it seems that the effect
> > is not obvious. A large number of objects' data will still reside in memory,
> > so this may not be a good method.
>
> Would it be possible to keep the data for a limited number of objects,
> then print everything related to these objects, free their data and
> start again with another limited number of objects?
>

"limited number of objects", is this want to reduce the overhead of free()?
May be a good solution. But I think, can we just only release the memory
of an object after printing it instead of free() together like ref_array_clear()
does?

> > Anyway, stay confident. I can solve these difficult problems with
> > the help of mentors and reviewers. `:)`
>
> Sure :-)

Thanks!
--
ZheNing Hu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [GSoC] Git Blog 4
  2021-06-14 13:20 ` Atharva Raykar
@ 2021-06-15  9:06   ` ZheNing Hu
  0 siblings, 0 replies; 9+ messages in thread
From: ZheNing Hu @ 2021-06-15  9:06 UTC (permalink / raw)
  To: Atharva Raykar
  Cc: Git List, Junio C Hamano, Christian Couder, Hariom verma, Jeff King

Atharva Raykar <raykar.ath@gmail.com> 于2021年6月14日周一 下午9:20写道:
>
> On 13-Jun-2021, at 19:47, ZheNing Hu <adlternative@gmail.com> wrote:
> >
> > [...]
> >
> > Since I am a newbie to docker, I would like to know if there is any
> > way to run the Git's Github CI program remotely or locally?
>
> Whenever I push to my fork on GitHub, a GitHub Actions workflow gets
> triggered. I don't think you need to do any special setup for it.
>

Well, maybe sometimes I might want Github to give me a more humane
error message instead of just telling me where the test failed. So I may
need to make some changes to ".cirrus.yml" to do it. (just for debug)

> You should be able to just access it from the 'Actions' tab.
>
> Either way, if you are looking for something specific, there is some
> work going on to update the documentation to mention this:
>
> https://lore.kernel.org/git/patch-2.3-7add00cc87-20210512T084137Z-avarab@gmail.com/
>
> Maybe that might be of help to you :-)
>

Thanks, it's helpful!

> ---
> Atharva Raykar
> ಅಥರ್ವ ರಾಯ್ಕರ್
> अथर्व रायकर
>

--
ZheNing Hu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [GSoC] Git Blog 4
  2021-06-15  8:59   ` ZheNing Hu
@ 2021-06-15 12:30     ` ZheNing Hu
  0 siblings, 0 replies; 9+ messages in thread
From: ZheNing Hu @ 2021-06-15 12:30 UTC (permalink / raw)
  To: Christian Couder; +Cc: Git List, Junio C Hamano, Hariom verma, Jeff King

ZheNing Hu <adlternative@gmail.com> 于2021年6月15日周二 下午4:59写道:
>
> >
> > In the cover letter of your patch series you say:
> >
> > "There is still an unresolved issue: performance overhead is very large, so
> > that when we use:
> >
> > git cat-file --batch --batch-all-objects >/dev/null
> >
> > on git.git, it may fail."
> >
> > Is this the same issue? Is it only a memory issue, or is your patch
> > series also making things slower?
> >
>
> Yes, they are talking about the same thing, the memory usage is too large.
> Of course I should check for memory leaks first. However, this is mainly
> caused by changes in the strategy of cat-file printing object data.
>

In fact, it is indeed a problem caused by a memory leak: batch_object_write()
forget to free the ref_array_item->value. After solving the problem, although
the performance of `git cat-file --batch-all-objects --batch` is still poor, at
least there will be no triggering the oom killer.

--
ZheNing Hu

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-06-15 12:30 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-13 14:17 [GSoC] Git Blog 4 ZheNing Hu
2021-06-13 23:28 ` Eric Sunshine
2021-06-14  3:41   ` ZheNing Hu
2021-06-14  8:02 ` Christian Couder
2021-06-14 12:02   ` Christian Couder
2021-06-15  8:59   ` ZheNing Hu
2021-06-15 12:30     ` ZheNing Hu
2021-06-14 13:20 ` Atharva Raykar
2021-06-15  9:06   ` ZheNing Hu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.