All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] [git-p4.py] Add --checkpoint-period option to sync/clone
@ 2016-09-12 22:02 Ori Rawlings
  2016-09-12 22:02 ` Ori Rawlings
  2016-09-15 21:17 ` [PATCH v2 0/1] git-p4: " Ori Rawlings
  0 siblings, 2 replies; 7+ messages in thread
From: Ori Rawlings @ 2016-09-12 22:02 UTC (permalink / raw)
  To: git
  Cc: Vitor Antunes, Lars Schneider, Luke Diamand, Pete Wyckoff, Ori Rawlings

Importing a long history from Perforce into git using the git-p4 tool
can be especially challenging. The `git p4 clone` operation is based
on an all-or-nothing transactionality guarantee. Under real-world
conditions like network unreliability or a busy Perforce server,
`git p4 clone` and  `git p4 sync` operations can easily fail, forcing a
user to restart the import process from the beginning. The longer the
history being imported, the more likely a fault occurs during the
process. Long enough imports thus become statistically unlikely to ever
succeed.

I'm looking for feedback on a potential approach for addressing the
problem. My idea was to leverage the checkpoint feature of git 
fast-import. I've included a patch which exposes a new option to the 
sync/clone commands in the git-p4 tool. The option enables explict 
checkpoints on a periodic basis (approximately every x seconds).

If the sync/clone command fails during processing of Perforce changes, 
the user can craft a new git p4 sync command that will identify 
changes that have already been imported and proceed with importing 
only changes more recent than the last successful checkpoint.

Assuming this approach makes sense, there are a few questions/items I
have:

  1. To add tests for this option, I'm thinking I'd need to simulate a 
     Perforce server or client that exits abnormally after first 
     processing some operations successfully. I'm looking for 
     suggestions on sane approaches for implementing that.
  2. From a usability perspective, I think it makes sense to print 
     out a message upon clone/sync failure if the user has enabled the 
     option. This message would describe how long ago the last 
     successful checkpoint was completed and document what command/s 
     to execute to continue importing Perforce changes. Ideally, the 
     commmand to continue would be exactly the same as the command 
     which failed, but today, clone will ignore any commits already 
     imported to git. There are some lingering TODO comments in 
     git-p4.py suggesting that clone should try to avoid reimporting
     changes. I don't mind taking a stab at addressing the TODO, but 
     am worried I'll quickly encounter edge cases in the clone/sync 
     features I don't understand.
  3. This is my first attempt at a git contribution, so I'm definitely 
     looking for feedback on commit messages, etc.


Cheers!

Ori Rawlings (1):
  [git-p4.py] Add --checkpoint-period option to sync/clone

 git-p4.py | 8 ++++++++
 1 file changed, 8 insertions(+)

-- 
2.7.4 (Apple Git-66)


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] [git-p4.py] Add --checkpoint-period option to sync/clone
  2016-09-12 22:02 [PATCH] [git-p4.py] Add --checkpoint-period option to sync/clone Ori Rawlings
@ 2016-09-12 22:02 ` Ori Rawlings
  2016-09-13  8:10   ` Luke Diamand
  2016-09-15 21:17 ` [PATCH v2 0/1] git-p4: " Ori Rawlings
  1 sibling, 1 reply; 7+ messages in thread
From: Ori Rawlings @ 2016-09-12 22:02 UTC (permalink / raw)
  To: git
  Cc: Vitor Antunes, Lars Schneider, Luke Diamand, Pete Wyckoff, Ori Rawlings

Importing a long history from Perforce into git using the git-p4 tool
can be especially challenging. The `git p4 clone` operation is based
on an all-or-nothing transactionality guarantee. Under real-world
conditions like network unreliability or a busy Perforce server,
`git p4 clone` and  `git p4 sync` operations can easily fail, forcing a
user to restart the import process from the beginning. The longer the
history being imported, the more likely a fault occurs during the
process. Long enough imports thus become statistically unlikely to ever
succeed.

The underlying git fast-import protocol supports an explicit checkpoint
command. The idea here is to optionally allow the user to force an
explicit checkpoint every <x> seconds. If the sync/clone operation fails
branches are left updated at the appropriate commit available during the
latest checkpoint. This allows a user to resume importing Perforce
history while only having to repeat at most approximately <x> seconds
worth of import activity.

Signed-off-by: Ori Rawlings <orirawlings@gmail.com>
---
 git-p4.py | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/git-p4.py b/git-p4.py
index fd5ca52..40cb64f 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -2244,6 +2244,7 @@ class P4Sync(Command, P4UserMap):
                 optparse.make_option("-/", dest="cloneExclude",
                                      action="append", type="string",
                                      help="exclude depot path"),
+                optparse.make_option("--checkpoint-period", dest="checkpointPeriod", type="int", help="Period in seconds between explict git fast-import checkpoints (by default, no explicit checkpoints are performed)"),
         ]
         self.description = """Imports from Perforce into a git repository.\n
     example:
@@ -2276,6 +2277,7 @@ class P4Sync(Command, P4UserMap):
         self.tempBranches = []
         self.tempBranchLocation = "refs/git-p4-tmp"
         self.largeFileSystem = None
+        self.checkpointPeriod = -1
 
         if gitConfig('git-p4.largeFileSystem'):
             largeFileSystemConstructor = globals()[gitConfig('git-p4.largeFileSystem')]
@@ -3031,6 +3033,8 @@ class P4Sync(Command, P4UserMap):
 
     def importChanges(self, changes):
         cnt = 1
+        if self.checkpointPeriod > -1:
+            self.lastCheckpointTime = time.time()
         for change in changes:
             description = p4_describe(change)
             self.updateOptionDict(description)
@@ -3107,6 +3111,10 @@ class P4Sync(Command, P4UserMap):
                                 self.initialParent)
                     # only needed once, to connect to the previous commit
                     self.initialParent = ""
+
+                    if self.checkpointPeriod > -1 and time.time() - self.lastCheckpointTime > self.checkpointPeriod:
+                        self.checkpoint()
+                        self.lastCheckpointTime = time.time()
             except IOError:
                 print self.gitError.read()
                 sys.exit(1)
-- 
2.7.4 (Apple Git-66)


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] [git-p4.py] Add --checkpoint-period option to sync/clone
  2016-09-12 22:02 ` Ori Rawlings
@ 2016-09-13  8:10   ` Luke Diamand
  0 siblings, 0 replies; 7+ messages in thread
From: Luke Diamand @ 2016-09-13  8:10 UTC (permalink / raw)
  To: Ori Rawlings; +Cc: Git Users, Vitor Antunes, Lars Schneider, Pete Wyckoff

On 12 September 2016 at 23:02, Ori Rawlings <orirawlings@gmail.com> wrote:
> Importing a long history from Perforce into git using the git-p4 tool
> can be especially challenging. The `git p4 clone` operation is based
> on an all-or-nothing transactionality guarantee. Under real-world
> conditions like network unreliability or a busy Perforce server,
> `git p4 clone` and  `git p4 sync` operations can easily fail, forcing a
> user to restart the import process from the beginning. The longer the
> history being imported, the more likely a fault occurs during the
> process. Long enough imports thus become statistically unlikely to ever
> succeed.

That would never happen :-)

The usual thing that I find is that my Perforce login session expires.

>
> The underlying git fast-import protocol supports an explicit checkpoint
> command. The idea here is to optionally allow the user to force an
> explicit checkpoint every <x> seconds. If the sync/clone operation fails
> branches are left updated at the appropriate commit available during the
> latest checkpoint. This allows a user to resume importing Perforce
> history while only having to repeat at most approximately <x> seconds
> worth of import activity.

I think this ought to work, and could be quite useful. It would be
good to have some kind of test case for it though, and updated
documentation.

Luke

>
> Signed-off-by: Ori Rawlings <orirawlings@gmail.com>
> ---
>  git-p4.py | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/git-p4.py b/git-p4.py
> index fd5ca52..40cb64f 100755
> --- a/git-p4.py
> +++ b/git-p4.py
> @@ -2244,6 +2244,7 @@ class P4Sync(Command, P4UserMap):
>                  optparse.make_option("-/", dest="cloneExclude",
>                                       action="append", type="string",
>                                       help="exclude depot path"),
> +                optparse.make_option("--checkpoint-period", dest="checkpointPeriod", type="int", help="Period in seconds between explict git fast-import checkpoints (by default, no explicit checkpoints are performed)"),
>          ]
>          self.description = """Imports from Perforce into a git repository.\n
>      example:
> @@ -2276,6 +2277,7 @@ class P4Sync(Command, P4UserMap):
>          self.tempBranches = []
>          self.tempBranchLocation = "refs/git-p4-tmp"
>          self.largeFileSystem = None
> +        self.checkpointPeriod = -1

Or use None?

>
>          if gitConfig('git-p4.largeFileSystem'):
>              largeFileSystemConstructor = globals()[gitConfig('git-p4.largeFileSystem')]
> @@ -3031,6 +3033,8 @@ class P4Sync(Command, P4UserMap):
>
>      def importChanges(self, changes):
>          cnt = 1
> +        if self.checkpointPeriod > -1:
> +            self.lastCheckpointTime = time.time()

Could you just always set the lastCheckpointTime?

>          for change in changes:
>              description = p4_describe(change)
>              self.updateOptionDict(description)
> @@ -3107,6 +3111,10 @@ class P4Sync(Command, P4UserMap):
>                                  self.initialParent)
>                      # only needed once, to connect to the previous commit
>                      self.initialParent = ""
> +
> +                    if self.checkpointPeriod > -1 and time.time() - self.lastCheckpointTime > self.checkpointPeriod:
> +                        self.checkpoint()
> +                        self.lastCheckpointTime = time.time()

If you use time.time(), then this could fail to work as expected if
the system time goes backwards (e.g. NTP updates). However, Python 2
doesn't have access to clock_gettime() without jumping through hoops,
so perhaps we leave this as a bug until git-p4 gets ported to Python
3.



>              except IOError:
>                  print self.gitError.read()
>                  sys.exit(1)
> --
> 2.7.4 (Apple Git-66)
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2 0/1] git-p4: Add --checkpoint-period option to sync/clone
  2016-09-12 22:02 [PATCH] [git-p4.py] Add --checkpoint-period option to sync/clone Ori Rawlings
  2016-09-12 22:02 ` Ori Rawlings
@ 2016-09-15 21:17 ` Ori Rawlings
  2016-09-15 21:17   ` [PATCH v2 1/1] " Ori Rawlings
  1 sibling, 1 reply; 7+ messages in thread
From: Ori Rawlings @ 2016-09-15 21:17 UTC (permalink / raw)
  To: git
  Cc: Vitor Antunes, Lars Schneider, Luke Diamand, Pete Wyckoff, Ori Rawlings

Importing a long history from Perforce into git using the git-p4 tool
can be especially challenging. The `git p4 clone` operation is based
on an all-or-nothing transactionality guarantee. Under real-world
conditions like network unreliability or a busy Perforce server,
`git p4 clone` and  `git p4 sync` operations can easily fail, forcing a
user to restart the import process from the beginning. The longer the
history being imported, the more likely a fault occurs during the
process. Long enough imports thus become statistically unlikely to ever
succeed.

My idea was to leverage the checkpoint feature of git fast-import.
I've included a patch which exposes a new option to the sync/clone
commands in the git-p4 tool. The option enables explict checkpoints on
a periodic basis (approximately every x seconds).

If the sync/clone command fails during processing of Perforce changes,
the user can craft a new git p4 sync command that will identify
changes that have already been imported and proceed with importing
only changes more recent than the last successful checkpoint.

In v2 of this patch series I've added some basic test scenarios,
documentation, and did some minor clean up of the implementation
based on feedback on v1.

Ori Rawlings (1):
  git-p4: Add --checkpoint-period option to sync/clone

 Documentation/git-p4.txt            | 12 ++++++-
 git-p4.py                           |  7 ++++-
 t/t9830-git-p4-checkpoint-period.sh | 59 ++++++++++++++++++++++++++++++-
 3 files changed, 78 insertions(+), 0 deletions(-)
 create mode 100755 t/t9830-git-p4-checkpoint-period.sh

-- 
git-series 0.8.10

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2 1/1] git-p4: Add --checkpoint-period option to sync/clone
  2016-09-15 21:17 ` [PATCH v2 0/1] git-p4: " Ori Rawlings
@ 2016-09-15 21:17   ` Ori Rawlings
  2016-09-16 16:19     ` Lars Schneider
  0 siblings, 1 reply; 7+ messages in thread
From: Ori Rawlings @ 2016-09-15 21:17 UTC (permalink / raw)
  To: git
  Cc: Vitor Antunes, Lars Schneider, Luke Diamand, Pete Wyckoff, Ori Rawlings

Importing a long history from Perforce into git using the git-p4 tool
can be especially challenging. The `git p4 clone` operation is based
on an all-or-nothing transactionality guarantee. Under real-world
conditions like network unreliability or a busy Perforce server,
`git p4 clone` and  `git p4 sync` operations can easily fail, forcing a
user to restart the import process from the beginning. The longer the
history being imported, the more likely a fault occurs during the
process. Long enough imports thus become statistically unlikely to ever
succeed.

The underlying git fast-import protocol supports an explicit checkpoint
command. The idea here is to optionally allow the user to force an
explicit checkpoint every <x> seconds. If the sync/clone operation fails
branches are left updated at the appropriate commit available during the
latest checkpoint. This allows a user to resume importing Perforce
history while only having to repeat at most approximately <x> seconds
worth of import activity.

Signed-off-by: Ori Rawlings <orirawlings@gmail.com>
---
 Documentation/git-p4.txt            | 12 ++++++-
 git-p4.py                           |  7 ++++-
 t/t9830-git-p4-checkpoint-period.sh | 59 ++++++++++++++++++++++++++++++-
 3 files changed, 78 insertions(+), 0 deletions(-)
 create mode 100755 t/t9830-git-p4-checkpoint-period.sh

diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
index c83aaf3..e48ed6d 100644
--- a/Documentation/git-p4.txt
+++ b/Documentation/git-p4.txt
@@ -252,6 +252,18 @@ Git repository:
 	Use a client spec to find the list of interesting files in p4.
 	See the "CLIENT SPEC" section below.
 
+--checkpoint-period <n>::
+	Issue explicit 'checkpoint' commands to the underlying
+	linkgit:git-fast-import[1] approximately every 'n' seconds. If
+	syncing or cloning from the Perforce server is interrupted, the
+	process can be resumed from the most recent checkpoint with a
+	new 'sync' invocation. This is useful in the situations where a
+	large amount of changes are being imported over an unreliable
+	network connection. Explicit checkpoints can take up to several
+	minutes each, so a suitable value for the checkpoint period is
+	approximately 1200 seconds. By default, no explicit checkpoints 
+	are performed.
+
 -/ <path>::
 	Exclude selected depot paths when cloning or syncing.
 
diff --git a/git-p4.py b/git-p4.py
index fd5ca52..4c84871 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -2244,6 +2244,7 @@ class P4Sync(Command, P4UserMap):
                 optparse.make_option("-/", dest="cloneExclude",
                                      action="append", type="string",
                                      help="exclude depot path"),
+                optparse.make_option("--checkpoint-period", dest="checkpointPeriod", type="int", help="Period in seconds between explict git fast-import checkpoints (by default, no explicit checkpoints are performed)"),
         ]
         self.description = """Imports from Perforce into a git repository.\n
     example:
@@ -2276,6 +2277,7 @@ class P4Sync(Command, P4UserMap):
         self.tempBranches = []
         self.tempBranchLocation = "refs/git-p4-tmp"
         self.largeFileSystem = None
+        self.checkpointPeriod = None
 
         if gitConfig('git-p4.largeFileSystem'):
             largeFileSystemConstructor = globals()[gitConfig('git-p4.largeFileSystem')]
@@ -3031,6 +3033,7 @@ class P4Sync(Command, P4UserMap):
 
     def importChanges(self, changes):
         cnt = 1
+        self.lastCheckpointTime = time.time()
         for change in changes:
             description = p4_describe(change)
             self.updateOptionDict(description)
@@ -3107,6 +3110,10 @@ class P4Sync(Command, P4UserMap):
                                 self.initialParent)
                     # only needed once, to connect to the previous commit
                     self.initialParent = ""
+
+                    if self.checkpointPeriod >= 0 and time.time() - self.lastCheckpointTime >= self.checkpointPeriod:
+                        self.checkpoint()
+                        self.lastCheckpointTime = time.time()
             except IOError:
                 print self.gitError.read()
                 sys.exit(1)
diff --git a/t/t9830-git-p4-checkpoint-period.sh b/t/t9830-git-p4-checkpoint-period.sh
new file mode 100755
index 0000000..6ba4914
--- /dev/null
+++ b/t/t9830-git-p4-checkpoint-period.sh
@@ -0,0 +1,59 @@
+#!/bin/sh
+
+test_description='git p4 checkpoint-period tests'
+
+. ./lib-git-p4.sh
+
+p4_submit_each () {
+	for file in $@
+	do
+		echo $file > "$file" &&
+		p4 add "$file" &&
+		p4 submit -d "$file"
+	done
+}
+
+test_expect_success 'start p4d' '
+	start_p4d
+'
+
+test_expect_success 'no explicit checkpoints' '
+	cd "$cli" &&
+	p4_submit_each file1 file2 file3 &&
+	git p4 clone --dest="$git" //depot@all &&
+	test_when_finished cleanup_git &&
+	(
+		git -C "$git" reflog refs/remotes/p4/master >lines &&
+		test_line_count = 1 lines &&
+		p4_submit_each file4 file5 file6 &&
+		git -C "$git" p4 sync &&
+		git -C "$git" reflog refs/remotes/p4/master >lines &&
+		test_line_count = 2 lines
+	)
+'
+
+test_expect_success 'restart p4d' '
+	kill_p4d &&
+	start_p4d
+'
+
+test_expect_success 'checkpoint every 0 seconds, i.e. every commit' '
+	cd "$cli" &&
+	p4_submit_each file1 file2 file3 &&
+	git p4 clone --dest="$git" --checkpoint-period 0 //depot@all &&
+	test_when_finished cleanup_git &&
+	(
+		git -C "$git" reflog refs/remotes/p4/master >lines &&
+		test_line_count = 3 lines &&
+		p4_submit_each file4 file5 file6 &&
+		git -C "$git" p4 sync --checkpoint-period 0 &&
+		git -C "$git" reflog refs/remotes/p4/master >lines &&
+		test_line_count = 6 lines
+	)
+'
+
+test_expect_success 'kill p4d' '
+	kill_p4d
+'
+
+test_done
-- 
git-series 0.8.10

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/1] git-p4: Add --checkpoint-period option to sync/clone
  2016-09-15 21:17   ` [PATCH v2 1/1] " Ori Rawlings
@ 2016-09-16 16:19     ` Lars Schneider
  2016-09-16 17:43       ` Ori Rawlings
  0 siblings, 1 reply; 7+ messages in thread
From: Lars Schneider @ 2016-09-16 16:19 UTC (permalink / raw)
  To: Ori Rawlings; +Cc: git, Vitor Antunes, Luke Diamand, Pete Wyckoff


> On 15 Sep 2016, at 23:17, Ori Rawlings <orirawlings@gmail.com> wrote:
> 
> Importing a long history from Perforce into git using the git-p4 tool
> can be especially challenging. The `git p4 clone` operation is based
> on an all-or-nothing transactionality guarantee. Under real-world
> conditions like network unreliability or a busy Perforce server,
> `git p4 clone` and  `git p4 sync` operations can easily fail, forcing a
> user to restart the import process from the beginning. The longer the
> history being imported, the more likely a fault occurs during the
> process. Long enough imports thus become statistically unlikely to ever
> succeed.
> 
> The underlying git fast-import protocol supports an explicit checkpoint
> command. The idea here is to optionally allow the user to force an
> explicit checkpoint every <x> seconds. If the sync/clone operation fails
> branches are left updated at the appropriate commit available during the
> latest checkpoint. This allows a user to resume importing Perforce
> history while only having to repeat at most approximately <x> seconds
> worth of import activity.

This looks interesting! I ran into the same issue and added a parameter to the p4 commands to retry (patch not yet proposed to the mailing list):
https://github.com/autodesk-forks/git/commit/fcfc96a7814935ee6cefb9d69e44def30a90eabb

Would it make sense to print the "git-p4 resume command" in case an error happens and checkpoints are written?

Cheers,
Lars

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/1] git-p4: Add --checkpoint-period option to sync/clone
  2016-09-16 16:19     ` Lars Schneider
@ 2016-09-16 17:43       ` Ori Rawlings
  0 siblings, 0 replies; 7+ messages in thread
From: Ori Rawlings @ 2016-09-16 17:43 UTC (permalink / raw)
  To: Lars Schneider; +Cc: git, Vitor Antunes, Luke Diamand, Pete Wyckoff

On Fri, Sep 16, 2016 at 11:19 AM, Lars Schneider
<larsxschneider@gmail.com> wrote:
>
>
> This looks interesting! I ran into the same issue and added a parameter to the p4 commands to retry (patch not yet proposed to the mailing list):
> https://github.com/autodesk-forks/git/commit/fcfc96a7814935ee6cefb9d69e44def30a90eabb

I was unaware of the retry flag to the p4 command, that seems like a
useful trick too. I think both approaches might pair nicely together
(p4 optimistically retrying, but still falling back to the latest git
checkpoint if we exhaust our N retry attempts).

> Would it make sense to print the "git-p4 resume command" in case an error happens and checkpoints are written?

I was thinking something like this would be a good idea and would
certainly aide in usability. Resuming a sync is fairly
straight-forward (just re-execute the same command). Resuming a clone
is a bit more problematic, today if a depot path argument is provided
to the sync or clone command (and it is always required for clone), no
attempt is made to examine the existing git branches and limit to only
Perforce changes missing from git.

There is a lingering TODO in the script where we check the presence of
the depot path argument, with a suggestion that we should always make
an attempt to continue building upon existing history when it is
available. I think there might be a few edge cases around this
behavior that I'd need to think through. But, if I'm able to address
the TODO, then printing the command to resume the import should be
pretty straight-forward. I'll continue working on that next week.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-09-16 17:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-12 22:02 [PATCH] [git-p4.py] Add --checkpoint-period option to sync/clone Ori Rawlings
2016-09-12 22:02 ` Ori Rawlings
2016-09-13  8:10   ` Luke Diamand
2016-09-15 21:17 ` [PATCH v2 0/1] git-p4: " Ori Rawlings
2016-09-15 21:17   ` [PATCH v2 1/1] " Ori Rawlings
2016-09-16 16:19     ` Lars Schneider
2016-09-16 17:43       ` Ori Rawlings

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.