All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] prserv: don't wait until exit to sync
@ 2014-10-27 17:27 Ben Shelton
  2014-11-02 21:00 ` Burton, Ross
  0 siblings, 1 reply; 9+ messages in thread
From: Ben Shelton @ 2014-10-27 17:27 UTC (permalink / raw)
  To: bitbake-devel

In the commit 'prserv: Ensure data is committed', the PR server moved to
only committing transactions to the database when the PR server is
stopped.  This improves performance, but it means that if the machine
running the PR server loses power unexpectedly or if the PR server
process gets SIGKILL, the uncommitted package revision data is lost.

To fix this issue, sync the database periodically, once per 30 seconds
by default, if it has been marked as dirty.  To be safe, continue to
sync the database at exit regardless of its status.

Signed-off-by: Ben Shelton <ben.shelton@ni.com>
---
 lib/prserv/db.py   | 14 ++++++++++++++
 lib/prserv/serv.py |  7 +++++++
 2 files changed, 21 insertions(+)

diff --git a/lib/prserv/db.py b/lib/prserv/db.py
index 49f36da..3bdc046 100644
--- a/lib/prserv/db.py
+++ b/lib/prserv/db.py
@@ -19,6 +19,7 @@ class PRTable(object):
     def __init__(self, conn, table, nohist):
         self.conn = conn
         self.nohist = nohist
+        self.dirty = False
         if nohist:
             self.table = "%s_nohist" % table 
         else:
@@ -47,6 +48,11 @@ class PRTable(object):
         self.conn.commit()
         self._execute("BEGIN EXCLUSIVE TRANSACTION")
 
+    def sync_if_dirty(self):
+        if self.dirty:
+            self.sync()
+            self.dirty = False
+
     def _getValueHist(self, version, pkgarch, checksum):
         data=self._execute("SELECT value FROM %s WHERE version=? AND pkgarch=? AND checksum=?;" % self.table,
                            (version, pkgarch, checksum))
@@ -62,6 +68,8 @@ class PRTable(object):
             except sqlite3.IntegrityError as exc:
                 logger.error(str(exc))
 
+            self.dirty = True
+
             data=self._execute("SELECT value FROM %s WHERE version=? AND pkgarch=? AND checksum=?;" % self.table,
                                (version, pkgarch, checksum))
             row=data.fetchone()
@@ -89,6 +97,8 @@ class PRTable(object):
                 logger.error(str(exc))
                 self.conn.rollback()
 
+            self.dirty = True
+
             data=self._execute("SELECT value FROM %s WHERE version=? AND pkgarch=? AND checksum=?;" % self.table,
                                (version, pkgarch, checksum))
             row=data.fetchone()
@@ -118,6 +128,8 @@ class PRTable(object):
             except sqlite3.IntegrityError as exc:
                 logger.error(str(exc))
 
+            self.dirty = True
+
             data = self._execute("SELECT value FROM %s WHERE version=? AND pkgarch=? AND checksum=?;" % self.table,
                            (version, pkgarch, checksum))
             row = data.fetchone()
@@ -139,6 +151,8 @@ class PRTable(object):
             except sqlite3.IntegrityError as exc:
                 logger.error(str(exc))
 
+        self.dirty = True
+
         data = self._execute("SELECT value FROM %s WHERE version=? AND pkgarch=? AND checksum=? AND value>=?;" % self.table,
                             (version,pkgarch,checksum,value))
         row=data.fetchone()
diff --git a/lib/prserv/serv.py b/lib/prserv/serv.py
index 1b08d59..6ab1097 100644
--- a/lib/prserv/serv.py
+++ b/lib/prserv/serv.py
@@ -122,6 +122,10 @@ class PRServer(SimpleXMLRPCServer):
     def work_forever(self,):
         self.quit = False
         self.timeout = 0.5
+        iter_count = 1
+        # With 60 iterations between syncs and a 0.5 second timeout between
+        # iterations, this will sync if dirty every ~30 seconds.
+        iterations_between_sync = 60
 
         logger.info("Started PRServer with DBfile: %s, IP: %s, PORT: %s, PID: %s" %
                      (self.dbfile, self.host, self.port, str(os.getpid())))
@@ -129,6 +133,9 @@ class PRServer(SimpleXMLRPCServer):
         self.handlerthread.start()
         while not self.quit:
             self.handle_request()
+            iter_count = (iter_count + 1) % iterations_between_sync
+            if iter_count == 0:
+                self.table.sync_if_dirty()
 
         self.table.sync()
         logger.info("PRServer: stopping...")
-- 
2.1.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] prserv: don't wait until exit to sync
  2014-10-27 17:27 [PATCH] prserv: don't wait until exit to sync Ben Shelton
@ 2014-11-02 21:00 ` Burton, Ross
  2014-11-03 15:47   ` Ben Shelton
  0 siblings, 1 reply; 9+ messages in thread
From: Burton, Ross @ 2014-11-02 21:00 UTC (permalink / raw)
  To: Ben Shelton; +Cc: bitbake-devel

[-- Attachment #1: Type: text/plain, Size: 954 bytes --]

On 27 October 2014 17:27, Ben Shelton <ben.shelton@ni.com> wrote:

> In the commit 'prserv: Ensure data is committed', the PR server moved to
> only committing transactions to the database when the PR server is
> stopped.  This improves performance, but it means that if the machine
> running the PR server loses power unexpectedly or if the PR server
> process gets SIGKILL, the uncommitted package revision data is lost.
>
> To fix this issue, sync the database periodically, once per 30 seconds
> by default, if it has been marked as dirty.  To be safe, continue to
> sync the database at exit regardless of its status.
>

This appears to be causing random problems for me where bitbake will
timeout attempting to access the PR database, my hunch is that it's
blocking on disk I/O.  Are there any tricks we can do with sqlite to reduce
the overhead of committing? (assuming that sqlite isn't causing a full
filesystem sync).

Ross

[-- Attachment #2: Type: text/html, Size: 1435 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] prserv: don't wait until exit to sync
  2014-11-02 21:00 ` Burton, Ross
@ 2014-11-03 15:47   ` Ben Shelton
  2014-11-03 17:30     ` Richard Purdie
  2014-11-04 14:13     ` Richard Purdie
  0 siblings, 2 replies; 9+ messages in thread
From: Ben Shelton @ 2014-11-03 15:47 UTC (permalink / raw)
  To: Burton, Ross; +Cc: bitbake-devel

On 11/02, Burton, Ross wrote:
> On 27 October 2014 17:27, Ben Shelton <ben.shelton@ni.com> wrote:
> 
> > In the commit 'prserv: Ensure data is committed', the PR server moved to
> > only committing transactions to the database when the PR server is
> > stopped.  This improves performance, but it means that if the machine
> > running the PR server loses power unexpectedly or if the PR server
> > process gets SIGKILL, the uncommitted package revision data is lost.
> >
> > To fix this issue, sync the database periodically, once per 30 seconds
> > by default, if it has been marked as dirty.  To be safe, continue to
> > sync the database at exit regardless of its status.
> >
> 
> This appears to be causing random problems for me where bitbake will
> timeout attempting to access the PR database, my hunch is that it's
> blocking on disk I/O.  Are there any tricks we can do with sqlite to reduce
> the overhead of committing? (assuming that sqlite isn't causing a full
> filesystem sync).
> 
> Ross

After running a few large nightly builds, we've seen some issues with
this as well.  It looks like the issue is in the PR server itself, which
logs this error:

"OperationalError: cannot start a transaction within a transaction"

However, I'm confused as to why this is happening, since the only place
new transactions are being created is in the sync() function ("BEGIN
EXCLUSIVE TRANSACTION"), and AFAIK that's only called by a single
thread.  Any ideas?

Would it make sense to revert the patch until we identify/fix the issue?

Ben


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] prserv: don't wait until exit to sync
  2014-11-03 15:47   ` Ben Shelton
@ 2014-11-03 17:30     ` Richard Purdie
  2014-11-03 18:27       ` Gary Thomas
  2014-11-04 14:13     ` Richard Purdie
  1 sibling, 1 reply; 9+ messages in thread
From: Richard Purdie @ 2014-11-03 17:30 UTC (permalink / raw)
  To: Ben Shelton; +Cc: bitbake-devel

On Mon, 2014-11-03 at 09:47 -0600, Ben Shelton wrote:
> On 11/02, Burton, Ross wrote:
> > On 27 October 2014 17:27, Ben Shelton <ben.shelton@ni.com> wrote:
> > 
> > > In the commit 'prserv: Ensure data is committed', the PR server moved to
> > > only committing transactions to the database when the PR server is
> > > stopped.  This improves performance, but it means that if the machine
> > > running the PR server loses power unexpectedly or if the PR server
> > > process gets SIGKILL, the uncommitted package revision data is lost.
> > >
> > > To fix this issue, sync the database periodically, once per 30 seconds
> > > by default, if it has been marked as dirty.  To be safe, continue to
> > > sync the database at exit regardless of its status.
> > >
> > 
> > This appears to be causing random problems for me where bitbake will
> > timeout attempting to access the PR database, my hunch is that it's
> > blocking on disk I/O.  Are there any tricks we can do with sqlite to reduce
> > the overhead of committing? (assuming that sqlite isn't causing a full
> > filesystem sync).
> > 
> > Ross
> 
> After running a few large nightly builds, we've seen some issues with
> this as well.  It looks like the issue is in the PR server itself, which
> logs this error:
> 
> "OperationalError: cannot start a transaction within a transaction"
> 
> However, I'm confused as to why this is happening, since the only place
> new transactions are being created is in the sync() function ("BEGIN
> EXCLUSIVE TRANSACTION"), and AFAIK that's only called by a single
> thread.  Any ideas?

Did the commit() fail and therefore there was already an transaction
open? It leads to another quesiton of why the commit would fail (timeout
maybe?).

> Would it make sense to revert the patch until we identify/fix the issue?

You have flagged a valid issue that I would like to get to the bottom of
so perhaps not quite yet.

I'm wondering if we can have some in memory copy of the table which we
flush to disk in a separate thread which wouldn't influence the PR
service request responses but its a horrible idea to workaround what
seems like a fundamental problem in sqlite :/.

Cheers,

Richard





^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] prserv: don't wait until exit to sync
  2014-11-03 17:30     ` Richard Purdie
@ 2014-11-03 18:27       ` Gary Thomas
  2014-11-04  8:42         ` Richard Purdie
  0 siblings, 1 reply; 9+ messages in thread
From: Gary Thomas @ 2014-11-03 18:27 UTC (permalink / raw)
  To: bitbake-devel

On 2014-11-03 10:30, Richard Purdie wrote:
> On Mon, 2014-11-03 at 09:47 -0600, Ben Shelton wrote:
>> On 11/02, Burton, Ross wrote:
>>> On 27 October 2014 17:27, Ben Shelton <ben.shelton@ni.com> wrote:
>>>
>>>> In the commit 'prserv: Ensure data is committed', the PR server moved to
>>>> only committing transactions to the database when the PR server is
>>>> stopped.  This improves performance, but it means that if the machine
>>>> running the PR server loses power unexpectedly or if the PR server
>>>> process gets SIGKILL, the uncommitted package revision data is lost.
>>>>
>>>> To fix this issue, sync the database periodically, once per 30 seconds
>>>> by default, if it has been marked as dirty.  To be safe, continue to
>>>> sync the database at exit regardless of its status.
>>>>
>>>
>>> This appears to be causing random problems for me where bitbake will
>>> timeout attempting to access the PR database, my hunch is that it's
>>> blocking on disk I/O.  Are there any tricks we can do with sqlite to reduce
>>> the overhead of committing? (assuming that sqlite isn't causing a full
>>> filesystem sync).
>>>
>>> Ross
>>
>> After running a few large nightly builds, we've seen some issues with
>> this as well.  It looks like the issue is in the PR server itself, which
>> logs this error:
>>
>> "OperationalError: cannot start a transaction within a transaction"
>>
>> However, I'm confused as to why this is happening, since the only place
>> new transactions are being created is in the sync() function ("BEGIN
>> EXCLUSIVE TRANSACTION"), and AFAIK that's only called by a single
>> thread.  Any ideas?
>
> Did the commit() fail and therefore there was already an transaction
> open? It leads to another quesiton of why the commit would fail (timeout
> maybe?).
>
>> Would it make sense to revert the patch until we identify/fix the issue?
>
> You have flagged a valid issue that I would like to get to the bottom of
> so perhaps not quite yet.
>
> I'm wondering if we can have some in memory copy of the table which we
> flush to disk in a separate thread which wouldn't influence the PR
> service request responses but its a horrible idea to workaround what
> seems like a fundamental problem in sqlite :/.

I just got this error:
ERROR: Can NOT get PRAUTO from remote PR service
ERROR: Function failed: package_get_auto_pr
ERROR: Logfile of failure stored in: /home/local/rpi-latest_2014-10-30/tmp/work/armv6-vfp-amltd-linux-gnueabi/usbutils/007-r0/temp/log.do_package.13260
ERROR: Task 3204 (/home/local/poky-latest/meta/recipes-bsp/usbutils/usbutils_007.bb, do_package) failed with exit code '1'

Is it the same as what's being discussed above?  Where can I
look for more info on what happened?

n.b. I just restarted my build and it seems happy to carry on
where it left off.

-- 
------------------------------------------------------------
Gary Thomas                 |  Consulting for the
MLB Associates              |    Embedded world
------------------------------------------------------------


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] prserv: don't wait until exit to sync
  2014-11-03 18:27       ` Gary Thomas
@ 2014-11-04  8:42         ` Richard Purdie
  2014-11-04 10:51           ` Richard Purdie
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Purdie @ 2014-11-04  8:42 UTC (permalink / raw)
  To: Gary Thomas; +Cc: bitbake-devel

On Mon, 2014-11-03 at 11:27 -0700, Gary Thomas wrote:
> On 2014-11-03 10:30, Richard Purdie wrote:
> > On Mon, 2014-11-03 at 09:47 -0600, Ben Shelton wrote:
> >> On 11/02, Burton, Ross wrote:
> >>> On 27 October 2014 17:27, Ben Shelton <ben.shelton@ni.com> wrote:
> >>>
> >>>> In the commit 'prserv: Ensure data is committed', the PR server moved to
> >>>> only committing transactions to the database when the PR server is
> >>>> stopped.  This improves performance, but it means that if the machine
> >>>> running the PR server loses power unexpectedly or if the PR server
> >>>> process gets SIGKILL, the uncommitted package revision data is lost.
> >>>>
> >>>> To fix this issue, sync the database periodically, once per 30 seconds
> >>>> by default, if it has been marked as dirty.  To be safe, continue to
> >>>> sync the database at exit regardless of its status.
> >>>>
> >>>
> >>> This appears to be causing random problems for me where bitbake will
> >>> timeout attempting to access the PR database, my hunch is that it's
> >>> blocking on disk I/O.  Are there any tricks we can do with sqlite to reduce
> >>> the overhead of committing? (assuming that sqlite isn't causing a full
> >>> filesystem sync).
> >>>
> >>> Ross
> >>
> >> After running a few large nightly builds, we've seen some issues with
> >> this as well.  It looks like the issue is in the PR server itself, which
> >> logs this error:
> >>
> >> "OperationalError: cannot start a transaction within a transaction"
> >>
> >> However, I'm confused as to why this is happening, since the only place
> >> new transactions are being created is in the sync() function ("BEGIN
> >> EXCLUSIVE TRANSACTION"), and AFAIK that's only called by a single
> >> thread.  Any ideas?
> >
> > Did the commit() fail and therefore there was already an transaction
> > open? It leads to another quesiton of why the commit would fail (timeout
> > maybe?).
> >
> >> Would it make sense to revert the patch until we identify/fix the issue?
> >
> > You have flagged a valid issue that I would like to get to the bottom of
> > so perhaps not quite yet.
> >
> > I'm wondering if we can have some in memory copy of the table which we
> > flush to disk in a separate thread which wouldn't influence the PR
> > service request responses but its a horrible idea to workaround what
> > seems like a fundamental problem in sqlite :/.
> 
> I just got this error:
> ERROR: Can NOT get PRAUTO from remote PR service
> ERROR: Function failed: package_get_auto_pr
> ERROR: Logfile of failure stored in: /home/local/rpi-latest_2014-10-30/tmp/work/armv6-vfp-amltd-linux-gnueabi/usbutils/007-r0/temp/log.do_package.13260
> ERROR: Task 3204 (/home/local/poky-latest/meta/recipes-bsp/usbutils/usbutils_007.bb, do_package) failed with exit code '1'
> 
> Is it the same as what's being discussed above?

Yes.

>   Where can I
> look for more info on what happened?

We're still figuring out what is going on but it is roughly that:

a) The build generates a ton of IO
b) That IO builds up into a queue
c) The PR service decides it needs to sync to disk
d) The PR service hits an fsync() of some kind in sqlite whilst writing 
e) The PR service is blocked for its clients until the sync() finishes
f) Connections to the PR service timeout.

It would be nice if we could write the sqlite data in a separate thread
whilst the readers continue. There is an asynchronous module but its
deprecated:

http://www.sqlite.org/asyncvfs.html

WAL is recommended instead:

http://www.sqlite.org/wal.html

so we probably need to look at that.

> n.b. I just restarted my build and it seems happy to carry on
> where it left off.

No data is lost and this is a transient issue.

Why not revert the patch? The issue is that the data in the PR service
whilst in memory, *never* makes it to disk until process exit time. This
is bad if your build server loses power for example. I would therefore
like to try and fix this rather then revert.

Cheers,

Richard



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] prserv: don't wait until exit to sync
  2014-11-04  8:42         ` Richard Purdie
@ 2014-11-04 10:51           ` Richard Purdie
  2014-11-04 13:10             ` Burton, Ross
  0 siblings, 1 reply; 9+ messages in thread
From: Richard Purdie @ 2014-11-04 10:51 UTC (permalink / raw)
  To: Gary Thomas; +Cc: bitbake-devel

On Tue, 2014-11-04 at 08:42 +0000, Richard Purdie wrote:
> We're still figuring out what is going on but it is roughly that:
> 
> a) The build generates a ton of IO
> b) That IO builds up into a queue
> c) The PR service decides it needs to sync to disk
> d) The PR service hits an fsync() of some kind in sqlite whilst writing 
> e) The PR service is blocked for its clients until the sync() finishes
> f) Connections to the PR service timeout.
> 
> It would be nice if we could write the sqlite data in a separate thread
> whilst the readers continue. There is an asynchronous module but its
> deprecated:
> 
> http://www.sqlite.org/asyncvfs.html
> 
> WAL is recommended instead:
> 
> http://www.sqlite.org/wal.html
> 
> so we probably need to look at that.

FWIW,

http://git.yoctoproject.org/cgit.cgi/poky-contrib/commit/?h=rpurdie/t222&id=df0dabbf90f3da7153295adb83e51c41ae4d9375

is a patch Ross is kindly testing for me...

Cheers,

Richard



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] prserv: don't wait until exit to sync
  2014-11-04 10:51           ` Richard Purdie
@ 2014-11-04 13:10             ` Burton, Ross
  0 siblings, 0 replies; 9+ messages in thread
From: Burton, Ross @ 2014-11-04 13:10 UTC (permalink / raw)
  To: Richard Purdie; +Cc: bitbake-devel, Gary Thomas

[-- Attachment #1: Type: text/plain, Size: 376 bytes --]

On 4 November 2014 10:51, Richard Purdie <richard.purdie@linuxfoundation.org
> wrote:

> FWIW,
>
>
> http://git.yoctoproject.org/cgit.cgi/poky-contrib/commit/?h=rpurdie/t222&id=df0dabbf90f3da7153295adb83e51c41ae4d9375
>
> is a patch Ross is kindly testing for me...
>

This patch appears to be causing the error less frequently, but I'm still
hitting it.

Ross

[-- Attachment #2: Type: text/html, Size: 956 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] prserv: don't wait until exit to sync
  2014-11-03 15:47   ` Ben Shelton
  2014-11-03 17:30     ` Richard Purdie
@ 2014-11-04 14:13     ` Richard Purdie
  1 sibling, 0 replies; 9+ messages in thread
From: Richard Purdie @ 2014-11-04 14:13 UTC (permalink / raw)
  To: Ben Shelton; +Cc: bitbake-devel

On Mon, 2014-11-03 at 09:47 -0600, Ben Shelton wrote:
> On 11/02, Burton, Ross wrote:
> > On 27 October 2014 17:27, Ben Shelton <ben.shelton@ni.com> wrote:
> > 
> > > In the commit 'prserv: Ensure data is committed', the PR server moved to
> > > only committing transactions to the database when the PR server is
> > > stopped.  This improves performance, but it means that if the machine
> > > running the PR server loses power unexpectedly or if the PR server
> > > process gets SIGKILL, the uncommitted package revision data is lost.
> > >
> > > To fix this issue, sync the database periodically, once per 30 seconds
> > > by default, if it has been marked as dirty.  To be safe, continue to
> > > sync the database at exit regardless of its status.
> > >
> > 
> > This appears to be causing random problems for me where bitbake will
> > timeout attempting to access the PR database, my hunch is that it's
> > blocking on disk I/O.  Are there any tricks we can do with sqlite to reduce
> > the overhead of committing? (assuming that sqlite isn't causing a full
> > filesystem sync).
> > 
> > Ross
> 
> After running a few large nightly builds, we've seen some issues with
> this as well.  It looks like the issue is in the PR server itself, which
> logs this error:
> 
> "OperationalError: cannot start a transaction within a transaction"
> 
> However, I'm confused as to why this is happening, since the only place
> new transactions are being created is in the sync() function ("BEGIN
> EXCLUSIVE TRANSACTION"), and AFAIK that's only called by a single
> thread.  Any ideas?

I just posted a patch for this. Any SQL transaction will start a new one
if one wasn't already in progress. This was therefore a race between two
threads. Whether it fixes all the problems remains to be seen, WAL mode
may still be a good idea as it does perform better for our use case.

Cheers,

Richard



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-11-04 14:14 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-27 17:27 [PATCH] prserv: don't wait until exit to sync Ben Shelton
2014-11-02 21:00 ` Burton, Ross
2014-11-03 15:47   ` Ben Shelton
2014-11-03 17:30     ` Richard Purdie
2014-11-03 18:27       ` Gary Thomas
2014-11-04  8:42         ` Richard Purdie
2014-11-04 10:51           ` Richard Purdie
2014-11-04 13:10             ` Burton, Ross
2014-11-04 14:13     ` Richard Purdie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.