Reliable backups for high-activity pad
See original GitHub issueHi,
We have a production instance of etherpad-lite 1.6.1 for a nonprofit, which apart from being used normally with many pads, has a specific pad that has a lot of activity and history, as it is used kind of as a board of things to do/in progress/recently done and is updated many times every day by several members of the org, and it is hardly ever deleted and recreated.
This makes it a pad that comes to have tens of thousands of revisions after a few months. I’m quite sure this is not really what etherpad-lite was designed for, but “unfortunately” the org members like this way of working very much and are very used to it and we’ve still not found a better tool.
We’ve already had several catastrophes with this specific pad due to kernel panics, unclean shutdowns, mysql restarts (mostly for sec upgrades) without stopping etherpad first, corrupt changesets that lead to the high-activity pad being unreadable (fails with Failed assertion: Invalid changeset (checkRep failed)
client-side). Other lower activity pads on the instance seem to cope with those events quite nicely though. Additionally, some members sometimes have a faulty connection that causes their browser to reconnect very often and I wonder if that doesn’t generate even more revisions to fade their author color each time. That’s a secondary problem, but if it’s the case then it also makes the pad history grow even faster and increases the chances of failures as I perceive it.
Restoring attempts for this pad usually includes:
- calling for a member to not close his pad browser tab and copy-paste the html somewhere
- running checkPad.js / repairPad.js which takes ages due to the huge history and does not change the situation
- trying to get a proper backup using getText/getHTML API methods (calls always fail)
- trying to restore a working version using restoreRevision/copyPad API methods (both calls succeed but copyPad takes ages and just creates a 2nd nonworking pad)
- trying to call deletePad via the API (takes ages and fails)
- remove pad rows from database manually and recreate the pad from a copy-paste of the HTML, losing both history and authorship
That situation led me to try and setup frequent backups of the whole instance. I’m actually doing hourly mysql dumps right now (at least for the last 24h). Unfortunately I discovered that restoring those backups also lead to a nonworking checkRep failed
pad. Which led me to believe that doing mysql dumps actually produces a faulty database image unless etherpad is stopped.
I would have used the API to make backups but after a few weeks/months of activity the API calls just take longer than the backup interval. And stopping the instance every hour to run mysqldump would be quite disruptive.
So here are my questions:
- is there a way to tell a running etherpad to somehow flush everything to database so that a mysqldump has a higher chance of being usable ?
- is there a better way of backing up pads in a way that is automatable and preserves authorship upon restore (history would be nice but not mandatory here)
- is there something I can do to somehow “fix” the faulty mysql dumps ?
Here is a faulty mysqldump for reference. The high-activity pad ID is “affaires-courantes”. All our activity and pads are public so there’s no risk of disclosing personal/secret information here.
Thanks 😃
Issue Analytics
- State:
- Created 6 years ago
- Reactions:2
- Comments:13 (11 by maintainers)
Top GitHub Comments
As #3991 is merged this gives us an awesome tool for recoveries I can go ahead and close this. If we get another report we should be able to recover upon request and now the tools are available to debug/diagnose and recover.
As per #3991 I think to do a restoration/rebuild you need these values else changeset ops wont work. Once the merge is complete I can continue work on my script/branch but it looks like any pads with revs(@100) edited before the merge is complete and in place wont be able to actually be rebuild rendering both existing methods pointless.