Stability problem with a 1000 nodes puppetserver
See original GitHub issueHi,
Thank you for the great job with puppetboard.
Since the number of nodes in our environment reached about 1000 nodes, we frequently meet this error message when trying to load :
On the puppetdb logs side, I meet this error :
ERROR [p.p.threadpool] Error processing command on thread cmd-proc-thread-23365
clojure.lang.ExceptionInfo: Value does not match schema: {:resources [nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil {:events [{:status (not (instance? java.lang.String nil))} nil]} nil nil nil nil nil nil nil nil nil nil nil nil nil nil]}
at schema.core$validator$fn__2894.invoke(core.clj:155)
at schema.core$validate.invokeStatic(core.clj:164)
at schema.core$validate.invoke(core.clj:159)
at puppetlabs.puppetdb.command$store_report$fn__32755.invoke(command.clj:355)
at puppetlabs.puppetdb.command$store_report.invokeStatic(command.clj:354)
at puppetlabs.puppetdb.command$store_report.invoke(command.clj:353)
at puppetlabs.puppetdb.command$process_command_BANG_.invokeStatic(command.clj:389)
at puppetlabs.puppetdb.command$process_command_BANG_.invoke(command.clj:380)
at puppetlabs.puppetdb.command$process_command_and_respond_BANG_$fn__32863.invoke(command.clj:442)
at puppetlabs.puppetdb.command$call_with_quick_retry$fn__32856.invoke(command.clj:424)
at puppetlabs.puppetdb.command$call_with_quick_retry.invokeStatic(command.clj:423)
at puppetlabs.puppetdb.command$call_with_quick_retry.invoke(command.clj:421)
at puppetlabs.puppetdb.command$process_command_and_respond_BANG_.invokeStatic(command.clj:440)
at puppetlabs.puppetdb.command$process_command_and_respond_BANG_.invoke(command.clj:438)
at puppetlabs.puppetdb.command$process_cmdref$fn__32873.invoke(command.clj:505)
at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__30651$fn__30652$fn__30653.invoke(metrics.clj:14)
at puppetlabs.puppetdb.utils.metrics.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
at com.codahale.metrics.Timer.time(Timer.java:101)
at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__30651$fn__30652.invoke(metrics.clj:14)
at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__30651$fn__30652$fn__30653.invoke(metrics.clj:14)
at puppetlabs.puppetdb.utils.metrics.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
at com.codahale.metrics.Timer.time(Timer.java:101)
at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__30651$fn__30652.invoke(metrics.clj:14)
at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_.invokeStatic(metrics.clj:17)
at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_.invoke(metrics.clj:6)
at puppetlabs.puppetdb.command$process_cmdref.invokeStatic(command.clj:501)
at puppetlabs.puppetdb.command$process_cmdref.invoke(command.clj:480)
at puppetlabs.puppetdb.command$message_handler$fn__32881.invoke(command.clj:551)
at puppetlabs.puppetdb.threadpool$dochan$fn__32634$fn__32635.invoke(threadpool.clj:117)
at puppetlabs.puppetdb.threadpool$call_on_threadpool$fn__32629.invoke(threadpool.clj:95)
at clojure.lang.AFn.run(AFn.java:22)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2017-10-20 10:38:47,059 WARN [p.p.q.engine] The event-counts entity is experimental and may be altered or removed in the future.
The puppetdb performance dashboard looks like this in production :
On the apache2 logs side :
[Fri Oct 20 09:08:26.546106 2017] [wsgi:error] [pid 29006] raise ReadTimeout(e, request=request)
[Fri Oct 20 09:08:26.546112 2017] [wsgi:error] [pid 29006] ReadTimeout: HTTPConnectionPool(host='localhost', port=8080): Read timed out. (read timeout=20)
[Fri Oct 20 09:08:26.547887 2017] [wsgi:error] [pid 29006] INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (3): localhost
[Fri Oct 20 09:08:51.138396 2017] [wsgi:error] [pid 29006] ERROR:pypuppetdb.api:Connection to PuppetDB timed out on localhost:8080 over HTTP.
[Fri Oct 20 09:08:51.141567 2017] [wsgi:error] [pid 29006] ERROR:puppetboard.app:Exception on /FST/ [GET]
[Fri Oct 20 09:08:51.141580 2017] [wsgi:error] [pid 29006] Traceback (most recent call last):
When the board “respawn”, it works great but I don’t understand why sometimes it fails for several minutes…
I need advices to find a way to fix it. Maybe reducing the nodes in the overview page? Is there a way to limit a long list on the xx last reports or something like this?
Thanks for your help, Guillaume
Issue Analytics
- State:
- Created 6 years ago
- Comments:8 (6 by maintainers)
Top Results From Across the Web
Puppet Server known Issues
Puppet Server 7.2.0 and 6.16.0 include the following new API endpoint: PUT /puppet-ca/v1/certificate_revocation_list . To access this endpoint, ...
Read more >Getting Started With Puppet Code: Manifests and Modules
The manifest will be developed on a Puppet agent node, and executed via puppet apply , so an agent/master setup is not required....
Read more >4.10.5. Puppet Certificate Issues - the SIMP documentation!
If you delete the Puppet Server's certificate, you will need to re-deploy Puppet certificates to all of your nodes! 4.10.5.1.1. Puppet Client Re-Registration...
Read more >Foreman with Puppet node Net::HTTPNotFound Error
A smart proxy has been configured in foreman for the foreman/puppetmaster. There were no issues on the agents in generating the CSR and...
Read more >Communication issues | Troubleshooting Puppet
Adding an entry to /etc/hosts for your Puppet Server also bypasses any DNS problems that you may have in the initial configuration of...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks @guillaume-ferry , your issue and comments helped me a lot in tuning my PuppetDB’s Postgres !
Thanks for your answer.
Since our last discussion, I was almost sure that my puppetdb configuration was correct. I assumed that the problem was probably deeper. Maybe in my PostgresSQL conf.
I found in this documentation some clues that led me to tune my Postgres configuration. Specially with the pgtune tool : http://pgfoundry.org/projects/pgtune/
Here are the list of the parameters that pgtune advised me to tune :
Effectively this new configuration significantly increased the database performance. Queries takes about half the time to execute and puppetboard didn’t not hangs since 2 weeks ago. The puppetDB performance dashboard now looks like this :
However, the report tab still hangs with an internal server error in the 1000 nodes environment. As if the number of reports to treat was too hight. But the report-ttl setting in my pupetdb conf is on 14d… Do I need to decrease this setting?
In the test environment, with few nodes, report tab works fine… I did’nt change the NORMAL_TABLE_COUNT setting. The settings.py looks like this :
More than 500 nodes are students computer rooms that boots as and when needed… When many classrooms are used, I suppose that students mostly boots those computer simultaneously. It should corresponds to spikes in the performance dashboard…