Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Stability problem with a 1000 nodes puppetserver

See original GitHub issue

Hi,

Thank you for the great job with puppetboard.

Since the number of nodes in our environment reached about 1000 nodes, we frequently meet this error message when trying to load :

board-puppetdb-error

On the puppetdb logs side, I meet this error :

ERROR [p.p.threadpool] Error processing command on thread cmd-proc-thread-23365
clojure.lang.ExceptionInfo: Value does not match schema: {:resources [nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil {:events [{:status (not (instance? java.lang.String nil))} nil]} nil nil nil nil nil nil nil nil nil nil nil nil nil nil]}
	at schema.core$validator$fn__2894.invoke(core.clj:155)
	at schema.core$validate.invokeStatic(core.clj:164)
	at schema.core$validate.invoke(core.clj:159)
	at puppetlabs.puppetdb.command$store_report$fn__32755.invoke(command.clj:355)
	at puppetlabs.puppetdb.command$store_report.invokeStatic(command.clj:354)
	at puppetlabs.puppetdb.command$store_report.invoke(command.clj:353)
	at puppetlabs.puppetdb.command$process_command_BANG_.invokeStatic(command.clj:389)
	at puppetlabs.puppetdb.command$process_command_BANG_.invoke(command.clj:380)
	at puppetlabs.puppetdb.command$process_command_and_respond_BANG_$fn__32863.invoke(command.clj:442)
	at puppetlabs.puppetdb.command$call_with_quick_retry$fn__32856.invoke(command.clj:424)
	at puppetlabs.puppetdb.command$call_with_quick_retry.invokeStatic(command.clj:423)
	at puppetlabs.puppetdb.command$call_with_quick_retry.invoke(command.clj:421)
	at puppetlabs.puppetdb.command$process_command_and_respond_BANG_.invokeStatic(command.clj:440)
	at puppetlabs.puppetdb.command$process_command_and_respond_BANG_.invoke(command.clj:438)
	at puppetlabs.puppetdb.command$process_cmdref$fn__32873.invoke(command.clj:505)
	at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__30651$fn__30652$fn__30653.invoke(metrics.clj:14)
	at puppetlabs.puppetdb.utils.metrics.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
	at com.codahale.metrics.Timer.time(Timer.java:101)
	at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__30651$fn__30652.invoke(metrics.clj:14)
	at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__30651$fn__30652$fn__30653.invoke(metrics.clj:14)
	at puppetlabs.puppetdb.utils.metrics.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
	at com.codahale.metrics.Timer.time(Timer.java:101)
	at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__30651$fn__30652.invoke(metrics.clj:14)
	at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_.invokeStatic(metrics.clj:17)
	at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_.invoke(metrics.clj:6)
	at puppetlabs.puppetdb.command$process_cmdref.invokeStatic(command.clj:501)
	at puppetlabs.puppetdb.command$process_cmdref.invoke(command.clj:480)
	at puppetlabs.puppetdb.command$message_handler$fn__32881.invoke(command.clj:551)
	at puppetlabs.puppetdb.threadpool$dochan$fn__32634$fn__32635.invoke(threadpool.clj:117)

	at puppetlabs.puppetdb.threadpool$call_on_threadpool$fn__32629.invoke(threadpool.clj:95)
	at clojure.lang.AFn.run(AFn.java:22)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2017-10-20 10:38:47,059 WARN  [p.p.q.engine] The event-counts entity is experimental and may be altered or removed in the future.

The puppetdb performance dashboard looks like this in production :

capture d ecran 2017-10-20 a 10 57 56

On the apache2 logs side :

[Fri Oct 20 09:08:26.546106 2017] [wsgi:error] [pid 29006]     raise ReadTimeout(e, request=request)
[Fri Oct 20 09:08:26.546112 2017] [wsgi:error] [pid 29006] ReadTimeout: HTTPConnectionPool(host='localhost', port=8080): Read timed out. (read timeout=20)
[Fri Oct 20 09:08:26.547887 2017] [wsgi:error] [pid 29006] INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (3): localhost
[Fri Oct 20 09:08:51.138396 2017] [wsgi:error] [pid 29006] ERROR:pypuppetdb.api:Connection to PuppetDB timed out on localhost:8080 over HTTP.
[Fri Oct 20 09:08:51.141567 2017] [wsgi:error] [pid 29006] ERROR:puppetboard.app:Exception on /FST/ [GET]
[Fri Oct 20 09:08:51.141580 2017] [wsgi:error] [pid 29006] Traceback (most recent call last):

When the board “respawn”, it works great but I don’t understand why sometimes it fails for several minutes…

I need advices to find a way to fix it. Maybe reducing the nodes in the overview page? Is there a way to limit a long list on the xx last reports or something like this?

Thanks for your help, Guillaume

Issue Analytics

State:
Created 6 years ago
Comments:8 (6 by maintainers)

Top GitHub Comments

1reaction

gdubickicommented, Jun 15, 2019

Thanks @guillaume-ferry , your issue and comments helped me a lot in tuning my PuppetDB’s Postgres !

1reaction

guillaume-ferrycommented, Nov 13, 2017

Thanks for your answer.

Since our last discussion, I was almost sure that my puppetdb configuration was correct. I assumed that the problem was probably deeper. Maybe in my PostgresSQL conf.

I found in this documentation some clues that led me to tune my Postgres configuration. Specially with the pgtune tool : http://pgfoundry.org/projects/pgtune/

Here are the list of the parameters that pgtune advised me to tune :

------------------------------------------------------------------------------
# CUSTOMIZED OPTIONS
#------------------------------------------------------------------------------

# Add settings for extensions here
listen_addresses = localhost
default_statistics_target = 50 # pgtune wizard 2017-10-26
maintenance_work_mem = 960MB # pgtune wizard 2017-10-26
constraint_exclusion = on # pgtune wizard 2017-10-26
checkpoint_completion_target = 0.9 # pgtune wizard 2017-10-26
effective_cache_size = 11GB # pgtune wizard 2017-10-26
work_mem = 96MB # pgtune wizard 2017-10-26
wal_buffers = 8MB # pgtune wizard 2017-10-26
checkpoint_segments = 16 # pgtune wizard 2017-10-26
shared_buffers = 3840MB # pgtune wizard 2017-10-26
max_connections = 80 # pgtune wizard 2017-10-26

Effectively this new configuration significantly increased the database performance. Queries takes about half the time to execute and puppetboard didn’t not hangs since 2 weeks ago. The puppetDB performance dashboard now looks like this :

pb1311

However, the report tab still hangs with an internal server error in the 1000 nodes environment. As if the number of reports to treat was too hight. But the report-ttl setting in my pupetdb conf is on 14d… Do I need to decrease this setting?

In the test environment, with few nodes, report tab works fine… I did’nt change the NORMAL_TABLE_COUNT setting. The settings.py looks like this :

DEV_LISTEN_HOST = '127.0.0.1'
DEV_LISTEN_PORT = 5000
LOGLEVEL = 'debug'
PUPPETDB_HOST = 'localhost'
PUPPETDB_PORT = 8080
PUPPETDB_SSL_VERIFY = False
PUPPETDB_TIMEOUT = 60
UNRESPONSIVE_HOURS = 168
ENABLE_CATALOG = False
ENABLE_QUERY = True
LOCALISE_TIMESTAMP = True
OFFLINE_MODE = False
PUPPETDB_EXPERIMENTAL = False
DEFAULT_ENVIRONMENT = 'production'
REPORTS_COUNT = 10

More than 500 nodes are students computer rooms that boots as and when needed… When many classrooms are used, I suppose that students mostly boots those computer simultaneously. It should corresponds to spikes in the performance dashboard…

Top Results From Across the Web

Puppet Server known Issues

Puppet Server 7.2.0 and 6.16.0 include the following new API endpoint: PUT /puppet-ca/v1/certificate_revocation_list . To access this endpoint, ...

Getting Started With Puppet Code: Manifests and Modules

The manifest will be developed on a Puppet agent node, and executed via puppet apply , so an agent/master setup is not required....

4.10.5. Puppet Certificate Issues - the SIMP documentation!

If you delete the Puppet Server's certificate, you will need to re-deploy Puppet certificates to all of your nodes! 4.10.5.1.1. Puppet Client Re-Registration...

Foreman with Puppet node Net::HTTPNotFound Error

A smart proxy has been configured in foreman for the foreman/puppetmaster. There were no issues on the agents in generating the CSR and...

Communication issues | Troubleshooting Puppet

Adding an entry to /etc/hosts for your Puppet Server also bypasses any DNS problems that you may have in the initial configuration of...