
Additionally, fixed most of the errors reported by rstcheck. Fixes https://github.com/saltstack/salt/issues/58668
10 KiB
Troubleshooting the Salt Master
Running in the Foreground
A great deal of information is available via the debug logging system, if you are having issues with minions connecting or not starting run the master in the foreground:
# salt-master -l debug
Anyone wanting to run Salt daemons via a process supervisor such as
monit, runit, or supervisord, should omit the
-d
argument to the daemons and run them in the
foreground.
What Ports does the Master Need Open?
For the master, TCP ports 4505 and 4506 need to be open. If you've
put both your Salt master and minion in debug mode and don't see an
acknowledgment that your minion has connected, it could very well be a
firewall interfering with the connection. See our firewall configuration
<firewall>
page for help opening the firewall on various
platforms.
If you've opened the correct TCP ports and still aren't seeing connections, check that no additional access control system such as SELinux or AppArmor is blocking Salt.
Too many open files
The salt-master needs at least 2 sockets per host that connects to it, one for the Publisher and one for response port. Thus, large installations may, upon scaling up the number of minions accessing a given master, encounter:
12:45:29,289 [salt.master ][INFO ] Starting Salt worker process 38
Too many open files
sock != -1 (tcp_listener.cpp:335)
The solution to this would be to check the number of files allowed to be opened by the user running salt-master (root by default):
[root@salt-master ~]# ulimit -n
1024
If this value is not equal to at least twice the number of minions,
then it will need to be raised. For example, in an environment with 1800
minions, the nofile
limit should be set to no less than
3600. This can be done by creating the file
/etc/security/limits.d/99-salt.conf
, with the following
contents:
root hard nofile 4096
root soft nofile 4096
Replace root
with the user under which the master runs,
if different.
If your master does not have an /etc/security/limits.d
directory, the lines can simply be appended to
/etc/security/limits.conf
.
As with any change to resource limits, it is best to stay logged into
your current shell and open another shell to run ulimit -n
again and verify that the changes were applied correctly. Additionally,
if your master is running upstart, it may be necessary to specify the
nofile
limit in /etc/default/salt-master
if
upstart isn't respecting your resource limits:
limit nofile 4096 4096
Note
The above is simply an example of how to set these values, and you may wish to increase them even further if your Salt master is doing more than just running Salt.
Salt Master Stops Responding
There are known bugs with ZeroMQ versions less than 2.1.11 which can
cause the Salt master to not respond properly. If you're running a
ZeroMQ version greater than or equal to 2.1.9, you can work around the
bug by setting the sysctls net.core.rmem_max
and
net.core.wmem_max
to 16777216. Next, set the third field in
net.ipv4.tcp_rmem
and net.ipv4.tcp_wmem
to at
least 16777216.
You can do it manually with something like:
# echo 16777216 > /proc/sys/net/core/rmem_max
# echo 16777216 > /proc/sys/net/core/wmem_max
# echo "4096 87380 16777216" > /proc/sys/net/ipv4/tcp_rmem
# echo "4096 87380 16777216" > /proc/sys/net/ipv4/tcp_wmem
Or with the following Salt state:
net.core.rmem_max:
sysctl:
- present
- value: 16777216
net.core.wmem_max:
sysctl:
- present
- value: 16777216
net.ipv4.tcp_rmem:
sysctl:
- present
- value: 4096 87380 16777216
net.ipv4.tcp_wmem:
sysctl:
- present
- value: 4096 87380 16777216
Live Python Debug Output
If the master seems to be unresponsive, a SIGUSR1 can be passed to the salt-master threads to display what piece of code is executing. This debug information can be invaluable in tracking down bugs.
To pass a SIGUSR1 to the master, first make sure the master is running in the foreground. Stop the service if it is running as a daemon, and start it in the foreground like so:
# salt-master -l debug
Then pass the signal to the master when it seems to be unresponsive:
# killall -SIGUSR1 salt-master
When filing an issue or sending questions to the mailing list for a problem with an unresponsive daemon, be sure to include this information if possible.
Live Salt-Master Profiling
When faced with performance problems one can turn on master process profiling by sending it SIGUSR2.
# killall -SIGUSR2 salt-master
This will activate yappi
profiler inside salt-master
code, then after some time one must send SIGUSR2 again to stop profiling
and save results to file. If run in foreground salt-master will report
filename for the results, which are usually located under
/tmp
on Unix-based OSes and c:\temp
on
windows.
Make sure you have yappi installed.
Results can then be analyzed with kcachegrind or similar tool.
Make sure you have yappi installed.
On Windows, in the absence of kcachegrind, a simple file-based workflow to create profiling graphs could use gprof2dot, graphviz and this batch file:
::
:: Converts callgrind* profiler output to *.pdf, via *.dot
::
@echo off
del *.dot.pdf
for /r %%f in (callgrind*) do (
echo "%%f"
gprof2dot.exe -f callgrind --show-samples "%%f" -o "%%f.dot"
dot.exe "%%f.dot" -Tpdf -O
del "%%f.dot"
)
Commands Time Out or Do Not Return Output
Depending on your OS (this is most common on Ubuntu due to apt-get)
you may sometimes encounter times where a state.apply
<salt.modules.state.apply_>
, or other long running commands
do not return output.
By default the timeout is set to 5 seconds. The timeout value can
easily be increased by modifying the timeout
line within
your /etc/salt/master
configuration file.
Having keys accepted for Salt minions that no longer exist or are not reachable also increases the possibility of timeouts, since the Salt master waits for those systems to return command results.
Passing the -c Option to Salt Returns a Permissions Error
Using the -c
option with the Salt command modifies the
configuration directory. When the configuration file is read it will
still base data off of the root_dir
setting. This can
result in unintended behavior if you are expecting files such as
/etc/salt/pki
to be pulled from the location specified with
-c
. Modify the root_dir
setting to address
this behavior.
Salt Master Doesn't Return Anything While Running jobs
When a command being run via Salt takes a very long time to return
(package installations, certain scripts, etc.) the master may drop you
back to the shell. In most situations the job is still running but Salt
has exceeded the set timeout before returning. Querying the job queue
will provide the data of the job but is inconvenient. This can be
resolved by either manually using the -t
option to set a
longer timeout when running commands (by default it is 5 seconds) or by
modifying the master configuration file: /etc/salt/master
and setting the timeout
value to change the default timeout
for all commands, and then restarting the salt-master service.
If a state.apply
run takes too long, you can find a
bottleneck by adding the --out=profile <salt.output.profile>
option.
Salt Master Auth Flooding
In large installations, care must be taken not to overwhealm the master with authentication requests. Several options can be set on the master which mitigate the chances of an authentication flood from causing an interruption in service.
Note
recon_default:
The average number of seconds to wait between reconnection attempts.
- recon_max:
-
The maximum number of seconds to wait between reconnection attempts.
- recon_randomize:
-
A flag to indicate whether the recon_default value should be randomized.
- acceptance_wait_time:
-
The number of seconds to wait for a reply to each authentication request.
- random_reauth_delay:
-
The range of seconds across which the minions should attempt to randomize authentication attempts.
- auth_timeout:
-
The total time to wait for the authentication process to complete, regardless of the number of attempts.
Running states locally
To debug the states, you can use call locally.
salt-call -l trace --local state.highstate
The top.sls file is used to map what SLS modules get loaded onto what minions via the state system.
It is located in the file defined in the file_roots
variable of the salt master configuration file which is defined by found
in CONFIG_DIR/master
, normally
/etc/salt/master
The default configuration for the file_roots
is:
file_roots:
base:
- /srv/salt
So the top file is defaulted to the location
/srv/salt/top.sls
Salt Master Umask
The salt master uses a cache to track jobs as they are published and returns come back. The recommended umask for a salt-master is 022, which is the default for most users on a system. Incorrect umasks can result in permission-denied errors when the master tries to access files in its cache.