Monit is a monitoring daemon process that will check every 60 seconds on configured services, pids, ports, host information, or pretty much anything you want and also take a course of action on failure/change detected.
- I decided that CFEngine while great at what it does, was too slow at restarting processes (especially as we move to a more HA environment) so I needed to find something that would catch failures and respond faster than 5 minutes. Monit appears to be the answer. Step 2 of this roll out is to have CFEngine monitor Monit so we can answer the “who watches the watchmen?” concern.
Every machine monit is installed on you can go to https://machinename.example.com:2812 to view status information for that individual machine. I have setup a very restricrtive LDAP group that only allows admin to access this page.
I have setup M/Monit which is the centralized version on monit.
Each machine is configured to send all communication over SSL to M/Monit(and from M/Monit back to machine).
I like to do a yum install to setup the services correctly and then download/install the latest binary from their website as the latest version has a lot more options and flexibility than what is available in the regular CentOS repos. Monit-5.12.1 has been tested with and runs successfully on Centos 5, 6 and 7.
I created a script to do most of the heavy lifting for me and grab it from our internal RPM site.
cd /opt; wget http://rpm.example.com/kickstart/mapserver/monit-install chmod 0700 monit-install ./monit-install
yum install monit ftp cd /opt; wget http://mmonit.com/monit/dist/binary/5.12.1/monit-5.12.1-linux-x64.tar.gz; Or wget http://rpm.zedxinc.com/ZedX/monit-5.12.1-linux-x64.tar.gz (for private 10.0.25.X machines)
I have a standard monitrc, pam authentication module, and certificate for ssl traffic that I download from a local ftp site.
echo " #!/bin/bash cd /opt HOST=10.0.0.K USER=kickstart PASSWD=kickstart ftp -i -n -v 10.0.0.K << EOT binary user kickstart kickstart cd monit mget monitrc #mget pam-monit mget pam-monit-centos6 mget mmonit.pem bye EOT ">> /opt/ftp.sh chmod 755 /opt/ftp.sh /opt/ftp.sh; cp -rf /opt/monitrc /etc/ ; chmod 0700 /etc/monitrc ; cp -rf /opt/pam-moni* /etc/pam.d/monit; mv /opt/mmonit.pem /etc/certmonger/; chmod 0700 /etc/certmonger/mmonit.pem;
Most everything is set up properly in /etc/monitrc. If the machine is in our public network then we must manually change the m/monit collector IP address so that data is not going across separate networks. Monit is setup to have a 240 second start delay so that when a machine boots up there are not service start conflicts, this is very important.
after any changes to the config file or adding a new file in /etc/monit.d/, you must check syntax and then reload for the new config to be read.
monit.d]# monit -t Control file syntax OK monit.d]# monit reload Reinitializing monit daemon
See the OpenSSL page to understand how we configured the monit.pem to work with M/Monit with SSL
I will paste below the important bits of the monitrc file, make sure these are set on each server
set daemon 60 with start delay 240 set mmonit https://monit:firstname.lastname@example.org.[public IP]:8443/collector #set mmonit https://monit:email@example.com.[[Private IP]:8443/collector # change mmonit collecter depending on network set httpd port 2812 and ssl enable pemfile /etc/certmonger/mmonit.pem use address 10.0.0.[localhost] # use primary interface host specific allow 10.0.1.[m/monit server] allow @monitadmins allow @monitrc readonly
Permissions on this file and the mmonit.pem must be 0700 or process will fail to start and also bad people could read our information.
I place individual config files in /etc/monit.d/ for separate services and monitoring configs. Syntax does change between CentOS 6 and CentOS 7 (blame initd vs systemd).
Always run “monit -t” to check syntax is correct before “monit reload” and adding new configs into the mix. Once these launch if they are not properly configured you could end up with unintended consequences (killing production httpd, for example, not that that ever happened or anything).
check network eth0 with interface eth0 if saturation > 95% then alert check network eth2 with interface eth2 if saturation > 95% then alert #OR if CentOS 7 check network ens160 with interface ens160 if saturation > 95% then alert check network ens192 with interface ens192 if saturation > 95% then alert
check filesystem rootfs with path / if space usage > 95% then alert
check process sshd with pidfile /var/run/sshd.pid start program "/usr/bin/systemctl start sshd" stop program "/usr/bin/systemctl start sshd" restart program "/usr/bin/systemctl restart sshd" if failed port 22 protocol ssh then restart check process sshd with pidfile /var/run/sshd.pid start program "/sbin/service sshd start" stop program "/sbin/service sshd stop" if failed port 22 protocol ssh then restart
check process sssd with pidfile /var/run/sssd.pid start program "/sbin/service sssd start" stop program "/sbin/service sssd stop" if changed pid then restart
check process postgresql with pidfile /var/lib/pgsql/data/postmaster.pid start program "/etc/init.d/postgresql start" stop program "/etc/init.d/postgresql stop" if changed pid then restart
check process httpd with pidfile /var/run/httpd/httpd.pid start program "/sbin/service httpd start" stop program "/sbin/service https stop" restart program "/sbin/service httpd restart" if failed host "IP" port 80 then restart check process httpd with pidfile /var/run/httpd/httpd.pid start program "/usr/bin/systemctl start httpd" stop program "/usr/bin/systemctl stop httpd" restart program "/usr/bin/systemctl restart httpd" if failed port 80 protocol ssh then restart
check process mysql with pidfile /var/run/mysqld/mysqld.pid start program "/sbin/service mysqld start" stop program "/sbin/service mysqld stop" restart program "/sbin/service mysqld restart" if failed unix /var/lib/mysql/mysql.sock then restart
Remember in centos 7 mysql is now mariadb
check process mesos-master matching "/usr/sbin/mesos-master --work_dir=/var/run/mesos --ip=10.0.20.43 --hostname=bfeprdmes001 --zk=zk://10.0.20.43:2181/mesos --cluster=PROD --quorum=1 2 " start program "/usr/local/bin/start_mesos" # I had to create custom start/stop scripts for mesos as monit would not launch it successfully via the standard command you see in the matching portion. stop program "/usr/local/bin/kill_mesos" if failed host localhost port 5050 then restart if changed pid then restart alert firstname.lastname@example.org
check process singularity matching "java -jar /opt/Singularity/SingularityService/target/SingularityService-0.4.1-shaded.jar server /opt/Singularity/singularity.yaml" start program "/usr/local/bin/start_singularity" stop program "/usr/local/bin/kill_singularity" if failed host localhost port 8082 then restart if changed pid then restart alert email@example.com
you can adapt these to match pretty much any situation/pid/process/or even exit status of individual tasks.
M/Monit is the centralized management website for monit. It makes our job very easy and is a great asset to the systems department.
If the ID file is duplicated on multiple machines (this can happen if you clone the system including the Monit ID file) then several Monit instances will update the same host entry in M/Monit
1. Change the Monit ID. If you use Monit 5.8 or newer, use monit -r to reset the ID. For older Monit versions just remove the ID file. For example: rm -f ~/.monit.id (the location can have been changed with the “set idfile” statement in .monitrc),
Source Monit FAQ