How to Set up a Nagios monitering system to use on DTC.
This is a page header set up by don@bowenvale.co.nz with notes on how to set up Nagios to monitor your systems.
My trial systems are running Debian Squeeze.
Edit by Florian Heigl:
Ideally for a very quick Nagios install, use the Open Monitoring Distribution ("OMD"). This has an automatic Nagios installer and various things that are much better than alternative methods.
That aside I had promised on the mailing list to write down my mail / dtc monitoring along with the NagVis map. So maybe we can work on that together. I'm sure I still have some things missing in my monitoring, we can compare once don has added his Nagios settings.
check for ~500MB free in /opt or prepare a corresponding symlink
wget http://omdistro.org/attachments/download/118/omd-0.50_0.squeeze_amd64.deb
apt-get install gdebi
gdebi omd-0.50_0.squeeze_amd64.deb
omd create dtcmon
omd start dtcmon
If you want to use the check_mk addon, the next steps would be:
apt-get install check-mk-agent check-mk-agent-logwatch
on the DTC box & Set up connectivity / xinetd / firewalling to port 6556 on the DTC box (unless you're on it anyway). It's also possible to use SSH transport or stunnel, this is hidden in the docs down below as something called DATASOURCE_PROGRAM. Actually *anything* is possible. But the agent part does never process or accept any input anyway, so let's not get off the track.
Then at Nagios box:
- Turn into the user associated with this Nagios "site"
su - dtcmon
- Add the DTC server to your check_mk config (which is in etc/check_mk and conf.d,
- then run an inventory of the DTC server detecting all *basic* services
echo "all_hosts += [ 'dtc.domain.com|tcp' ]" >> ~/etc/check_mk/conf.d/dtc.mk
cmk -I dtc.domain.com # case must match
It should print something like this:
(Some of the checks will not show since they are relying on extra plugins)
apt.upgrades 1 new checks
cpu.loads 1 new checks
cpu.threads 1 new checks
df 1 new checks
diskstat 1 new checks
kernel 3 new checks
kernel.util 1 new checks
lnx_if 1 new checks
logwatch 3 new checks
mem.used 1 new checks
mounts 1 new checks
mysql_check 9 new checks
mysql_status 16 new checks
postfix_mailq 1 new checks
ps.perf 22 new checks
tcp_conn_stats 1 new checks
uptime 1 new checks
Right now, you can run a manual check of the DTC box, based on what has been found, it doesn't even need a running nagios.
# cmk -v dtc.example.com
Check_mk version 2011.09.30
CPU load OK - 15min Load 1.00 at 2 CPUs
CPU utilization OK - user: 35.6%, system: 0.4%, wait: 0.2%
Database apachelogs OK - No errors found
Database dtc OK - No errors found
Database floh_phtagr OK - No errors found
Database phtagr OK - No errors found
Database roundcube OK - No errors found
Database wordpress OK - No errors found
Counter wrapped, not handled by check, ignoring this check result: diskstat.SUMMARY.read: Counter initialization
Interface eth0 OK - [2] (up) speed unknown
Kernel Context Switches OK - 94/s in last 27 secs
Kernel Major Page Faults OK - 0/s in last 27 secs
Kernel Process Creations OK - 3/s in last 27 secs
LOG /var/log/kern.log OK - no old or new error messages
LOG /var/log/mail.warn WARN - error messages present!
LOG /var/log/messages OK - no old or new error messages
Memory used OK - 0.85 GB used (0.52 GB RAM + 0.33 GB SWAP, this is 114.0% of 0.75 GB RAM)
Mount options of / OK - mount options are data=ordered,errors=remount-ro,relatime,rw
MySQL Status Created_tmp_files OK - count 5 (levels at 0/0)
MySQL Status Innodb_buffer_pool_pages_dirty OK - count 0 (levels at 20/40)
MySQL Status Innodb_buffer_pool_pages_free OK - count 1 (levels at 0/0)
MySQL Status Innodb_buffer_pool_pages_total OK - count 512 (levels at 0/0)
MySQL Status Innodb_data_pending_fsyncs OK - count 0 (levels at 50/100)
MySQL Status Innodb_os_log_pending_fsyncs OK - count 0 (levels at 5/10)
MySQL Status Innodb_row_lock_current_waits OK - count 0 (levels at 5/10)
MySQL Status Innodb_row_lock_time_avg OK - count 0 (levels at 0/0)
MySQL Status Innodb_row_lock_time_max OK - count 0 (levels at 0/0)
MySQL Status Open_files OK - count 46 (levels at 100/200)
MySQL Status Qcache_free_memory OK - count 412136 (levels at 0/0)
MySQL Status Qcache_hits OK - count 40614 (levels at 0/0)
MySQL Status Qcache_not_cached OK - count 7225 (levels at 0/0)
MySQL Status Slow_launch_threads OK - count 0 (levels at 5/10)
MySQL Status Slow_queries OK - count 3 (levels at 5/10)
MySQL Status Table_locks_waited OK - count 0 (levels at 100/500)
Number of threads OK - 181 threads
Postfix Queue OK - The mailqueue is empty
TCP Connections OK - ESTABLISHED: 2, CLOSE_WAIT: 10, TIME_WAIT: 3
Uptime OK - up since Mon Jun 13 17:53:00 2011 (131d 23:56:28)
apt.upgrades OK - All packages are up to date.
fs_/ OK - 19.0% used (5.98 of 31.5 GB), (levels at 80.0/90.0%), trend: +21.46MB / 24 hours
proc_Amavis OK - 7 processes
proc_Apache2 OK - 7 processes
proc_Bacula FD OK - 1 processes
proc_ClamAV OK - 1 processes
proc_Courier Authdaemon OK - 7 processes
proc_Courier IMAP OK - 5 processes
proc_Courier POP OK - 4 processes
proc_DTC Stats OK - 1 processes
proc_DkimProxy OK - 6 processes
proc_MySQL OK - 1 processes
proc_Postfix Pickup OK - 1 processes
proc_Postfix Postmaster OK - 1 processes
proc_Postfix Queue Manager OK - 1 processes
proc_Postfix TLS manager OK - 1 processes
proc_SSH OK - 1 processes
proc_SpamD OK - 1 processes
proc_bind OK - 1 processes
proc_fail2ban OK - 1 processes
proc_syslog OK - 1 processes
proc_xinetd OK - 1 processes
OK - Agent version 1.1.11i1, execution time 5.1 sec|execution_time=5.076
If it has (like above) detected new stuff and things look somewhat OK (like above), then it's time to use the following command.
If things are NOT working, it's 99.999% a connection issue.
cmk -R
Generating Nagios configuration...OK
Validating Nagios configuration...OK
Precompiling host checks...OK
Restarting Nagios...OK
to generate a new nagios configuration & apply it.
After that you should be able to go to
http://monitoringbox/dtcmon/(approve sites) and find a startup banner there.
Select one of the Nagios guis there, there's a standard nagios gui (haven't used that in a year), "thruk" which is a faster version of the old GUI and Multisite, which is the one coming with check_mk.
Docs are at:
http://mathias-kettner.de/check_mk.html(approve sites)
Adding a config file (mail_procs.mk) as an example that does the nagios monitoring for DTC's mail stuff would look like this:
# mailserver
inventory_processes_perf += [
( "Courier Authdaemon", "~.*courier-authlib/authdaemond", ANY_USER, 0, 4, 10, 20),
( "Courier IMAP", "~.*imaplogin /usr/bin/imapd", ANY_USER, 0, 2, 60, 70),
( "Courier POP", "~.*pop3login /usr/lib/courier/courier/courierpop3d", ANY_USER, 0, 2, 60, 70),
( "Amavis", "~.*amavisd", ANY_USER, 0, 2, 8, 12),
( "Postfix Postmaster", "~.*postfix/master", ANY_USER, 0, 1, 2, 6),
( "Postfix Pickup", "~.*pickup -l", ANY_USER, 0, 1, 2, 6),
( "Postfix Queue Manager", "~.*qmgr -l", ANY_USER, 0, 1, 2, 6),
( "Postfix Queue show", "~.*showq -l", ANY_USER, 0, 1, 2, 6),
( "Postfix TLS manager", "~.*tlsmgr -l", ANY_USER, 0, 1, 2, 6),
( "ClamAV", "~.*/usr/sbin/clamd", ANY_USER, 0, 1, 2, 6),
( "DkimProxy", "~.*/usr/sbin/dkimproxy.in", ANY_USER, 0, 4, 10, 20),
( "DkimProxy", "~.*/usr/sbin/dkimproxy.out", ANY_USER, 0, 4, 10, 20),
( "SpamD", "~.*sbin/spamd", ANY_USER, 0, 1, 60, 70),
]
Replacing ANY_USER with a certain username will keep that user context checked, and the numbers to the end are the process numbers to expect. (see cmk -M ps for that, it's a bit tricky)
The other parts for dtc:
inventory_processes_perf += [
( "MySQL", "~.*sbin/mysqld", ANY_USER, 0, 1, 5, 8),
( "MySQL", "~.*libexec/mysqld", ANY_USER, 0, 1, 5, 8),
( "bind", "~.*sbin/named", "bind", 0, 1, 2, 4),
( "syslog", "~.*syslogd", ANY_USER, 0, 1, 2, 6),
( "syslog", "~.*syslog-ng", ANY_USER, 0, 1, 5, 10),
( "xinetd", "~.*xinetd", ANY_USER, 0, 1, 5, 10),
( "Apache2", "~.*sbin/apache2", ANY_USER, 1, 4, 30, 50),
( "Apache2", "~.*sbin/httpd", ANY_USER, 1, 4, 30, 50),
( "DTC Stats", "~.*share/dtc/admin/dtc-stats-daemon.php", ANY_USER, 0, 1, 2, 2),
( "Nagios" , "~.*bin/nagios" , ANY_USER, 0, 1, 60, 70),
( "DTC-Xen Daemon", "~.*usr/sbin/dtc-soap-server", ANY_USER, 0, 1, 1, 2),
]
Here's two screenshots, and there I'll leave it before it all sounds advertise-ish.
But I promised to write the docs. :>
http://www.wartungsfenster.de/resources/cmk_dtc.JPG(approve sites)
http://www.wartungsfenster.de/resources/cmk_dtc2.JPG(approve sites)
I noticed I broke my NagVis map. I'll add that once it's working again :(
TODOs I thought of:
- FS alert levels + Trend alerts
- FS Quota reporting (that's not available atm)
- Adding the MySQL, apt update check, SMART, lmsensors and Xen plugins
- Set averaged bandwidth alerts (i.e. 60 min average traffic exceeds 75% of wire rate)
- logwatch.cfg rules specially for DTC / Mail
- adding monitoring for the DTC cron job log
Please add any obvious stuff I'm missing. i.e. defining contacts and setting up notifciations (I hate that part)
Editing this page means accepting its license.