I was asked about getting some "monitoring" going the other day, and decided to ask someone with expertise. I told Wei that we had just had an issue with our logs filling up the
/var/ partition, and wanted to make sure it didn't happen in the future. His answer was pretty enlightening; he said: "Don't do that. Fix the problem."
I'm not a sysadmin (anymore), but this was almost something of a zen revelation. Once you've identified a problem, you should just fix it.
People set up massively complex systems with multiple tiers of notifications, but most of what's generated by them is noise. Use logging and analysis to set yourself up for post mortems and trend identification; things like "we need a bigger filesystem", or "we need more nodes" or app servers or rodbs. Monitoring should be notifying unexpected outlying conditions; if you get an email or a page from a monitoring system that you expected, something has been botched.