Last week I have been migrating logging from Papertrails to my new setup. Papertrail was nice for some time but is a bit costy for larger amount of logs and is especially costy for longer-term log storage.
My main use case of logging has been error and intrusion monitoring. The minor use cases have been visitor analytics (usually handled by Google Analytics), performance monitoring (call logs with execution times) and audit logs (whether an action was performed or not).
Once a good logging infrastructure is in place, specific monitors can be built onto it. For example, I have a couple of shell scripts, executed periodically by cron, that check various services and write PASS/FAIL result to a log file (and now directly into rsyslog). The logs are watched for specific patterns (containing a string "FAIL" in this case) and alerts are automatically sent to my mail address.
I have found it very effective to debug web applications by filtering and correlating different logs (web server access log + application stderr log). Doing this with basic text tools (grep etc.) is not easy and not even possible when one of the log files does not contain timestamps.
A one way to solve logging and monitoring is to log and monitor at each individual machine. There are tools like logwatch for this. I did not find this effective. While alerts work, you still need to grep and match multiple logs separatedly to debug anything.
I looked through various self-hosted and cloud alternatives to Papertrail:
- AWS CloudWatch (cannot add non-AWS sources, useless to me)
- Loggly (cloud, too expensive)
- Logentries (cloud, too expensive, used them before Papertrail)
- ELK stack (self-hosted, looks too complex to work reliably, written in Java)
- Fluentd (self-hosted, looks like syslog reimplementation, no UI for browsing logs)
- Graylog (self-hosted, looks too complex to work reliably, written in Java)
- Splunk (self-hosted but licenses might cost a lot)
Centralized logging requires a common protocol. All the solutions above, excluding CloudWatch, including Papertrail, support the syslog protocol. Papertrail even maintains a custom syslog-based log forwarder. This made it easy to test out the solutions (no need to install lots of fancy daemons!).
Almost every Linux installation has a syslog-compatible log agent already installed. There is a high chance that it is rsyslog. It comes by default on Debian, Ubuntu, RedHat, etc. Playing around with rsyslog while configuring it to forward logs to those services above gave me an idea that rsyslog can be used itself to collect logs.
The rsyslog description is quite broad and does not describe a particular use case. It is described as
swiss army knife of logging
It has lots of features. Grasping everything took lots of reading. It is able to do the following (and a lot more but this is related to our use case):
- Forward logs to remote machines (securely, over TLS).
- Watch log files (and not break with log rotation).
- Assign custom tags to log files.
- Store logs in MySQL (my choice for storing logs).
- Do all of this in very performant manner.
What it cannot do:
- Stateful alerts (i.e on 2 events in last hour).
- Browse logs (UI).
- Remove old logs (depends on storage).
I have built a custom analytics application before and knew exactly what to do once I have logs in the database. So the plan was to get all machines to forward logs to the central rsyslog, make it save logs into the MySQL database, and then throw together a small app that implements the alerts, log browser and custom log retention.
Rsyslog has one huge downside: its configuration format depends on the version. There is the legacy format (documented somewhere at all?) and the new RainerScript syntax. Many of my machines are running Debian Wheezy that has rsyslog 5.x and supports legacy syntax mostly (this depends on the specific rsyslog plugins too!). I have stuck with the legacy syntax for now for the sake of consistency. I will use the new format for new machines, running Debian Jessy at least. The documentation for the legacy format seems to come mostly from the documention of specific rsyslog modules.
Secure forwarding (client)
The secure log forwarding uses transmission encryption through TLS. It is documented here. The TLS setup uses self-generated CA (root) certificate. The official guide generates separate certificate to each machine and verifies individual machines. It is also possible to setup a single certificate and make server/clients verify only the certificate (as used here). This still provides encryption for log transmission.
I also made these decisions in the client configuration:
- Only forward
local7facility (my own set of files).
- Discard forwarded messages locally (fewer files with the same data).
I also added the
$LocalHostName directive at the top of the configuration file. This makes sure that the forwarded messages contain the actual hostname, not something generic like
Receiving messages (server)
The server TLS configuration is similar to the client configuration. The MySQL configuration is automatically set up by installing
rsyslog-mysql on Debian. ALL logs from rsyslog (from both local and remote sources) will be stored in the database now. As I did not want to log everything, I made a change to rsyslog MySQL configuration to only store messages from the
Rsyslog contains a module to watch log files. Warning! It is tricky to get the configuration (you need legacy directives on Debian Wheezy!) right if you have many files to watch. My busiest server has 42. In order to avoid errors I wrote a small Node.js script to validate the configuration. It checks for each file block:
- File's facility is set to
- File's custom tag is set and is unique.
- File's state file is
- File actually exists.
- File's severity is either
- The file monitor start directive exists.
Errors like non-unique state files cause missing logs.
What I watch
For each app I watch:
- Each processes
- Audit log (if it has).
- Web server access log (if web app).
- Web server error log (if web app).
All logs for a the app are grouped by the tag prefix
Besides app logs I also watch:
- Machine's supervisord (I use it as a process manager) log.
- Machine ping logs (machine uptime log).
- Logged status codes of a list of URLs (app uptime log).
I currently do not watch/forward SSH login failures, mail arrival, etc. I have no idea what exactly to do with these. I do not have password login for SSH so that brute force attach is not going to get someone into the system.
I built my own application to implement alerts. It works by chunking/stream processing the log table by primary key (autoincremented - ordered by time!). This makes it very efficient. No log line is polled twice by a single alert. I have not yet released the application but I plan to do it sometimes. I have currently implemented pattern alerts with mails (with custom throttle time for sending mails) but I also want to implement rate (X lines per sec) and idle (no line received in certain period) alerts.
I tested the system with message rate of 5k per second (I usually get 10-100 per second). MySQL showed some high iowait with 5k rate. It went away after switching to periodic flush (from flush after commit). The logs table uses InnoDB engine. This can be configured by setting
innodb_flush_log_at_trx_commit = 0
This also means that 1 second of transactions can be lost on server crash. That was fine for me. I also looked at switching to MyISAM table engine but with alert reads and heavy concurrent inserts it might perform worse.
I have added two indexes to the table. One on
SysLogTag and the other on
FromHost. The average row size for 1.7M rows is 290 bytes. The average message length is 125 bytes. This makes overhead 2.32 (it is actually better as this does not consider metadata such as tag, timestamp and host).
I have built a simple Node.js app for log browsing. It started from live tail view. Live tail and search is implemented by a single simple SQL query. Live browsing is very fast, a typical request from web app to display live tail for "Everything" takes < 10ms. For search it depends how many logs it actually has to scan. Good performance came as a suprise and I'm very happy with it.
The app groups different logs by tag prefix. The currently displayed or searched group is selected by a two-level dropdown. The first level is machine name and the second level is the app name. I currently have 8 machines and about 30 apps.
I plan to release the log browser, rsyslog conf checker, and the alert system as Open Source at some point of time. I will write a longer post on it. Until that time comes, the system will be extensively used by me and my clients for monitoring various projects.
I have Open Sourced the log viewer.
A link to the article about performance tips for rsyslog-mysql setup.