Fault tolerant Zabbix monitoring cluster

Let’s say, you decided to NOT use a great complex solution from SolarWinds, but still look for a great monitoring solution. In this case, Zabbix is your best friend. Its flexibility could be a great asset for the company, but it is still not the ideal solution. Here is a simple solution to how to build up a fault-tolerant Zabbix cluster.

What we would like to achieve: we would like to have a fault-tolerant front end (Zabbix server + Web UI), fault-tolerant backend (database cluster), and the ability to monitor thousands of hosts with hundreds of items/parameters at each, and we could set data gathering period to 1 min, to have a huge amount of historical and actual data. I tested this solution with a HIGH number of gathered parameters (more than 300k items queried every minute) and a HUGE database (~1Tb). Solution working fine, without any noticeable delays in queries. Should note: The efficiency of a solution depends on assigned resources and the configuration of servers – resource shortage and bad configuration will lead to degraded performance, glitches, and potential instability of the cluster.

The solution itself:
Front-end and Zabbix server = Nginx + Zabbix web UI + native Zabbix HA. (2 servers are enough).
Database cluster solution = MariaDB + Galera (I`m advising to use 3 or more database servers, but still working fine with 2 database servers. Notice: yep, subject of assigned resources and server configuration).
Zabbix proxy = dockerized Zabbix proxy (if you need to run custom scripts – you will need to rebuild the docker container). Notice: yep, also the subject of assigned resources and server configuration (and container configuration).

If you do this with the right resources and correct configuration you will get a great monitoring solution.

Fault tolerant DNS cluster

Yep. For small companies, it might be not important, how you build/configure your DNS servers, as you might need to serve just a few queries per second or a few queries per minute. But good design is key to success if your goal is to serve thousands and thousands of queries per minute or second.
Here is an example of a design that will allow you to achieve this goal.

Queries go from up to down (to the DNS cluster).

The first line – “FW/LB” – is Load Balancer. Any. It could be a Fortigate firewall, working in pair with another firewall, It could be an F5 Load balancer, a solution based on Cisco devices.

The second line – is a group of DNS servers. It is not a cluster, but standalone DNS servers, configured as DNS caches, this means they will store recent DNS queries in memory (in case no records are present in memory DNS server will query Auth DNS server), and I`m advising you to use three DNS servers. But you can use more, or less. That depends on your tasks/goals, but never use a single server.

The third line – “FW/LB” again. With the same purpose – to rebalance queries between Auth DNS servers.

The fourth line (last, but not least important) – Authoritative DNS servers. In this case, the best practice is to use two: primary and backup servers with zone transfer between them. But you can also use more servers or less (if you feel lucky), that again depend on how big your system load is.

You can even create a fifth line and call it the “source of truth” and put the primary (master) DNS server there, living only backup (secondary) DNS servers in the fourth line.