Build HA Encrypted DNS Server On Raspberry Pi 5

Alex Johnson
-
Build HA Encrypted DNS Server On Raspberry Pi 5

In today's increasingly connected world, having a robust and secure Domain Name System (DNS) infrastructure is more critical than ever. This guide will walk you through the process of setting up a highly available (HA) and encrypted DNS server stack using the power of the Raspberry Pi 5. We'll leverage a combination of cutting-edge tools and techniques to create a self-healing, monitored, and secure DNS solution that can handle production-grade demands, all from a compact and energy-efficient Raspberry Pi.

🎯 Project Overview: Your Bulletproof DNS Infrastructure

Imagine a DNS setup that's not just functional, but practically indestructible. That's the goal here: to build a production-grade, self-healing, high-availability encrypted DNS infrastructure on the formidable Raspberry Pi 5 (8GB RAM). We're not just talking about basic ad-blocking; we're aiming for a comprehensive system complete with monitoring, alerting, and automated remediation. This means if something goes wrong, the system can detect it, alert you, and even fix itself, minimizing downtime and ensuring your network's connectivity remains uninterrupted. This project is designed for those who want granular control over their network's DNS resolution, enhanced privacy through encryption, and the peace of mind that comes with a highly resilient setup. We'll be diving deep into architecture, core services, networking, observability, and even AI-driven self-healing capabilities.

πŸ—οΈ Architecture: The Blueprint for Resilience

To achieve our goal of a bulletproof DNS server, we need a well-thought-out architecture. This involves choosing the right hardware, meticulously selecting the core software stack, defining our networking strategy, and implementing a robust observability layer. Let's break down the components:

Hardware: The Foundation

At the heart of our setup is the Raspberry Pi 5 with a generous 8GB of RAM. This powerhouse single-board computer provides more than enough processing power and memory to handle multiple demanding services simultaneously. For the operating system, we'll be using Raspberry Pi OS (based on Debian Trixie), a stable and well-supported Linux distribution. To keep things organized and ensure consistency, all our service data and configurations will reside within the /opt/stacks/ directory structure. This logical separation makes management and backups significantly easier.

Core Stack: The Engine of Your DNS

DNS Services (Dual Setup for Redundancy)

Redundancy is key for high availability. Therefore, we'll deploy a dual setup for our core DNS services:

  • Pi-hole v6 Γ—2: We'll run two instances of Pi-hole, acting as our primary and secondary DNS servers. Pi-hole is renowned for its ad-blocking and DNS filtering capabilities. We'll enhance its power with curated regex filters and extensive blocklists, including popular choices like Hagezi and OISD, to ensure maximum ad and tracker blocking. To provide stable and predictable network addresses, Pi-hole containers will be assigned static IPs within the 192.168.8.x range using macvlan networks. This allows them to have their own unique IP addresses on your physical network, simplifying management and troubleshooting.
  • Unbound Γ—2: Complementing Pi-hole, we'll deploy two instances of Unbound, a validating, recursive, and caching DNS resolver. We'll use the mvance/unbound-rpi:latest Docker image, known for its efficiency on ARM architectures. Unbound will be responsible for DNSSEC validation, ensuring the authenticity of DNS records. Each Unbound instance will sit behind its respective Pi-hole instance, handling the recursive lookups. They will be assigned static IPs 192.168.8.240 and 192.168.8.241 respectively, acting as the upstream DNS servers for our Pi-hole instances.
  • CoreDNS (Optional Proxy/Sidecar): For added flexibility and potential load balancing or advanced routing scenarios, CoreDNS can be employed as an optional proxy or sidecar. While not strictly necessary for a basic HA setup, it offers advanced configuration possibilities.
  • Gravity Sync: To ensure consistency between our two Pi-hole instances, we'll implement Gravity Sync. This tool automatically replicates Pi-hole's configuration, blocklists, and settings from the primary to the secondary, guaranteeing that both servers are always in sync and providing seamless failover.

High Availability: The Self-Healing Heartbeat

High availability is achieved through several mechanisms:

  • keepalived: This essential tool will manage Virtual Router Redundancy Protocol (VRRP), enabling a floating virtual IP (VIP). This VIP will be the primary address clients use to access the DNS service. If the active node fails, keepalived automatically transfers the VIP to the backup node, ensuring near-instantaneous failover with minimal disruption.
  • Dual-node setup: We'll configure two Raspberry Pi 5s (or one Pi 5 acting as a dual-container host, managed carefully) for automatic failover. The VIP will float between these nodes based on their health.
  • Health checks: Continuous, automated health checks will monitor the status of our DNS services. If a service becomes unresponsive, keepalived and our automation scripts will detect it and initiate failover or remediation.

Mesh Networking: Secure Connectivity

For secure communication between nodes, especially if you expand this setup across different physical locations or want a hardened internal network, we'll use Nebula 1.9.7. Nebula creates a secure, encrypted overlay mesh network. We'll set up a Certificate Authority (CA), lighthouse nodes (for initial connection discovery), and node certificates to ensure that all communication within the mesh is encrypted and authenticated. Configuration files will be neatly organized under /opt/stacks/nebula.

Container Platform: Modern Deployment

To manage these diverse services efficiently, we'll rely on a robust containerization platform:

  • Docker Engine 29 (rootful): We'll use the latest stable version of Docker, configured in rootful mode for maximum performance and compatibility. This allows Docker to manage system resources directly.
  • containerd v2.1.5: This is the underlying container runtime that Docker uses, ensuring efficient container lifecycle management.
  • Docker Compose v2 plugin: For defining and managing multi-container Docker applications, we'll use the modern Docker Compose V2 plugin.
  • Portainer: To provide a user-friendly web interface for managing our Docker containers, stacks, and networks, we'll deploy Portainer CE. This simplifies deployment, monitoring, and troubleshooting.

🌐 Networking Architecture: The Communication Grid

A well-defined networking architecture is crucial for service isolation, security, and seamless communication. We'll be using several Docker networks tailored to specific needs:

  1. dns_net (macvlan): This network is critical for our DNS services. Using the macvlan driver, containers on this network will obtain their own unique MAC addresses and IP addresses directly from your physical network (specifically, the 192.168.8.0/24 subnet). This approach treats each container as if it were a separate physical device on your LAN, which is ideal for services like Pi-hole and Unbound that need to respond directly to network requests. An optional macvlan-shim can be used to ensure these containers can also reach the host (Raspberry Pi) if needed.

  2. observability_net (bridge): This network will be a standard bridge network (with IPs in the 172.20.x.x range) dedicated to our monitoring stack (Prometheus, Grafana, Loki, etc.). This provides network isolation for these services, preventing them from interfering with or being directly exposed by the DNS network. Communication between the DNS services and the observability stack will be carefully managed.

  3. Nebula overlay: This is a unique, encrypted overlay network created by Nebula. It forms a secure, private mesh connecting your Raspberry Pi nodes (and potentially other Nebula-enabled devices across different networks). All traffic within this overlay is encrypted end-to-end, providing a secure channel for inter-node communication and site-to-site connectivity, regardless of the underlying physical network topology.

This layered networking approach ensures that each component has the necessary connectivity while maintaining strong security boundaries.

πŸ“Š Observability Stack: Seeing Inside Your System

What gets measured, gets managed. Our observability stack provides deep insights into the health and performance of our DNS infrastructure.

Metrics & Dashboards: Visualizing Performance

  • Prometheus: This is our time-series database and monitoring system. Prometheus will be configured to scrape metrics from various components. This includes specialized exporters for Pi-hole and Unbound, a general-purpose Node Exporter to gather host-level system metrics (CPU, memory, disk, network), and metrics from our custom AI-Watchdog. By collecting these metrics, Prometheus builds a historical record of your system's performance.
  • Grafana: The visualization layer. Grafana will connect to Prometheus as a data source to create operational dashboards. These dashboards will provide real-time and historical views of critical data points such as DNS performance (latency, query volume), system health (resource utilization), service availability, query statistics, and ad/tracker block rates. Custom dashboards will be designed to give you an at-a-glance overview of your entire DNS infrastructure.
  • Alertmanager: This component handles alerts generated by Prometheus. It routes alerts to the appropriate notification channels, deduplicates them, and silences them if necessary. We'll configure it to send notifications via Signal (or your preferred messaging platform), ensuring you're promptly informed of any issues.

Logging: The System's Diary

  • Loki: A log aggregation system inspired by Prometheus, designed to make operating logging simple. Loki indexes metadata about logs rather than the full content, making it efficient. It will collect logs from all our containers and the host system.
  • Promtail: This is the agent responsible for discovering log files on our Raspberry Pi nodes and shipping them to Loki. Promtail ensures that all relevant logs are captured and sent for central aggregation.

Exporters: Data Collectors

To feed data into Prometheus, we need specific data collectors (exporters):

  • Pi-hole v6 exporter: (Currently in development) This will expose key Pi-hole metrics. This is a crucial component for monitoring Pi-hole's performance and status.
  • Unbound exporter: Provides detailed metrics about Unbound's recursive resolution activities, cache performance, and DNSSEC validation status.
  • Node Exporter: A standard tool for exposing hardware and OS metrics from a server.
  • AI-Watchdog metrics endpoint: Our custom AI application will expose its own operational metrics.

This comprehensive observability stack ensures that you have full visibility into your DNS server's performance and health, allowing for proactive management and rapid troubleshooting.

πŸ€– AI-Powered Self-Healing: The Intelligent Guardian

Beyond just monitoring, we aim for a system that can intelligently respond to issues. This is where our AI-Powered Self-Healing component comes into play.

AI-Watchdog: The Brains of the Operation

  • Platform: A custom Flask application will serve as the core of our AI-Watchdog. Flask is a lightweight Python web framework, perfect for building this kind of microservice.
  • Capabilities: The AI-Watchdog will possess several advanced capabilities:
    • Health check automation: It will go beyond simple ping checks, performing deeper, context-aware health checks on our DNS services.
    • Anomaly detection: By analyzing trends in metrics and logs, it can identify unusual patterns that might indicate an impending issue before it becomes critical.
    • Self-remediation actions: Upon detecting a problem, the watchdog can automatically trigger predefined recovery actions, such as restarting a failed container or service.
    • Prometheus metrics exposure: It will expose its own operational metrics to Prometheus, allowing us to monitor the watchdog itself.
    • Container restart logic: It will intelligently restart containers that have crashed or become unresponsive, leveraging Docker's API.
    • Service dependency awareness: It understands the relationships between services (e.g., Pi-hole depends on Unbound), ensuring that remediation actions are taken in the correct order.

Automation Scripts: The Hands of the System

Supporting the AI-Watchdog will be a suite of automation scripts:

  • dns-health: A comprehensive script designed to perform a full suite of health checks on the DNS stack.
  • dns-check.sh: A quicker, command-line utility for immediate status probes.
  • Auto-restart on failure: Containers will be configured with restart policies, but the AI-Watchdog will provide more intelligent, context-aware restarts.
  • Intelligent recovery sequences: For more complex failures, the watchdog can orchestrate a sequence of actions to restore service.

This AI-driven approach transforms the DNS server from a passive service into an active, intelligent system capable of maintaining its own health.

🚨 Alerting & Notifications: Staying Informed

Even with self-healing, it's crucial to be informed about the status of your infrastructure. Our alerting system ensures you're always in the loop.

Signal Integration: Real-time Updates

We'll integrate Alertmanager with Signal, a secure messaging application, to deliver timely notifications. These alerts will cover a range of events:

  • Service down alerts: Immediate notification when a critical service becomes unavailable.
  • Service up notifications: Confirmation when a service has been successfully restored.
  • Upgrade notifications: Alerts for available software updates for key components.
  • Performance degradation warnings: Notifying you if query latency increases or other performance metrics dip below acceptable thresholds.
  • Security events: Alerts for potential security concerns detected by the system.
  • Self-healing actions taken: Informing you when the AI-Watchdog has automatically resolved an issue.

This robust alerting system ensures that you are aware of both routine operations and any exceptional events requiring your attention.

πŸ“‹ Implementation Checklist: Your Step-by-Step Guide

This project is broken down into manageable phases to guide you through the setup process. Each phase builds upon the previous one, ensuring a structured and systematic deployment.

Phase 1: Foundation βœ…

This initial phase focuses on setting up the basic environment on your Raspberry Pi 5.

  • Install Raspberry Pi OS (Debian Trixie): Ensure you have the latest stable version of Raspberry Pi OS installed and updated. This provides a solid base for all subsequent installations.
  • Install Docker Engine 29 + containerd: Get the latest Docker Engine and its underlying runtime, containerd, up and running. This is essential for containerizing all our services. Make sure to configure it for rootful operation for optimal performance and flexibility.
  • Create directory structure /opt/stacks/{dns,observability,nebula}: Establish a clear and organized directory structure. This will house all configuration files and persistent data for your Docker services, making management and backups straightforward.
  • Configure Docker networks (dns_net, observability_net): Set up the necessary Docker networks. dns_net will be a macvlan network for your DNS services to have direct LAN IPs, while observability_net will be a bridge network for your monitoring stack.
  • Set up rootful Docker with proper permissions: Ensure Docker is running correctly with root privileges and that file permissions are set up to allow Docker to function optimally.
  • Remove deprecated compose syntax: Modern Docker Compose files do not require the version: key at the top level. Removing it ensures compatibility with newer Docker versions and adheres to current best practices.

Phase 2: DNS Infrastructure πŸ”„

This phase focuses on deploying and configuring the core DNS services.

  • Deploy Pi-hole v6 containers (primary + secondary):
    • Install two instances of Pi-hole using Docker. Configure them to use static IPs on the dns_net macvlan.
    • Configure blocklists (Hagezi, OISD): Populate Pi-hole with comprehensive ad and tracker blocklists for maximum effectiveness.
    • Set up regex filters: Implement custom regular expressions for more granular blocking control.
    • Assign static IPs on macvlan: Ensure each Pi-hole instance has a predictable IP address on your local network.
    • Configure upstream DNS (Unbound): Set Pi-hole instances to forward DNS queries to your local Unbound resolvers.
  • Deploy Unbound containers Γ—2: Set up two Unbound instances, also with static IPs on dns_net.
    • Configure DNSSEC validation: Enable DNSSEC to ensure the authenticity and integrity of DNS responses.
    • Create /opt/stacks/dns/unbound1 config & /opt/stacks/dns/unbound2 config: Define the Unbound configuration files, ensuring they are correctly mounted into the containers.
    • Bind mount configurations: Use Docker volumes to persist Unbound's configuration and cache.
    • Verify recursive resolution: Test that Unbound is correctly performing recursive lookups and validating DNSSEC.
  • Deploy CoreDNS proxy (if needed): If you decide to use CoreDNS for advanced routing or proxying, deploy it now.
  • Test DNS resolution end-to-end: Ensure that clients can resolve domain names through the Pi-hole/Unbound stack, and that Pi-hole is correctly filtering queries.
  • Configure DNS-over-HTTPS (optional): If desired, set up DoH for encrypted DNS queries to the internet.

Phase 3: High Availability πŸ“

This phase brings redundancy and automatic failover to your DNS setup.

  • Install keepalived on both nodes: Deploy keepalived on each Raspberry Pi.
  • Configure VRRP for VIP floating: Set up keepalived to manage a Virtual IP address (VIP) that will be used by clients to access the DNS service. This VIP will automatically move between the active and standby nodes.
  • Define master/backup priorities: Configure keepalived's priority settings to determine which node should be the master under normal conditions.
  • Test VIP failover: Manually simulate a failure on the master node to verify that the VIP correctly transfers to the backup node and that DNS resolution continues with minimal interruption.
  • Configure health check scripts: Develop and configure scripts that keepalived can use to monitor the health of the critical DNS services (Pi-hole, Unbound). If a service fails, keepalived can initiate a failover.
  • Deploy Gravity Sync: Install and configure Gravity Sync to automatically synchronize the configuration, blocklists, and settings between the primary and secondary Pi-hole instances. This ensures consistency and enables seamless failover.
    • Configure primaryβ†’secondary sync: Set up the direction and schedule for synchronization.
    • Verify blocklist replication: Confirm that blocklists are correctly updated on both Pi-hole instances.
    • Test settings synchronization: Ensure that all configuration changes made on the primary are reflected on the secondary.

Phase 4: Mesh Networking πŸ“

Secure your inter-node communication with Nebula.

  • Complete Nebula setup:
    • Generate CA certificate: Create a Certificate Authority (CA) to sign all other certificates within the Nebula network.
    • Create node certificates: Generate unique certificates for each Raspberry Pi node that will participate in the mesh.
    • Deploy lighthouse node: Configure one or more nodes to act as lighthouses, helping other nodes discover each other on the network.
    • Configure node-to-node connectivity: Set up the Nebula configuration files (config.yml) on each node, pointing them to the lighthouse(s) and defining their network roles.
    • Test encrypted overlay: Verify that nodes can ping and communicate with each other securely over the encrypted Nebula tunnel.
    • Document certificate management: Establish procedures for managing certificates, including renewal and revocation.

Phase 5: Observability πŸ“

Gain deep insights into your system's performance and health.

  • Deploy Prometheus: Set up Prometheus to collect metrics.
    • Create scrape configs: Define which targets (Pi-hole, Unbound, Node Exporter, AI-Watchdog) Prometheus should scrape and how often.
    • Configure retention: Set the duration for which Prometheus should store metrics data.
    • Set up service discovery: Configure Prometheus to automatically discover new targets as they are deployed.
  • Deploy Grafana: Install Grafana to visualize the collected metrics.
    • Create datasources: Connect Grafana to your Prometheus instance.
    • Import/create dashboards: Set up dashboards for monitoring:
      • DNS query dashboard: Visualizes query volume, latency, and cache hit rates.
      • Pi-hole statistics: Shows block rates, upstream queries, and client activity.
      • Unbound performance: Details DNSSEC validation success, cache usage, and recursive query times.
      • System resources: Monitors CPU, memory, disk I/O, and network usage on the Raspberry Pi hosts.
      • Service health overview: Provides a status summary of all critical services.
      • Network traffic: Monitors bandwidth usage.
    • Configure authentication: Secure your Grafana instance with user authentication.
  • Deploy Alertmanager: Configure Alertmanager for intelligent alerting.
    • Configure Signal webhook: Set up Alertmanager to send alerts to your Signal account or group.
    • Create alert rules: Define specific conditions that trigger alerts, such as:
      • Service down: Critical services are unresponsive.
      • Service recovered: Notification when a downed service is back online.
      • High query latency: DNS resolution times exceed acceptable limits.
      • High block rate change: Significant deviation in the number of blocked queries.
      • System resource alerts: CPU, RAM, or disk usage critical.
      • Certificate expiry: Alerts for upcoming expiration of SSL/TLS certificates.
  • Deploy Loki + Promtail: Set up centralized logging.
    • Configure log collection: Define which logs Promtail should collect from containers and the host.
    • Set retention policies: Configure how long logs are stored in Loki.
    • Create log dashboards: Visualize and search logs within Grafana.
  • Deploy exporters: Ensure all necessary exporters are running and configured correctly.
    • Fix Pi-hole v6 exporter: Address any issues with the Pi-hole exporter to ensure it provides accurate metrics.
    • Deploy Unbound exporter: Collect detailed Unbound statistics.
    • Verify Node Exporter: Confirm host metrics are being collected.
    • Configure custom exporters: Set up any other specific data collectors needed.

Phase 6: AI-Watchdog πŸ€–

Implement the intelligent self-healing capabilities.

  • Develop AI-Watchdog application: Build the core logic for the watchdog.
    • Health check engine: Develop sophisticated health checks that go beyond basic connectivity.
    • Anomaly detection logic: Implement algorithms to spot unusual patterns in metrics and logs.
    • Self-healing actions: Define and code the automated responses to detected issues (e.g., container restarts, service reconfigurations).
    • Prometheus metrics endpoint: Expose the watchdog's own performance metrics.
    • Container restart logic: Use Docker's API to manage container lifecycles.
    • Service dependency graph: Model the relationships between services to ensure proper remediation sequencing.
  • Deploy AI-Watchdog container: Package the application and deploy it as a Docker container.
  • Configure remediation policies: Define thresholds and specific actions for different types of failures.
  • Test self-healing scenarios: Simulate various failure conditions to verify the watchdog's effectiveness:
    • DNS service crash: Ensure it restarts Pi-hole or Unbound.
    • Network partition: Test its resilience to connectivity issues.
    • Resource exhaustion: Verify it can detect and potentially alleviate resource constraints.
    • Configuration drift: Ensure it can detect unauthorized changes and revert them or alert.
  • Integrate with Prometheus alerts: Ensure that the watchdog's actions and status are reflected in your monitoring and alerting system.

Phase 7: Portainer & Management πŸ“

Simplify management with a user-friendly interface.

  • Deploy Portainer CE: Install Portainer as a Docker container, making it accessible via a web browser.
  • Configure stack management: Set up Portainer to manage your Docker Compose stacks efficiently.
  • Set up templates: Create reusable templates for common service deployments.
  • Configure role-based access: Define user roles and permissions to control access to different functionalities within Portainer.
  • Create backup procedures: Establish and document robust backup strategies for your configurations and persistent data, leveraging Portainer's capabilities or custom scripts.

Phase 8: Security Hardening πŸ”’

Ensure your DNS infrastructure is secure against threats.

  • Configure firewall rules: Implement strict firewall rules on the Raspberry Pi hosts to allow only necessary traffic.
  • Harden Docker daemon: Apply security best practices to the Docker daemon configuration, such as disabling unnecessary features and restricting privileges.
  • Secure Nebula mesh: Ensure your Nebula configuration is robust, using strong encryption and proper certificate management.
  • Implement certificate rotation: Establish a process for regularly rotating all certificates (Nebula, potentially SSL/TLS for services) to mitigate risks associated with compromised keys.
  • Enable audit logging: Configure system and Docker audit logging to track all actions performed on the system.
  • Configure fail2ban (optional): Install and configure fail2ban to protect against brute-force attacks on SSH and other services.
  • Regular security updates: Maintain a schedule for applying security patches and updates to the operating system, Docker, and all deployed applications.

Phase 9: Testing & Validation βœ…

Thorough testing is essential to guarantee reliability and performance.

  • End-to-end DNS resolution tests: Perform comprehensive tests to ensure DNS resolution works correctly for all types of queries from various clients.
  • Failover testing (kill primary): Simulate the failure of the primary DNS server (e.g., by stopping its container or rebooting the Pi) and verify that the VIP is correctly transferred and that clients experience minimal interruption.
  • Load testing (query volume): Use tools to simulate high DNS query loads to assess the system's performance under stress and identify potential bottlenecks.
  • Network partition simulation: Test how the system behaves if network connectivity is temporarily lost between nodes or between clients and servers.
  • Self-healing verification: Trigger failure conditions deliberately and confirm that the AI-Watchdog correctly detects and resolves them.
  • Alert notification testing: Ensure that all defined alerts are triggered correctly and arrive via your configured notification channel (e.g., Signal).
  • Backup/restore procedures: Test your backup and restore processes to ensure data can be recovered effectively.
  • Documentation review: Validate that all documentation is accurate, complete, and easy to understand.

Phase 10: Documentation πŸ“š

Comprehensive documentation is vital for future maintenance and knowledge transfer.

  • Architecture diagrams: Create visual representations of the entire system, including network topology, service dependencies, and data flows.
  • Installation guide: A detailed walkthrough of the entire setup process, suitable for replication.
  • Configuration reference: A comprehensive guide to all configuration files and their parameters.
  • Operational runbook: Step-by-step procedures for common operational tasks, such as restarting services, performing backups, and monitoring.
  • Troubleshooting guide: A knowledge base of common issues and their solutions.
  • Backup/restore procedures: Clear instructions on how to perform backups and restore the system from backups.
  • Security best practices: Guidelines for maintaining a secure DNS infrastructure.
  • Performance tuning guide: Tips and techniques for optimizing DNS resolution speed and efficiency.

πŸ“ Directory Structure: Organized for Efficiency

A well-organized directory structure is fundamental for managing a complex setup like this. We'll maintain a clear hierarchy under /opt/stacks/ to keep all configurations and data tidy and easily accessible.

/opt/stacks/
β”œβ”€β”€ dns/
β”‚   β”œβ”€β”€ docker-compose.yml
β”‚   β”œβ”€β”€ pihole1/
β”‚   β”‚   β”œβ”€β”€ etc-pihole/
β”‚   β”‚   └── etc-dnsmasq.d/
β”‚   β”œβ”€β”€ pihole2/
β”‚   β”‚   β”œβ”€β”€ etc-pihole/
β”‚   β”‚   └── etc-dnsmasq.d/
β”‚   β”œβ”€β”€ unbound1/
β”‚   β”‚   └── unbound.conf
β”‚   β”œβ”€β”€ unbound2/
β”‚   β”‚   └── unbound.conf
β”‚   β”œβ”€β”€ coredns/          # Optional CoreDNS configuration
β”‚   └── gravity-sync/     # Gravity Sync configuration if not containerized differently
β”œβ”€β”€ observability/
β”‚   β”œβ”€β”€ docker-compose.yml
β”‚   β”œβ”€β”€ prometheus/
β”‚   β”‚   └── prometheus.yml
β”‚   β”œβ”€β”€ grafana/
β”‚   β”‚   └── dashboards/     # Grafana dashboard JSON files
β”‚   β”œβ”€β”€ alertmanager/
β”‚   β”‚   └── alertmanager.yml
β”‚   β”œβ”€β”€ loki/
β”‚   └── promtail/
β”œβ”€β”€ nebula/
β”‚   β”œβ”€β”€ docker-compose.yml
β”‚   β”œβ”€β”€ ca.crt
β”‚   β”œβ”€β”€ ca.key
β”‚   β”œβ”€β”€ node1.crt
β”‚   β”œβ”€β”€ node1.key
β”‚   β”œβ”€β”€ node2.crt
β”‚   β”œβ”€β”€ node2.key
β”‚   └── config.yml      # Nebula configuration file
β”œβ”€β”€ ai-watchdog/
β”‚   β”œβ”€β”€ docker-compose.yml
β”‚   β”œβ”€β”€ app.py
β”‚   β”œβ”€β”€ requirements.txt
β”‚   └── config/         # AI-Watchdog specific configurations
β”œβ”€β”€ portainer/
β”‚   └── docker-compose.yml
└── scripts/
    β”œβ”€β”€ dns-health      # Health check script
    β”œβ”€β”€ dns-check.sh    # Quick check script
    └── backup.sh       # Backup automation script

This structure separates concerns logically, with dedicated directories for DNS, observability, networking, AI, and management tools. Each subdirectory contains its respective docker-compose.yml file and any necessary configuration or persistent data volumes.

πŸ”§ Key Configuration Files: The Heart of the Setup

While many configuration files are involved, the docker-compose.yml files and the core service configurations are paramount. Let's look at some examples:

Docker Compose Examples

DNS Stack (/opt/stacks/dns/docker-compose.yml)

This compose file defines how our Pi-hole and Unbound services are run within Docker. Note the use of macvlan networks and volume mounts for persistent data.

services:
  pihole1:
    image: pihole/pihole:latest
    container_name: pihole1
    networks:
      dns_net:
        ipv4_address: 192.168.8.53
    environment:
      - TZ=Europe/Athens
      - WEBPASSWORD=<your_secure_password>
      - DNS1=192.168.8.240#5335  # Forward to Unbound instance 1 (using non-standard port for potential internal routing)
      - DNS2=no               # Disable second upstream DNS server
      - PIHOLE_DNS_ A=1.1.1.1   # Optional: fallback DNS
      - PIHOLE_DNS_AAAA=1.1.1.1 # Optional: fallback DNS
    volumes:
      - './pihole1/etc-pihole:/etc/pihole'
      - './pihole1/etc-dnsmasq.d:/etc/dnsmasq.d'
    restart: unless-stopped

  unbound1:
    image: mvance/unbound-rpi:latest
    container_name: unbound1
    networks:
      dns_net:
        ipv4_address: 192.168.8.240
    volumes:
      - './unbound1:/etc/unbound'
    restart: unless-stopped

# Add pihole2 and unbound2 similarly, adjusting IPs and configurations

networks:
  dns_net:
    external: true
    # Ensure this network is created beforehand: 
    # docker network create -d macvlan --subnet=192.168.8.0/24 --gateway=192.168.8.1 eth0

Unbound Configuration (/opt/stacks/dns/unbound1/unbound.conf)

This configuration enables DNSSEC, sets cache sizes, and specifies upstream servers (though often Unbound acts as the root of resolution itself).

server:
  # Basic settings
  interface: 0.0.0.0
  port: 5335 # Using 5335 to differentiate from standard 53, allowing Pi-hole to use it explicitly
  access-control: 192.168.8.0/24 allow # Allow queries from your LAN subnet
  do-ip4: yes
  do-ip6: no
  do-udp: yes
  do-tcp: yes

  # DNSSEC validation
  auto-trust-anchor-file: "/etc/unbound/root.key"
  harden-dnssec-stripped: yes
  harden-offline: yes
  harden-algo-downgrade: yes

  # Caching
  cache-min-ttl: 3600
  cache-max-ttl: 86400
  msg-cache-size: 64m
  rrset-cache-size: 64m

  # Logging (optional, can be verbose)
  # verbosity: 1

  # Privacy (optional)
  hide-identity: yes
  hide-version: yes
  qname-minimisation: yes

# Forwarding (usually not needed if you want Unbound to be the root resolver)
# forward-zone:
#  name: "."
#  forward-addr: 1.1.1.1@853#cloudflare-dns.com # Example for DNS-over-HTTPS forwarding

Nebula Configuration (/opt/stacks/nebula/config.yml)

This is a simplified example; a full configuration includes certificates and CA details.

local_ip: 10.0.0.2/24 # Example Nebula internal IP

# Public listen address and port
listen:
  host: 0.0.0.0
  port: 4242

# Certificates (paths to your generated .pem files)
cert: /etc/nebula/node.crt
key: /etc/nebula/node.key

# Certificate Authority certificate
ca: /etc/nebula/ca.crt

# Static routes (optional, for more complex routing)
# static_routes:
#  - host: 10.0.0.1/24
#    route: 10.0.0.1

# Pings to lighthouses
ping_interval: 10

# Define the lighthouse(s)
lighthouses:
  - host: <lighthouse_ip_or_dns>:4242

# Logging (optional)
# logging:
#  level: info

🚨 Alert Rules: Defining Critical Thresholds

Well-defined alert rules are crucial for timely intervention. Here are some examples categorized by severity:

Critical Alerts

These alerts indicate a severe issue that requires immediate attention.

  • DNS service unavailable: Triggered when a core DNS service (Pi-hole, Unbound) becomes unreachable via its health check.
  • VIP failover occurred: Notifies you when the floating IP address has switched nodes, indicating a potential issue with the primary node.
  • Certificate expiring (<7 days): Alerts you well in advance of any critical certificates (like Nebula node certificates) expiring, preventing outages.
  • Disk space critical (<10% free): Warns of critically low disk space, which can lead to service failures and data loss.
  • Memory usage critical (>90% sustained): Indicates the system is running out of RAM, potentially causing performance degradation or crashes.
  • Self-healing action failed: Alerts you if the AI-Watchdog attempts to fix an issue but fails to resolve it.

Warning Alerts

These alerts highlight potential problems or deviations from normal operation.

  • High query latency (>100ms p95): Notifies you if the 95th percentile of DNS query response times exceeds a predefined threshold, indicating potential performance issues.
  • Increased block rate (>20% change): Alerts if there's a significant spike or drop in the ad/tracker block rate, which could indicate a change in network traffic or a misconfiguration.
  • Service restart detected: Notifies you if a critical service container restarts unexpectedly, prompting investigation.
  • Configuration drift detected: Alerts if the AI-Watchdog or another monitoring mechanism detects unauthorized or unexpected changes to critical configurations.
  • Backup failed: Informs you if the scheduled backup process fails to complete successfully.

πŸ“Š Success Metrics: Measuring Effectiveness

To gauge the success of our HA encrypted DNS server, we'll track several key performance indicators (KPIs):

  • Availability: Aiming for >99.9% uptime. This means the DNS service should be accessible almost all the time, with minimal planned or unplanned downtime.
  • Query latency: Target <50ms p95. The 95th percentile of DNS query response times should remain low, ensuring a responsive browsing experience.
  • Failover time: Target <5 seconds. The duration from the failure of the active node to the activation of the backup node should be minimal.
  • Self-healing success rate: Aiming for >95%. The AI-Watchdog should successfully resolve the vast majority of detected issues automatically.
  • Alert accuracy: Target <5% false positives. Alerts should be reliable, minimizing unnecessary noise.
  • Block effectiveness: Target >90% of common ad/tracking domains blocked. This ensures the primary function of Pi-hole is effectively achieved.

πŸ”— Resources: Diving Deeper

For further information and detailed documentation on the technologies used in this project, please refer to the following official resources:

  • Pi-hole Documentation: The official documentation for setting up and managing Pi-hole.
  • Unbound Documentation: Comprehensive guides on configuring and using the Unbound recursive resolver.
  • Nebula Documentation: Official documentation for Nebula, covering installation, configuration, and advanced networking.
  • Prometheus Alerting: Detailed information on Prometheus's alerting capabilities and Alertmanager integration.
  • Grafana Dashboards: A repository of community-created Grafana dashboards, useful for inspiration and pre-built visualizations.
  • keepalived Documentation: Official documentation for keepalived, essential for understanding VRRP and high availability configurations.

πŸ“ Notes: Project Status and Recent Changes

Current Status

  • βœ… Docker Engine 29 running (rootful): The containerization foundation is in place.
  • βœ… Directory structure created: /opt/stacks/ is organized and ready.
  • βœ… Networks configured (dns_net, observability_net): Core networking for services is established.
  • βœ… Nebula CA and certificates generated: The basis for secure mesh networking is ready.
  • βœ… Switched to mvance/unbound-rpi image: Utilizing an efficient Unbound image for Raspberry Pi.
  • πŸ”„ Pi-hole v6 exporter troubleshooting in progress: Finalizing metrics collection for Pi-hole.
  • πŸ“ Gravity Sync not yet installed: This crucial component for Pi-hole HA needs to be deployed.
  • πŸ“ keepalived configuration pending: Setting up the high-availability mechanism requires careful configuration.

Recent Fixes

  • Removed rootless Docker remnants, ensuring stable rootful operation.
  • Fixed /run/docker.sock handling, ensuring it's treated as a socket, not a directory.
  • Removed deprecated version: key from compose files for modern compatibility.
  • Switched from ghcr.io/madnuttah/unbound to the public mvance/unbound-rpi image for better availability and standardization.

Target Timeline: 4-6 weeks for full deployment and testing. This ambitious project aims to deliver a highly resilient, secure, and intelligently managed DNS infrastructure powered by the Raspberry Pi 5. By following these steps, you'll create a robust system that offers privacy, security, and unparalleled uptime.


For further insights into network security and DNS best practices, consider exploring resources from the Internet Society at internetsociety.org and the Electronic Frontier Foundation (EFF) at eff.org. These organizations provide valuable information on digital privacy, security, and the future of the internet.

You may also like