Monitoring Setup¶

This guide covers comprehensive monitoring setup for Nexus deployments, including metrics collection, alerting, logging, and observability best practices.

Overview¶

Monitoring is crucial for maintaining the health, performance, and reliability of Nexus deployments. This documentation provides guidance for setting up monitoring across different deployment scenarios using industry-standard tools.

Monitoring Stack Components¶

Core Components¶

Metrics Collection: Prometheus, InfluxDB, or cloud-native solutions
Visualization: Grafana, Kibana, or cloud dashboards
Alerting: Alertmanager, PagerDuty, or cloud alerting
Log Aggregation: ELK Stack, Fluentd, or cloud logging
Distributed Tracing: Jaeger, Zipkin, or cloud tracing
Uptime Monitoring: Pingdom, UptimeRobot, or synthetic monitoring

Prometheus Setup¶

Installation¶

Docker Deployment¶

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: nexus-prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./rules:/etc/prometheus/rules:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=200h'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'

  alertmanager:
    image: prom/alertmanager:latest
    container_name: nexus-alertmanager
    restart: unless-stopped
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager_data:/alertmanager

  node-exporter:
    image: prom/node-exporter:latest
    container_name: nexus-node-exporter
    restart: unless-stopped
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'

volumes:
  prometheus_data:
  alertmanager_data:

Prometheus Configuration¶

Create prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: 'nexus-monitor'

rule_files:
  - "rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'nexus'
    static_configs:
      - targets: ['nexus:8080']
    metrics_path: '/metrics'
    scrape_interval: 30s

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'postgres-exporter'
    static_configs:
      - targets: ['postgres-exporter:9187']

  - job_name: 'redis-exporter'
    static_configs:
      - targets: ['redis-exporter:9121']

  - job_name: 'nginx-exporter'
    static_configs:
      - targets: ['nginx-exporter:9113']

Alert Rules¶

Create rules/nexus-alerts.yml:

groups:
  - name: nexus.rules
    rules:
      - alert: NexusDown
        expr: up{job="nexus"} == 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Nexus instance is down"
          description: "Nexus instance {{ $labels.instance }} has been down for more than 0 minutes."

      - alert: NexusHighCPU
        expr: (100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage"
          description: "CPU usage is above 80% for 5 minutes on {{ $labels.instance }}"

      - alert: NexusHighMemory
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage"
          description: "Memory usage is above 90% for 5 minutes on {{ $labels.instance }}"

      - alert: NexusHighDiskUsage
        expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High disk usage"
          description: "Disk usage is above 85% on {{ $labels.instance }}"

      - alert: NexusHighErrorRate
        expr: rate(nexus_http_requests_total{status=~"5.."}[5m]) / rate(nexus_http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate"
          description: "Error rate is above 5% for 5 minutes"

      - alert: NexusSlowResponse
        expr: histogram_quantile(0.95, rate(nexus_http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Slow response times"
          description: "95th percentile response time is above 1 second"

      - alert: DatabaseConnectionHigh
        expr: nexus_db_connections_active / nexus_db_connections_max > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High database connection usage"
          description: "Database connection pool is 80% utilized"

      - alert: RedisMemoryHigh
        expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.9
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Redis memory usage high"
          description: "Redis memory usage is above 90%"

Grafana Setup¶

Installation¶

grafana:
  image: grafana/grafana:latest
  container_name: nexus-grafana
  restart: unless-stopped
  ports:
    - "3000:3000"
  environment:
    - GF_SECURITY_ADMIN_PASSWORD=admin123
    - GF_USERS_ALLOW_SIGN_UP=false
  volumes:
    - grafana_data:/var/lib/grafana
    - ./grafana/provisioning:/etc/grafana/provisioning
    - ./grafana/dashboards:/var/lib/grafana/dashboards

volumes:
  grafana_data:

Dashboard Configuration¶

Create grafana/provisioning/dashboards/dashboard.yml:

apiVersion: 1

providers:
  - name: 'nexus-dashboards'
    orgId: 1
    folder: 'Nexus'
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /var/lib/grafana/dashboards

Create grafana/provisioning/datasources/datasource.yml:

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true

Nexus Dashboard¶

Create grafana/dashboards/nexus-overview.json:

{
  "dashboard": {
    "id": null,
    "title": "Nexus Overview",
    "tags": ["nexus"],
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "Request Rate",
        "type": "stat",
        "targets": [
          {
            "expr": "rate(nexus_http_requests_total[5m])",
            "legendFormat": "Requests/sec"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "reqps"
          }
        }
      },
      {
        "id": 2,
        "title": "Response Time",
        "type": "stat",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(nexus_http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "95th percentile"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "s"
          }
        }
      },
      {
        "id": 3,
        "title": "Error Rate",
        "type": "stat",
        "targets": [
          {
            "expr": "rate(nexus_http_requests_total{status=~\"5..\"}[5m]) / rate(nexus_http_requests_total[5m]) * 100",
            "legendFormat": "Error %"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent"
          }
        }
      },
      {
        "id": 4,
        "title": "Active Users",
        "type": "stat",
        "targets": [
          {
            "expr": "nexus_active_users",
            "legendFormat": "Active Users"
          }
        ]
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "30s"
  }
}

Application Metrics¶

Metrics Implementation¶

Add to your Nexus application:

const promClient = require('prom-client');

// Create metrics registry
const register = new promClient.Registry();

// Default metrics
promClient.collectDefaultMetrics({ register });

// Custom metrics
const httpRequestsTotal = new promClient.Counter({
  name: 'nexus_http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status'],
  registers: [register]
});

const httpRequestDuration = new promClient.Histogram({
  name: 'nexus_http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route'],
  buckets: [0.1, 0.5, 1, 2, 5],
  registers: [register]
});

const activeUsers = new promClient.Gauge({
  name: 'nexus_active_users',
  help: 'Number of active users',
  registers: [register]
});

const dbConnections = new promClient.Gauge({
  name: 'nexus_db_connections_active',
  help: 'Number of active database connections',
  registers: [register]
});

// Middleware for metrics collection
function metricsMiddleware(req, res, next) {
  const start = Date.now();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;

    httpRequestsTotal
      .labels(req.method, req.route?.path || req.path, res.statusCode)
      .inc();

    httpRequestDuration
      .labels(req.method, req.route?.path || req.path)
      .observe(duration);
  });

  next();
}

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

module.exports = {
  register,
  httpRequestsTotal,
  httpRequestDuration,
  activeUsers,
  dbConnections,
  metricsMiddleware
};

Alerting Setup¶

Alertmanager Configuration¶

Create alertmanager.yml:

global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alerts@nexus.example.com'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
  - name: 'web.hook'
    email_configs:
      - to: 'admin@nexus.example.com'
        subject: 'Nexus Alert: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}

    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#alerts'
        title: 'Nexus Alert'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

PagerDuty Integration¶

receivers:
  - name: 'pagerduty'
    pagerduty_configs:
      - routing_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
        description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
        details:
          summary: '{{ .CommonAnnotations.summary }}'
          description: '{{ .CommonAnnotations.description }}'
          severity: '{{ .CommonLabels.severity }}'

Log Management¶

ELK Stack Setup¶

Elasticsearch¶

elasticsearch:
  image: docker.elastic.co/elasticsearch/elasticsearch:8.5.0
  container_name: nexus-elasticsearch
  environment:
    - discovery.type=single-node
    - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    - xpack.security.enabled=false
  ports:
    - "9200:9200"
  volumes:
    - elasticsearch_data:/usr/share/elasticsearch/data

volumes:
  elasticsearch_data:

Logstash¶

logstash:
  image: docker.elastic.co/logstash/logstash:8.5.0
  container_name: nexus-logstash
  ports:
    - "5044:5044"
  volumes:
    - ./logstash/pipeline:/usr/share/logstash/pipeline:ro
    - ./logstash/config:/usr/share/logstash/config:ro
  depends_on:
    - elasticsearch

Create logstash/pipeline/nexus.conf:

input {
  beats {
    port => 5044
  }
}

filter {
  if [fields][service] == "nexus" {
    json {
      source => "message"
    }

    date {
      match => [ "timestamp", "ISO8601" ]
    }

    mutate {
      add_tag => [ "nexus" ]
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "nexus-%{+YYYY.MM.dd}"
  }
}

Kibana¶

kibana:
  image: docker.elastic.co/kibana/kibana:8.5.0
  container_name: nexus-kibana
  ports:
    - "5601:5601"
  environment:
    - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
  depends_on:
    - elasticsearch

Filebeat Configuration¶

Create filebeat.yml:

filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/log/nexus/*.log
    fields:
      service: nexus
    fields_under_root: true

output.logstash:
  hosts: ["logstash:5044"]

processors:
  - add_host_metadata:
      when.not.contains.tags: forwarded

Application Logging¶

Structured Logging¶

const winston = require('winston');

const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  defaultMeta: {
    service: 'nexus',
    environment: process.env.NODE_ENV
  },
  transports: [
    new winston.transports.File({
      filename: '/var/log/nexus/error.log',
      level: 'error'
    }),
    new winston.transports.File({
      filename: '/var/log/nexus/combined.log'
    })
  ]
});

if (process.env.NODE_ENV !== 'production') {
  logger.add(new winston.transports.Console({
    format: winston.format.simple()
  }));
}

module.exports = logger;

Request Logging Middleware¶

const logger = require('./logger');

function requestLogger(req, res, next) {
  const start = Date.now();

  res.on('finish', () => {
    const duration = Date.now() - start;

    logger.info('HTTP Request', {
      method: req.method,
      url: req.url,
      status: res.statusCode,
      duration,
      userAgent: req.get('User-Agent'),
      ip: req.ip,
      userId: req.user?.id
    });
  });

  next();
}

module.exports = requestLogger;

Distributed Tracing¶

Jaeger Setup¶

jaeger:
  image: jaegertracing/all-in-one:latest
  container_name: nexus-jaeger
  ports:
    - "16686:16686"
    - "14268:14268"
  environment:
    - COLLECTOR_OTLP_ENABLED=true

Application Tracing¶

const { NodeTracerProvider } = require('@opentelemetry/sdk-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');

const provider = new NodeTracerProvider({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'nexus',
    [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
  }),
});

const jaegerExporter = new JaegerExporter({
  endpoint: 'http://jaeger:14268/api/traces',
});

provider.addSpanProcessor(new BatchSpanProcessor(jaegerExporter));
provider.register();

Health Checks¶

Application Health Endpoint¶

const express = require('express');
const app = express();

app.get('/health', (req, res) => {
  const health = {
    status: 'ok',
    timestamp: new Date().toISOString(),
    checks: {
      database: 'ok',
      redis: 'ok',
      external_apis: 'ok'
    }
  };

  res.json(health);
});

app.get('/ready', async (req, res) => {
  try {
    // Check database connection
    await db.raw('SELECT 1');

    // Check Redis connection
    await redis.ping();

    res.json({ status: 'ready' });
  } catch (error) {
    res.status(503).json({
      status: 'not ready',
      error: error.message
    });
  }
});

Cloud Monitoring¶

AWS CloudWatch¶

const AWS = require('aws-sdk');
const cloudwatch = new AWS.CloudWatch();

function publishMetric(metricName, value, unit = 'Count') {
  const params = {
    Namespace: 'Nexus/Application',
    MetricData: [
      {
        MetricName: metricName,
        Value: value,
        Unit: unit,
        Timestamp: new Date()
      }
    ]
  };

  cloudwatch.putMetricData(params).promise()
    .catch(err => console.error('CloudWatch error:', err));
}

Google Cloud Monitoring¶

const monitoring = require('@google-cloud/monitoring');
const client = new monitoring.MetricServiceClient();

async function publishMetric(metricType, value) {
  const projectId = process.env.GOOGLE_CLOUD_PROJECT;
  const projectPath = client.projectPath(projectId);

  const dataPoint = {
    interval: {
      endTime: {
        seconds: Date.now() / 1000,
      },
    },
    value: {
      doubleValue: value,
    },
  };

  const timeSeries = {
    metric: {
      type: `custom.googleapis.com/nexus/${metricType}`,
    },
    resource: {
      type: 'global',
    },
    points: [dataPoint],
  };

  const request = {
    name: projectPath,
    timeSeries: [timeSeries],
  };

  await client.createTimeSeries(request);
}

Performance Monitoring¶

APM Integration¶

// New Relic
require('newrelic');

// DataDog
const tracer = require('dd-trace').init({
  service: 'nexus',
  env: process.env.NODE_ENV
});

// AppDynamics
require('appdynamics').profile({
  controllerHostName: 'controller.appdynamics.com',
  controllerPort: 443,
  controllerSslEnabled: true,
  accountName: 'your-account',
  accountAccessKey: 'your-access-key',
  applicationName: 'Nexus',
  tierName: 'Web',
  nodeName: process.env.HOSTNAME
});

Monitoring Best Practices¶

Metric Guidelines¶

Use appropriate metric types:
Counters for cumulative values
Gauges for point-in-time values
Histograms for distributions
Label wisely:
Keep cardinality low
Use meaningful label names
Avoid user-specific labels
Monitor what matters:
Business metrics
Application performance
Infrastructure health
User experience

Alert Guidelines¶

Alert on symptoms, not causes
Keep alerts actionable
Avoid alert fatigue
Use appropriate severities
Include runbook links

Dashboard Guidelines¶

Focus on key metrics
Use consistent time ranges
Include context and annotations
Organize by audience
Keep it simple and readable

Troubleshooting¶

Common Issues¶

High Cardinality Metrics¶

# Check metric cardinality
curl -s http://prometheus:9090/api/v1/label/__name__/values | jq '.data[]' | wc -l

# Find high cardinality metrics
promtool query instant 'topk(10, count by (__name__)({__name__=~".+"}))'

Missing Metrics¶

# Check target status
curl -s http://prometheus:9090/api/v1/targets

# Check service discovery
curl -s http://prometheus:9090/api/v1/targets?state=active

Grafana Dashboard Issues¶

# Check Grafana logs
docker logs nexus-grafana

# Test data source connection
curl -H "Authorization: Bearer $API_KEY" \
  http://grafana:3000/api/datasources/proxy/1/api/v1/query?query=up

Monitoring Setup¶

Overview¶

Monitoring Stack Components¶

Core Components¶

Prometheus Setup¶

Installation¶

Docker Deployment¶

Prometheus Configuration¶

Alert Rules¶

Grafana Setup¶

Installation¶

Dashboard Configuration¶

Nexus Dashboard¶

Application Metrics¶

Metrics Implementation¶

Alerting Setup¶

Alertmanager Configuration¶

PagerDuty Integration¶

Log Management¶

ELK Stack Setup¶

Elasticsearch¶

Logstash¶

Kibana¶

Filebeat Configuration¶

Application Logging¶

Structured Logging¶

Request Logging Middleware¶

Distributed Tracing¶

Jaeger Setup¶

Application Tracing¶

Health Checks¶

Application Health Endpoint¶

Cloud Monitoring¶

AWS CloudWatch¶

Google Cloud Monitoring¶

Performance Monitoring¶

APM Integration¶

Monitoring Best Practices¶

Metric Guidelines¶

Alert Guidelines¶

Dashboard Guidelines¶

Troubleshooting¶

Common Issues¶

High Cardinality Metrics¶

Missing Metrics¶

Grafana Dashboard Issues¶

See Also¶