The VAS Experts ePDG Monitoring system provides full operational control of the fast-epdg component, the VoWiFi (Voice over WiFi) gateway operating according to 3GPP TS 29.273 and TS 24.302. The gateway provides secure transmission of voice and packet traffic through untrusted Wi-Fi channels with IPSec / IKEv2 tunneling and integration with the EPC core through SWu, SWm, SWx, S2b, S6b interfaces.
The solution provides a single monitoring platform for the mobile operator’s operational services — from the IPSec SA (L3 security) level to the KPI of VoWiFi subscriber experience.
/metrics endpoint in fast-epdg, without Java, without JMX, without external agents.<mermaid> flowchart TB
subgraph DataPlane["Data Plane"]
IPSEC["IPSec ESP<br/>IKEv2 SA / Child SA<br/>Kernel xfrm"]
GTPU["GTP-U Tunneller<br/>S2b Data<br/>ePDG ↔ PGW"]
end
subgraph ControlPlane["Control Plane"]
IKE["IKEv2 SWu<br/>EAP-AKA' auth"]
DIAM["Diameter Client<br/>SWx/SWm/S6b"]
GTPC["GTPv2-C S2b<br/>to PGW/SMF"]
CTRL["ePDG Controller<br/>Attach/Detach FSM"]
end
subgraph Collection["Metrics Collection"]
PROMEXP["fast-epdg<br/>/metrics endpoint<br/>:9817"]
end
subgraph Storage["Storage"]
PROM["Prometheus<br/>TSDB<br/>15-day retention"]
end
subgraph Visualization["Visualization"]
GRAF["Grafana<br/>4 дашборда, 35+ панелей"]
end
subgraph Alerting["Alerting"]
AM["Alertmanager<br/>Routing / Inhibition"]
EMAIL["Email SMTP"]
SNMPGW["SNMP Trap Sender<br/>Webhook → Trap gateway"]
NMS["Внешняя NMS<br/>SNMP v2c UDP/162"]
WH["Webhooks<br/>Telegram / PagerDuty"]
end
IKE --> PROMEXP IPSEC --> PROMEXP GTPC --> PROMEXP GTPU --> PROMEXP DIAM --> PROMEXP CTRL --> PROMEXP
PROMEXP --> PROM PROM --> GRAF PROM --> AM
AM --> EMAIL AM --> SNMPGW SNMPGW --> NMS AM --> WH
</mermaid>
| Level | Component | Technology |
|---|---|---|
| Collection | Built-in /metrics endpoint fast-epdg | Prometheus text format over HTTP |
| Storage | Prometheus TSDB | Local storage, 15-day storage by default |
| Visualization | Grafana + JSON support | Autodownload 4 dashboards |
| Alerting | Alertmanager + SNMP Trap Sender | PromQL rules → webhook → SNMP v2c trap |
<mermaid> flowchart LR
EXP["fast-epdg<br/>/metrics :9817"]
EXP --> CFG["Config<br/>2 metrics"] EXP --> NET["Network<br/>1 metric"] EXP --> PROTO["Protocols L5-L7<br/>15 metrics"] EXP --> SVC["Service KPI<br/>4 metrics"] EXP --> SESS["Session State<br/>4 metrics"] EXP --> APP["Application<br/>3 metrics"] EXP --> SYS["System<br/>4 metrics"]
PROTO --> IKEV2["IKEv2<br/>SWu — 3"] PROTO --> GTPC["GTPv2-C<br/>S2b — 4"] PROTO --> GTPU["GTP-U<br/>S2b data — 3"] PROTO --> DIA["Diameter<br/>SWm/SWx/S6b — 5"]
</mermaid>
| Category | Number of metrics | Survey interval | Key indicators |
|---|---|---|---|
| Config | 2 | 10 sec | Configuration status, reload counter |
| Network | 1 | 10 sec | Node connection status (PGW/AAA/HSS) |
| IKEv2 (SWu) | 3 | 10 sec | Reports by type (IKE_SA_INIT, IKE_AUTH, CREATE_CHILD_SA), delay diagram, errors |
| GTPv2-C (S2b) | 4 | 10 sec | Messages (Create/Modify/Delete Session), delays, errors, relays |
| GTP-U data plane | 3 | 10 sec | Packets/bytes, tunneling errors |
| Diameter (SWm/SWx/S6b) | 5 | 10 sec | Command code messages (DER/DEA, MAR/MAA, AAR/AAA), delays, errors, watchdog, connection status |
| Service KPI | 4 | 10 sec | Percentage of successful attempts, duration histogram, service availability, uptime |
| Session State | 4 | 10 sec | IKE SA, Child SA, GTP sessions, all users |
| Application | 3 | 10 sec | Number of streams, memory, log messages by levels |
| System | 4 | 10 sec | CPU recycling, memory, memory disposal, open FD |
| Total | 33 metrics |
All metrics have the prefix epdg_ and are organized in a hierarchy:
epdg_ ├── config_* # Configuration ├── network_* # Network layer ├── ikev2_* # SWu (IKEv2/IPSec) ├── gtp_* # S2b control-plane GTPv2-C ├── gtpu_* # S2b data-plane GTP-U ├── diameter_* # SWm/SWx/S6b ├── service_* # Service KPIs (attach, availability, uptime) ├── session_* # Session Status (IKE SA, Child SA, GTP, subscribers) ├── app_* # App Metrics (memory, threads, logs) └── system_* # System metrics (CPU, disk, network)
All metrics are exported through a single /metrics endpoint in Prometheus text format. The name follows the rules of Prometheus: epdg_<group>_<name>[_unit], the Counter type has the suffix _total, Histogram is the suffix _seconds/_bytes.
| Name | Type | Appointment |
|---|---|---|
epdg_config_status | Gauge | Component configuration status (0=error, 1=ok) |
epdg_config_reload_total | Counter | Configuration download counter (success/failure) |
| Name | Type | Appointment |
|---|---|---|
epdg_network_connection_status | Gauge | TCP/UDP connection status to a node (0=down, 1=up) — applies to PGW (S2b), AAA (SWm), HSS (SWx) |
| Name | Type | Appointment |
|---|---|---|
epdg_ikev2_messages_total | Counter | IKEv2 Message Counter (IKE_SA_INIT / IKE_AUTH / CREATE_CHILD_SA / INFORMATIONAL) |
epdg_ikev2_request_duration_seconds | Histogram | IKEv2 response time |
epdg_ikev2_errors_total | Counter | IKEv2 errors (NO_PROPOSAL_CHOSEN, AUTHENTICATION_FAILED, INVALID_SYNTAX, etc.) |
| Name | Type | Appointment |
|---|---|---|
epdg_gtp_messages_total | Counter | GTPv2-C (Create/Modify/Delete Session, Echo) |
epdg_gtp_request_duration_seconds | Histogram | Waiting time request → reply |
epdg_gtp_errors_total | Counter | GTP-C error by Cause Code |
epdg_gtp_retransmissions_total | Counter | Redirecting GTP-C requests |
| Name | Type | Appointment |
|---|---|---|
epdg_gtpu_packets_total | Counter | Packages via GTP-U tunnel (uplink/downlink) |
epdg_gtpu_bytes_total | Counter | Bytes through GTP-U tunnel |
epdg_gtpu_errors_total | Counter | Tunneling errors (TEID mismatch, decap fail) |
| Name | Type | Appointment |
|---|---|---|
epdg_diameter_messages_total | Counter | DER/DEA (SWm), MAR/MAA (SWx), AAR/AAA (S6b), STR/STA |
epdg_diameter_request_duration_seconds | Histogram | Waiting time request → reply by Diameter |
epdg_diameter_errors_total | Counter | Errors by Experimental-Result-Code |
epdg_diameter_watchdog_status | Gauge | DWR/DWA watchdog status to node (0=timeout, 1=ok) |
epdg_diameter_connection_status | Gauge | Diameter connection status to node (0=disconnected, 1=connected) |
| Name | Type | Appointment |
|---|---|---|
epdg_service_attach_total | Counter | Attempts to connect (success/failure) via APN |
epdg_service_attach_duration_seconds | Histogram | Duration of connection (IKE_SA_INIT → session ready) |
epdg_service_availability | Gauge | Accessibility flag (0=down, 1=up) |
epdg_service_uptime_seconds | Gauge | Service availability time |
| Name | Type | Appointment |
|---|---|---|
epdg_session_ike_sa_total | Gauge | Active IKE SA |
epdg_session_child_sa_total | Gauge | Active Child SA (IPSec tunnels) |
epdg_session_gtp_sessions_total | Gauge | Active GTP-C sessions on S2b |
epdg_session_subscribers_total | Gauge | Unique subscribers (UE connected) |
| Name | Type | Appointment |
|---|---|---|
epdg_app_threads_total | Gauge | Total number of work streams |
epdg_app_memory_bytes | Gauge | Process memory by type |
epdg_app_log_messages_total | Counter | Log messages by level (debug/info/warn/error/fatal) |
| Name | Type | Appointment |
|---|---|---|
epdg_system_cpu_usage_percent | Gauge | Download CPU |
epdg_system_memory_bytes | Gauge | System memory |
epdg_system_disk_bytes | Gauge | Disk space |
epdg_system_open_fds | Gauge | Open file descriptions |
| Type | Appointment |
|---|---|
| Counter | Monotonically growing counter (messages, errors, reboots) |
| Gauge | Current value (active sessions, memory, status) |
| Histogram | Distribution of values with automatic slices over intervals (duration, lifetime) |
<mermaid> flowchart LR
CORE["VAS Experts<br/>ePDG Monitoring"]
CORE --> P["Prometheus<br/>CNCF / OpenMetrics"] CORE --> S["SNMP v2c<br/>EPDG-MIB"] CORE --> G["Grafana<br/>JSON Provisioning"] CORE --> W["Webhooks<br/>ChatOps"] CORE --> AM["Alertmanager<br/>Routing"]
P --> P1["Cloud-native NMS<br/>Thanos / Cortex / Mimir"] S --> S1["Legacy NMS<br/>HP OpenView, NetAct<br/>IBM Tivoli"] G --> G1["NOC Wall Displays<br/>Drill-down Analytics"] W --> W1["Telegram / Slack<br/>PagerDuty / OpsGenie"] AM --> AM1["Smart routing<br/>Severity-based"]
</mermaid>
The native /metrics endpoint on port 9817 is built into fast-epdg. The format is standard text format Prometheus v0.0.4 (compatible with OpenMetrics). Aggregation is supported with the central Prometheus operator; remote_write team support for long-term storage in Thanos, Cortex, Grafana Mimir.
47 OID covers the Prometheus metric + 14 trap notifications (with raise/clear pairs according to RFC 3877 ALARM-MIB). Compatible with HP OpenView, IBM Tivoli NetCool, Nokia NetAct, Huawei U2000.
<mermaid> flowchart TB
IANA["IANA PEN<br/>enterprises<br/>.1.3.6.1.4.1"] VAS["VAS Experts<br/>.1.3.6.1.4.1.43823<br/>(vas.expert)"] EPDG["EPDG-MIB<br/>.43823.1"] EPC["EPC Monitoring<br/>.43823.100"]
IANA --> VAS VAS --> EPDG VAS --> EPC
EPDG --> OBJ["epdgObjects<br/>.43823.1.1"] EPDG --> NOTIF["epdgNotifications<br/>.43823.1.2<br/>14 trap types"] EPDG --> CONF["epdgConformance<br/>.43823.1.3"]
OBJ --> SERVICE["service .1.1.1<br/>4 OID"] OBJ --> IKE["ikev2 .1.1.2<br/>6 OID"] OBJ --> GTP["gtp .1.1.3<br/>8 OID"] OBJ --> DIAM["diameter .1.1.4<br/>7 OID"] OBJ --> SESS["sessions .1.1.5<br/>8 OID"] OBJ --> SYS["system .1.1.6<br/>8 OID"] OBJ --> NET["network .1.1.7<br/>6 OID"]
NOTIF --> TRAPAGR["7 raise / 7 clear<br/>pairs"]
</mermaid>
Examples of SNMP requests:
# The entire ePDG tree snmpwalk -v2c -c public <host>.1.3.6.1.4.1.43823.1 # Service availability (Gauge 0..1) snmpget -v2c -c public <host> .1.3.6.1.4.1.43823.1.1.0
4 JSON dashboard support (35+ panels total):
Automatic installation through an API that supports Grafana. Adaptive design for Network Control Center (NOC) status monitors with auto-update every 15 seconds.
Webhook interface for integration with any notification system: Telegram Bot, Slack, PagerDuty Events API v2, OpsGenie, Microsoft Teams. A separate SNMP Trap Sender service converts Alertmanager webhooks to SNMP v2c traps with Enterprise OID.
| Criticism | Alarma | Description | Reaction |
|---|---|---|---|
| Critical | ePDG_Service_Down, ePDG_High_Attach_Failure_Rate, ePDG_PGW_Unreachable, ePDG_AAA_Unreachable, ePDG_Diameter_Watchdog_Timeout | Component is unavailable, widespread connection failures, nodes are unavailable | Immediate escalation: Email + SNMP Trap + Webhook. Repeat every hour |
| Warning | ePDG_High_IKEv2_Latency, ePDG_High_GTP_Latency, ePDG_High_IKEv2_Error_Rate, ePDG_High_GTP_Error_Rate, ePDG_High_Memory_Usage, ePDG_High_CPU_Usage, ePDG_Low_Disk_Space, ePDG_High_Error_Log_Rate | Performance degradation, resource anomalies | Email. Resend every 4 hours. Suppressed if a “Critical” status is present on the same component |
<mermaid> flowchart LR
AL["ePDG Alert Rules<br/>20+"]
AL --> CR["Critical<br/>5 rules"] AL --> WR["Warning<br/>8 rules"] AL --> INFO["Recording<br/>34 rules"]
CR --> C1["Service_Down<br/>availability == 0"]
CR --> C2["Attach_Failure_Rate<br/>> 10%"]
CR --> C3["PGW_Unreachable<br/>connection_status{s2b} == 0"]
CR --> C4["AAA_Unreachable<br/>connection_status{swm} == 0"]
CR --> C5["Diameter_Watchdog_Timeout<br/>watchdog_status == 0"]
WR --> W1["High_IKEv2_Latency<br/>p95 > 1.0 s"] WR --> W2["High_GTP_Latency<br/>p95 > 0.5 s"] WR --> W3["High_IKEv2_Error_Rate<br/>> 5%"] WR --> W4["High_GTP_Error_Rate<br/>> 5%"] WR --> W5["High_Memory_Usage<br/>> 80%"] WR --> W6["High_CPU_Usage<br/>> 80%"] WR --> W7["Low_Disk_Space<br/>< 10%"] WR --> W8["High_Error_Log_Rate<br/>> 10/s"]
INFO --> I1["attach_success_rate<br/>preaggregated"] INFO --> I2["p95_p99_latency<br/>preaggregated"] INFO --> I3["throughput<br/>preaggregated"]
</mermaid>
<mermaid> sequenceDiagram
participant M as Метрика (Prometheus) participant R as Alert Rule (PromQL) participant AM as Alertmanager participant E as Email (SMTP) participant SG as SNMP Trap Gateway participant NMS as Внешняя NMS participant W as Webhook (ChatOps)
M->>R: The value exceeds the threshold R->>R: Waiting (for: 1-10 мин) R->>AM: Alert FIRING AM->>AM: Group by [alertname, component] AM->>AM: Inhibition check (critical overrides warning)
alt severity = critical
AM->>E: Email [CRITICAL]
AM->>SG: Webhook → SNMP Trap
SG->>NMS: SNMP v2c Trap (OID .1.3.6.1.4.1.43823.1.2.X)
AM->>W: Webhook (Telegram / PagerDuty)
else severity = warning
AM->>E: Email [WARNING]
end
Note over M,R: The metric is returning to normal R->>AM: Alert RESOLVED R->>SG: clear-trap (paired notification) AM->>E: Email [RESOLVED]
</mermaid>
alertname + component with a 30-second windowfor prevents false positives| Dashboard | Panel | Purpose |
|---|---|---|
| ePDG Overview | 10 | Service availability, connection success rate, number of active sessions, SWu/SWm/S2b status, interface bandwidth |
| IKEv2 Details | 10 | Mes per second by type, histogram of request duration, delay in the 95th percentile, error by type, IKE SA life cycle |
| GTP Details | 8 | GTPv2-C PGW messages, retransmissions, cause code errors, GTP-U (uplink/downlink) carriers |
| Diameter Details | 7 | Number of application messages (SWm/SWx/S6b), duration of requests, state of watchdog timer, distribution of result codes, chronology of connection states |
<mermaid> flowchart TB
NOC["NOC Dashboard Layer"]
NOC --> OVER["ePDG Overview<br/>KPI Summary"] NOC --> IKE["IKEv2 Details<br/>Drill-down"] NOC --> GTP["GTP Details<br/>Drill-down"] NOC --> DIA["Diameter Details<br/>Drill-down"]
OVER -->|Click attach KPI| IKE OVER -->|Click session count| GTP OVER -->|Click peer status| DIA
</mermaid>
ePDG monitoring is fully integrated into overall packet core monitoring:
<mermaid> flowchart TB
subgraph Common["Unified Monitoring Stack"]
PROM["Prometheus"]
GRAF["Grafana"]
AM["Alertmanager"]
end
subgraph Sources["Sources of EPC metrics"]
DPI["FastDPI<br/>:9110"]
SMF["SMF /metrics<br/>:9090"]
PCEF["fast-pcef /metrics<br/>:9090"]
PCRF["FastPCRF"]
EPDG["fast-epdg<br/>:9817"]
end
DPI --> PROM SMF --> PROM PCEF --> PROM PCRF --> PROM EPDG --> PROM
PROM --> GRAF PROM --> AM
</mermaid>
The NOC operator sees all EPC components (DPI, SMF, PCEF, FastPCRF, ePDG) in a single Grafana interface, with a single alarm system and notification routing through one Alertmanager.
<mermaid> graph LR
L1["L1 Physical<br/>NIC counters via system"] L2["L2 Data Link<br/>MAC, VLAN"] L3["L3 Network<br/>IP, IPSec ESP, GTP-U"] L4["L4 Transport<br/>TCP/UDP/SCTP"] L5["L5 Session<br/>GTPv2-C, IKEv2"] L6["L6 Presentation<br/>IKEv2/IPSec encryption, EAP-AKA'"] L7["L7 Application<br/>Diameter, service bearer ops"] Operations["Operations<br/>KPI, SLA, Capacity"] CX["CX Level<br/>Subscriber Experience"]
L1 --> L2 --> L3 --> L4 --> L5 --> L6 --> L7 --> Operations --> CX
style L1 fill:#e74c3c,color:#fff style L2 fill:#e67e22,color:#fff style L3 fill:#f39c12,color:#fff style L4 fill:#2ecc71,color:#fff style L5 fill:#1abc9c,color:#fff style L6 fill:#3498db,color:#fff style L7 fill:#9b59b6,color:#fff style Operations fill:#34495e,color:#fff style CX fill:#2c3e50,color:#fff
</mermaid>
OSI model:
| Level | Metrics | Examples |
|---|---|---|
| L1/L2 Physical / Data Link | - | Covered by a separate node_exporter or equivalent at the OS level (not included in the ePDG metrics list) |
| L3 Network / IPSec tunnels | 3 | epdg_gtpu_packets_total, epdg_gtpu_bytes_total, epdg_gtpu_errors_total — GTP-U data plane |
| L4 Transport | 1 | epdg_network_connection_status — TCP connections to nodes (PGW/AAA/HSS) |
| L5 Session | 3 | epdg_session_ike_sa_total, epdg_session_child_sa_total, epdg_session_gtp_sessions_total |
| L6 Presentation/Security | 3 | epdg_ikev2_messages_total, epdg_ikev2_request_duration_seconds, epdg_ikev2_errors_total — IKEv2/IPSec encryption and EAP-AKA authentication |
| L7 Application | 9 | epdg_diameter_* (SWm/SWx/S6b, 5 metrics), epdg_gtp_* (GTPv2-C, 4 metrics) |
Operator level:
| Level | Metrics | Examples |
|---|---|---|
| Operations | 11 | epdg_service_availability, epdg_service_uptime_seconds, epdg_app_* (3), epdg_system_* (4), epdg_config_* (2) |
| Customer Experience | 3 | epdg_service_attach_duration_seconds p95, epdg_service_attach_total (success rate), epdg_ikev2_request_duration_seconds p99 |
| QoE indicator | Source metrics | Interpretation |
|---|---|---|
| VoWiFi connection time | epdg_service_attach_duration_seconds p95 | > 3 seconds — subscriber notices delay when switching to WiFi |
| Continuity of service | epdg_session_ike_sa_total delta | Mass discharge > 50 IKE SA = accessibility issue |
| Authentication success | ePDG_High_Attach_Failure_Rate alert rate | > 5% = HSS/AAA node problem |
| Delayed appointment bearer | epdg_gtp_request_duration_seconds{msg=create-session} p99 | > 500 ms — delayed availability of voice channel |
| GTP-U tunnel | epdg_gtpu_errors_total rate / epdg_gtpu_packets_total | > 0.1% = degradation of voice quality |
| IKEv2-reliability | epdg_ikev2_errors_total by type | NO_PROPOSAL_CHOSEN / AUTHENTICATION_FAILED — problems with certs / UE |
| Standard | Area | Application |
|---|---|---|
| 3GPP TS 29.273 | SWx/S6b/SWm | Methodology for accounting for Diameter messages and resulting codes |
| 3GPP TS 24.302 | SWu (IKEv2) | Definition of IKEv2 message types and error codes |
| 3GPP TS 33.402 | 3GPP security for non-3GPP access | EAP-AKA'/IKEv2 security parameters |
| 3GPP TS 23.402 | Non-3GPP access architecture | Interface Structure (SWu/SWm/SWx/S6b/S2b) |
| 3GPP TS 32.421 | Performance Measurement | Collection methodology KPI |
| 3GPP TS 32.409 | Performance measurement charging | Counter structure |
| IETF RFC 7296 | IKEv2 | Message types, error notifications, state SA |
| IETF RFC 6733 | Diameter | Command codes, Result-Codes |
| IETF RFC 4187 | EAP-AKA | Authentication via SIM |
| IETF RFC 3877 | ALARM MIB | Enterprise MIB structure for alarms |
| IETF RFC 3418 | SNMPv2 MIB | SNMP v2c compatibility |
| Prometheus Exposition Format | Metrics (v0.0.4) | Export metric format |
| OpenMetrics | CNCF Standard | Prospective compatibility |
<mermaid> flowchart TB
subgraph Host1["ePDG Server"]
EPDG["fast-epdg<br/>(VoWiFi gateway)"]
PLUGIN["/metrics endpoint<br/>:9817"]
EPDG -.-> PLUGIN
end
subgraph Host2["Monitoring server"]
PROM["Prometheus"]
GRAF["Grafana"]
AM["Alertmanager"]
SNMPTRAP["SNMP Trap Sender<br/>(webhook gateway)"]
PROM --> GRAF
PROM --> AM
AM --> SNMPTRAP
end
subgraph Host3["External systems"]
NMS["Операторская NMS<br/>(HP OpenView /<br/>NetAct / Tivoli)"]
CHAT["ChatOps<br/>(Telegram / PagerDuty)"]
end
PLUGIN -->|HTTP :9817/metrics| PROM SNMPTRAP -->|UDP 162| NMS AM -->|Webhook| CHAT
</mermaid>
| Parameter | Value |
|---|---|
| Metrics footprint | Integrated (~2 MB memory overhead) |
| External dependencies | The self-contained fast-epdg package (rpm) |
| Management | fast-epdg.service systemd |
| Configuration | The monitoring section in fast-epdg.conf |
| Update | Updating the configuration without interrupting operations |
| OS | Linux (RHEL/CentOS 8+, Ubuntu 22.04+) |
| Port | 9817 TCP (listening on 0.0.0.0, configurable) |
| Deployment time | < 5 minutes (enable the plugin in the config file + restart) |
The monitoring section in fast-epdg.conf:
monitoring {
enabled = yes
listen_port = 9817
listen_address = 0.0.0.0
update_interval = 10
metrics {
ikev2 = yes
gtp = yes
diameter = yes
service = yes
session = yes
app = yes
system = yes
}
}
Each group of metrics can be independently turned on/off without recompilation.