Observability - TheCodeWalkers

Observability is a key requirement for monitoring and debugging microservice architectures. With the rise of distributed systems and the increasing complexity of modern software applications, observability has become an essential component of any production system.

Observability in microservices involves the ability to gather and analyze data from each individual microservice to provide a comprehensive view of the entire system. This includes the ability to monitor and track performance metrics, identify errors and exceptions, and trace requests as they move through the system.

Here are the three main key components of observability:

Logging

Enrich our growing comLogging is a critical component of observability. Each microservice should generate logs that record important events, such as requests received, responses sent, errors encountered and other relevant information. These logs should be centralized and easily accessible to enable easy debugging and analysis.

Each of the microservices should capture the following different types of logs.
- TRACE → DEBUG → INFO → WARN → ERROR → FATAL
Each of the log lines should have a list of the following items in it.
- Date and Time: Millisecond precision and easily sortable.
- Log Level: ERROR, WARN, INFO, DEBUG, TRACE,FATAL.
- Process ID.
- A --- separator to distinguish the start of actual log messages.
- Thread name: Enclosed in square brackets (may be truncated for console output).
- Logger name: This is usually the source class name (often abbreviated).
- The log message.
The current log level should be configurable at the microservice level.
There should be a log file size-related configuration at each microservice. Once the limit reaches, logs should be rollover to a new file.

Traces

Tracing involves tracking requests as they move through the system, allowing developers to identify bottlenecks and diagnose problems. Tools like Jaeger and Zipkin, Sleuth can be used to provide distributed tracing capabilities.

Each of the traces should have at least the below info
- Timestamp
- EventTypeTraces
- TraceID
- SpanID
- ParentID
- ServiceID
- Duration

Metrics

Metrics provide a quantitative measure of system performance and health. Each microservice should expose metrics such as CPU usage, memory usage, network latency, and other relevant performance indicators. These metrics should be collected and analyzed in a centralized location, such as a monitoring dashboard.

A list of the below metrics could be collected from different microservices:
- Health
- Memory usages
- CPU usages
- Monitor REST APIs
- Process/Threads metrics
- Users metrics with session
- Garbage collection metrics
- Any notifications services consumer and events metrics
- Cached objects metrics
- Different database metrics

All of the metrics could be stored in the any Time Series Based Database.
Personal Preference InfluxDB

Automated alerts

Automated alerts can notify developers when issues arise, allowing them to quickly identify and resolve problems. Alerting systems can be configured to trigger based on specific metrics, such as error rates or response times.

PagerDuty
Grafana OnCall
OpsGenie
VictorOps is now Splunk On-Call
Alertmanager
BigPanda

Observability Solutions available in the market

There are many solutions available in the market that can help organizations with logging, metrics, and traces for their microservices architecture. Here are some popular options:

GrafanaLab(PLG)
- Logs – Loki, Promtail
- Traces – Tempo
- Metrics – Mimir, Prometheus, Graphite
- Visualize – Grafana
ELK – Elasticsearch, Logstash, Kibana
EFK – Elasticsearch, Fluentd, Kibana
Splunk – It’s a commercial complete logging solution.
Graylog – Graylog is another complete logging solution, an open-source alternative to Splunk.

Loki
The way Loki represents logs by a set of label pairs is similar to how Prometheus represents metrics. When deployed in an environment alongside Prometheus, logs from Promtail usually have the same labels as your applications metrics thanks to using the same service discovery mechanisms. Having logs and metrics with the same labels enables users to seamlessly context switch between metrics and logs, helping with root cause analysis.

Promtail
Promtail’s use case is specifically tailored to Loki. Its main mode of operation is to discover log files stored on disk and forward them associated with a set of labels to Loki. Promtail can do service discovery for Kubernetes pods running on the same node as Promtail, act as a container sidecar or a Docker logging driver, read logs from specified folders, and tail the systemd journal.