Observability


Observability is a key requirement for monitoring and debugging microservice architectures. With the rise of distributed systems and the increasing complexity of modern software applications, observability has become an essential component of any production system.

Observability in microservices involves the ability to gather and analyze data from each individual microservice to provide a comprehensive view of the entire system. This includes the ability to monitor and track performance metrics, identify errors and exceptions, and trace requests as they move through the system.

Here are the three main key components of observability:

  • Each of the microservices should capture the following different types of logs.
    • TRACE → DEBUG → INFO → WARN → ERROR → FATAL
  • Each of the log lines should have a list of the following items in it.
    • Date and Time: Millisecond precision and easily sortable.
    • Log Level: ERRORWARNINFODEBUGTRACE,FATAL.
    • Process ID.
    • --- separator to distinguish the start of actual log messages.
    • Thread name: Enclosed in square brackets (may be truncated for console output).
    • Logger name: This is usually the source class name (often abbreviated).
    • The log message.
  • The current log level should be configurable at the microservice level.
  • There should be a log file size-related configuration at each microservice. Once the limit reaches, logs should be rollover to a new file.

  • Each of the traces should have at least the below info
    • Timestamp
    • EventTypeTraces
    • TraceID
    • SpanID
    • ParentID
    • ServiceID
    • Duration

  • A list of the below metrics could be collected from different microservices:
    • Health
    • Memory usages
    • CPU usages
    • Monitor REST APIs
    • Process/Threads metrics
    • Users metrics with session
    • Garbage collection metrics
    • Any notifications services consumer and events metrics
    • Cached objects metrics
    • Different database metrics

All of the metrics could be stored in the any Time Series Based Database.

Personal Preference InfluxDB

Automated alerts

Automated alerts can notify developers when issues arise, allowing them to quickly identify and resolve problems. Alerting systems can be configured to trigger based on specific metrics, such as error rates or response times.

  • PagerDuty
  • Grafana OnCall
  • OpsGenie
  • VictorOps is now Splunk On-Call
  • Alertmanager
  • BigPanda

Observability Solutions available in the market

There are many solutions available in the market that can help organizations with logging, metrics, and traces for their microservices architecture. Here are some popular options:

  • GrafanaLab(PLG)
    • Logs – Loki, Promtail
    • Traces – Tempo
    • Metrics – Mimir, Prometheus, Graphite
    • Visualize – Grafana
  • ELK – Elasticsearch, Logstash, Kibana
  • EFK – Elasticsearch, Fluentd, Kibana
  • Splunk – It’s a commercial complete logging solution.
  • Graylog – Graylog is another complete logging solution, an open-source alternative to Splunk.

Loki

The way Loki represents logs by a set of label pairs is similar to how Prometheus represents metrics. When deployed in an environment alongside Prometheus, logs from Promtail usually have the same labels as your applications metrics thanks to using the same service discovery mechanisms. Having logs and metrics with the same labels enables users to seamlessly context switch between metrics and logs, helping with root cause analysis.

Promtail

Promtail’s use case is specifically tailored to Loki. Its main mode of operation is to discover log files stored on disk and forward them associated with a set of labels to Loki. Promtail can do service discovery for Kubernetes pods running on the same node as Promtail, act as a container sidecar or a Docker logging driver, read logs from specified folders, and tail the systemd journal.

Alternative to Logstash

Fluentbit(due to kubernate), Fluentd, Filebeat , Logagent , Rsyslog , syslog-ng, Apache Flume.

Reference

https://grafana.com/

https://github.com/grafana

https://github.com/grafana/loki/blob/main/LICENSING.md

https://github.com/grafana/grafana/blob/HEAD/LICENSING.md

https://github.com/prometheus/client_java

https://github.com/prometheus/prometheus

https://docs.spring.io/spring-boot/docs/current/reference/htmlsingle/#actuator.metrics.export.prometheus

https://prometheus.io/docs/instrumenting/exporters/

https://micrometer.io/docs

https://ordina-jworks.github.io/monitoring/2020/11/16/monitoring-spring-prometheus-grafana.html

https://crashlaker.medium.com/which-logging-solution-4b96ad3e8d21