A Gentle Introduction to Logstash

Centralize, transform & stash your data

CC Chang
6 min readNov 8, 2022

Table of Contents

  • Overview
  • Why Logstash? Issues to solve
  • Case Study: Normalizing Movie Log from Kafka Stream
  • Strengths & Limitations

Overview

Logstash is an open-source data collection engine with real-time pipelining capabilities. Logstash can dynamically unify data from disparate sources and normalize the data into destinations of your choice. Cleanse and democratize all your data for diverse advanced downstream analytics and visualization use cases.

Logstash pipeline contains three stages: Input, Filter, and Output.

Input Stage: Logstash uses different input plugins to “pull” the data of all shapes, sizes, and sources. For example, we can install File plugin to enable Logstash to read log files continuously or install Kafka plugin to read Kafka streaming data. Logstash also provides a multitude of input plugins such as stdin, UDP, TCP, HTTP Endpoints, JDBC, Log4J, AWS CloudWatch, Rabbit MQ, Twitter feed, etc.

Filter Stage: Logstash leverages the filter plugin to extract necessary information from the input stream and transforms it into a more meaningful common format. Logstash provides different types of filter plugins for different processing needs. For example, there are plugins for parsing and processing XML, JSON, unstructured, and CSV data, API responses, Geocoding IP addresses, or relational data.

Output Stage: Similar to the input stage, Logstash “pushes” the collected and cleansed data to various downstream with corresponding output plugins, including AWS S3, Elasticsearch, HDFS, CSV files, etc. In addition, Logstash enables you to define your own routing logic or send the data to multiple destinations.

Logstash’s pluggable framework features over 200 plugins, which makes Logstash one of the most versatile solutions for gathering data. Mix, match, and orchestrate different inputs, filters, and outputs to work in pipeline harmony.

Why Logstash? Issues to solve

Nowadays, data come from a wide variety of sources with different formats (e.g. structured, semi-structured, unstructured), which makes it hard to use effectively. For example, analytics engines can’t process data with arbitrary formats. Also, if data were formatted differently, it’s nearly impossible for data scientists or business analysts to compare large datasets and see how they impact one another.

To manage data of different types coming in from different systems, we require a tool to gather data from multiple sources and transform them into a common format to generate insights in real-time. In this way, we can interact with data from different systems simultaneously. In addition, it allows downstream analytics engines, data scientists, or business analysts to make the most of your data.

Logstash is the tool that solves the problem above. It can dynamically unify data from disparate sources and normalize the data into destinations of your choice. Moreover, what distinguishes Logstash from most other data collection services (which simply act as an aggregator or message queue) is its ability to apply filters to the input data and process it.

Case Study: Normalizing Movie Log from Kafka Stream

In our movie streaming scenario, there are three types of semi-structured logs mixing in a Kafka stream (Kafka topic). Our goal is to ingest mixed logs from this Kafka stream, extract the necessary information in each type of log, and separate them into three different structured CSV files.

The following are three types of logs mixing in Kafka stream:

  1. Watch log: The user watches a movie; the movie is split into 1-minute mpg files that are requested by the user as long as they are watching the movie.
Format: <time>,<userid>,GET /data/m/<movieid>/<minute>.mpg

2. Rate log: The user rates a movie with 1 to 5 stars.

Format: <time>,<userid>,GET /rate/<movieid>=<rating>

3. Recommendation result log: The user considers watching a movie and a list of recommendations is requested.

Format: <time>,<userid>,recommendation request <server>, status <200 for success>, result: <recommendations>, <responsetime> ms

After setting up Kafka and Logstash, now we are going to write our “input”, “filter” and “output” logics in “logstash.conf” file, which includes three main parts: Input, Filter, and Output.

  1. Input Stage: Use Kafka input plugin to pull streaming movie logs from topic “movielog5” with Kakfa endpoint “localhost:9092”.
input {
kafka {
bootstrap_servers => "localhost:9092"
topics => ["movielog5"]
}
}

2. Filter Stage: Use grok filter plugin to match patterns and extract necessary information from a log.

filter {
grok {
break_on_match => true
match => [
"message", "%{TIMESTAMP_ISO8601:time},%{NUMBER:user_id},% {WORD:method} /data/m/%{DATA:movie_id}/%{NUMBER:watch_length}.mpg",
"message", "%{TIMESTAMP_ISO8601:time},%{NUMBER:user_id},%{WORD:method} /rate/%{DATA:movie_id}=%{NUMBER:rate}", "message", "%{TIMESTAMP_ISO8601:time},%{NUMBER:user_id},recommendation request %{HOSTNAME:hostname}:%{NUMBER:port}, status %{NUMBER:http_status_code}, result: %{DATA:recommendation_result}, %{NUMBER:response_time} ms"
]
}
# Classify different types of logs by their unique keys, and add unique metadata field to separate them for output stage.
if [watch_length] {
mutate {
add_field => { "[@metadata][type]" => "watch" }
}
} else if [rate] {
mutate {
add_field => { "[@metadata][type]" => "rate" }
}
} else if [recommendation_result] {
mutate {
add_field => { "[@metadata][type]" => "recommendation" }
}
} else {
mutate {
add_field => { "[@metadata][type]" => "unknown" }
}
}
}

3. Output Stage: Stash logs to different CSV files based on their types, and only store the necessary fields (keys) extracted by grok during filter stage.

output {
if [@metadata][type] == "watch" {
csv {
fields => ["time", "user_id", "movie_id", "watch_length"]
path => "/home/chichenc/kafka/watch.csv"
}
}
else if [@metadata][type] == "rate" {
csv {
fields => ["time", "user_id", "movie_id", "rate"]
path => "/home/chichenc/kafka/rate.csv"
}
}
else if [@metadata][type] == "recommendation" {
csv {
fields => ["time", "user_id", "hostname", "http_status_code", "response_time", "recommendation_result"]
path => "/home/chichenc/kafka/recommendation.csv"
}
}
else {
file {
codec => line { format => "%{message}"}
path => "/home/chichenc/kafka/error.log"
}
}
}

After completing “logstash.conf”, we can run the following command to start Logstash pipeline on Linux.

sudo -u logstash /usr/share/logstash/bin/logstash --path.settings /etc/logstash -f logstash.conf

Sample Output

1. watch.csv: <time>,<user_id>,<movie_id>,<watch_length>
2. rate.csv: <time>,<user_id>,<movie_id>,<rate>
3. recommendation.csv: <time>,<user_id>,<hostname>,<http_status_code>,<response_time>,<recommendation_result>
4. error.log: <message that doesn't match any of given patterns>
sample output of cleansed and formatted logs

To summarize, Logstash is a server-side data processing pipeline that ingests data from a multitude of sources, transforms it, and then sends it to your favorite “stash.” We provide a movie streaming scenario where Logstash keeps pulling logs from Kafka stream, captures important information from the semi-structured log entry, and classifies it to different CSV files.

Strengths & Limitations

Strengths

  1. User-friendly: Logstash can be set up in minutes, and it owns clear documentation and straightforward configuration for various use cases.
  2. High flexibility: There’re 200 plugins so that Logstash can ingest data from a variety of sources, parse different data formats, and route data to wherever you want.
  3. High performance: If there are enough computing resources and memory, Logstash can perform in-memory computations, so the processing speed is fairly high. Also, the user can specify multiple workers to execute on Filter/Output stages, which increases parallelism.
  4. High observability: Logstash provides users with visibility into the behavior and performance of the pipeline with a Monitoring Dashboard. The user is able to observe the event received rate, event latency, event emitted rate, CPU utilization, system load, etc. This feature makes troubleshooting more effective and efficient.

Limitations

  1. Data loss: Logstash stores incoming data in memory by default, so if Logstash is terminated, the in-flight message is going to be lost. Although Logstash provides a persistent queue that buffers data on disk, data will still be lost if an abnormal shutdown occurs before the checkpoint file has been committed.
  2. Low horizontal scalability: Logstash does not support clustering in the sense that members of the cluster coordinate with one another to increase or decrease computing pipelines as needed to meet changing demand. If the amount of input data to a Logstash pipeline increase drastically, this pipeline will be the performance bottleneck since Logstash will not re-balance the data to another pipeline automatically. Instead, the user has to create and manage multiple pipelines and install other load-balancing services to handle this problem on his own.
  3. Low availability: Similar to the problem regarding scalability, Logstash doesn’t provide a built-in high-availability solution. Logstash itself is a single-running instance, which is vulnerable and subject to server failure. What’s worse, the pipeline will get stuck if the amount of input data is huge, which makes Logstash unavailable. To solve this problem, the user has to integrate other services with Logstash to design a more advanced system architecture, which even makes it hard to manage.

--

--

Responses (6)