Scroll Top

Kafka Tiered Storage: Enhancing Scalability and Cost-Efficiency

Home Kafka Tiered Storage: Enhancing Scalability and Cost-Efficiency

By Nanthakumaran S Technical Blog December 22, 2024

Introduction

Apache Kafka’s traditional storage model is effective for high-performance data access, but businesses operating large-scale deployments face considerable hurdles. As data volumes continue to expand exponentially, the need for more sophisticated storage solutions has become increasingly apparent. This blog examines Kafka tiered storage, a critical feature released in September 2023 that addresses the fundamental challenges of scalability and cost-effectiveness in enterprise Kafka deployments.

TL;DR

Kafka tiered storage is a feature that addresses scalability and cost challenges in large-scale Kafka deployments. It works by:

Storing recent, frequently accessed data on local disks (hot tier)
Moving older, less accessed data to cheaper remote storage, like S3 (cold tier)

Key benefits include:

Improved scalability without proportionally increasing broker count
Significant cost reduction for long-term data storage

Tiered storage is handy for scenarios requiring long-term data retention, such as financial services, e-commerce, and log analytics. Implementation involves configuring brokers, setting up remote storage, and defining data migration policies between tiers.

Apache Kafka has become the heart of modern data architectures, serving as a distributed event streaming platform for high-throughput, fault-tolerant data pipelines. It plays a crucial role in handling large-scale data processing and real-time analytics across various industries.

Traditional Kafka Data Storage

Traditionally, Kafka stored all data on brokers’ local disks. This approach ensured high performance and low latency for data access, which are critical features of Kafka’s architecture.

Challenges of the Traditional Approach:

As organizations increasingly relied on Kafka for their data needs, several challenges emerged with the traditional storage model:

Scalability Issues:Growing data volumes required continual addition of brokers to increase storage capacity.
Data Retention Trade-offs:Difficult decisions between keeping historical data and managing practical storage limitations.
Cost Inefficiency: Storing all data on high-performance local disks was expensive, especially for infrequently accessed data.
Operational Overhead:Managing large clusters with extensive local storage required significant effort and expertise.

As organizations increasingly rely on Kafka for their data needs, the demand for more efficient and cost-effective storage solutions has grown. This is where Kafka tiered storage comes into play, offering an intuitive approach to data management within Kafka ecosystems.

Kafka Tiered Storage

Kafka Tiered Storage was first proposed in KIP-405 in December 2018, and, after numerous iterations and enhancements, it was officially released in September 2023. We developed this feature to offer Kafka users a more flexible and cost-effective storage solution.

Storage Tiers:

Kafka tiered storage introduces a two-tiered approach to data storage:

Hot Tier: This tier uses local disks on the Kafka brokers to store recently and frequently accessed data. It provides the high performance that Kafka is known for.
Cold Tier: This tier uses remote object storage (such as Amazon S3, Google Cloud Storage, or Azure Blob Storage) for older, less frequently accessed data.

Use Cases for Hot Tier vs. Cold Tier:

Hot Tier: Ideal for recent data that requires high-performance access, such as real-time analytics or current transaction processing.
Cold Tier:Suitable for historical data, long-term analytics, compliance requirements, and scenarios where immediate access is not critical.

Data lifecycle:

The data lifecycle in Kafka Tiered Storage is as follows:

As usual, the hot tier receives new data.
Periodically, data moves from the hot tier to the cold tier based on configurable policies (e.g., age of data or segment size).
Depending on its storage location, the system retrieves data from either the hot or cold tier when consumers request it.
The brokers maintain metadata about the location of data segments, enabling them to route read requests to the appropriate tier.

Implementing Kafka Tiered Storage

Let’s see some code snippets to understand how to configure and use Kafka tiered storage.

Configuring Tiered Storage

To enable tiered storage, you need to modify the broker configuration. Here’s an example of how to configure tiered storage using Amazon S3 as the cold tier.

# Enable tiered storage
tiered.storage.enable=true

# Configure the remote storage provider
tiered.storage.provider=s3

# S3 configuration
tiered.storage.s3.bucket.name=YOUR_BUCKET_NAME
tiered.storage.s3.region=YOUR_BUCKET_REGION

# Authentication (using IAM role)
tiered.storage.s3.iam.role=YOUR_IAM_ROLE_FOR_KAFKA_TIERED_STORAGE

# Tiering policy
tiered.storage.local.retention.bytes=100G
tiered.storage.local.retention.ms=86400000 # 24 hours

In this configuration:

We enable tiered storage and specify S3 as the provider.
We set the S3 bucket name and region.
We use an IAM role for authentication (alternatively, you could use access keys).
We define a tiering policy that moves data to the cold tier when it’s either older than 24 hours or when the local storage exceeds 100GB.

Creating a Topic with Tiered Storage

When creating a new topic, you can enable tiered storage and set specific retention policies.

kafka-topics.sh - create - bootstrap-server localhost:9092 \
 - topic TOPIC_NAME \
 - partitions 3 \
 - replication-factor 3 \
 - config remote.storage.enable=true \
 - config retention.ms=2592000000 \ # 30 days total retention
 - config local.retention.ms=86400000 # 24 hours on hot tier

This command creates a topic with tiered storage enabled, setting a total retention of 30 days, with the most recent 24 hours kept in the hot tier.

Considerations and Best Practices

When implementing tiered storage, keep these things in mind:

Topic Configuration: Always explicitly configure tiered storage settings for each topic rather than relying solely on broker-wide defaults.
Latency Aware Consumers:Implement your consumers to deal with potential increased latency when reading from the cold tier.
Testing: Thoroughly test your tiered storage setup, including failover scenarios and recovery processes.
Security: Implement appropriate access controls and use encryption both at rest and in transit to properly secure your remote storage.

Conclusion

Kafka tiered storage represents a significant evolution in the Kafka ecosystem, addressing key challenges of scalability, cost-efficiency, and data retention. By leveraging both local and remote storage, organizations can build more flexible, scalable, and cost-effective event-streaming architectures. As data volumes continue to grow exponentially, tiered storage is poised to become an essential feature for many Kafka deployments, enabling businesses to extract more value from their data while optimizing their infrastructure costs.

Nanthakumaran S

+ posts

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use nonessential cookies that help us analyze and understand how you use this website and enhance your user experience. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.

Necessary

Always Enabled

Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.

Cookie	Duration	Description
_GRECAPTCHA	5 months 27 days	This cookie is set by Google. In addition to certain standard Google cookies, reCAPTCHA sets a necessary cookie (_GRECAPTCHA) when executed for the purpose of providing its risk analysis.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other".
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Functional

Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.

Cookie	Duration	Description
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
lang	session	This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
Zoominfo	session	Zoominfo uses technologies to collect and store information when you interact with services it offer to their partners, such as advertising services or analytics. All of those processes are meant to improve your user experience and the overall quality of our services.

Analytics

Analytics cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_111355416_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_hjAbsoluteSessionInProgress	30 minutes	This cookie is used to detect the first pageview session of a user. This is a True/False flag set by the cookie.
_hjFirstSeen	30 minutes	This is set by Hotjar to identify a new user’s first session. It stores a true/false value, indicating whether this was the first time Hotjar saw this user. It is used by Recording filters to identify new user sessions.
_hjid	1 year	This is a Hotjar cookie that is set when the customer first lands on a page using the Hotjar script.
_hjIncludedInPageviewSample	2 minutes	This cookie is set to let Hotjar know whether the user is included in the data sampling defined by site's pageview limit.
_hjIncludedInSessionSample	2 minutes	This cookie is set to let Hotjar know whether the user is included in the data sampling defined by site's daily session limit.
_hjTLDTest	session	Hotjar test cookie to check the most generic cookie path it should use, instead of the page hostname. This is done so that cookies can be shared across subdomains (where applicable). To determine this, we store the _hjTLDTest cookie for different URL substring alternatives until it fails. After this check, the cookie is removed.
oktgid	1 year	This cookie is used for storing the visitor ID of the user who clicked on an okt.to link.
oktsid	session	This cookie is used for storing the session ID of the user who clicked on an okt.to link.

Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.

Cookie	Duration	Description
bscookie	2 years	This cookie is a browser ID cookie set by Linked share Buttons and ad tags.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by YouTube and is used to track the views of embedded videos on YouTube pages.

Other

Other uncategorized cookies are those that are being analyzed and have not yet been classified into a category according to their type and purpose.

Cookie	Duration	Description
__gwtCookieCheck	session	This cookie is used to check if the visitors' browser supports cookies.
AnalyticsSyncHistory	1 month	These cookies are used to deliver advertisements more relevant to you and your interests. They are also used to limit the number of times you see an advertisement as well as help measure the effectiveness of the advertising campaign. They remember that you have visited a website and this information is shared with other organizations such as advertisers.
li_gc	2 years	These cookies are used to deliver advertisements more relevant to you and your interests. They are also used to limit the number of times you see an advertisement as well as help measure the effectiveness of the advertising campaign. They remember that you have visited a website and this information is shared with other organizations such as advertisers.
UserMatchHistory	1 month	LinkedIn - Used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.

Kafka Tiered Storage: Enhancing Scalability and Cost-Efficiency

Introduction

TL;DR

Traditional Kafka Data Storage

Challenges of the Traditional Approach:

Kafka Tiered Storage

Storage Tiers:

Use Cases for Hot Tier vs. Cold Tier:

Data lifecycle:

Implementing Kafka Tiered Storage

Configuring Tiered Storage

Creating a Topic with Tiered Storage

Considerations and Best Practices

Conclusion

Nanthakumaran S

Contact Us

Contact Us

Contact Us

Contact Us

Contact Us

Contact Us