8 posts tagged with "tech"

A Deep Dive into a Production OOMKilled Alert: The Full Story from JVM Memory to a Code Vulnerability

September 1, 2025 · 11 min read

"In the complex world of distributed systems, every seemingly harmless alert can be a thread leading to the core of a problem."

Introduction: A New Challenge

A few months ago, driven by a passion for backend engineering, I joined my company's chatbot team to work on a Java project called bot-gateway. Interestingly, our service had been unstable for months, with Kubernetes Pods restarting intermittently every day. However, due to a busy feature development schedule, the issue was consistently postponed. What was even more interesting was that, as a newcomer to Java, after two months of shipping business features, I finally had the chance to join an "Engineering Excellence Sprint." Fueled by curiosity and a love for solving technical puzzles, I volunteered to take a deep dive into this problem.

The investigation was challenging, but the final outcome was incredibly satisfying. I can say that this journey took me through everything from business code to the JVM memory model, from K8s pod management to live production monitoring. I'm writing down the entire process here to share it with you all.

Part I: The Abrupt Pod Restarts

It all began on a seemingly calm night when our alerting system was suddenly triggered:

[FIRING:1] Container has been restarted. Reason: OOMKilled

This alert came from one of our core services, bot-gateway.

eks-prod-help-center-Channel-Just-Eat-Takeaway-com-Slack-09-02-2025_04_59_PM.png

The term OOMKilled spells trouble for any SRE or developer. After some research, I understood it meant a container was ruthlessly terminated by its host (a Kubernetes node) for running out of memory. What was more frustrating was how sudden and violent it was. The application layer had no time to leave any meaningful "last words"—no application logs, and only a cryptic message in the container logs:

Task ... ran out of memory
... deleted with exit code 137

exit code 137 signifies that the process was terminated by a SIGKILL signal.

Event-Management-All-Events-Datadog-09-02-2025_04_12_PM.png

The service was restarting again and again, impacting user experience and putting immense pressure on the team. And so, my deep-dive investigation began.

Part II: Laying the Theoretical Groundwork

I wasn't deeply familiar with the JVM before this, so this was a perfect opportunity to solidify my foundational knowledge.

Distinguishing between these two core concepts was crucial:

OOMKilled vs java.lang.OutOfMemoryError

OutOfMemoryError (OOM Error): This is the JVM's "internal conflict." The JVM realizes its own Heap Memory is exhausted and proactively throws an exception. It's a relatively "graceful" way to fail.
OOMKilled: This is the container's "external conflict." Kubernetes detects that the total memory footprint of the container (Heap + Non-Heap + Native Memory, etc.) has exceeded its configured limit (limits.memory). To protect the stability of the entire node, the OS acts like an unforgiving city warden, forcibly evicting the rogue process.

A Brief on the JVM Memory Model

Heap Memory: Stores all object instances created with new. It's divided into a Young Generation and an Old Generation. Objects are born in the Young Gen and are promoted to the Old Gen if they survive multiple Garbage Collection (GC) cycles. A continuously growing Old Gen usually indicates a memory leak.
Non-Heap Memory: In Java 8+, this primarily refers to Metaspace, which stores class definitions, methods, and other metadata. A continuously growing Metaspace often signals a ClassLoader Leak.
(TODO: Link to a separate JVM memory blog)

Of course, at the beginning, I had no idea whether this was a container configuration issue (e.g., the app just needed more than the 1GB of memory it was allocated) or a bug in our application code. I had no choice but to start wading through the water.

Part III: The Detective Work

After consulting with various AI teachers (ChatGPT, Gemini, DeepSeek), I devised a plan: analyze monitoring metrics, inspect a memory snapshot (Heap Dump), and trace error logs.

1. The Abnormal ECG on the Metrics Dashboard

First, I opened our JVM Metrics dashboard. The chart before the fix was shocking:

The most glaring anomaly was the GC Old Gen Size chart. It clearly showed the memory usage of the Old Generation on a relentless, upward climb that never decreased. This strongly suggested that a large number of objects were being improperly held long-term and couldn't be collected by GC. The heap usage consistently stayed around 500MB, which seemed odd for a stateless gateway service that keeps most of its data in a Redis cache.

2. The "Crime Scene" Inside the Heap Dump

To unmask these "deadbeat" objects, we needed a heap dump. At the same time, to prevent the pod from being killed directly by the container and instead have the application throw a more gentle OutOfMemoryError, I tried adding various JVM flags—a bit of JVM tuning:

<jvmFlags>
    <jvmFlag>-server</jvmFlag>
    <jvmFlag>-XX:MinRAMPercentage=40</jvmFlag>
    <jvmFlag>-XX:MaxRAMPercentage=60</jvmFlag>
    <jvmFlag>-XX:MaxDirectMemorySize=192m</jvmFlag>
    <jvmFlag>-Xss512k</jvmFlag>
    <jvmFlag>-XX:NativeMemoryTracking=summary</jvmFlag>
    <jvmFlag>-XX:+UnlockDiagnosticVMOptions</jvmFlag>
    <jvmFlag>-XX:+HeapDumpOnOutOfMemoryError</jvmFlag>
    <jvmFlag>-XX:HeapDumpPath=/dumps</jvmFlag>
    <jvmFlag>-Dio.netty.leakDetection.level=PARANOID</jvmFlag>
    <jvmFlag>-javaagent:/library/dd-java-agent.jar</jvmFlag>
    <jvmFlag>-Ddd.jmxfetch.enabled=true</jvmFlag>
    <jvmFlag>-Ddd.jmxfetch.statsd.enabled=true</jvmFlag>
</jvmFlags>

The purpose of these flags included:

Limiting the JVM memory range so it would throw an OutOfMemoryError exception when exhausted, preserving the crime scene's stack trace.
Setting the thread stack size to 512k (since we don't have overly complex logic).
Capping Direct Memory at 192M.
Enabling Native Memory Tracking to analyze it with jcmd diff.
Configuring HeapDumpOnOutOfMemoryError to save the heap dump to a K8s container volume.
Enabling the Datadog agent's jmxfetch.

We captured a heap snapshot (.hprof file) via the Spring Actuator /heapdump endpoint. A little side story here: I used to have exec access to K8s pods, which made it easy to shell in, run JDK commands, and download dumps. However, for security reasons, the SRE team had recently revoked this permission, leaving kubectl debug as the only option. So, I had to pick a pod that looked like it was on the verge of crashing, use kubectl to forward its port 8080 to my local machine, and then hit localhost:8080/heapdump to download the file.

Opening it with the Memory Analyzer Tool (MAT) was revealing: .

Suspect A: Giant byte[] Arrays & Netty The Dominator Tree view showed several abnormally large byte[] arrays. Tracing their references via Path to GC Roots, we found they all pointed back to reactor.netty's memory pool components (PoolChunk). This led us to initially suspect a Netty buffer leak, prompting the addition of the Dio.netty.leakDetection.level=PARANOID JVM flag to catch any unreleased memory allocations in the logs.
Suspect B: The Peculiar DatadogClassLoader The Leak Suspects report pointed to another problem: a large number of java.util.zip.ZipFile$Source (open JAR file handles) and java.lang.Class objects were being held by a classloader named DatadogClassLoader. This not only explained why our Non-Heap memory was slowly growing but also uncovered a chronic resource leak caused by our monitoring agent.

3. The Decisive Error Log (The Smoking Gun)

With the JVM flags configured, I started monitoring. A day and night passed with no specific logs, but I noticed the pod restart frequency had significantly decreased. Just as I was feeling lost, I tried to dump another heap profile via the Spring Actuator endpoint. As luck would have it, the pod must have been truly on the edge this time. The endpoint returned a 500 error, with the reason being java.lang.OutOfMemoryError: Java heap space.

This was interesting. So it could throw an OutOfMemoryError after all. I immediately broadened my log search for "java.lang.OutOfMemoryError" and, to my surprise, found several instances buried in a pile of error logs. Aside from the one triggered by the /heapdump endpoint, all the others came from another piece of logic:

An internal JVM OOM Error:

java.lang.OutOfMemoryError: Java heap space
    at java.desktop/java.awt.image.DataBufferByte.<init>(DataBufferByte.java:93)
    ...
    at javax.imageio.ImageIO.read(ImageIO.java:1466)
    ...
    at com.justeattakeaway.botgateway.service.evidence.validators.impl.ImageValidator.readImage(ImageValidator.java:75)

The case was cracked! This stack trace was a beam of light that illuminated the entire problem. The error was happening in our ImageValidator service! I quickly reviewed the code. We have a feature that allows users to upload images of food-related issues for customer service. During the upload, the backend performs validation (e.g., image dimensions, file size, format), which relies on ImageValidator. Internally, it was calling ImageIO.read(), a method that loads the entire, uncompressed pixel data of the image into heap memory.

This meant we were holding the complete user-uploaded image in memory, which is incredibly resource-intensive over time. These large objects couldn't be GC'd and became "tenants for life" in the Old Gen. An even more severe security risk was that a malicious user could upload an "image bomb"—a file that is small in size but has an extremely high resolution—to instantly exhaust all memory. For example, a 20,000x20,000 pixel image requires about 1.6 GB of heap space! This was a critical code vulnerability.

Part IV: The Fix

We now had a complete picture of the problem:

Chronic Illness: A ClassLoader leak from the Datadog Agent was slowly raising the Non-Heap memory baseline.
Complication: A potential buffer leak in Netty was adding pressure to the Heap.
The Trigger: The dangerous image handling logic in ImageValidator was the final straw that broke the camel's back.

Our solution had to be a targeted, three-pronged attack:

The Critical Patch (The Cure): Refactor ImageValidator. We abandoned the direct call to ImageIO.read() and switched to the safer ImageReader API to read the image's dimensions (metadata) before fully decoding it. If the dimensions exceeded a preset safety threshold, the image was rejected immediately.
Process Re-engineering (Strengthening the Foundation): We refactored the entire file-handling flow from a byte[]based approach to an InputStreambased, streaming approach, reading data in 8KB chunks. From the source, if the size exceeded the limit, the image was rejected, avoiding the problem of loading large files into memory all at once.
Long-term Governance (Eradicating the Disease): We identified our Datadog Java Tracer version (1.39.0) and planned an upgrade. Upgrading to the latest version would resolve the known ClassLoader leak and other minor logging errors simultaneously.

Part V: Calm After the Storm

After deploying the fix, the results were immediate and dramatic. The JVM Metrics dashboard showed an unprecedented level of stability:

Heap Usage no longer had fatal spikes and settled into a healthy, periodic pattern, dropping from 500MB to a stable 150MB.
Old Gen Size stopped its relentless growth, stabilizing at a reasonable 100MB (down from 400MB).
New Gen Size dropped from 90MB to 30MB.
The OOMKilled alerts fell silent.

Final Chapter: Lessons and Reflections

This thrilling investigation (during which my carefully configured JVM flags were rolled back multiple times due to other production incidents) left us with invaluable lessons:

Everything is Connected: Complex production issues are often the result of multiple, seemingly unrelated factors (application vulnerabilities, dependency leaks, framework usage) compounding each other.
Theory Guides Practice: A deep understanding of foundational knowledge like the JVM memory model is the key to correctly interpreting monitoring data and memory snapshots.
Tools are a Detective's Eyes: Proficiency with monitoring systems (Metrics), memory analyzers (MAT), and built-in diagnostic tools is fundamental to efficient problem-solving.
Defensive Programming: Never trust user input. Implementing strict, memory-safe validation for untrusted data (like uploaded files) is a lifeline for system robustness.

This experience not only solved a tricky production issue but also deepened my understanding of the JVM, containerization, and distributed systems. Every production problem is a precious learning opportunity. It's through these challenges that we grow into better engineers—better Builders, Solvers, and Collaborators.

Uber 高性能 Web App 优化实践

July 10, 2017 · 8 min read

原文 - Building m.uber: ENGINEERING A HIGH-PERFORMANCE WEB APP FOR THE GLOBAL MARKET

Performance matters on mobile.

又是一篇关于性能优化的实践。

m.uber 团队对 m.uber - 即他们的超级轻量 web app 做了一些性能优化的工作。

范围全面，从代码到打包到部署到缓存，都有涉及。

TL;DR

Performance Tools

Preact over React
Webpack dynamic bundle splitting & tree-shaking capabilities
Tiny Libraries & Minimal Dependencies
source-map-explorer

从十个 React 迷你设计模式谈开去

June 28, 2017 · 15 min read

很早之前就一直在读的一篇文章，10 个React Mini 设计模式，一边做 Creator 项目，也一边终于把它精读完。

结合自己的开发时候的项目经验，做了点笔记。

Creator 项目是一个多端（Web + Mobile）React SPA，且有一些表单填写和复杂的交互组件。

~~自己单独封装了一个很简单的基于 Node EventEmitter 的 Store，开发过程中收获很大，这些细节之后可以细说。~~ 产品那边后来又加了「置顶」功能，类似双向数据通信的 EventEmitter 逻辑有点太乱了，所以还是狠心花时间升级成了 Redux + Immutable.js + Normalizr 技术栈，果然省心很多。

原文作者说你是不是天天写 React, 写着写着发现自己可能经常用来实现需求的，也总是那么几个方法，往大了讲其实就是开发中的 设计模式。在这里我们称为 Mini Patterns。

Fun with Codemod & AST

February 15, 2017 · 14 min read

TL;DR

Facebook 为了解决「大型代码库」迁移，基于 AST 造了个工具 Codemod
基于 Codemod 又构建了 JavaScript 代码迁移专用的工具 jscodeshift 和 React-codemod
理解这些工具背后的原理有助于从一个单纯的「API 使用者」变成一个工程师般的「创造者」
Demo Time！Let's write a codemod
一些有价值的参考

移动环境下的 SEO

May 20, 2016 · 7 min read

真的有很久很久没来理这个博客，距离上一篇文章的发布日期是两年前。毕业工作后来了豆瓣，最近对移动环境（主要是浏览器及微信）的 SEO 相关进行了下研究，正好在这里分享一下。

以下，Enjoy。

Ali校招笔试题思考

December 1, 2014 · 6 min read

昨晚搭着末班车，参加了阿里今年的实习生在线笔试。回忆之前惨不忍睹的内推面试，玩了整整一个寒假后接到了不期而至的面试电话，连之前一些基础的还算熟悉的题都答得不流畅自然一气呵成，我就知道我悲剧了。痛定思痛，作为一个即将毕业的大四老鸟（只是说年龄...)，在被各种鄙视，各种蜚语，各种不确定存在的黑暗时期，依旧不屈不挠地学习思考着，我都要被自己感动了。笔试题只有一个小时，13道题，一开始是单项选择和不定项选择，考了AMD编码规范、闭包、setTimeout的异步、前端安全及一些我认为蛮有意思的小题，挺考基础的，不是太难，但要细心，我居然也慢悠悠做，时间就那么过去了一半。后来看到了六大道问答题，基本是编程，涉及CSS3、原生JS、事件处理、Ajax等，就渐渐慌了，写代码的手居然有了渐冻症的感觉，心理素质有待提高。于是在有差不多一半大题没完成的情况下，被迫交了卷。交卷后才灵感突现，猛然想起了那些题的解法，还是代码经验不够啊。为了防止再出现这样的情况，在这里贴下题目思路和解答。

写在百度电面后

November 4, 2014 · 7 min read

Desperado.

road to the sky

睡得昏昏沉沉的早上，突然接到了电话，在从惊呆到真呆中，就这么面完了百度。

“先作下自我介绍吧！” 自我介绍？天，简历里面不是写了吗？噢，我写了什么？ “呃，我叫XXX，是电子科大电子信息工程的大四学生...呃，我喜欢互联网及互联网开发…最近一直在自学…嗯。” “完了吗？” “说完了。” 当时我就在想，完了。脑袋空白。

Oh My Ghostium

November 3, 2014 · 8 min read

之前装了Wordpress扔在大洋彼岸一个Arizona搬瓦工的VPS后就再没去理过它，想想也有半年了。
后来分析觉得，首先肯定是自己懒惰浮躁，静不下来写东西。其次是博客页面没有漂亮到让我有一种打开就想写的冲动。
但是他们说**为什么你应该（从现在开始就）写博客，于是紧跟大牛的步伐，虽然没有什么牛逼技术可以跟别人分享，但是看看大牛写的技术文章谢谢心得也是一种进步。于是就开始了万劫不复的折腾深渊...
在谷歌搜寻平台的时候看到了一个同是F2E的一个博客罗磊的独立博客，瞬间被大Banner的设计吸引到，看了很久后拖到了网页末尾，扫到了一行字本博客基于拽酷炫的 GHOST。
比较了Ghost/Jekyll/Octo/Hexo 等等之后，认为Ghost**还是最适合我：

是基于 Node.js 的博客平台。
**Just a blogging platform.**简单简洁，响应式设计。
免费，支持完全的自定义。

DigitalOcean很贴心的有Ghost的APP安装镜像包，安装后直接打开http://your.domain/2368就会看到第一次登陆的窗口，设置好Blog Title，用户名，密码就可以愉快的开始体验Ghost了。
又过了几天，手贱点开了很多Ghost类的博客，发现大家怎么都这样啊，使用默认的Casper主题已到审美疲劳。
在连博文都没写几篇的情况下，那么就继续万劫不复的折腾深渊 咯。

Introduction: A New Challenge​

Part I: The Abrupt Pod Restarts​

Part II: Laying the Theoretical Groundwork​

Part III: The Detective Work​

1. The Abnormal ECG on the Metrics Dashboard​

2. The "Crime Scene" Inside the Heap Dump​

3. The Decisive Error Log (The Smoking Gun)​

Part IV: The Fix​

Part V: Calm After the Storm​

Final Chapter: Lessons and Reflections​

TL;DR​

Performance Tools​

TL;DR​

Introduction: A New Challenge

Part I: The Abrupt Pod Restarts

Part II: Laying the Theoretical Groundwork

Part III: The Detective Work

1. The Abnormal ECG on the Metrics Dashboard

2. The "Crime Scene" Inside the Heap Dump

3. The Decisive Error Log (The Smoking Gun)

Part IV: The Fix

Part V: Calm After the Storm

Final Chapter: Lessons and Reflections

TL;DR

Performance Tools

TL;DR