Skip to main content

A Deep Dive into a Production OOMKilled Alert: The Full Story from JVM Memory to a Code Vulnerability

· 11 min read

"In the complex world of distributed systems, every seemingly harmless alert can be a thread leading to the core of a problem."

Introduction: A New Challenge

A few months ago, driven by a passion for backend engineering, I joined my company's chatbot team to work on a Java project called bot-gateway. Interestingly, our service had been unstable for months, with Kubernetes Pods restarting intermittently every day. However, due to a busy feature development schedule, the issue was consistently postponed. What was even more interesting was that, as a newcomer to Java, after two months of shipping business features, I finally had the chance to join an "Engineering Excellence Sprint." Fueled by curiosity and a love for solving technical puzzles, I volunteered to take a deep dive into this problem.

The investigation was challenging, but the final outcome was incredibly satisfying. I can say that this journey took me through everything from business code to the JVM memory model, from K8s pod management to live production monitoring. I'm writing down the entire process here to share it with you all.

Part I: The Abrupt Pod Restarts

It all began on a seemingly calm night when our alerting system was suddenly triggered:

[FIRING:1] Container has been restarted. Reason: OOMKilled

This alert came from one of our core services, bot-gateway.

eks-prod-help-center-Channel-Just-Eat-Takeaway-com-Slack-09-02-2025_04_59_PM.png

The term OOMKilled spells trouble for any SRE or developer. After some research, I understood it meant a container was ruthlessly terminated by its host (a Kubernetes node) for running out of memory. What was more frustrating was how sudden and violent it was. The application layer had no time to leave any meaningful "last words"—no application logs, and only a cryptic message in the container logs:

Task ... ran out of memory
... deleted with exit code 137

exit code 137 signifies that the process was terminated by a SIGKILL signal.

Event-Management-All-Events-Datadog-09-02-2025_04_12_PM.png

The service was restarting again and again, impacting user experience and putting immense pressure on the team. And so, my deep-dive investigation began.

Part II: Laying the Theoretical Groundwork

I wasn't deeply familiar with the JVM before this, so this was a perfect opportunity to solidify my foundational knowledge.

Distinguishing between these two core concepts was crucial:

OOMKilled vs java.lang.OutOfMemoryError

  • OutOfMemoryError (OOM Error): This is the JVM's "internal conflict." The JVM realizes its own Heap Memory is exhausted and proactively throws an exception. It's a relatively "graceful" way to fail.
  • OOMKilled: This is the container's "external conflict." Kubernetes detects that the total memory footprint of the container (Heap + Non-Heap + Native Memory, etc.) has exceeded its configured limit (limits.memory). To protect the stability of the entire node, the OS acts like an unforgiving city warden, forcibly evicting the rogue process.

A Brief on the JVM Memory Model

  • Heap Memory: Stores all object instances created with new. It's divided into a Young Generation and an Old Generation. Objects are born in the Young Gen and are promoted to the Old Gen if they survive multiple Garbage Collection (GC) cycles. A continuously growing Old Gen usually indicates a memory leak.
  • Non-Heap Memory: In Java 8+, this primarily refers to Metaspace, which stores class definitions, methods, and other metadata. A continuously growing Metaspace often signals a ClassLoader Leak.
  • (TODO: Link to a separate JVM memory blog)

Of course, at the beginning, I had no idea whether this was a container configuration issue (e.g., the app just needed more than the 1GB of memory it was allocated) or a bug in our application code. I had no choice but to start wading through the water.

Part III: The Detective Work

After consulting with various AI teachers (ChatGPT, Gemini, DeepSeek), I devised a plan: analyze monitoring metrics, inspect a memory snapshot (Heap Dump), and trace error logs.

1. The Abnormal ECG on the Metrics Dashboard

First, I opened our JVM Metrics dashboard. The chart before the fix was shocking:

JVM-Metrics-Datadog-before.png

The most glaring anomaly was the GC Old Gen Size chart. It clearly showed the memory usage of the Old Generation on a relentless, upward climb that never decreased. This strongly suggested that a large number of objects were being improperly held long-term and couldn't be collected by GC. The heap usage consistently stayed around 500MB, which seemed odd for a stateless gateway service that keeps most of its data in a Redis cache.

2. The "Crime Scene" Inside the Heap Dump

To unmask these "deadbeat" objects, we needed a heap dump. At the same time, to prevent the pod from being killed directly by the container and instead have the application throw a more gentle OutOfMemoryError, I tried adding various JVM flags—a bit of JVM tuning:

<jvmFlags>
<jvmFlag>-server</jvmFlag>
<jvmFlag>-XX:MinRAMPercentage=40</jvmFlag>
<jvmFlag>-XX:MaxRAMPercentage=60</jvmFlag>
<jvmFlag>-XX:MaxDirectMemorySize=192m</jvmFlag>
<jvmFlag>-Xss512k</jvmFlag>
<jvmFlag>-XX:NativeMemoryTracking=summary</jvmFlag>
<jvmFlag>-XX:+UnlockDiagnosticVMOptions</jvmFlag>
<jvmFlag>-XX:+HeapDumpOnOutOfMemoryError</jvmFlag>
<jvmFlag>-XX:HeapDumpPath=/dumps</jvmFlag>
<jvmFlag>-Dio.netty.leakDetection.level=PARANOID</jvmFlag>
<jvmFlag>-javaagent:/library/dd-java-agent.jar</jvmFlag>
<jvmFlag>-Ddd.jmxfetch.enabled=true</jvmFlag>
<jvmFlag>-Ddd.jmxfetch.statsd.enabled=true</jvmFlag>
</jvmFlags>

The purpose of these flags included:

  • Limiting the JVM memory range so it would throw an OutOfMemoryError exception when exhausted, preserving the crime scene's stack trace.
  • Setting the thread stack size to 512k (since we don't have overly complex logic).
  • Capping Direct Memory at 192M.
  • Enabling Native Memory Tracking to analyze it with jcmd diff.
  • Configuring HeapDumpOnOutOfMemoryError to save the heap dump to a K8s container volume.
  • Enabling the Datadog agent's jmxfetch.

We captured a heap snapshot (.hprof file) via the Spring Actuator /heapdump endpoint. A little side story here: I used to have exec access to K8s pods, which made it easy to shell in, run JDK commands, and download dumps. However, for security reasons, the SRE team had recently revoked this permission, leaving kubectl debug as the only option. So, I had to pick a pod that looked like it was on the verge of crashing, use kubectl to forward its port 8080 to my local machine, and then hit localhost:8080/heapdump to download the file.

Opening it with the Memory Analyzer Tool (MAT) was revealing: .

  1. Suspect A: Giant byte[] Arrays & Netty The Dominator Tree view showed several abnormally large byte[] arrays. Tracing their references via Path to GC Roots, we found they all pointed back to reactor.netty's memory pool components (PoolChunk). This led us to initially suspect a Netty buffer leak, prompting the addition of the Dio.netty.leakDetection.level=PARANOID JVM flag to catch any unreleased memory allocations in the logs.

    heapdump_hprof_dominator_tree.png

    heapdump_byte_array_path_to_GC.png

  2. Suspect B: The Peculiar DatadogClassLoader The Leak Suspects report pointed to another problem: a large number of java.util.zip.ZipFile$Source (open JAR file handles) and java.lang.Class objects were being held by a classloader named DatadogClassLoader. This not only explained why our Non-Heap memory was slowly growing but also uncovered a chronic resource leak caused by our monitoring agent.

3. The Decisive Error Log (The Smoking Gun)

With the JVM flags configured, I started monitoring. A day and night passed with no specific logs, but I noticed the pod restart frequency had significantly decreased. Just as I was feeling lost, I tried to dump another heap profile via the Spring Actuator endpoint. As luck would have it, the pod must have been truly on the edge this time. The endpoint returned a 500 error, with the reason being java.lang.OutOfMemoryError: Java heap space.

heapdump-OOMError.png

This was interesting. So it could throw an OutOfMemoryError after all. I immediately broadened my log search for "java.lang.OutOfMemoryError" and, to my surprise, found several instances buried in a pile of error logs. Aside from the one triggered by the /heapdump endpoint, all the others came from another piece of logic:

An internal JVM OOM Error:

java.lang.OutOfMemoryError: Java heap space
at java.desktop/java.awt.image.DataBufferByte.<init>(DataBufferByte.java:93)
...
at javax.imageio.ImageIO.read(ImageIO.java:1466)
...
at com.justeattakeaway.botgateway.service.evidence.validators.impl.ImageValidator.readImage(ImageValidator.java:75)

The case was cracked! This stack trace was a beam of light that illuminated the entire problem. The error was happening in our ImageValidator service! I quickly reviewed the code. We have a feature that allows users to upload images of food-related issues for customer service. During the upload, the backend performs validation (e.g., image dimensions, file size, format), which relies on ImageValidator. Internally, it was calling ImageIO.read(), a method that loads the entire, uncompressed pixel data of the image into heap memory.

This meant we were holding the complete user-uploaded image in memory, which is incredibly resource-intensive over time. These large objects couldn't be GC'd and became "tenants for life" in the Old Gen. An even more severe security risk was that a malicious user could upload an "image bomb"—a file that is small in size but has an extremely high resolution—to instantly exhaust all memory. For example, a 20,000x20,000 pixel image requires about 1.6 GB of heap space! This was a critical code vulnerability.

Part IV: The Fix

We now had a complete picture of the problem:

  1. Chronic Illness: A ClassLoader leak from the Datadog Agent was slowly raising the Non-Heap memory baseline.
  2. Complication: A potential buffer leak in Netty was adding pressure to the Heap.
  3. The Trigger: The dangerous image handling logic in ImageValidator was the final straw that broke the camel's back.

Our solution had to be a targeted, three-pronged attack:

  1. The Critical Patch (The Cure): Refactor ImageValidator. We abandoned the direct call to ImageIO.read() and switched to the safer ImageReader API to read the image's dimensions (metadata) before fully decoding it. If the dimensions exceeded a preset safety threshold, the image was rejected immediately.
  2. Process Re-engineering (Strengthening the Foundation): We refactored the entire file-handling flow from a byte[]based approach to an InputStreambased, streaming approach, reading data in 8KB chunks. From the source, if the size exceeded the limit, the image was rejected, avoiding the problem of loading large files into memory all at once.
  3. Long-term Governance (Eradicating the Disease): We identified our Datadog Java Tracer version (1.39.0) and planned an upgrade. Upgrading to the latest version would resolve the known ClassLoader leak and other minor logging errors simultaneously.

Part V: Calm After the Storm

After deploying the fix, the results were immediate and dramatic. The JVM Metrics dashboard showed an unprecedented level of stability:

JVM-Metrics-Datadog-09-02-2025_04_04_PM.png

  • Heap Usage no longer had fatal spikes and settled into a healthy, periodic pattern, dropping from 500MB to a stable 150MB.
  • Old Gen Size stopped its relentless growth, stabilizing at a reasonable 100MB (down from 400MB).
  • New Gen Size dropped from 90MB to 30MB.
  • The OOMKilled alerts fell silent.

Final Chapter: Lessons and Reflections

This thrilling investigation (during which my carefully configured JVM flags were rolled back multiple times due to other production incidents) left us with invaluable lessons:

  1. Everything is Connected: Complex production issues are often the result of multiple, seemingly unrelated factors (application vulnerabilities, dependency leaks, framework usage) compounding each other.
  2. Theory Guides Practice: A deep understanding of foundational knowledge like the JVM memory model is the key to correctly interpreting monitoring data and memory snapshots.
  3. Tools are a Detective's Eyes: Proficiency with monitoring systems (Metrics), memory analyzers (MAT), and built-in diagnostic tools is fundamental to efficient problem-solving.
  4. Defensive Programming: Never trust user input. Implementing strict, memory-safe validation for untrusted data (like uploaded files) is a lifeline for system robustness.

This experience not only solved a tricky production issue but also deepened my understanding of the JVM, containerization, and distributed systems. Every production problem is a precious learning opportunity. It's through these challenges that we grow into better engineers—better Builders, Solvers, and Collaborators.

Restarting Life Abroad at 30: A Quest for Choice and Freedom

· 3 min read

Hangout in a dutch town

Time has flown by, and it's already been nine months since I arrived in the Netherlands for work at the end of April 2023. I'm still eagerly awaiting my appointment with the GP and trying to decipher the Dutch signs on the streets. But finally, I can pause to reflect on the early part of 2024.

From the moment I decided to try working overseas, I began preparing in every aspect. Back then, ChatGPT wasn't as widely used, so I used Google Translate to complete my English resume. With courage, I sent applications to various companies and stumbled through interviews in broken English. Thankfully, my previous job supported WFH, allowing me to continue practicing, applying for jobs, preparing for interviews, and discussing life with my Cambly tutor in my free time. I also followed Xiaohongshu bloggers like Huanzhuzhu Run, even changing my name to something similar, drawing inspiration from their daily updates. It felt as though I didn't go through such a lengthy and arduous preparation process. Despite solving 100 LeetCode problems and practicing spoken English for over two months, after receiving almost countless rejection letters, I was fortunate, perhaps, to receive a few job offers from Japan and the Netherlands. After some thought, I chose the Netherlands.

So, at the end of April, I quickly resigned, packed up eight years of belongings, sold nearly two hundred books, and bid farewell to my parents and old friends before boarding the direct flight from Beijing to Amsterdam. I was so exhausted that I couldn't even think about what was to come next. Upon landing, I was stunned for a moment by the vast grasslands and blue skies. After resting for four or five days, I started my new job. Just yesterday, it seemed I was a busy worker among the high-rises and traffic of Beijing; the next moment, I found myself in a picturesque, sparsely populated Dutch town. Having never studied abroad, this was my first time fully immersing myself in an environment where English, and even a mix of other languages, was spoken. The freshness, cultural shock, jet lag, and completely incomprehensible Dutch overwhelmed me. Smiling and saying "Oh, the taste is not bad" to colleagues while eating cold sandwiches, and feeling the curious glances on the street, I remember my first month being a blur of fragments. Suddenly becoming the "foreigner" in others' eyes, tearing up over hot food – that's when I truly realized the importance of my iron-clad Chinese stomach and language skills for integration and belonging.

Looking back, my decision was partly driven by a longing for the outside world and partly by a fear of certain oppressive aspects of life. During the pandemic, I discussed goals and dreams with my language tutor. I told her that I always wanted the right to "choose at any time," which is something I've been pursuing all along.

Welcome

· One min read

Hello, I started writing again.

This time, I used Docusaurus, and it worked pretty good.

This is my personal website; I hope I can continue to ponder and summarize since I believe it is a good habit.

Goal

So I want to accomplish the following:

  • organize my knowledge and keep track of what I've lately learned;
  • experiment with new productivity strategies, stay motivated and avoid burnout;
  • improve my English(as a non-native English speaker);
  • become more self-disciplined so that I have more time for the activities listed above.

Uber 高性能 Web App 优化实践

· 8 min read

原文 - Building m.uber: ENGINEERING A HIGH-PERFORMANCE WEB APP FOR THE GLOBAL MARKET

Performance matters on mobile.

又是一篇关于性能优化的实践。

m.uber 团队对 m.uber - 即他们的超级轻量 web app 做了一些性能优化的工作。

范围全面,从代码到打包到部署到缓存,都有涉及。

TL;DR

Performance Tools

从十个 React 迷你设计模式谈开去

· 15 min read

很早之前就一直在读的一篇文章,10 个React Mini 设计模式,一边做 Creator 项目,也一边终于把它精读完。

结合自己的开发时候的项目经验,做了点笔记。

Creator 项目是一个多端(Web + Mobile)React SPA,且有一些表单填写和复杂的交互组件。

自己单独封装了一个很简单的基于 Node EventEmitterStore,开发过程中收获很大,这些细节之后可以细说。 产品那边后来又加了「置顶」功能,类似双向数据通信的 EventEmitter 逻辑有点太乱了,所以还是狠心花时间升级成了 Redux + Immutable.js + Normalizr 技术栈,果然省心很多。

原文作者说你是不是天天写 React, 写着写着发现自己可能经常用来实现需求的,也总是那么几个方法,往大了讲其实就是开发中的 设计模式。在这里我们称为 Mini Patterns

Bug makers or bug fixers

· 6 min read

世界上只有两种人: Bug makers or bug fixers.

​ —— 郫县豆瓣

可能是周一不宜上线吧。

下午在做某个需求的时候,突然收到了一个之前修过但却并没有修好的小 Bug。

接着又收到了一个刚上线的新需求报过来的线上 Bug 反馈,然后紧接着同一个需求的两个、三个 Bug… 一大片,炸开了。

最后, PM 说,”这次 Bug 太多了,都先回滚吧。”

只是突然觉得,已经工作了一年多近两年的我,写代码的时候依然不走心。

其实作为程序员,被人指出你代码的 bug 是难受的,除了尴尬,有时候甚至真的会感到羞耻,会有点看不起自己也觉得会被人看不起。

我只希望自己能够在每次写代码的时候永远记住这种感觉。

之前也是遇到过各种奇奇怪怪的线上 Bug,其实大部分都是自己写的时候不用心、测试时候不上心导致。

而最近,写的时候居然总有种匆匆忙忙急着完成任务的感觉,可结果却是既没有快速完成,也没有保证质量,既不好也不快,反而耽误了很多时间在 Debug 上。

十足的 Bug Maker.

其实仔细反思一下,我是知道原因的。

可能我们都有病

· 9 min read

焦虑、抑郁?

昨天在朋友圈偶然看到了一个关于「成电延时摄影」视频,突然很好奇地点进去看了下。伴随背景音乐响起,晨雾中的主楼,晨光中的银杏大道,蓝天下的宿舍楼,黄昏下的品学楼,夜幕降临的图书馆,灯光辉煌的体育馆,还有一直没机会上过课的新教学楼,碧波荡漾的东湖西湖…
以成都的天气,作者肯定是花了很多心血的,虽然不排除后期,但是一共七千五百张照片,一帧帧下来,每一幅都是曝光完美,好像从来没有见过那么美的成电。 仔细想想,毕业离开学校近两年,从来没有很认真地怀念过自己的大学。
每次跟人提起,就只能用一个词来形容,黑暗。
那段时间,我经历了入学的迷茫、苦闷,无疾而终的异地感情和孤立,梦想与现实差距带来的焦虑、抑郁,人际关系紧张的被忽视、被抛弃、被不理解,以及进入大学就再没有好过的睡眠。特别严重的一段时间,白天如行尸走肉,不想读书,也无法集中注意力,晚上必须服用安眠药才能入睡,走在路上还有着被害妄想症心理。 同时我也很疑惑,为什么身边的那些人就可以那么无忧无虑,那么轻易得到自己想要的东西。

Fun with Codemod & AST

· 14 min read

TL;DR

  • Facebook 为了解决「大型代码库」迁移,基于 AST 造了个工具 Codemod
  • 基于 Codemod 又构建了 JavaScript 代码迁移专用的工具 jscodeshiftReact-codemod
  • 理解这些工具背后的原理有助于从一个单纯的「API 使用者」变成一个工程师般的「创造者」
  • Demo Time!Let's write a codemod
  • 一些有价值的参考

豆瓣、北京、工作,及终于与自己握手言和的 2016

· 23 min read

在隆冬,我终于知道,我身上有一个不可战胜的夏天。 — 加缪

豆瓣

在 2015 年末,我在笔记本上这样写着:

有时候在想,能加入豆瓣也真的算是一种幸运。豆瓣这家公司,之前或现在,算是由复旦和华科这两所我很喜欢的学校的人组成,所以有些时候,对于这里的各种相似人生观、世界观及价值观会特别认同。就算偶尔也会对自己现在的能力与效率感到一点焦虑,不过不管怎么说,只要不退步,总是会前进。

15 年夏天,我加入了豆瓣。
对,就是这个文艺得出了名的互联网公司。
还记得 14 年底我正上大四,尴尬的时间,面临找工作选公司的窘境。
而当时的我,喜欢拍照听歌,弹弹吉他,看些好玩的杂书,见到迷人的设计会兴奋不已,所以在那所死板沉寂绩点为王的工科学校,自认为算是个文艺青年。但是我也一直喜欢硬件,喜欢消费电子,喜欢拆装,享受通过自己动手创造出东西的过程,喜欢 Web,也能写点代码,人生某个阶段曾把成为一名 Geek 作为理想之一。于是不想浪费自己的底子,想着还是可以干点有技术含量的活,那么就去文艺气质的公司做点技术活吧,在不毁自己三观的同时还能养活自己,我是这么想的。 幸运的是,在我做决定的四个月后,我收到了来自豆瓣的 Offer。

跟随前进者的步伐

· 4 min read

每个人,在人生不同阶段,或多或少是需要一些鸡血的。

自从工作以来,也确实觉得自己懒癌,且很多事情缺乏自控力和自律。

只记得乔老爷子说过,

自由从何而来?从自信来,而自信则是从自律来!先学会克制自己,用严格的日程表控制生活,才能在这种自律中不断磨练出自信。自信是对事情的控制能力,如果你连最基本的时间都做控制不了,还谈什么自信?

是不是真的出自乔老爷子之口还没有经过求证,但当时看到这句话的时候,确实是突然会心一击。

一直想做起来的个人博客,无论是实实在在的文章内容,到表面的 UI / 主题 / 页面,再到更底层的网络优化 / VPS 配置 / 自动运维,这块真是在一拖再拖。

其实自己一直有在关注一些有意思的同行,这里先贴几个自己缺乏灵感的时候会去光临的个人网站清单。