Revolutionizing AI Memory Management: DeepSeek’s Engram
DeepSeek, in collaboration with Peking University, has introduced a groundbreaking training method called Engram, designed to decouple memory storage from computational processes. This innovative approach has the potential to significantly improve the efficiency of large AI models.
Traditional large language models require high-bandwidth memory for knowledge retrieval and basic computation, creating a bottleneck in both performance and cost. This HBM bottleneck is widely recognized as a key reason DRAM prices rose by 5X in just 10 weeks, as hardware demand spiked to support large AI models.
Validation and Technical Approach
The researchers behind Engram have identified that existing models waste sequential depth on trivial operations, which could otherwise support higher-level reasoning. Engram allows models to efficiently “look up” essential information without overloading GPU memory, freeing capacity for more complex reasoning tasks. The system was tested on a 27-billion-parameter model and showed measurable improvements across standard industry benchmarks.
By performing knowledge retrieval through hashed N-grams, Engram provides static memory access independent of the current context. The retrieved information is then adjusted using a context-aware gating mechanism to align with the model’s hidden state. This design allows models to handle long context inputs more efficiently and supports system-level prefetching with minimal performance overhead.
Minimizing High-Speed Memory Requirements
Engram minimizes the amount of high-speed memory required by using lookups for static information, making memory usage more efficient. This approach complements other hardware-efficient solutions, such as Phison’s AI inference accelerators, which offer a cost-effective way to expand total memory using SSDs. Combined, these approaches allow AI systems to optimize fast-memory usage while affordably increasing overall memory capacity.
The method separates static pattern storage from dynamic computation, enhancing the Transformer backbone without increasing FLOPs or parameter counts. DeepSeek formalized a U-shaped expansion rule to optimize the allocation of parameters between the MoE conditional computation module and the Engram memory module. Tests show that reallocating around 20–25% of the sparse parameter budget to Engram yields better performance than pure MoE models, maintaining stable gains across different scales.
Scalability and Efficiency
Engram’s deterministic retrieval mechanism allows memory capacity to scale linearly across multiple GPUs while supporting asynchronous prefetching during inference. It offloads static knowledge reconstruction from lower layers, freeing attention mechanisms to focus on global context. Hierarchical caching of frequently used embeddings enhances efficiency, and the module works with existing GPU and system memory architectures, potentially avoiding costly HBM upgrades.
This technique may relieve pressure on expensive memory hardware, particularly in regions such as China, where HBM access lags behind competitors such as Samsung, SK Hynix, and Micron. Early validation of Engram suggests models can expand parameter scale and reasoning capacity while managing memory demands more efficiently. This approach may help ease memory constraints across AI infrastructure, potentially reducing sharp DDR5 DRAM price swings.
For more information, read the full article Here
Image Credit: www.techradar.com