Researchers from DeepSeek have proposed a counterintuitive approach to improving large language models by feeding them images of text instead of pure text as input tokens. Their study, “DeepSeek-OCR: Contexts Optical Compression,” suggests this method could achieve 9-10x text compression while maintaining 96% precision, potentially allowing AI systems to handle much larger context windows within existing memory constraints.
Why this matters: Current LLMs are severely limited by token capacity—most can only handle 200,000 to 500,000 tokens at once, forcing them to “forget” earlier parts of long conversations or documents. This breakthrough could theoretically expand context windows from 400,000 tokens to 4 million tokens, enabling more complex and lengthy AI interactions.
How tokenization works: Modern AI systems convert text into numbers called tokens before processing, then convert the numerical results back into readable text.
- When users hit token limits, AI systems start discarding earlier conversation history, making complex problem-solving nearly impossible.
- Early ChatGPT versions were limited to fewer than 10,000 tokens, roughly equivalent to 10,000 simple words.
- Today’s GPT-4 handles about 400,000 tokens, while Claude ranges from 200,000 to 500,000 tokens depending on the model.
The visual approach: Instead of directly processing text, the new method converts text into images first, then uses vision-based token encoding.
- The process reverses traditional optical character recognition (OCR), going from text-to-image rather than image-to-text.
- Visual tokens can represent the same information using substantially fewer tokens than equivalent digital text.
- The method leverages existing vision-language model technology, avoiding the need to build entirely new systems.
Key performance metrics: The DeepSeek research achieved impressive compression ratios across different precision levels.
- 96% precision at 9-10x compression ratio.
- 90% precision at 10-12x compression ratio.
- 60% precision at 20x compression ratio.
What experts are saying: AI luminary Andrej Karpathy, former director of AI at Tesla and OpenAI, expressed enthusiasm for the approach’s broader implications.
- “The more interesting part for me is whether pixels are better inputs to LLMs than text,” Karpathy posted on X.
- “Whether text tokens are wasteful and just terrible, at the input. Maybe it makes more sense that all inputs to LLMs should only ever be images.”
Additional advantages: The image-based approach offers benefits beyond compression for multilingual applications.
- Natural languages using pictorial characters and words are especially well-suited to image-based tokenization.
- The method can leverage existing computer vision advances rather than requiring entirely new architectures.
- It provides a natural bridge between vision and language processing capabilities.
The trade-offs: Like most technological advances, the approach involves balancing compression gains against precision losses.
- While 96% precision at 10x compression seems promising, the 60% precision at 20x compression may be too low for many applications.
- The practical value depends heavily on specific use cases and tolerance for reduced accuracy.
- More research is needed to understand the full implications and optimal implementation strategies.
The Surprising Idea That Generative AI Might Be Better Off Using Visual Images Of Text Rather Than Pure Text As Tokens