×
FlexOlmo architecture lets data owners remove content from trained AI models
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

The Allen Institute for AI has developed FlexOlmo, a new large language model architecture that allows data owners to remove their contributions from an AI model even after training is complete. This breakthrough challenges the current industry practice where data becomes permanently embedded in models, potentially reshaping how AI companies access and use training data while giving content creators unprecedented control over their intellectual property.

How it works: FlexOlmo uses a “mixture of experts” architecture that divides training into independent, modular components that can be combined or removed later.

  • Data owners first copy a publicly shared “anchor” model, then train a second model using their own data and combine it with the anchor before contributing to the final model.
  • The data itself never has to be handed over to the model builder, maintaining privacy and control.
  • “Conventionally, your data is either in or out,” says Ali Farhadi, CEO of Ai2. “Once I train on that data, you lose control. And you have no way out, unless you force me to go through another multi-million-dollar round of training.”

Key innovation: The breakthrough lies in a new scheme for representing model values that allows independently trained sub-models to be merged and later extracted.

  • “The training is completely asynchronous,” explains Sewon Min, a research scientist at Ai2 who led the technical work. “Data owners do not have to coordinate, and the training can be done completely independently.”
  • A magazine publisher could contribute text from its archive but later remove that sub-model if there’s a legal dispute or objection to how the model is being used.

Performance results: Testing showed FlexOlmo outperformed individual models and scored 10% better than other model-merging approaches.

  • Researchers created a 37 billion parameter model using their Flexmix dataset from proprietary sources including books and websites.
  • The model is about one-tenth the size of Meta’s largest open source model but demonstrated superior performance across all tasks.

Industry implications: The approach could transform how AI companies access sensitive data while addressing growing legal challenges around training data ownership.

  • Some publishers are suing large AI companies while others are cutting deals to grant access to their content.
  • In June, Meta won a major copyright case when a federal judge ruled the company didn’t violate law by training on text from books by 13 authors.
  • WIRED parent company Condé Nast has a deal in place with OpenAI.

What experts are saying: Stanford AI researcher Percy Liang sees the approach as a promising departure from current practices.

  • “Providing more modular control over data—especially without retraining—is a refreshing direction that challenges the status quo of thinking of language models as monolithic black boxes,” he says.
  • “Openness of the development process—how the model was built, what experiments were run, how decisions were made—is something that’s missing.”

Privacy considerations: While the approach offers more control, researchers warn that data reconstruction from the final model may still be possible.

  • Techniques like differential privacy might be required to ensure mathematically guaranteed data protection.
  • The method could enable AI firms to access sensitive private data in a more controlled way since the data doesn’t need to be disclosed during model building.

Future possibilities: Min suggests this could enable new types of collaborative open models where different data owners can co-develop without sacrificing privacy or control.

  • “I really think the data is the bottleneck in building the state of the art models,” she says.
  • “This could be a way to have better shared models where different data owners can codevelop, and they don’t have to sacrifice their data privacy or control.”
A New Kind of AI Model Lets Data Owners Take Control

Recent News