×
Anthropic will pay you $15,000 if you can hack its AI safety system
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

Anthropic has set out to test the robustness of its AI safety measures by offering a $15,000 reward to anyone who can successfully jailbreak their new Constitutional Classifiers system.

The challenge details: Anthropic has invited researchers to attempt bypassing their latest AI safety system, Constitutional Classifiers, which uses one AI model to monitor and improve another’s adherence to defined principles.

  • The challenge requires researchers to successfully jailbreak 8 out of 10 restricted queries
  • A previous round saw 183 red-teamers spend over 3,000 hours attempting to bypass the system, with no successful complete jailbreaks
  • The competition runs until February 10, offering participants a chance to win the $15,000 reward

System effectiveness: Early testing results demonstrate significant improvements in preventing unauthorized access and harmful outputs compared to traditional safeguards.

  • When operating alone, Claude blocked 14% of potential attacks
  • With Constitutional Classifiers enabled, the blocking rate increased to over 95% of unauthorized attempts
  • The system is guided by a “constitution” of principles that the AI model must follow

Technical implementation: Constitutional Classifiers represents an evolution in AI safety architecture, though some practical challenges remain.

  • The system employs Constitutional AI, where one AI model monitors and guides another
  • Current implementation faces high computational costs
  • Anthropic acknowledges that while highly effective, the system may not prevent every possible jailbreak attempt

Broader security implications: The initiative highlights the growing focus on AI safety and the importance of robust testing methodologies.

  • The open challenge approach encourages collaborative security testing
  • Results help identify potential vulnerabilities before they can be exploited
  • The high success rate in blocking unauthorized access demonstrates progress in AI safety measures

Future considerations: While Constitutional Classifiers shows promise in enhancing AI safety, several key challenges and opportunities lie ahead in its development and implementation.

  • Anthropic continues working to reduce the system’s computational requirements
  • New jailbreaking techniques may emerge as AI technology evolves
  • The balance between security and computational efficiency remains a crucial area for improvement
Jailbreak Anthropic's new AI safety system for a $15,000 reward

Recent News

Musk-backed DOGE project targets federal workforce with AI automation

DOGE recruitment effort targets 300 standardized roles affecting 70,000 federal employees, sparking debate over AI readiness for government work.

AI tools are changing workflows more than they are cutting jobs

Counterintuitively, the Danish study found that ChatGPT and similar AI tools created new job tasks for workers and saved only about three hours of labor monthly.

Disney abandons Slack after hacker steals terabytes of confidential data using fake AI tool

A Disney employee fell victim to malware disguised as an AI art tool, enabling the hacker to steal 1.1 terabytes of confidential data and forcing the company to abandon Slack entirely.