The OpenGPT-X research project has published a large AI language model called Teuken-7B. It was trained from scratch with all 24 official languages of the European Union and comprises seven billion parameters. Teuken-7B marks an important milestone for science and business in Europe. It offers researchers and companies an open-source alternative to commercial models, enabling more transparent and customizable AI solutions. The multilingual language model was developed under the leadership of the Fraunhofer Institutes for Intelligent Analysis and Information Systems IAIS and for Integrated Circuits IIS. The Institute of Computer Science congratulates in particular Prof. Dr. Stefan Wrobel, IAIS Institute Director and Professor of Department III at the Institute of Computer Science, on this outstanding success!
Multilingual training and efficiency
Teuken-7B is currently one of the few AI language models that has been developed multilingually right from the start. It contains around 50 percent non-English pre-training data and has proven to be stable and reliable in its performance across several languages. This offers added value, particularly for international companies with multilingual communication needs and product and service offerings. “Our model has demonstrated its capabilities across a wide range of languages, and we hope that as many people as possible will adapt and develop the model for their own work and applications. In this way, we want to contribute, both within the scientific community and together with companies from different industries, to the growing demand for transparent and customizable generative AI solutions”, says Prof. Dr. Stefan Wrobel.
A multilingual tokenizer specially developed in the OpenGPT-X project leads to a reduction in training costs compared to others such as Llama3 or Mistral. This can increase efficiency, especially for European languages with long words or in the operation of multilingual AI applications.
Open source model with an European perspective
As a freely available open source model, Teuken-7B offers an alternative from public research for science and companies in Europe. It supports the development of individual AI solutions without black-box components, which is particularly important for safety-critical applications in areas such as the automotive industry, robotics, medicine, and finance. In terms of data protection and security, the model enables the secure use of sensitive company and research data in compliance with European data protection and security regulations. Last but not least, a European language model strengthens the digital sovereignty, competitiveness, and resilience of Germany and Europe.
The OpenGPT-X project was funded by the Federal Ministry for Economic Affairs and Climate Protection (BMWK) with around 14 million euros. The ten project partners are Fraunhofer IAIS, Fraunhofer IIS, Forschungszentrum Jülich, KI Bundesverband, TU Dresden, DFKI, IONOS, Aleph Alpha, ControlExpert and WDR.
Teuken-7B is now accessible via the Gaia-X infrastructure and can be downloaded free of charge on Hugging Face.