Dutch news publishers join forces with TNO to build GPT-NL, a Dutch AI language model trained on legally obtained data

The Hague-based TNO, an independent organisation for applied research, is collaborating with members of NDP Nieuwsmedia, who represent the vast majority of Dutch news publishers, to further develop GPT-NL, the first large-scale Dutch AI language model trained exclusively on legally obtained data.

Consequently, the members of NDP Nieuwsmedia are making a substantial portion of their archives, containing articles from over 30 national and regional news titles, available to train the model.

The addition of the dataset is expected to double the amount of high-quality Dutch data used for training.

Additionally, the news agency ANP is also joining the initiative. It’s the first time worldwide that private news publishers are collaborating with an organisation developing an AI model.

‘We’re proud of this collaboration. NDP Nieuwsmedia members are not only providing high-quality data, but also sending a strong message: AI can be developed responsibly, with respect for copyright and public values,’ says Selmar Smit, Manager of Science & Technology at TNO and founder of GPT-NL.

Why this initiative?

The collaboration between NDP Nieuwsmedia and TNO supports the shared goal of GPT-NL and the Dutch government: to create a language model that respects copyright and establishes a benchmark for handling copyrighted content in AI systems.

Strict agreements have been put in place to prevent the technical extraction of articles from the AI model. When GPT-NL is released, publishers will receive appropriate compensation.

GPT-NL is a non-profit initiative by TNO, NFI, and SURF, providing a responsible alternative to existing language models.

It is developed for the Netherlands using high-quality Dutch data. Unlike some international models that use significant portions of content scraped from the internet without permission, GPT-NL carefully and ethically collects copyrighted data, ensuring that contributors are compensated for their work.

The model also complies with European regulations, including the AI Act. It is being developed for specific tasks such as summarising, simplifying, and extracting information from text.

Access to over 20 billion tokens

Due to the partnership with NDP Nieuwsmedia, the trade association representing private news publishers like DPG Media, Mediahuis, Erdee Mediagroep, and De Groene Amsterdammer, along with contributions from ANP, the model has access to over 20 billion tokens.

These articles encompass a wide variety of topics, including politics, economics, healthcare, and science, offering extensive material for training GPT-NL.

The datasets comprise billions of tokens, which are small units of text that help AI understand and process language. A token can represent a word, a part of a word, or even punctuation.

‘Big Tech companies have trained their models on news articles without permission or payment. This partnership between NDP Nieuwsmedia members and TNO shows there is an alternative route. We’re setting a precedent that helps the advancement of AI and strengthens journalism in the Netherlands. AI innovation can be ethical and responsible, without using the work of our journalists without permission. This step gives that movement a boost,’ says Rien van Beemen, Chair of NDP Nieuwsmedia

Training started in June 2025. In the fourth quarter of 2025, the model will be improved and prepared for first use. Earlier Dutch contributors of data to GPT-NL include DNB (De Nederlandsche Bank), ICTRecht, and Het Utrechts Archief.

Why this initiative?

Access to over 20 billion tokens

Vigneshwar Ravichandran