Benjamin Paaßen, 2025-07-06
Large language models (LLMs) are a foundational technology, unlocking novel research methods, teaching practices, and business models – even when looking beyond the hype[1]. Given the increasing importance of LLMs, it is deeply concerning that the supply chain for LLMs is controlled by a handful of AI corporations located in the US and China. The current practices of this handful of AI corporations stand in stark contrast to the vision of trustworthy AI, as well as human autonomy[2]: their LLM-based bots spread misinformation and propaganda and are used to replace human labor; the AI platforms form an oligopoly that can dictate prices and conditions; and the data used for training has been gathered without consent. The alignment of current big AI players with autocratic regimes in China and the US only heightens the concern that AI tools will increasingly undermine, rather than strengthen, digital autonomy (consider the case of Microsoft cutting off services for ICC members). To maintain autonomy – as well as competitiveness for all companies that wish to remain independent of a tech oligopoly – alternatives along all steps of the LLM supply chain have to be established. In this paper, we focus on the software side of this supply chain, starting at the end users interacting with AI tools, over the deployment of LLMs for these tools, the training of such LLMs, to the training data for this training. Starting from the most urgent recommendations at the end user side, we provide recommendations to promote human autonomy at each step of this supply chain.
LLM-based Tools
End users most immediately engage with LLMs via tools, most notably chat interfaces such as ChatGPT. To support digital autonomy of end users, we therefore need to make sure that they do not become dependent on certain tools but have alternatives. This is particularly urgent since any delay will mean that end users will become locked into platforms and products that use the usage fees and the accumulated (person-related) information to strengthen their market position even further.
Hence, we need to offer alternatives for the most crucial tools, especially chat interfaces, research tools for scientific literature, as well as core educational tools such as AI plugins for digital learning platforms. Crucially, such tools should be hosted at universities themselves to avoid flows of person-related data to third parties and enable universities to design and adjust such tools to their research and teaching needs. Fortunately, this is achievable as the compute needs for the tools themselves are modest, as is demonstrated by many success stories of universities hosting their own chat interface alternatives, e.g. via KI:connect.nrw[3] and HAWKI[4]. For literature search and AI plugins, developments are still in progress and urgently needed.
We recommend to:
- Provide project-based funding opportunities to develop new tools, both inside universities (e.g. via the Stiftung Innovation Hochschullehre) and beyond (e.g. via OpenSource development grants or ministry funding).
- Set up permanent development teams at the state or federal level which can maintain tools (e.g. as OpenSource output of project based funding) that have proven crucial and develop them further. These could be embedded at AI competency hubs, as suggested by the “KI-Zukunftsfonds Hochschule”[5].
- Equip universities with sufficient funding for permanent staff which can introduce tools at the university level (e.g. for RAG), and provide support and guidance to researchers, teachers, students and administrators how to utilize these tools responsibly (i.e., enhance AI literacy).
LLM Deployment
To enable LLM-based tools, LLMs must be available in the first place. In particular, this means copies of trained LLMs being deployed on powerful GPU servers which can respond to queries with low delay (a few seconds). Such deployment services can be bought from commercial providers – but this would make all tools (and hence their users) dependent on the AI oligopoly, again. Therefore, we urgently need alternative LLM deployment options. However, to make LLM deployment efficient, we need some level of centralization to profit from scaling effects and pooled expertise. High performance computing centers are, hence, the prime actors to provide this service. We also know that such deployments are achievable as GWDG in Göttingen[6] and OpenSourceKI.nrw[7] already provide success stories for effective and efficient deployment.
In line with the notion of a “KI-Zukunftsfonds Hochschule”5, we recommend to:
- Provide substantial funds to equip Tier 2 High Performance Computing Centers with GPU server infrastructure to deploy multiple parallel copies of state-of-the-art open weight LLMs (with ca. 100 bio. Parameters).
- Provide Tier 2 High Performance Computing Centers withpermanent staff to operate this infrastructure, update the models as needed, and develop new APIs for tool development. For research and teaching, this will have to be funded by the state and federal level (e.g. via ministry funds). For private companies, parallel infrastructure may be set up as part of AI (Giga-)factories, such as HammerHAI[8], and re-finance itself via contracts.
LLM Training
In order to deploy LLMs, they need to be trained, first. Fortunately, several alternatives for open weight LLMs are provided by private actors (e.g. Llama models out of the US, DeepSeek out of China, or Mistral models out of France) with substantial investment. When deploying such pre-trained models, no data or power flows to the model creators and, due to alternatives being available, we avoid dependencies on single creators. Hence, there is no urgent need to train alternative models. However, there is no guarantee that open weight models will be continuously provided by private actors and the training practices themselves do not consistently respect principles of openness and autonomy[9]. Hence, we need to take steps to become capable to train LLMs, and to provide better training practices for LLMs without engaging in an “AI race”. Since building such capabilities is challenging and costly, we suggest to centralize this effort at the EU level. In more detail, we recommend to:
- Provide substantial funds to equip at least one Tier 1/Tier 0 High Performance Computing Center with sufficient GPU infrastructure to train state-of-the-art LLMs at the order of 100 bio. Parameters. The JUPITER system[10] at FZ Jülich provides a good practice example in this regard.
- Set up at least one large-scale training project with ca. 200 mio. EUR of funding for ca. 200 researchers and developers over ca. 3 years to demonstrate that open models can be trained. Such large-scale projects should pool expertise and staff across university research teams as well as research institutes and companies that have experience in training LLMs at the 8 bio. parameter level (e.g. Darmstadt[11]). The OpenEuroLLM[12] and Open GPT-X[13] initiatives may be starting points.
LLM Training Data Collection
Current LLM training operates on data that has been collected without consent, is strongly biased towards the US-based, male, white, internet-affine population and is badly curated, containing vast amounts of toxic or at least questionable data[14]. It also becomes increasingly clear that LLM development is limited by the fact that no further reservoirs of publicly accessible, high-quality texts will become available – everything that is available has already been used[15]. Hence, to provide a basis for autonomy-respecting training of LLMs in the future, we recommend to take first steps toward a long-term collection project for training data at the global level. More specifically, we recommend to:
- Set up a ten-year, long-term, global data collection project to gather high-quality, curated text data from sources that are currently under-represented. This data should be gathered with explicit, informed consent for LLM training, guaranteeing that the resulting LLMs will be available as a commons. The data collection should consider both direct data donations by individual authors as well as negotiations with publishers and other text-owning institutions. The Common Pile project[16] may be a starting point.
- Set up a network of data stewards and curators who implement this project and are funded under it, involving public libraries and NGOs (e.g. Wikimedia) with experience on licensing and maintaining open data. These data stewards should also ensure long-term data maintenance under the FAIR principles and should ensure that the data is only available for LLM training under a public commons license to prevent privatization without consent.
Conclusion
We emphasize that all these recommendations can be implemented in parallel to gain sufficient speed. First success stories and examples driven forward by competent actors are already available at every step. The only thing needed is political action to make a public AI infrastructure happen and, thus, significantly strengthen human digital autonomy in the AI age.
[1] https://doi.org/10.1007/s10648-025-10020-8
[2] https://doi.org/10.1007/978-981-97-8638-1_7
[3] https://kiconnect.pages.rwth-aachen.de/pages/
[5] https://www.stifterverband.org/sites/default/files/2025-02/ki-zukunftsfords_hochschulen_2026-2030.pdf
[6] https://gwdg.de/en/services/application-services/ai-services/
[8] https://www.hlrs.de/press/detail/hammerhai-to-create-an-ai-factory-for-science-and-industry
[9] https://doi.org/10.1145/3630106.365900
[10] https://www.fz-juelich.de/de/aktuelles/news/pressemitteilungen/2025/europas-ki-turbo-jupiter-ai-factory
[11] https://hessian.ai/supercomputer-for-cutting-edge-ai-research-in-hesse/
[14] https://knowingmachines.org/models-all-the-way
[15] https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data