Tokenization, Vectorization and Copyright: Legal Implications of AI Training

Jun. 26, 2025 • Suraj

Student's Pen

Artificial intelligence (AI) is the computing ability capable of completing tasks that would otherwise use to require human intelligence. AI, in the form we see today, is a novel technological development that has not existed for more than a decade. Today’s AI is generative AI that has developed at an exponential rate since late 2022, after the public rollout of OpenAI’s ChatGPT. This development can be attributed to Large Language Model (LLM) technology.

LLMs, however, have triggered a complex legal debate surrounding copyright infringement. At the heart of it lies the debate of whether using someone else’s words and converting them into mechanical language would amount to copyright infringement. Or, whether using someone else's creative expression and converting that data into tokens and vectors would amount to copyright infringement. Before answering this legal question, let us understand how LLM works.

An LLM is a type of AI that is trained to understand human text patterns using millions and billions of data samples. It is then made to generate human-like texts when the user asks. During training, the model first breaks down the text into smaller units called tokens (tokenisation) and then assigns a numerical form to this data (vectorisation) to process it mathematically.

AI learns by predicting the next word in a sentence, checking if it guessed correctly, and adjusting itself each time it gets the prediction wrong, then repeating the whole process to get the optimum results. This process is repeated billions of times, and the model gradually improves. For easier understanding, an analogy can be drawn that an AI learns through trial and error process. It refines its answers like a student who practices and refines their answers.

While these operations are claimed by AI developers as non-expressive and purely mathematical, questions have emerged about whether they constitute unauthorised reproduction of copyrighted material. This article aims to break down the complex legal and technological tussle that is happening, not only around the world but in India as well, in the form of Asian News International versus OpenAI (ANI vs OpenAI). Click here to read more.

Understanding Tokenisation and Vectorisation

Tokenisation is the process where ChatGPT breaks down collected text into smaller units called tokens. Vectorisation then converts these tokens into numerical representations, capturing relationships and patterns between them for the model to learn and generate text, because ChatGPT ultimately processes the language in mathematical form and reproduces it in the form of a probability relation.

Hereinunder is an illustration to understand what tokenisation and vectorisation mean:

For instance, consider the data: “The cat and dog sat on the wall.”

Tokenisation would be breaking this data into smaller units- [“The”, “cat”, “and”, “dog”, “sat”, “on”, “the”, “wall”, “.”]

Vectorisation would be assigning these tokens a value (*Note: Vector is a quantity that is described by both magnitude and direction)-

“cat” → [0.12, -0.45, 0.88, ...]

“dog” → [0.13, -0.50, 0.91, ...]

“wall” → [0.40, 0.22, -0.77, ...]

These vectors are based on contextual similarity — e.g., “cat” and “dog” will have similar vectors because they often appear in similar contexts, like pets or animals.

A sample graph to show how “cat” and “dog” vectors are close and “wall” is different can be found here:

Once the sentence is tokenised and vectorised, the model uses these vectors to learn the patterns and relationships between words, understand the structure and predict the next words which is a probability-based outcome.

LLMs and Conflict with Law

Under the Copyright Act, 1957 (the Act), content that demonstrates originality and expressiveness is protected under Section 13(1)(a)[1], which covers original literary works. This protection extends to any content that involves human skill, judgment, and effort. In cases where such content is used without permission, even if publicly accessible, it may amount to copyright infringement. Under the Act, storing and reproducing is considered a violation of copyright.

Expressive work involves human creativity and original expression, like how an idea is written, structured, or presented. Non-expressive work is purely technical or functional, like copying data in a way that doesn't reproduce the original wording or style

AI engineers argue that responses generated are factual and mathematical predictions of words, both of which are non-expressive and hence non-copyrighted. Moreover, a large faction of people are of the opinion that minor infringement, if any, is the cost of technological advancement, which the world should bear for the greater good.

The Road Ahead

Delhi High Court has admitted the first-of-its-kind case of copyright violation in India[2]. AI has intermingled with every sphere of life. ANI has alleged OpenAI to have used its publicly available copyrighted data to train LLM models. This case may pave the way for future litigations on AI & music, art, literature, and much more.

What makes this case important is the huge responsibility it carries to set things right. As of now, ChatGPT is providing the majority of its services at minimal to no cost. This has been made possible thanks to billions of dollars raised by its board to support the costs and operations of the business. If reports are to be believed, the startup, founded in 2015, plans to go the for-profit route down the line. The result of this shift would be increased costs and burdens on the entire AI industry. To avoid such a detrimental impact, the author is of the view that a harmonious approach, where both law and technology may converge, should be explored.

REFERENCES

[1] Section 13: Works in which copyright subsists.

(1) Subject to the provisions of this section and the other provisions of this Act, copyright shall subsist throughout India in the following classes of works, that is to say, --

(a) original literary, dramatic, musical and artistic works;

[2] ANI MEDIA PVT LTD v. OPEN AI INC & ANR. CS(COMM) 1028/2024

Liked the article ?
Share this:

← Return to all blogs

Blog Categories

All Student's Pen Event Legacy