STRATEGY & INSIGHTS

8 min read

by

Ashley Altus

Published on 10/15/2024

Last updated on 02/12/2025

Published on 10/15/2024

Last updated on 02/12/2025

Understanding LLMs: Model size, training data, and tokenization

Subscribe to

the Shift!

Get emerging insights on innovative technology straight to your inbox.

Whether for internal operational efficiency (knowledge management, intelligent search, and smart assistants) or customer-facing services (self-serve customer service, sales and marketing campaigns), enterprises are increasingly deploying generative artificial intelligence (GenAI) into their strategies.

If your organization is preparing to build GenAI applications, then you’ll need a strong understanding of the fundamental concepts related to large language models (LLMs). By gaining a grasp on how GenAI operates and the impact of various LLMs, businesses can make informed decisions about getting the most out of this technology.

Model size

Model size refers to the number of parameters in an LLM. Just like how people who know about cars pay attention to horsepower and the number of cylinders in an engine, those who care about LLM model sizes pay attention to the parameter count.

What are parameters?

Parameters are the elements within the model that are learned from the training data. They represent the knowledge the model has acquired. They’re what the model uses to make predictions or generate text. Each parameter can be thought of as a small piece of information that the model uses to understand language.

Naturally, a model with more parameters means it can store more information and capture more nuances in a piece of input text. LLMs trained on more parameters typically have better performance and accuracy.

Examples of model sizes

So, what kinds of numbers are we talking about? To give you a clearer picture, let’s look at a few examples of popular models and their sizes:

Gemma 2B (from Google) has 2 billion parameters (hence the 2B in the name). It’s designed to be efficient while maintaining good performance. Gemma 2B balances the need for computational resources with the ability to handle a variety of tasks effectively.
Mistral 7B (from Mistral AI) features 7 billion parameters, providing a middle ground between smaller, less capable models and the largest, most resource-intensive ones. Mistral 7B aims to deliver strong performance with relatively manageable computational requirements.
GPT-4 (from OpenAI) does not have an official number of parameters disclosed. However, some rumors claim that GPT-4 is made out of 8 smaller LLMs, each with 220 billion parameters. This puts GPT-4 at an estimated size of 1.76 trillion parameters, about 10 times the size of GPT-3. This significantly enhances its performance and capabilities over its predecessor, but also subsequently requires substantial computational resources.

Balancing model size and resources

When choosing an LLM for your application, why wouldn’t you just choose the largest model, since it apparently performs the best? The benefits of a larger model come with increased computational demands and costs.

Larger models require more powerful hardware—such as high-end GPUs, TPUs, or AI accelerators (also known as NPUs)—to train and run efficiently. So, to start, you’ll be paying higher costs for the hardware. However, there’s also the energy consumption and cooling systems needed to maintain such equipment.

In addition to increased resource usage and cost, larger models often take longer to train and may require more sophisticated infrastructure to handle the data processing load.

Therefore, while it might be tempting to simply choose the largest model available, you must consider whether the performance gains justify the additional costs and resource requirements. For many applications you build, a slightly smaller model may offer sufficient performance—without the high resource demands.

Evaluating your specific needs and constraints will help you make a more informed decision about the optimal model size for your enterprise. Finding the right balance between model size, performance, and cost is key to successfully implementing LLM-based applications.

Training data

Training data is the dataset used to teach an LLM how to understand and generate text. The quality and quantity of this data directly influence how well the model performs. When preparing to develop and deploy GenAI, you will need to familiarize yourself with the training data and methodologies behind your chosen LLM.

Quality of training data

High-quality training data is diverse and extensive. It should cover a wide range of topics and language patterns. This diversity helps a model generalize better, meaning it can effectively apply what it has learned to new, unseen data. For example, if the training data includes a variety of texts from different domains—such as literature, science, and casual conversation—then the model will be better equipped to handle diverse queries in real-world applications.

Ethical considerations

Gathering and curating high-quality training data can be a challenging task. LLM builders work hard to ensure that the data is free from biases and represents different perspectives fairly. When training data has biases, then the outputs may have biases. This can be problematic in applications like customer service or content moderation.

In addition, you will need to ask questions about how an LLM’s training data was sourced. Were data privacy and copyright laws respected? All of these factors will play into your evaluation of ethical considerations.

Quantity of training data

While training data quality is important, quantity is equally important. Larger datasets give more examples for a model to learn from, and this generally leads to better model performance. However, there is a point of diminishing returns. At some point, adding more data only increases computational costs without significantly improving performance.

Tokenization

Tokenization is the process of converting text into tokens that a model can understand. This is an essential preprocessing step, and it impacts how a model interprets and generates text.

What is tokenization?

Different types of tokenization methods can affect the model's performance, accuracy, and efficiency. Tokenization breaks down text into tokens, but what constitutes a token depends on the tokenization method used. The main types of tokenization include:

Word-based tokenization: Breaks text into individual words. It's straightforward but can struggle with handling rare words or different forms of the same word (such as "run" versus "running").
Subword-based tokenization: Breaks words into smaller units, allowing the model to handle rare words more effectively. For example, the word "unhappiness" might be tokenized into "un", "happi", and "ness".
Character-based tokenization: Breaks text into individual characters. It can handle any input text but often requires longer token sequences to represent the same amount of information. This can be less efficient.

Impact on performance and cost

Tokenization methods directly impact a model's performance, accuracy, and efficiency. Subword tokenization is the common approach in most LLMs because it strikes a balance between handling rare words and maintaining efficiency. Efficient tokenization helps reduce the number of tokens a model needs to process, improving processing speed and reducing computational costs.

If you use a proprietary LLM, you will likely be charged based on the number of tokens processed and returned. This means that the efficiency of the underlying model’s tokenization method will directly impact your costs. Fewer tokens lead to lower costs. So, it’s important to choose a model with a tokenization method that balances performance with cost efficiency.

Practical implications

As you consider how different LLMs process text, pay attention to tokenization. With a more efficient tokenization method, you will see improved model performance and reduced computational demands. These are important factors when it comes to choosing and optimizing an LLM for specific applications.

Architecture

The architecture of an LLM impacts its capabilities, efficiency, and performance. All mainstream LLMs—both open and proprietary—are designed on a transformer architecture, enabling them to handle long-range dependencies in text more effectively than previous approaches. This architecture processes input data in parallel, allowing for fast training and inference.

Transformer configurations

The architecture of an LLM is a critical factor that influences its capabilities and efficiency. Understanding the architecture is essential for practical implementation, since it may inform decisions on hardware requirements, training time, and deployment strategies. Within the transformer architecture, main configurations include:

Encoder-only: Think of this as a model that focuses solely on understanding and encoding the input text into meaningful representations. It reads the entire text and creates a detailed summary. This is ideal for tasks like text classification and understanding the context of a sentence. An example of this is BERT, which excels at tasks requiring deep understanding of the input.
Encoder-decoder: This configuration uses two parts: an encoder to process and understand the input text and a decoder to generate the output text. The encoder converts the input into a meaningful summary (kind of like taking notes), which the decoder then uses to produce the desired output. This setup is great for tasks like translation. An example of this is T5.
Causal decoder: Imagine writing a story one word at a time, always looking at the words you've already written to decide what word comes next. This model generates text step-by-step, predicting each new word based on the previous ones. It’s used when you need to generate coherent, flowing text. Examples include GPT-3 and GPT-4.
Prefix decoder: Similar to the causal decoder, this model starts with a given beginning (or prefix) and then generates the rest of the text. It’s like continuing a story that starts with a specific introduction. This is useful for tasks where you need to build on some initial context.

Balancing LLM performance with cost and resources

Knowing the fundamental concepts behind LLMs is not just a technical necessity, but a strategic necessity for enterprises looking to make informed decisions. This knowledge allows enterprises to optimize their GenAI usage while balancing their organization’s specific needs and constraints.

Model size, training data, tokenization, and architecture all play a role in LLMs. But they’re not the only factors to consider. In part 2, we’ll cover the mechanisms behind LLMs, the intricacies of fine-tuning, and the inference process that drives LLM’s capabilities.

Subscribe to

the Shift!

Get emerging insights on innovative technology straight to your inbox.

Twitter

Facebook

Published on 00/00/0000

Last updated on 00/00/0000

Published on 00/00/0000

Last updated on 00/00/0000

Twitter

Facebook

Model size

What are parameters?

Examples of model sizes

So, what kinds of numbers are we talking about? To give you a clearer picture, let’s look at a few examples of popular models and their sizes:

Gemma 2B (from Google) has 2 billion parameters (hence the 2B in the name). It’s designed to be efficient while maintaining good performance. Gemma 2B balances the need for computational resources with the ability to handle a variety of tasks effectively.
Mistral 7B (from Mistral AI) features 7 billion parameters, providing a middle ground between smaller, less capable models and the largest, most resource-intensive ones. Mistral 7B aims to deliver strong performance with relatively manageable computational requirements.
GPT-4 (from OpenAI) does not have an official number of parameters disclosed. However, some rumors claim that GPT-4 is made out of 8 smaller LLMs, each with 220 billion parameters. This puts GPT-4 at an estimated size of 1.76 trillion parameters, about 10 times the size of GPT-3. This significantly enhances its performance and capabilities over its predecessor, but also subsequently requires substantial computational resources.

Balancing model size and resources

In addition to increased resource usage and cost, larger models often take longer to train and may require more sophisticated infrastructure to handle the data processing load.

Training data

Quality of training data

Ethical considerations

Quantity of training data

Tokenization

Tokenization is the process of converting text into tokens that a model can understand. This is an essential preprocessing step, and it impacts how a model interprets and generates text.

What is tokenization?

Word-based tokenization: Breaks text into individual words. It's straightforward but can struggle with handling rare words or different forms of the same word (such as "run" versus "running").
Subword-based tokenization: Breaks words into smaller units, allowing the model to handle rare words more effectively. For example, the word "unhappiness" might be tokenized into "un", "happi", and "ness".
Character-based tokenization: Breaks text into individual characters. It can handle any input text but often requires longer token sequences to represent the same amount of information. This can be less efficient.

Impact on performance and cost

Practical implications

Architecture

Transformer configurations

Encoder-only: Think of this as a model that focuses solely on understanding and encoding the input text into meaningful representations. It reads the entire text and creates a detailed summary. This is ideal for tasks like text classification and understanding the context of a sentence. An example of this is BERT, which excels at tasks requiring deep understanding of the input.
Encoder-decoder: This configuration uses two parts: an encoder to process and understand the input text and a decoder to generate the output text. The encoder converts the input into a meaningful summary (kind of like taking notes), which the decoder then uses to produce the desired output. This setup is great for tasks like translation. An example of this is T5.
Causal decoder: Imagine writing a story one word at a time, always looking at the words you've already written to decide what word comes next. This model generates text step-by-step, predicting each new word based on the previous ones. It’s used when you need to generate coherent, flowing text. Examples include GPT-3 and GPT-4.
Prefix decoder: Similar to the causal decoder, this model starts with a given beginning (or prefix) and then generates the rest of the text. It’s like continuing a story that starts with a specific introduction. This is useful for tasks where you need to build on some initial context.

by

Ashley Altus

Published on 10/15/2024

Last updated on 02/12/2025

Published on 10/15/2024

Last updated on 02/12/2025

Understanding LLMs: Model size, training data, and tokenization

Get emerging insights on innovative technology straight to your inbox.

Model size

What are parameters?

Examples of model sizes

Balancing model size and resources

Training data

Quality of training data

Ethical considerations

Quantity of training data

Tokenization

What is tokenization?

Impact on performance and cost

Practical implications

Architecture

Transformer configurations

Balancing LLM performance with cost and resources

Published on 00/00/0000

Last updated on 00/00/0000

Published on 00/00/0000

Last updated on 00/00/0000

by

Ashley Altus

Published on 10/15/2024

Last updated on 02/12/2025

Published on 10/15/2024

Last updated on 02/12/2025

Understanding LLMs: Model size, training data, and tokenization

Get emerging insights on innovative technology straight to your inbox.

Model size

What are parameters?

Examples of model sizes

Balancing model size and resources

Training data

Quality of training data

Ethical considerations

Quantity of training data

Tokenization

What is tokenization?

Impact on performance and cost

Practical implications

Architecture

Transformer configurations

Balancing LLM performance with cost and resources

Welcome to the future of agentic AI: The Internet of Agents

Related articles

AI/ML

Federated learning and LLMs: Redefining privacy-first AI training

Inside Outshift

13,000 AI prompts, 5 designs: How Outshift created unique Forbes cover wraps

AI/ML

Tips for teams to spot and protect against AI deepfakes