« Back to Glossary Index

In the context of Artificial Intelligence (AI), Machine Learning (ML), and particularly Natural Language Processing (NLP), a token refers to a unit of text that has been extracted during the process of tokenization.

Types of Tokens:

  • Word Tokens: These are individual words separated by spaces. For example, the sentence “AI is transforming industries” would be tokenized into the words “AI,” “is,” “transforming,” and “industries.”
  • Subword Tokens: In some languages or specialized vocabularies, words are broken down into smaller units called subwords. This approach helps in handling rare or out-of-vocabulary words by representing them as combinations of more common subword units.
  • Character Tokens: This method treats each character as a separate token. While this approach can handle any text, it may lose semantic meaning present in word-level tokens.
« Back to Glossary Index