The Yarowsky algorithm, introduced by David Yarowsky in 1995, is a semi-supervised learning method designed for word sense disambiguation (WSD) in computational linguistics. WSD involves determining the correct meaning of a word based on its context, which is crucial for tasks like machine translation and information retrieval.
Key Concepts:
- One Sense Per Collocation: This principle posits that a word typically exhibits only one sense within a specific collocation (a sequence of words that frequently occur together). For example, the word “bank” in the phrase “river bank” refers to the side of a river, while in “savings bank,” it refers to a financial institution.
- One Sense Per Discourse: This principle suggests that a word tends to maintain a consistent sense throughout a discourse or a larger segment of text. For instance, in a financial article, “interest” is more likely to refer to financial interest rather than personal interest.
Algorithm Overview:
- Seed Collocations Identification: The algorithm begins with a large, untagged corpus and identifies examples of the polysemous word in context. It then selects a small number of seed collocations that are representative of each sense. For example, for the word “plant,” seed collocations like “life” and “manufacturing” might be chosen to represent different senses.
- Decision List Learning: A decision list algorithm is employed to identify other reliable collocations by calculating the probability of each sense given a collocation. The decision list is ranked by the log-likelihood ratio, and smoothing techniques are applied to handle zero probabilities.
- Iterative Refinement: The classifier is applied iteratively to the untagged corpus. Examples classified with high confidence are added to the seed sets, and the decision list is updated accordingly. This process continues until no more reliable collocations are found.
Applications:
The Yarowsky algorithm has been applied to various WSD tasks, including:
- Machine Translation: Improving the accuracy of translating words with multiple meanings by selecting the appropriate sense based on context.
- Information Retrieval: Enhancing search engine results by disambiguating query terms to retrieve more relevant documents.
- Text Mining: Facilitating the extraction of meaningful information from large text corpora by accurately interpreting polysemous words.