Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Semantic Embedding for Mapping Clinical Metrics

Large Language Models (LLMs) use semantic embeddings to arrange and group topics based on their similarity by translating text into numerical vectors and measuring the proximity of these vectors in a high-dimensional space.

How does it work?

The Role of Embeddings and Vectors

  1. Encoding Semantic Meaning: The fundamental mechanism involves Embedding, which is a high dimensional encoding of tokens (words or pieces of words) that represents their semantic meaning. A transformer model, which is the core invention underlying current AI, embeds (or encodes) data into this high-dimensional space.

  2. Representing Similarity: In this high-dimensional space, the semantic meaning is encoded such that words or phrases with similar meanings tend to land on vectors that are close to each other.

  3. Measuring Distance: To determine how related topics are, the distance or alignment between their corresponding vectors is measured. The principle is that the further apart phrases or topics are in the trained semantic space, the less related they are.

Measuring Similarity and Grouping Topics

The process of grouping topics based on similarity relies on calculating the distance or alignment between these semantic vectors:

Application in Retrieval-Augmented Generation (RAG)

In practical LLM applications, especially those requiring access to domain-specific or updated external knowledge, semantic embeddings are central to Retrieval-Augmented Generation (RAG).

By leveraging semantic embeddings and calculating vector proximity, LLMs and associated systems can effectively cluster, retrieve, and synthesize information based on conceptual similarity, rather than just keyword matching.

Clinical Applications

The utility of LLM embeddings in this domain stems from their ability to quantify the semantic relationships between textual data, which is crucial for identifying patterns that may correlate with disease progression or characteristics.

Direct Applications in Disease Context

A referenced paper highlights the explicit potential of this technology in neurological disorders: the concept of leveraging LLM embeddings is being explored for “Revolutionizing Semantic Data Harmonization in Alzheimer’s and Parkinson’s Disease”. Semantic data harmonization is a necessary step for comparing and analyzing complex textual data (like clinical assessment responses) across different studies or time points, which is fundamental to modeling disease progression.

Mechanism: Semantic Similarity and Vector Methods

The core function relies on translating the complex natural language responses from cognitive or clinical assessments into high-dimensional numerical representations:

  1. Embedding Meaning: An LLM utilizes Embedding, which is a high dimensional encoding of tokens (words or pieces of words) that represents their semantic meaning.

  2. Quantifying Similarity: In this embedding space, words or phrases with similar meanings are represented by vectors that are located close to each other. The similarity between two textual responses (or assessment sections) is quantified using vector similarity methods.

  3. Metrics for Comparison: Specific metrics used to evaluate the closeness of meaning (Semantic Similarity) between texts include Cosine Scores (which measure the similarity between two vectors using the cosine index).

  4. Harmonizing Assessments: This ability to measure semantic similarity has been demonstrated in facilitating the harmonization of mental health questionnaires through Natural Language Processing (NLP). Applying this technique to cognitive/clinical assessments means that even if responses are phrased differently, they can be objectively grouped if they share the same underlying meaning (semantic similarity).

Utility for Modeling Progression

By leveraging semantic embeddings, researchers can potentially model disease progression in the following ways:

Using Compute Canada / DRAC to Embed data

Running a Large Language Model (LLM) to embed a collection of text for later retrieval involves significant machine learning computation and data handling, presenting several challenges related to data management, resource usage, and job scheduling within a High-Performance Computing (HPC) cluster environment.

Here are the key challenges associated with performing large-scale LLM embedding tasks:

1. Data Management and I/O Performance

LLM embedding typically requires reading a large text collection, which creates substantial I/O demands. The distributed filesystem architecture on HPC clusters introduces specific performance bottlenecks:

2. Computational Scale and Job Scheduling

LLMs require substantial resources, often including GPUs, and the embedding process may be lengthy or involve many iterations, leading to scheduling complexity:

3. Software Environment Challenges

Setting up the necessary environment for LLMs (which rely heavily on Python and specialized libraries) requires following specific HPC directives: