Development·January 15, 2025·8 min read

Introducing CorpusSense 2.0: semantic search across 23 languages

Our biggest update yet brings multilingual semantic search powered by sentence embeddings, a redesigned analysis dashboard, and expanded LLM insight capabilities.

Antonio J. Moreno OrtizPrincipal Investigator, Tecnolengua

Article image

When we first started building CorpusSense, we knew that search would be at the heart of the experience. Researchers need to find exactly what they're looking for — whether that's a specific word, a grammatical pattern, or a broader concept spread across thousands of documents.

With version 2.0, we're introducing semantic search: the ability to find text by meaning, not just by exact terms. This changes everything for researchers working with large multilingual corpora.

How semantic search works

Traditional corpus search relies on exact string matching or regular expressions. These are powerful tools, but they miss conceptual connections. If you search for "climate anxiety" you won't find passages about "eco-grief" or "environmental dread" — even though they express the same idea.

Semantic search uses sentence embeddings — dense vector representations of text generated by transformer models — to find passages that are conceptually similar to your query, regardless of the specific words used.

Semantic search doesn't replace lexical search — it complements it. The real power comes from combining both modes in a single query workflow.

— Dr. Moreno Ortiz

The embedding pipeline

Each text in your corpus is processed through our embedding pipeline at upload time. We use multilingual sentence transformers to generate 768-dimensional vectors for every segment, which are then indexed for fast approximate nearest-neighbor search.

Image placeholder

Fig. 1 — The semantic embedding pipeline processes texts at upload time, enabling sub-second search across millions of segments.

# Example: semantic search API call
results = corpus.search(
    query="climate anxiety",
    mode="semantic",
    languages=["en", "es", "fr"],
    top_k=50
)

What's new in 2.0

— Semantic search across all 23 supported languages

— Cross-lingual retrieval (search in English, find results in Spanish)

— Combined mode: semantic + lexical + pattern in one query

— Redesigned results interface with similarity scores

— Expanded LLM insights: 15 analytical dimensions (up from 10)

We're excited to see what researchers discover with these new capabilities. Semantic search opens up entirely new workflows — from cross-lingual comparative studies to exploratory analysis of themes you didn't know to look for. Try it today at corpus-sense.app.

Semantic SearchMultilingualEmbeddingsRelease

Image

Development·December 2024

Introducing CorpusSense 2.0: semantic search across 23 languages

How semantic search works

The embedding pipeline

What's new in 2.0

Related articles

BERTopic integration: automatic theme discovery in your corpora

15 dimensions of text analysis: what Qwen 2.5 reveals about your data

Getting started with XML metadata and subcorpora