API Reference

Word Embeddings

Alignment

Procrustes alignment for comparing word embeddings from different time periods.

class chronowords.alignment.procrustes.AlignmentMetrics(average_cosine_similarity: float, num_aligned_words: int, alignment_error: float)[source]

Bases: object

Container for alignment quality metrics.

Variables:

average_cosine_similarity (float) – Mean cosine similarity between aligned word pairs, in the range [-1, 1] (near 1.0 for a good alignment).
num_aligned_words (int) – Number of anchor words successfully aligned.
alignment_error (float) – Frobenius norm of the residual between the rotated source and target anchor matrices (>= 0).

Examples

>>> metrics = AlignmentMetrics(0.85, 1000, 0.15)
>>> metrics.average_cosine_similarity
0.85
>>> metrics.num_aligned_words
1000
>>> metrics.alignment_error
0.15

__init__(average_cosine_similarity: float, num_aligned_words: int, alignment_error: float) → None

alignment_error: float

average_cosine_similarity: float

num_aligned_words: int

class chronowords.alignment.procrustes.ProcrustesAligner(min_freq_rank: int | None = None, max_freq_rank: int | None = 1000)[source]

Bases: object

Aligns word embeddings from different time periods using Procrustes analysis.

Finds the optimal orthogonal transformation that maps a source embedding space onto a target space while preserving distances, using shared vocabulary words as anchors. Must be fit() before transform() or get_word_similarity() can be used.

Example

>>> import numpy as np
>>> aligner = ProcrustesAligner()
>>> vocab = ["word1", "word2"]
>>> emb_1800 = np.array([[1.0, 0.0], [0.0, 1.0]])
>>> emb_1850 = np.array([[0.0, 1.0], [-1.0, 0.0]])
>>> _ = aligner.fit(emb_1800, emb_1850, vocab, vocab)
>>> aligned = aligner.transform(emb_1800)

__init__(min_freq_rank: int | None = None, max_freq_rank: int | None = 1000) → None[source]

Initialize the aligner.

Parameters:

min_freq_rank – Lower bound (inclusive) of the frequency-rank slice used to select anchor words. None means “from the start”.
max_freq_rank – Upper bound (exclusive) of the frequency-rank slice. None means “to the end”.

Note

Both arguments are used directly as list-slice bounds on the vocabularies in find_common_words(), which are assumed to be ordered by descending frequency. They are not validated; a min_freq_rank greater than max_freq_rank yields an empty anchor set and causes fit() to raise ValueError.

Examples

>>> aligner = ProcrustesAligner(min_freq_rank=0, max_freq_rank=10)
>>> aligner.min_freq_rank
0
>>> aligner.max_freq_rank
10

anchors: dict[str, tuple[int, int]]

find_common_words(source_vocab: list[str], target_vocab: list[str]) → list[str][source]

Find common words between source and target vocabularies.

Slices each vocabulary to the [min_freq_rank:max_freq_rank] rank window and returns the intersection, providing stable anchor words for alignment.

Parameters:

source_vocab – Source vocabulary, assumed ordered by descending frequency.
target_vocab – Target vocabulary, assumed ordered by descending frequency.

Returns:

The common words within the rank window, sorted alphabetically. Empty if the windows do not overlap.

Examples

>>> aligner = ProcrustesAligner(min_freq_rank=0, max_freq_rank=2)
>>> source = ['the', 'in', 'a', 'rare']
>>> target = ['in', 'the', 'new', 'a']
>>> aligner.find_common_words(source, target)
['in', 'the']

fit(source_embeddings: ndarray, target_embeddings: ndarray, source_vocab: list[str], target_vocab: list[str], anchor_words: list[str] | None = None) → AlignmentMetrics[source]

Learn the orthogonal transformation matrix using Procrustes analysis.

Selects anchor words, L2-normalises their source and target vectors, and solves for the orthogonal matrix that best maps source anchors onto target anchors. Sets orthogonal_matrix, source_words, target_words and anchors.

Parameters:

source_embeddings – Source-space embeddings, row-indexed by source_vocab.
target_embeddings – Target-space embeddings, row-indexed by target_vocab.
source_vocab – Vocabulary list for source_embeddings.
target_vocab – Vocabulary list for target_embeddings.
anchor_words – Specific words to align on. If None, common words filtered by frequency rank (find_common_words()) are used.

Returns:

AlignmentMetrics describing alignment quality.

Raises:

ValueError – If no anchor words are available (no common words, or an empty anchor_words), or if every candidate anchor is dropped because it is missing from one vocabulary or has a near-zero vector in either space.

Note

Preconditions:

Each embedding matrix must have a row for every entry in its vocabulary; a vocab/embedding length mismatch surfaces as an IndexError while gathering anchor vectors (not checked).
The two embedding spaces must share dimensionality, otherwise scipy.linalg.orthogonal_procrustes() raises (not caught).
Anchor words whose source or target vector is effectively zero are silently skipped.

Examples

>>> import numpy as np
>>> aligner = ProcrustesAligner()
>>> source_emb = np.array([[1., 0.], [0., 1.]])
>>> target_emb = np.array([[0., 1.], [-1., 0.]])  # 90 degree rotation
>>> vocab = ['word1', 'word2']
>>> metrics = aligner.fit(source_emb, target_emb, vocab, vocab, ['word1', 'word2'])
>>> metrics.num_aligned_words
2
>>> round(metrics.average_cosine_similarity, 2)
1.0

get_word_similarity(word: str, source_emb: ndarray, target_emb: ndarray) → float | None[source]

Get similarity between word representations in source and target spaces.

Looks up word in both vocabularies, normalises its source and target vectors, rotates the source vector into the target space, and returns the cosine similarity.

Parameters:

word – Word to compare. Must be present in both source_words and target_words (populated by fit()).
source_emb – Source embeddings, row-indexed by source_words.
target_emb – Target embeddings, row-indexed by target_words.

Returns:

Cosine similarity in [-1, 1] between the aligned source vector and the target vector; higher means more similar usage across periods. None if word is absent from either vocabulary.

Raises:

AttributeError – If the aligner has not been fit (orthogonal_matrix is None) — surfaces from the matrix multiply, which is not guarded here unlike transform().

Note

The source/target vectors are divided by their L2 norm with no zero-norm guard. A zero vector produces nan/inf entries and a silent RuntimeWarning rather than None or an exception — see the project pre-mortem.

Examples

>>> import numpy as np
>>> aligner = ProcrustesAligner()
>>> aligner.source_words = ['cat', 'dog']  # Set after initialization
>>> aligner.target_words = ['cat', 'dog']  # Set after initialization
>>> aligner.orthogonal_matrix = np.eye(2)
>>> source_emb = np.array([[1., 0.], [0., 1.]])
>>> target_emb = np.array([[1., 0.], [0., 1.]])
>>> round(aligner.get_word_similarity('cat', source_emb, target_emb), 2)
1.0

load(path: Path) → None[source]

Load aligner state from a pickle written by save().

Overwrites orthogonal_matrix, source_words, target_words, anchors and the frequency-rank bounds with the saved values.

Parameters:

path – File path to read the pickled state from.

Raises:

FileNotFoundError – If path does not exist.
KeyError – If the pickle is missing an expected key (e.g. it was not written by save()).

Warning

This method unpickles path. Unpickling executes arbitrary code embedded in the file, so only load aligner files you trust.

orthogonal_matrix: ndarray | None

save(path: Path) → None[source]

Save the aligner state to disk via pickle.

Persists orthogonal_matrix, source_words, target_words, anchors and the frequency-rank bounds so a later load() can restore a fitted aligner.

Parameters:: path – File path to write the pickled state to.
Raises:: OSError – If path cannot be opened for writing (propagated from open).

Note

Saving an unfitted aligner is allowed and writes orthogonal_matrix=None; reloading it yields an aligner that still needs fit().

source_words: list[str]

target_words: list[str]

transform(embeddings: ndarray) → ndarray[source]

Apply the learned transformation to align embeddings.

Parameters:: embeddings – Embeddings to transform, with the same dimensionality as the space the aligner was fit on.
Returns:: The embeddings rotated into the target space (embeddings @ orthogonal_matrix).
Raises:: ValueError – If the aligner has not been fit yet (orthogonal_matrix is None).

Note

The column count of embeddings must match orthogonal_matrix; a mismatch raises ValueError from the matrix multiply (not checked explicitly).

Examples

>>> import numpy as np
>>> aligner = ProcrustesAligner()
>>> # No need to set source_words/target_words since we're just testing transform
>>> aligner.orthogonal_matrix = np.array([[0, 1], [-1, 0]])  # 90 degree rotation
>>> embeddings = np.array([[1, 0], [0, 1]])
>>> aligned = aligner.transform(embeddings)
>>> np.allclose(aligned, np.array([[0, 1], [-1, 0]]))
True

Topic Modeling

Topic modeling using NMF on PPMI matrices with support for temporal alignment.

class chronowords.topics.nmf.AlignedTopic(source_topic: Topic, target_topic: Topic, similarity: float)[source]

Bases: object

Container for aligned topic pairs.

Variables:

source_topic (chronowords.topics.nmf.Topic) – Topic from the source time period.
target_topic (chronowords.topics.nmf.Topic) – Topic from the target time period.
similarity (float) – Cosine similarity between the two topic distributions, in the range [-1, 1] (typically [0, 1] for non-negative distributions).

Examples

>>> import numpy as np
>>> dist = np.array([0.5, 0.3, 0.2])
>>> topic1 = Topic(1, [('cat', 0.5)], dist)
>>> topic2 = Topic(2, [('dog', 0.4)], dist)
>>> aligned = AlignedTopic(topic1, topic2, 0.8)
>>> aligned.source_topic.id
1
>>> aligned.target_topic.id
2
>>> aligned.similarity
0.8

__init__(source_topic: Topic, target_topic: Topic, similarity: float) → None

similarity: float

source_topic: Topic

target_topic: Topic

class chronowords.topics.nmf.Topic(id: int, words: list[tuple[str, float]], distribution: ndarray)[source]

Bases: object

Container for topic information.

Variables:

id (int) – Unique topic identifier.
words (list[tuple[str, float]]) – List of (word, weight) pairs for the top words, ordered by descending weight.
distribution (numpy.ndarray) – Full weight distribution over the vocabulary. Produced by TopicModel.fit() as a non-negative vector that sums to 1 (unless the raw NMF weights summed to 0, in which case it is left unnormalised). The dataclass does not enforce this.

Examples

>>> import numpy as np
>>> dist = np.array([0.5, 0.3, 0.2])
>>> topic = Topic(1, [('cat', 0.5), ('dog', 0.3)], dist)
>>> topic.id
1
>>> topic.words
[('cat', 0.5), ('dog', 0.3)]
>>> np.allclose(topic.distribution, [0.5, 0.3, 0.2])
True

__init__(id: int, words: list[tuple[str, float]], distribution: ndarray) → None

distribution: ndarray

id: int

words: list[tuple[str, float]]

class chronowords.topics.nmf.TopicModel(n_topics: int = 10, max_iter: int = 500, min_similarity: float = 0.1)[source]

Bases: object

Topic model using NMF on PPMI matrices.

Supports temporal alignment of topics between different time periods.

__init__(n_topics: int = 10, max_iter: int = 500, min_similarity: float = 0.1) → None[source]

Initialize topic model.

Parameters:

n_topics – Number of topics (NMF components) to extract. Must not exceed the smaller dimension of the matrix passed to fit(), or sklearn’s NMF raises.
max_iter – Maximum number of NMF iterations.
min_similarity – Minimum cosine similarity for a pair to be kept by align_with().

Note

Arguments are passed to sklearn.decomposition.NMF unvalidated; invalid values (e.g. n_topics <= 0) surface as errors from sklearn during fit(), not here.

Examples

>>> model = TopicModel(n_topics=5, max_iter=100)
>>> model.n_topics
5
>>> model.max_iter
100

_align_distributions(topic1: Topic, topic2: Topic, vocab1: list[str], vocab2: list[str]) → tuple[ndarray, ndarray][source]

Align two topic distributions to use the same vocabulary space.

Projects both topics onto the sorted union of vocab1 and vocab2 (missing words get weight 0), then renormalises each to sum to 1.

Parameters:

topic1 – First topic. topic1.distribution must be indexable by vocab1 positions.
topic2 – Second topic. topic2.distribution must be indexable by vocab2 positions.
vocab1 – Vocabulary for topic1.
vocab2 – Vocabulary for topic2.

Returns:

Two distributions of equal length (the size of the unified vocabulary), each renormalised to sum to 1 unless it was all-zero.

Note

A distribution that is shorter than its vocabulary raises IndexError while gathering values (not checked).

Examples

>>> import numpy as np
>>> model = TopicModel()
>>> dist1 = np.array([0.6, 0.4])
>>> dist2 = np.array([0.3, 0.7])
>>> t1 = Topic(1, [('cat', 0.6), ('dog', 0.4)], dist1)
>>> t2 = Topic(2, [('dog', 0.3), ('bird', 0.7)], dist2)
>>> aligned1, aligned2 = model._align_distributions(
...     t1, t2, ['cat', 'dog'], ['dog', 'bird']
... )
>>> len(aligned1) == len(aligned2)  # Same length after alignment
True
>>> np.allclose(aligned1.sum(), 1.0)  # Still normalized
True
>>> np.allclose(aligned2.sum(), 1.0)
True

_compute_topic_similarity(topic1: Topic, topic2: Topic) → float[source]

Compute cosine similarity between topic distributions.

Both topics are aligned against self.vocabulary (so this assumes both come from this model’s vocabulary), then compared.

Parameters:

topic1 – First topic.
topic2 – Second topic.

Returns:

Cosine similarity in [-1, 1]. Returns 0.0 if either aligned distribution is all-zero, if the result is NaN, or if any exception is raised during the computation.

Note

The computation is wrapped in a broad except Exception that maps any failure to 0.0, so a genuine error is indistinguishable from a true zero similarity. See the project pre-mortem.

Examples

>>> import numpy as np
>>> model = TopicModel()
>>> dist1 = np.array([1, 0])
>>> dist2 = np.array([0, 1])
>>> t1 = Topic(1, [('cat', 1.0)], dist1)
>>> t2 = Topic(2, [('dog', 1.0)], dist2)
>>> sim = model._compute_topic_similarity(t1, t2)
>>> round(sim, 1)
0.0

align_with(other: TopicModel) → list[AlignedTopic][source]

Align topics with another model using the Hungarian algorithm.

Builds a topic-by-topic cosine-distance cost matrix over the unified vocabulary, finds the optimal one-to-one matching with scipy.optimize.linear_sum_assignment(), and keeps pairs whose similarity is at least min_similarity.

Parameters:: other – Another fitted TopicModel.
Returns:: Matched AlignedTopic pairs with similarity >= min_similarity, sorted by descending similarity. May be empty if no pair clears the threshold.
Raises:: ValueError – If either model has not been fit (self.topics or other.topics is empty).

Note

Each topic’s distribution is assumed indexable by its model’s vocabulary. Unlike _compute_topic_similarity(), the cosine call here is not guarded, so an all-zero distribution can yield a NaN cost entry.

Examples

>>> import numpy as np
>>> from scipy.sparse import csr_matrix
>>> model1 = TopicModel(n_topics=2)
>>> model2 = TopicModel(n_topics=2)
>>> ppmi = csr_matrix([[1, 0], [0, 1]])
>>> model1.fit(ppmi, ['word1', 'word2'])
>>> model2.fit(ppmi, ['word1', 'word2'])
>>> aligned = model1.align_with(model2)
>>> len(aligned) > 0
True
>>> isinstance(aligned[0], AlignedTopic)
True

fit(ppmi_matrix: csr_matrix, vocabulary: list[str], top_n_words: int = 10) → None[source]

Fit topic model to PPMI matrix.

Runs NMF on ppmi_matrix, then builds one Topic per component with a normalised weight distribution and its top words. Populates vocabulary, topic_word_matrix and topics.

Parameters:

ppmi_matrix – Non-negative (sparse) PPMI matrix. Its number of columns must equal len(vocabulary).
vocabulary – Words corresponding to the matrix columns.
top_n_words – Number of top words to store per topic.

Raises:

ValueError – From sklearn.decomposition.NMF if n_topics exceeds the matrix dimensions or the matrix contains negative entries (PPMI is non-negative, so the latter normally cannot happen).
IndexError – If len(vocabulary) is smaller than the number of matrix columns (implicit, when indexing vocabulary[idx] for top words). Not checked explicitly.

Note

For any topic whose raw NMF weights sum to 0, the distribution is left unnormalised (it stays all-zero) rather than raising — that topic’s distribution will not sum to 1.

Examples

>>> import numpy as np
>>> from scipy.sparse import csr_matrix
>>> model = TopicModel(n_topics=2)
>>> ppmi = csr_matrix([[1, 0], [0, 1]])
>>> model.fit(ppmi, ['word1', 'word2'])
>>> len(model.topics)
2
>>> isinstance(model.topics[0], Topic)
True
>>> len(model.vocabulary)
2

get_document_topics(doc_vector: ndarray, threshold: float = 0.1) → list[tuple[int, float]][source]

Get topic distribution for a document vector.

Parameters:

doc_vector – Document vector in vocabulary space. Its length must match the feature dimension the model was fit on.
threshold – Minimum topic proportion to include (strict >).

Returns:

(topic_id, weight) pairs whose weight strictly exceeds threshold, sorted by descending weight. May be empty.

Raises:

ValueError – If the model has not been fit (topic_word_matrix is None) — explicit check.

Note

If the projected topic weights sum to 0, they are returned unnormalised rather than raising. doc_vector of the wrong length raises from sklearn.decomposition.NMF.transform() (not checked here).

Examples

>>> import numpy as np
>>> from scipy.sparse import csr_matrix
>>> model = TopicModel(n_topics=2)
>>> ppmi = csr_matrix([[1, 0], [0, 1]])
>>> model.fit(ppmi, ['word1', 'word2'])
>>> doc = np.array([0.8, 0.2])
>>> topics = model.get_document_topics(doc, threshold=0.1)
>>> len(topics) > 0
True
>>> all(w >= 0.1 for _, w in topics)
True

print_topics(top_n: int = 10) → None[source]

Print top words for each topic.

Parameters:: top_n – Maximum number of top words to print per topic.

Note

Prints to stdout and returns None. If the model has not been fit, prints an advisory message instead of raising.

Examples

>>> from scipy.sparse import csr_matrix
>>> model = TopicModel(n_topics=1)
>>> ppmi = csr_matrix([[1, 0], [0, 1]])
>>> model.fit(ppmi, ['word1', 'word2'])
>>> model.print_topics(top_n=2)

Topic 0:
  word...: 1.0000
  word...: 0.0000

topic_word_matrix: ndarray | None

topics: list[Topic]

vocabulary: list[str]

Utilities

Count-Min Sketch

class chronowords.utils.probabilistic_counter.CountMinSketch(width: int = 1000000, depth: int = 5, seed: int = 42, track_keys: bool = True)[source]

Bases: object

Count-Min Sketch implementation for memory-efficient counting.

Uses depth hash functions over width counters each to approximate item frequencies in fixed memory. Queries never underestimate the true count; they may overestimate it due to hash collisions.

Memory usage: width * depth * 4 bytes (int32 counters).
Error bound: an overestimate of about 2 / width of the total count, with probability at least 1 - 1 / 2**depth.

Examples

>>> cms = CountMinSketch(width=1000, depth=5, seed=42)
>>> cms.width
1000
>>> cms.depth
5

__init__(width: int = 1000000, depth: int = 5, seed: int = 42, track_keys: bool = True)[source]

Initialize Count-Min Sketch.

Parameters:

width – Number of counters per hash function (controls accuracy). Must be a positive integer.
depth – Number of hash functions / rows (controls the probability bound). Must be a positive integer.
seed – Seed for deriving the per-row hash seeds; fixes the sketch’s hashing so that two sketches with the same seed (and width/depth) are merge-compatible.
track_keys – Whether to record observed keys so get_heavy_hitters() can enumerate them. Disable to save memory; get_heavy_hitters() then raises.

Note

Arguments are not validated. width/depth must be positive or the underlying numpy.zeros((depth, width)) allocation fails.

_hash_indices(key: bytes) → ndarray[source]: Compute hash indices for all rows at once.

property arrays: tuple[ndarray, list[int], int]

Get raw arrays and parameters for the Cython PPMI kernel.

Returns:: A tuple (counts, seeds, width) exposing the internal count table (shape (depth, width)), the per-row hash seeds, and the table width — the inputs PPMIComputer needs to re-query the sketch.

Examples

>>> cms = CountMinSketch(width=3, depth=2, seed=42)
>>> counts, seeds, width = cms.arrays
>>> counts.shape
(2, 3)
>>> isinstance(seeds, list)
True
>>> width
3

estimate_error(confidence: float = 0.95) → float[source]

Estimate the maximum counting error.

Parameters:: confidence – Intended confidence level for the bound.
Returns:: The expected maximum overestimate, (2 / width) * total.

Note

The confidence argument currently has no effect on the returned value: an internal delta term is computed from confidence but discarded before the return. The result depends only on width and total. Flagged in the project pre-mortem; kept as-is to preserve behaviour.

Examples

>>> cms = CountMinSketch(width=1000, depth=5, seed=42)
>>> for _ in range(1000):
...     cms.update("word")
>>> error = cms.estimate_error(confidence=0.95)
>>> error > 0  # Should have some error estimation
True
>>> error < cms.total  # Error should be less than total counts
True

get_heavy_hitters(threshold: float) → list[tuple[str, int]][source]

Get items that appear more than threshold * total times.

Parameters:: threshold – Minimum frequency as a fraction of the total count, normally in (0, 1). Not validated; the comparison threshold is int(total * threshold) (truncated toward zero).
Returns:: (item, count) pairs whose estimated count is strictly greater than int(total * threshold), sorted by descending count. Counts are CMS estimates, so a returned count may overestimate the true value (and a borderline item may be a false positive), but no genuine heavy hitter is missed.
Raises:: RuntimeError – If the sketch was created with track_keys=False, since observed keys are then not retained.

Examples

>>> cms = CountMinSketch(width=1000, depth=5, seed=42)
>>> # Add a frequent word
>>> for _ in range(100):
...     cms.update("frequent")
>>> # Add some less frequent words
>>> for _ in range(10):
...     cms.update("rare")
>>> heavy = cms.get_heavy_hitters(threshold=0.05)  # 5% threshold
>>> len(heavy) > 0
True
>>> "frequent" == heavy[0][0]  # Most frequent word
True

merge(other: CountMinSketch) → None[source]

Merge another sketch into this one, in place.

Adds other’s counters and total into self and unions the tracked keys. Because both sketches share hashing parameters, the result is identical to a single sketch built from the concatenation of the two input streams.

Parameters:: other – Another sketch with the same width, depth and derived seeds as self.
Raises:: ValueError – If other is not merge-compatible (differing width, depth or seeds).

Examples

>>> cms1 = CountMinSketch(width=1000, depth=5, seed=42)
>>> cms2 = CountMinSketch(width=1000, depth=5, seed=42)
>>> cms1.update("word", count=3)
>>> cms2.update("word", count=2)
>>> cms1.merge(cms2)
>>> cms1.query("word")
5
>>> cms1.total
5

>>> # Error case - incompatible sketches
>>> cms3 = CountMinSketch(width=500, depth=5, seed=42)
>>> cms1.merge(cms3)
Traceback (most recent call last):
ValueError: Can only merge compatible sketches

query(key: str | bytes) → int[source]

Query the estimated count for a key.

Parameters:: key – Item to look up (str is UTF-8 encoded; bytes used as-is).
Returns:: The minimum counter across rows, which is the Count-Min Sketch estimate. This never underestimates the true count and returns 0 for an unseen key (barring collisions).

Examples

>>> cms = CountMinSketch(width=1000, depth=5, seed=42)
>>> cms.update("rare_word")
>>> cms.query("rare_word")
1
>>> cms.query("unseen_word")
0

update(key: str | bytes, count: int = 1) → None[source]

Update count for a key.

Parameters:

key – Item to count. str keys are UTF-8 encoded; bytes keys are used as-is (and decoded for key tracking).
count – Amount to increment by (default 1). Added to total and to each row counter as-is; no positivity check is performed.

Examples

>>> cms = CountMinSketch(width=1000, depth=5, seed=42)
>>> cms.update("apple")
>>> cms.update("apple")
>>> cms.query("apple")
2
>>> cms.update("banana", count=5)
>>> cms.query("banana")
5
>>> cms.total
7

API Reference

Word Embeddings

Alignment

Topic Modeling

Utilities

Count-Min Sketch

PPMI computation (Cython)