API Reference

Word Embeddings

Alignment

Procrustes alignment for comparing word embeddings from different time periods.

class chronowords.alignment.procrustes.AlignmentMetrics(average_cosine_similarity, num_aligned_words, alignment_error)[source]

Bases: object

Container for alignment quality metrics.

Fields:

average_cosine_similarity: Mean cosine similarity between aligned word pairs num_aligned_words: Number of words successfully aligned alignment_error: Frobenius norm of the difference between aligned matrices

Examples

>>> metrics = AlignmentMetrics(0.85, 1000, 0.15)
>>> metrics.average_cosine_similarity
0.85
>>> metrics.num_aligned_words
1000
>>> metrics.alignment_error
0.15
__init__(average_cosine_similarity, num_aligned_words, alignment_error)
alignment_error: float
average_cosine_similarity: float
num_aligned_words: int
class chronowords.alignment.procrustes.ProcrustesAligner(min_freq_rank=None, max_freq_rank=1000)[source]

Bases: object

Aligns word embeddings from different time periods using Procrustes analysis.

Finds optimal orthogonal transformation to align embeddings while preserving distances.

Example:

aligner = ProcrustesAligner() metrics = aligner.fit(

embeddings_1800, embeddings_1850, vocab_1800, vocab_1850

) aligned_embeddings = aligner.transform(embeddings_1800)

__init__(min_freq_rank=None, max_freq_rank=1000)[source]

Initialize the aligner.

Args:

min_freq_rank: Minimum frequency rank for anchor words max_freq_rank: Maximum frequency rank for anchor words

Examples:

>>> aligner = ProcrustesAligner(min_freq_rank=0, max_freq_rank=10)
>>> aligner.min_freq_rank
0
>>> aligner.max_freq_rank
10
anchors: dict[str, tuple[int, int]]
find_common_words(source_vocab, target_vocab)[source]

Find common words between source and target vocabularies.

Uses frequency rank filtering (min_freq_rank to max_freq_rank) to select stable anchor words for alignment.

Return type:

list[str]

Returns

List of common words sorted alphabetically

Examples

>>> aligner = ProcrustesAligner(min_freq_rank=0, max_freq_rank=2)
>>> source = ['the', 'in', 'a', 'rare']
>>> target = ['in', 'the', 'new', 'a']
>>> aligner.find_common_words(source, target)
['in', 'the']
fit(source_embeddings, target_embeddings, source_vocab, target_vocab, anchor_words=None)[source]

Learn the orthogonal transformation matrix using Procrustes analysis.

Return type:

AlignmentMetrics

Args:

source_embeddings: Source space word embeddings target_embeddings: Target space word embeddings source_vocab: Vocabulary list for source embeddings target_vocab: Vocabulary list for target embeddings anchor_words: Optional list of specific words to use for alignment

If None, uses common words filtered by frequency rank

Returns:

AlignmentMetrics containing quality measures of the alignment

Examples:

>>> import numpy as np
>>> aligner = ProcrustesAligner()
>>> source_emb = np.array([[1., 0.], [0., 1.]])
>>> target_emb = np.array([[0., 1.], [-1., 0.]])  # 90 degree rotation
>>> vocab = ['word1', 'word2']
>>> metrics = aligner.fit(source_emb, target_emb, vocab, vocab, ['word1', 'word2'])
>>> metrics.num_aligned_words
2
>>> round(metrics.average_cosine_similarity, 2)
1.0
get_word_similarity(word, source_emb, target_emb)[source]

Get similarity between word representations in source and target spaces.

Args:

word: Word to compare source_emb: Source embeddings target_emb: Target embeddings

Returns:

Cosine similarity [-1,1] between aligned vectors, higher values indicate more similar usage between periods. Returns None if word not found in either vocabulary.

Examples:

>>> import numpy as np
>>> aligner = ProcrustesAligner()
>>> aligner.source_words = ['cat', 'dog']  # Set after initialization
>>> aligner.target_words = ['cat', 'dog']  # Set after initialization
>>> aligner.orthogonal_matrix = np.eye(2)
>>> source_emb = np.array([[1., 0.], [0., 1.]])
>>> target_emb = np.array([[1., 0.], [0., 1.]])
>>> round(aligner.get_word_similarity('cat', source_emb, target_emb), 2)
1.0
load(path)[source]

Load the aligner state.

Return type:

None

orthogonal_matrix: ndarray | None
save(path)[source]

Save the aligner state.

Return type:

None

source_words: list[str]
target_words: list[str]
transform(embeddings)[source]

Apply the learned transformation to align embeddings.

Return type:

ndarray

Args:

embeddings: Embeddings to transform

Returns:

Transformed embeddings in the target space

Examples:

>>> import numpy as np
>>> aligner = ProcrustesAligner()
>>> # No need to set source_words/target_words since we're just testing transform
>>> aligner.orthogonal_matrix = np.array([[0, 1], [-1, 0]])  # 90 degree rotation
>>> embeddings = np.array([[1, 0], [0, 1]])
>>> aligned = aligner.transform(embeddings)
>>> np.allclose(aligned, np.array([[0, 1], [-1, 0]]))
True

Topic Modeling

Topic modeling using NMF on PPMI matrices with support for temporal alignment.

class chronowords.topics.nmf.AlignedTopic(source_topic, target_topic, similarity)[source]

Bases: object

Container for aligned topic pairs.

Fields:

source_topic: Topic from source time period target_topic: Topic from target time period similarity: Cosine similarity between topics

Examples

>>> import numpy as np
>>> dist = np.array([0.5, 0.3, 0.2])
>>> topic1 = Topic(1, [('cat', 0.5)], dist)
>>> topic2 = Topic(2, [('dog', 0.4)], dist)
>>> aligned = AlignedTopic(topic1, topic2, 0.8)
>>> aligned.source_topic.id
1
>>> aligned.target_topic.id
2
>>> aligned.similarity
0.8
__init__(source_topic, target_topic, similarity)
similarity: float
source_topic: Topic
target_topic: Topic
class chronowords.topics.nmf.Topic(id, words, distribution)[source]

Bases: object

Container for topic information.

Fields:

id: Unique topic identifier words: List of (word, weight) pairs for top words distribution: Full probability distribution over vocabulary

Examples

>>> import numpy as np
>>> dist = np.array([0.5, 0.3, 0.2])
>>> topic = Topic(1, [('cat', 0.5), ('dog', 0.3)], dist)
>>> topic.id
1
>>> topic.words
[('cat', 0.5), ('dog', 0.3)]
>>> np.allclose(topic.distribution, [0.5, 0.3, 0.2])
True
__init__(id, words, distribution)
distribution: ndarray
id: int
words: list[tuple[str, float]]
class chronowords.topics.nmf.TopicModel(n_topics=10, max_iter=500, min_similarity=0.1)[source]

Bases: object

Topic model using NMF on PPMI matrices.

Supports temporal alignment of topics between different time periods.

__init__(n_topics=10, max_iter=500, min_similarity=0.1)[source]

Initialize topic model.

Args:

n_topics: Number of topics to extract max_iter: Maximum number of iterations for NMF min_similarity: Minimum similarity for topic alignment

Examples:

>>> model = TopicModel(n_topics=5, max_iter=100)
>>> model.n_topics
5
>>> model.max_iter
100
_align_distributions(topic1, topic2, vocab1, vocab2)[source]

Align two topic distributions to use the same vocabulary space.

Return type:

tuple[ndarray, ndarray]

Args:

topic1: First topic topic2: Second topic vocab1: Vocabulary for first topic vocab2: Vocabulary for second topic

Returns:

Tuple of aligned distributions

Examples:

>>> import numpy as np
>>> model = TopicModel()
>>> dist1 = np.array([0.6, 0.4])
>>> dist2 = np.array([0.3, 0.7])
>>> t1 = Topic(1, [('cat', 0.6), ('dog', 0.4)], dist1)
>>> t2 = Topic(2, [('dog', 0.3), ('bird', 0.7)], dist2)
>>> aligned1, aligned2 = model._align_distributions(
...     t1, t2, ['cat', 'dog'], ['dog', 'bird']
... )
>>> len(aligned1) == len(aligned2)  # Same length after alignment
True
>>> np.allclose(aligned1.sum(), 1.0)  # Still normalized
True
>>> np.allclose(aligned2.sum(), 1.0)
True
_compute_topic_similarity(topic1, topic2)[source]

Compute cosine similarity between topic distributions.

Return type:

float

Args:

topic1: First topic topic2: Second topic

Returns:

Cosine similarity between the topics

Examples:

>>> import numpy as np
>>> model = TopicModel()
>>> dist1 = np.array([1, 0])
>>> dist2 = np.array([0, 1])
>>> t1 = Topic(1, [('cat', 1.0)], dist1)
>>> t2 = Topic(2, [('dog', 1.0)], dist2)
>>> sim = model._compute_topic_similarity(t1, t2)
>>> round(sim, 1)
0.0
align_with(other)[source]

Align topics with another model using Hungarian algorithm.

Return type:

list[AlignedTopic]

Args:

other: Another fitted TopicModel

Returns:

List of aligned topic pairs sorted by similarity

Raises:

ValueError: If either model is not fitted

Examples:

>>> import numpy as np
>>> from scipy.sparse import csr_matrix
>>> model1 = TopicModel(n_topics=2)
>>> model2 = TopicModel(n_topics=2)
>>> ppmi = csr_matrix([[1, 0], [0, 1]])
>>> model1.fit(ppmi, ['word1', 'word2'])
>>> model2.fit(ppmi, ['word1', 'word2'])
>>> aligned = model1.align_with(model2)
>>> len(aligned) > 0
True
>>> isinstance(aligned[0], AlignedTopic)
True
fit(ppmi_matrix, vocabulary, top_n_words=10)[source]

Fit topic model to PPMI matrix.

Return type:

None

Args:

ppmi_matrix: Sparse PPMI matrix from word embeddings vocabulary: List of words corresponding to matrix columns top_n_words: Number of top words to store per topic

Examples:

>>> import numpy as np
>>> from scipy.sparse import csr_matrix
>>> model = TopicModel(n_topics=2)
>>> ppmi = csr_matrix([[1, 0], [0, 1]])
>>> model.fit(ppmi, ['word1', 'word2'])
>>> len(model.topics)
2
>>> isinstance(model.topics[0], Topic)
True
>>> len(model.vocabulary)
2
get_document_topics(doc_vector, threshold=0.1)[source]

Get topic distribution for a document vector.

Return type:

list[tuple[int, float]]

Args:

doc_vector: Document vector in vocabulary space threshold: Minimum topic proportion to include

Returns:

List of (topic_id, weight) pairs above threshold, sorted by weight

Examples:

>>> import numpy as np
>>> from scipy.sparse import csr_matrix
>>> model = TopicModel(n_topics=2)
>>> ppmi = csr_matrix([[1, 0], [0, 1]])
>>> model.fit(ppmi, ['word1', 'word2'])
>>> doc = np.array([0.8, 0.2])
>>> topics = model.get_document_topics(doc, threshold=0.1)
>>> len(topics) > 0
True
>>> all(w >= 0.1 for _, w in topics)
True
print_topics(top_n=10)[source]

Print top words for each topic.

Return type:

None

Args:

top_n: Number of top words to print per topic

Examples:

>>> from scipy.sparse import csr_matrix
>>> model = TopicModel(n_topics=1)
>>> ppmi = csr_matrix([[1, 0], [0, 1]])
>>> model.fit(ppmi, ['word1', 'word2'])
>>> model.print_topics(top_n=2)

Topic 0:
  word...: 1.0000
  word...: 0.0000
topic_word_matrix: ndarray | None
topics: list[Topic]
vocabulary: list[str]

Utilities

class chronowords.utils.probabilistic_counter.CountMinSketch(width=1000000, depth=5, seed=42, track_keys=True)[source]

Bases: object

Count-Min Sketch implementation for memory-efficient counting.

Uses multiple hash functions to approximate frequencies with bounded error. Memory usage: width * depth * 4 bytes Error bound: ≈ 2/width with probability 1 - 1/2^depth

Examples

>>> cms = CountMinSketch(width=1000, depth=5, seed=42)
>>> cms.width
1000
>>> cms.depth
5
__init__(width=1000000, depth=5, seed=42, track_keys=True)[source]

Initialize Count-Min Sketch.

Args:

width: Number of counters per hash function (controls accuracy) depth: Number of hash functions (controls probability bound) seed: Random seed for hash function initialization track_keys: Whether to track observed keys (disable for memory savings)

_hash_indices(key)[source]

Compute hash indices for all rows at once.

Return type:

ndarray

property arrays: tuple[ndarray, list[int], int]

Get raw arrays and parameters for Cython code.

Examples

>>> cms = CountMinSketch(width=3, depth=2, seed=42)
>>> counts, seeds, width = cms.arrays
>>> counts.shape
(2, 3)
>>> isinstance(seeds, list)
True
>>> width
3
estimate_error(confidence=0.95)[source]

Estimate maximum counting error.

Return type:

float

Args:

confidence: Confidence level for the error bound

Returns:

Maximum expected counting error at given confidence level

Examples:

>>> cms = CountMinSketch(width=1000, depth=5, seed=42)
>>> for _ in range(1000):
...     cms.update("word")
>>> error = cms.estimate_error(confidence=0.95)
>>> error > 0  # Should have some error estimation
True
>>> error < cms.total  # Error should be less than total counts
True
get_heavy_hitters(threshold)[source]

Get items that appear more than threshold * total times.

Return type:

list[tuple[str, int]]

Args:

threshold: Minimum frequency as fraction of total counts

Returns:

List of (item, count) pairs sorted by count descending

Raises:

RuntimeError: If track_keys was disabled

Examples:

>>> cms = CountMinSketch(width=1000, depth=5, seed=42)
>>> # Add a frequent word
>>> for _ in range(100):
...     cms.update("frequent")
>>> # Add some less frequent words
>>> for _ in range(10):
...     cms.update("rare")
>>> heavy = cms.get_heavy_hitters(threshold=0.05)  # 5% threshold
>>> len(heavy) > 0
True
>>> "frequent" == heavy[0][0]  # Most frequent word
True
merge(other)[source]

Merge another sketch into this one.

Return type:

None

Examples

>>> cms1 = CountMinSketch(width=1000, depth=5, seed=42)
>>> cms2 = CountMinSketch(width=1000, depth=5, seed=42)
>>> cms1.update("word", count=3)
>>> cms2.update("word", count=2)
>>> cms1.merge(cms2)
>>> cms1.query("word")
5
>>> cms1.total
5
>>> # Error case - incompatible sketches
>>> cms3 = CountMinSketch(width=500, depth=5, seed=42)
>>> cms1.merge(cms3)
Traceback (most recent call last):
ValueError: Can only merge compatible sketches
query(key)[source]

Query count for a key.

Return type:

int

Examples

>>> cms = CountMinSketch(width=1000, depth=5, seed=42)
>>> cms.update("rare_word")
>>> cms.query("rare_word")
1
>>> cms.query("unseen_word")
0
update(key, count=1)[source]

Update count for a key.

Return type:

None

Args:

key: Item to count (string or bytes) count: Amount to increment (default: 1)

Examples:

>>> cms = CountMinSketch(width=1000, depth=5, seed=42)
>>> cms.update("apple")
>>> cms.update("apple")
>>> cms.query("apple")
2
>>> cms.update("banana", count=5)
>>> cms.query("banana")
5
>>> cms.total
7