API Reference
Word Embeddings
Alignment
Procrustes alignment for comparing word embeddings from different time periods.
- class chronowords.alignment.procrustes.AlignmentMetrics(average_cosine_similarity, num_aligned_words, alignment_error)[source]
Bases:
objectContainer for alignment quality metrics.
- Fields:
average_cosine_similarity: Mean cosine similarity between aligned word pairs num_aligned_words: Number of words successfully aligned alignment_error: Frobenius norm of the difference between aligned matrices
Examples
>>> metrics = AlignmentMetrics(0.85, 1000, 0.15) >>> metrics.average_cosine_similarity 0.85 >>> metrics.num_aligned_words 1000 >>> metrics.alignment_error 0.15
- __init__(average_cosine_similarity, num_aligned_words, alignment_error)
- alignment_error: float
- average_cosine_similarity: float
- num_aligned_words: int
- class chronowords.alignment.procrustes.ProcrustesAligner(min_freq_rank=None, max_freq_rank=1000)[source]
Bases:
objectAligns word embeddings from different time periods using Procrustes analysis.
Finds optimal orthogonal transformation to align embeddings while preserving distances.
Example:
aligner = ProcrustesAligner() metrics = aligner.fit(
embeddings_1800, embeddings_1850, vocab_1800, vocab_1850
) aligned_embeddings = aligner.transform(embeddings_1800)
- __init__(min_freq_rank=None, max_freq_rank=1000)[source]
Initialize the aligner.
Args:
min_freq_rank: Minimum frequency rank for anchor words max_freq_rank: Maximum frequency rank for anchor words
Examples:
>>> aligner = ProcrustesAligner(min_freq_rank=0, max_freq_rank=10) >>> aligner.min_freq_rank 0 >>> aligner.max_freq_rank 10
- anchors: dict[str, tuple[int, int]]
- find_common_words(source_vocab, target_vocab)[source]
Find common words between source and target vocabularies.
Uses frequency rank filtering (min_freq_rank to max_freq_rank) to select stable anchor words for alignment.
- Return type:
list[str]
Returns
List of common words sorted alphabetically
Examples
>>> aligner = ProcrustesAligner(min_freq_rank=0, max_freq_rank=2) >>> source = ['the', 'in', 'a', 'rare'] >>> target = ['in', 'the', 'new', 'a'] >>> aligner.find_common_words(source, target) ['in', 'the']
- fit(source_embeddings, target_embeddings, source_vocab, target_vocab, anchor_words=None)[source]
Learn the orthogonal transformation matrix using Procrustes analysis.
- Return type:
Args:
source_embeddings: Source space word embeddings target_embeddings: Target space word embeddings source_vocab: Vocabulary list for source embeddings target_vocab: Vocabulary list for target embeddings anchor_words: Optional list of specific words to use for alignment
If None, uses common words filtered by frequency rank
Returns:
AlignmentMetrics containing quality measures of the alignment
Examples:
>>> import numpy as np >>> aligner = ProcrustesAligner() >>> source_emb = np.array([[1., 0.], [0., 1.]]) >>> target_emb = np.array([[0., 1.], [-1., 0.]]) # 90 degree rotation >>> vocab = ['word1', 'word2'] >>> metrics = aligner.fit(source_emb, target_emb, vocab, vocab, ['word1', 'word2']) >>> metrics.num_aligned_words 2 >>> round(metrics.average_cosine_similarity, 2) 1.0
- get_word_similarity(word, source_emb, target_emb)[source]
Get similarity between word representations in source and target spaces.
Args:
word: Word to compare source_emb: Source embeddings target_emb: Target embeddings
Returns:
Cosine similarity [-1,1] between aligned vectors, higher values indicate more similar usage between periods. Returns None if word not found in either vocabulary.
Examples:
>>> import numpy as np >>> aligner = ProcrustesAligner() >>> aligner.source_words = ['cat', 'dog'] # Set after initialization >>> aligner.target_words = ['cat', 'dog'] # Set after initialization >>> aligner.orthogonal_matrix = np.eye(2) >>> source_emb = np.array([[1., 0.], [0., 1.]]) >>> target_emb = np.array([[1., 0.], [0., 1.]]) >>> round(aligner.get_word_similarity('cat', source_emb, target_emb), 2) 1.0
- orthogonal_matrix: ndarray | None
- source_words: list[str]
- target_words: list[str]
- transform(embeddings)[source]
Apply the learned transformation to align embeddings.
- Return type:
ndarray
Args:
embeddings: Embeddings to transform
Returns:
Transformed embeddings in the target space
Examples:
>>> import numpy as np >>> aligner = ProcrustesAligner() >>> # No need to set source_words/target_words since we're just testing transform >>> aligner.orthogonal_matrix = np.array([[0, 1], [-1, 0]]) # 90 degree rotation >>> embeddings = np.array([[1, 0], [0, 1]]) >>> aligned = aligner.transform(embeddings) >>> np.allclose(aligned, np.array([[0, 1], [-1, 0]])) True
Topic Modeling
Topic modeling using NMF on PPMI matrices with support for temporal alignment.
- class chronowords.topics.nmf.AlignedTopic(source_topic, target_topic, similarity)[source]
Bases:
objectContainer for aligned topic pairs.
- Fields:
source_topic: Topic from source time period target_topic: Topic from target time period similarity: Cosine similarity between topics
Examples
>>> import numpy as np >>> dist = np.array([0.5, 0.3, 0.2]) >>> topic1 = Topic(1, [('cat', 0.5)], dist) >>> topic2 = Topic(2, [('dog', 0.4)], dist) >>> aligned = AlignedTopic(topic1, topic2, 0.8) >>> aligned.source_topic.id 1 >>> aligned.target_topic.id 2 >>> aligned.similarity 0.8
- __init__(source_topic, target_topic, similarity)
- similarity: float
- class chronowords.topics.nmf.Topic(id, words, distribution)[source]
Bases:
objectContainer for topic information.
- Fields:
id: Unique topic identifier words: List of (word, weight) pairs for top words distribution: Full probability distribution over vocabulary
Examples
>>> import numpy as np >>> dist = np.array([0.5, 0.3, 0.2]) >>> topic = Topic(1, [('cat', 0.5), ('dog', 0.3)], dist) >>> topic.id 1 >>> topic.words [('cat', 0.5), ('dog', 0.3)] >>> np.allclose(topic.distribution, [0.5, 0.3, 0.2]) True
- __init__(id, words, distribution)
- distribution: ndarray
- id: int
- words: list[tuple[str, float]]
- class chronowords.topics.nmf.TopicModel(n_topics=10, max_iter=500, min_similarity=0.1)[source]
Bases:
objectTopic model using NMF on PPMI matrices.
Supports temporal alignment of topics between different time periods.
- __init__(n_topics=10, max_iter=500, min_similarity=0.1)[source]
Initialize topic model.
Args:
n_topics: Number of topics to extract max_iter: Maximum number of iterations for NMF min_similarity: Minimum similarity for topic alignment
Examples:
>>> model = TopicModel(n_topics=5, max_iter=100) >>> model.n_topics 5 >>> model.max_iter 100
- _align_distributions(topic1, topic2, vocab1, vocab2)[source]
Align two topic distributions to use the same vocabulary space.
- Return type:
tuple[ndarray,ndarray]
Args:
topic1: First topic topic2: Second topic vocab1: Vocabulary for first topic vocab2: Vocabulary for second topic
Returns:
Tuple of aligned distributions
Examples:
>>> import numpy as np >>> model = TopicModel() >>> dist1 = np.array([0.6, 0.4]) >>> dist2 = np.array([0.3, 0.7]) >>> t1 = Topic(1, [('cat', 0.6), ('dog', 0.4)], dist1) >>> t2 = Topic(2, [('dog', 0.3), ('bird', 0.7)], dist2) >>> aligned1, aligned2 = model._align_distributions( ... t1, t2, ['cat', 'dog'], ['dog', 'bird'] ... ) >>> len(aligned1) == len(aligned2) # Same length after alignment True >>> np.allclose(aligned1.sum(), 1.0) # Still normalized True >>> np.allclose(aligned2.sum(), 1.0) True
- _compute_topic_similarity(topic1, topic2)[source]
Compute cosine similarity between topic distributions.
- Return type:
float
Args:
topic1: First topic topic2: Second topic
Returns:
Cosine similarity between the topics
Examples:
>>> import numpy as np >>> model = TopicModel() >>> dist1 = np.array([1, 0]) >>> dist2 = np.array([0, 1]) >>> t1 = Topic(1, [('cat', 1.0)], dist1) >>> t2 = Topic(2, [('dog', 1.0)], dist2) >>> sim = model._compute_topic_similarity(t1, t2) >>> round(sim, 1) 0.0
- align_with(other)[source]
Align topics with another model using Hungarian algorithm.
- Return type:
list[AlignedTopic]
Args:
other: Another fitted TopicModel
Returns:
List of aligned topic pairs sorted by similarity
Raises:
ValueError: If either model is not fitted
Examples:
>>> import numpy as np >>> from scipy.sparse import csr_matrix >>> model1 = TopicModel(n_topics=2) >>> model2 = TopicModel(n_topics=2) >>> ppmi = csr_matrix([[1, 0], [0, 1]]) >>> model1.fit(ppmi, ['word1', 'word2']) >>> model2.fit(ppmi, ['word1', 'word2']) >>> aligned = model1.align_with(model2) >>> len(aligned) > 0 True >>> isinstance(aligned[0], AlignedTopic) True
- fit(ppmi_matrix, vocabulary, top_n_words=10)[source]
Fit topic model to PPMI matrix.
- Return type:
None
Args:
ppmi_matrix: Sparse PPMI matrix from word embeddings vocabulary: List of words corresponding to matrix columns top_n_words: Number of top words to store per topic
Examples:
>>> import numpy as np >>> from scipy.sparse import csr_matrix >>> model = TopicModel(n_topics=2) >>> ppmi = csr_matrix([[1, 0], [0, 1]]) >>> model.fit(ppmi, ['word1', 'word2']) >>> len(model.topics) 2 >>> isinstance(model.topics[0], Topic) True >>> len(model.vocabulary) 2
- get_document_topics(doc_vector, threshold=0.1)[source]
Get topic distribution for a document vector.
- Return type:
list[tuple[int,float]]
Args:
doc_vector: Document vector in vocabulary space threshold: Minimum topic proportion to include
Returns:
List of (topic_id, weight) pairs above threshold, sorted by weight
Examples:
>>> import numpy as np >>> from scipy.sparse import csr_matrix >>> model = TopicModel(n_topics=2) >>> ppmi = csr_matrix([[1, 0], [0, 1]]) >>> model.fit(ppmi, ['word1', 'word2']) >>> doc = np.array([0.8, 0.2]) >>> topics = model.get_document_topics(doc, threshold=0.1) >>> len(topics) > 0 True >>> all(w >= 0.1 for _, w in topics) True
- print_topics(top_n=10)[source]
Print top words for each topic.
- Return type:
None
Args:
top_n: Number of top words to print per topic
Examples:
>>> from scipy.sparse import csr_matrix >>> model = TopicModel(n_topics=1) >>> ppmi = csr_matrix([[1, 0], [0, 1]]) >>> model.fit(ppmi, ['word1', 'word2']) >>> model.print_topics(top_n=2) Topic 0: word...: 1.0000 word...: 0.0000
- topic_word_matrix: ndarray | None
- vocabulary: list[str]
Utilities
- class chronowords.utils.probabilistic_counter.CountMinSketch(width=1000000, depth=5, seed=42, track_keys=True)[source]
Bases:
objectCount-Min Sketch implementation for memory-efficient counting.
Uses multiple hash functions to approximate frequencies with bounded error. Memory usage: width * depth * 4 bytes Error bound: ≈ 2/width with probability 1 - 1/2^depth
Examples
>>> cms = CountMinSketch(width=1000, depth=5, seed=42) >>> cms.width 1000 >>> cms.depth 5
- __init__(width=1000000, depth=5, seed=42, track_keys=True)[source]
Initialize Count-Min Sketch.
Args:
width: Number of counters per hash function (controls accuracy) depth: Number of hash functions (controls probability bound) seed: Random seed for hash function initialization track_keys: Whether to track observed keys (disable for memory savings)
- property arrays: tuple[ndarray, list[int], int]
Get raw arrays and parameters for Cython code.
Examples
>>> cms = CountMinSketch(width=3, depth=2, seed=42) >>> counts, seeds, width = cms.arrays >>> counts.shape (2, 3) >>> isinstance(seeds, list) True >>> width 3
- estimate_error(confidence=0.95)[source]
Estimate maximum counting error.
- Return type:
float
Args:
confidence: Confidence level for the error bound
Returns:
Maximum expected counting error at given confidence level
Examples:
>>> cms = CountMinSketch(width=1000, depth=5, seed=42) >>> for _ in range(1000): ... cms.update("word") >>> error = cms.estimate_error(confidence=0.95) >>> error > 0 # Should have some error estimation True >>> error < cms.total # Error should be less than total counts True
- get_heavy_hitters(threshold)[source]
Get items that appear more than threshold * total times.
- Return type:
list[tuple[str,int]]
Args:
threshold: Minimum frequency as fraction of total counts
Returns:
List of (item, count) pairs sorted by count descending
Raises:
RuntimeError: If track_keys was disabled
Examples:
>>> cms = CountMinSketch(width=1000, depth=5, seed=42) >>> # Add a frequent word >>> for _ in range(100): ... cms.update("frequent") >>> # Add some less frequent words >>> for _ in range(10): ... cms.update("rare") >>> heavy = cms.get_heavy_hitters(threshold=0.05) # 5% threshold >>> len(heavy) > 0 True >>> "frequent" == heavy[0][0] # Most frequent word True
- merge(other)[source]
Merge another sketch into this one.
- Return type:
None
Examples
>>> cms1 = CountMinSketch(width=1000, depth=5, seed=42) >>> cms2 = CountMinSketch(width=1000, depth=5, seed=42) >>> cms1.update("word", count=3) >>> cms2.update("word", count=2) >>> cms1.merge(cms2) >>> cms1.query("word") 5 >>> cms1.total 5
>>> # Error case - incompatible sketches >>> cms3 = CountMinSketch(width=500, depth=5, seed=42) >>> cms1.merge(cms3) Traceback (most recent call last): ValueError: Can only merge compatible sketches
- query(key)[source]
Query count for a key.
- Return type:
int
Examples
>>> cms = CountMinSketch(width=1000, depth=5, seed=42) >>> cms.update("rare_word") >>> cms.query("rare_word") 1 >>> cms.query("unseen_word") 0
- update(key, count=1)[source]
Update count for a key.
- Return type:
None
Args:
key: Item to count (string or bytes) count: Amount to increment (default: 1)
Examples:
>>> cms = CountMinSketch(width=1000, depth=5, seed=42) >>> cms.update("apple") >>> cms.update("apple") >>> cms.query("apple") 2 >>> cms.update("banana", count=5) >>> cms.query("banana") 5 >>> cms.total 7