Skip to contents

This function creates embeddings with sentence-transformers, configures UMAP, HDBSCAN, and CountVectorizer, optionally wires a representation model, and fits a BERTopic model from R. The returned model can be used with bertopicr helpers.

Usage

train_bertopic_model(
  docs,
  embedding_model = "Qwen/Qwen3-Embedding-0.6B",
  embeddings = NULL,
  embedding_batch_size = 32,
  embedding_show_progress = TRUE,
  umap_model = NULL,
  umap_n_neighbors = 15,
  umap_n_components = 5,
  umap_min_dist = 0,
  umap_metric = "cosine",
  umap_random_state = 42,
  hdbscan_model = NULL,
  hdbscan_min_cluster_size = 50,
  hdbscan_min_samples = 20,
  hdbscan_metric = "euclidean",
  hdbscan_cluster_selection_method = "eom",
  hdbscan_gen_min_span_tree = TRUE,
  hdbscan_prediction_data = TRUE,
  hdbscan_core_dist_n_jobs = 1,
  vectorizer_model = NULL,
  stop_words = "all_stopwords",
  ngram_range = c(1, 3),
  min_df = 2L,
  max_df = 50L,
  max_features = 10000,
  strip_accents = NULL,
  decode_error = "strict",
  encoding = "UTF-8",
  representation_model = c("none", "keybert", "mmr", "ollama"),
  representation_params = list(),
  ollama_model = NULL,
  ollama_base_url = "http://localhost:11434/v1",
  ollama_api_key = "ollama",
  ollama_client_params = list(),
  ollama_prompt = NULL,
  top_n_words = 200L,
  calculate_probabilities = TRUE,
  verbose = TRUE,
  seed = NULL,
  timestamps = NULL,
  topics_over_time_nr_bins = 20L,
  topics_over_time_global_tuning = TRUE,
  topics_over_time_evolution_tuning = TRUE,
  classes = NULL,
  compute_reduced_embeddings = TRUE,
  reduced_embedding_n_neighbors = 10L,
  reduced_embedding_min_dist = 0,
  reduced_embedding_metric = "cosine",
  compute_hierarchical_topics = TRUE,
  bertopic_args = list()
)

Arguments

docs

Character vector of documents to model.

embedding_model

Sentence-transformers model name or local path.

embeddings

Optional precomputed embeddings (matrix or array).

embedding_batch_size

Batch size for embedding encoding.

embedding_show_progress

Logical. Show embedding progress bar.

umap_model

Optional pre-built UMAP Python object. If NULL, one is created.

umap_n_neighbors

Number of neighbors for UMAP.

umap_n_components

Number of UMAP components.

umap_min_dist

UMAP min_dist parameter.

umap_metric

UMAP metric.

umap_random_state

Random state for UMAP.

hdbscan_model

Optional pre-built HDBSCAN Python object. If NULL, one is created.

hdbscan_min_cluster_size

HDBSCAN min_cluster_size.

hdbscan_min_samples

HDBSCAN min_samples.

hdbscan_metric

HDBSCAN metric.

hdbscan_cluster_selection_method

HDBSCAN cluster selection method.

hdbscan_gen_min_span_tree

HDBSCAN gen_min_span_tree.

hdbscan_prediction_data

Logical. Whether to generate prediction data.

hdbscan_core_dist_n_jobs

HDBSCAN core_dist_n_jobs.

vectorizer_model

Optional pre-built CountVectorizer Python object.

stop_words

Stop words for CountVectorizer. Use "all_stopwords" to load the bundled multilingual list, "english", or a character vector.

ngram_range

Length-2 integer vector for n-gram range.

min_df

Minimum document frequency for CountVectorizer.

max_df

Maximum document frequency for CountVectorizer.

max_features

Maximum features for CountVectorizer.

strip_accents

Passed to CountVectorizer. Use NULL to preserve umlauts.

decode_error

Passed to CountVectorizer when decoding input bytes.

encoding

Text encoding for CountVectorizer (defaults to "utf-8").

representation_model

Representation model to use: "none", "keybert", "mmr", or "ollama".

representation_params

Named list of parameters passed to the representation model.

ollama_model

Ollama model name when representation_model = "ollama".

ollama_base_url

Base URL for the Ollama OpenAI-compatible endpoint.

ollama_api_key

API key placeholder for the Ollama OpenAI-compatible endpoint.

ollama_client_params

Named list of extra parameters passed to openai$OpenAI().

ollama_prompt

Optional prompt template for the Ollama OpenAI representation.

top_n_words

Number of top words per topic to keep in the model.

calculate_probabilities

Logical. Whether to calculate topic probabilities.

verbose

Logical. Verbosity for BERTopic.

seed

Optional random seed.

timestamps

Optional vector of timestamps (Date/POSIXt/ISO strings or integer) for topics over time. Defaults to NULL (topics over time disabled).

topics_over_time_nr_bins

Number of bins for topics_over_time.

topics_over_time_global_tuning

Logical. Whether to enable global tuning for topics_over_time.

topics_over_time_evolution_tuning

Logical. Whether to enable evolution tuning for topics_over_time.

classes

Optional vector of class labels (character or factor) for topics per class. Defaults to NULL (topics per class disabled).

compute_reduced_embeddings

Logical. If TRUE, computes 2D and 3D UMAP reductions.

reduced_embedding_n_neighbors

Number of neighbors for reduced embeddings.

reduced_embedding_min_dist

UMAP min_dist for reduced embeddings.

reduced_embedding_metric

UMAP metric for reduced embeddings.

compute_hierarchical_topics

Logical. If TRUE, computes hierarchical topics.

bertopic_args

Named list of extra arguments passed to BERTopic().

Value

A list with elements model, topics, probabilities, embeddings, reduced_embeddings_2d, reduced_embeddings_3d, hierarchical_topics, topics_over_time, and topics_per_class.

Examples

if (FALSE) { # \dontrun{
setup_python_environment()
texts <- c("Cats are great pets", "Dogs are loyal companions", "Markets fluctuate")
fit <- train_bertopic_model(texts, embedding_model = "sentence-transformers/all-MiniLM-L6-v2")
visualize_topics(fit$model, filename = "intertopic_distance_map", auto_open = FALSE)
} # }