Skip to main content

Synthesizer

Quick Summary

deepeval offers a data Synthesizer for anyone to easily generate evaluation datasets from scratch. The Synthesizer class is a synthetic data generator that first uses an LLM to generate a series of inputs, before evolving each input to make them more complex and realistic. These evolved inputs are then used to create a list of synthetic Goldens, which makes up your synthetic EvaluationDataset.

Did You Know?

deepeval's Synthesizer uses the data evolution method to generate large volumes of data across various complexity levels to make synthetic data more realistic. This method was originally introduced by the developers of Evol-Instruct and WizardML.

For those interested, here is a great article on how deepeval's synthesizer was built.

Creating An Synthesizer

deepeval's Synthesizer can be used as a standalone or within an EvaluationDataset. To begin, create a Synthesizer:

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()

There are two optional parameters when creating a Synthesizer:

  • [Optional] model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of type DeepEvalBaseLLM. Defaulted to gpt-4o.
  • [Optional] multithreading: a boolean which when set to True, enables concurrent generation of goldens. Defaulted to True.
  • [Optional] embedder: a string specifying which of OpenAI's embedding models to use, OR any custom embedding model of type DeepEvalBaseEmbeddingModel. Defaulted to 'text-embedding-3-small'.
info

As you'll learn later, an embedding model is only used when using the generate_goldens_from_docs() method, so don't worry about the embedder parameter too much unless you're looking to use your own embedding model.

Using Synthesizer As A Standalone

There are 4 approaches a deepeval's Synthesizer can generate synthetic Goldens:

  1. Generating synthetic Goldens using context extracted from documents.
  2. Generating synthetic Goldens from a list of provided context.
  3. Generating synthetic Goldens from a list of provided input.
  4. Generating synthetic Goldens from a scratch

1. Generating From Documents

To generate synthetic Goldens from documents, simply provide a list of document paths:

info

The generate_goldens_from_docs method employs a token-based text splitter to manage document chunking, meaning the chunk_size and chunk_overlap parameters do not guarantee exact context sizes. This approach is designed to ensure meaningful and coherent context extraction, but might lead to variations in the expected size of each context.

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()
synthesizer.generate_goldens_from_docs(
document_paths=['example.txt', 'example.docx', 'example.pdf'],
max_goldens_per_document=2
)

There are one mandatory and seven optional parameters when using the generate_goldens_from_docs method:

  • document_paths: a list strings, representing the path to the documents from which contexts will be extracted from. Supported documents types include: .txt, .docx, and .pdf.
  • [Optional] include_expected_output: a boolean which when set to True, will additionally generate an expected_output for each synthetic Golden. Defaulted to False.
  • [Optional] max_goldens_per_document: the maximum number of golden data points to be generated for each document. Defaulted to 5.
  • [Optional] chunk_size: specifies the size of text chunks (in characters) to be considered for context extraction within each document. Defaulted to 1024.
  • [Optional] chunk_overlap: an int that determines the overlap size between consecutive text chunks during context extraction. Defaulted to 0.
  • [Optional] num_evolutions: the number of evolution steps to apply to each generated input. This parameter controls the complexity and diversity of the generated dataset by iteratively refining and evolving the initial inputs. Defaulted to 1.
  • [Optional] enable_breadth_evolve: a boolean which when set to True, introduces a wider variety of context modifications, enhancing the dataset's diversity. Defaulted to False.
  • [Optional] evolution_types: a list of EvolutionType, specifying methods used during data evolution. Defaulted to all EvolutionTypes.

2. Generating From Provided Contexts

deepeval also allows you to generate synthetic Goldens from a manually provided a list of context instead of directly generating from your documents.

tip

This is especially helpful if you already have an indexed and/or embedded knowledge base. For example, if you already have documents parsed and stored in an existing vector database, you may consider handling the logic to retrieve text chunks yourself.

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()
synthesizer.generate_goldens(
# Provide a list of context for synthetic data generation
contexts=[
["The Earth revolves around the Sun.", "Planets are celestial bodies."],
["Water freezes at 0 degrees Celsius.", "The chemical formula for water is H2O."],
]
)

There are one mandatory and five optional parameters when using the generate_goldens method:

  • contexts: a list of context, where each context is itself a list of strings, ideally sharing a common theme or subject area.
  • [Optional] include_expected_output: a boolean which when set to True, will additionally generate an expected_output for each synthetic Golden. Defaulted to False.
  • [Optional] max_goldens_per_context: the maximum number of golden data points to be generated from each context. Adjusting this parameter can influence the size of the resulting dataset. Defaulted to 2.
  • [Optional] num_evolutions: the number of evolution steps to apply to each generated input. This parameter controls the complexity and diversity of the generated dataset by iteratively refining and evolving the initial inputs. Defaulted to 1.
  • [Optional] enable_breadth_evolve: a boolean indicating whether to enable breadth evolution strategies during data generation. When set to True, it introduces a wider variety of context modifications, enhancing the dataset's diversity. Defaulted to False.
  • [Optional] evolution_types: a list of EvolutionType, specifying methods used during data evolution. Defaulted to all EvolutionTypes.
caution

While the previous methods first use an LLM to generate a series of inputs based on the provided context before evolving them, generate_goldens_from_inputs simply evolves the provided list of inputs into more complex and diverse Goldens. It's also important to note that this method will only populate the input field of each generated Golden.

3. Generating From Provided Inputs

If your LLM application does not rely on a retrieval context, or if you simply wish to generate a synthetic dataset based on information outside your application's information database, deepeval also supports generating synthetic Goldens from an initial list of inputs, which serve as examples from which additional inputs will be generated.

info

While the previous methods first use an LLM to generate a series of inputs based on the provided context before evolving them, generate_goldens_from_inputs simply evolves the provided list of inputs into more complex and diverse Goldens. It's also important to note that this method will only populate the input field of each generated Golden.

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()
synthesizer.generate_goldens_from_inputs(
inputs=[
"What is 2+2",
"Give me the solution to 12/5",
"5! = ?"
],
num_evolutions=20
)

There are one mandatory and three optional parameters when using the generate_goldens_from_docs method:

  • inputs: a list of strings, representing your initial list of example inputs.
  • [Optional] num_evolutions: the number of evolution steps to apply to each generated input. This parameter controls the complexity and diversity of the generated dataset by iteratively refining and evolving the initial inputs. Defaulted to 1.
  • [Optional] enable_breadth_evolve: a boolean which when set to True, introduces a wider variety of context modifications, enhancing the dataset's diversity. Defaulted to False.
  • [Optional] evolution_types: a list of InputEvolutionType, specifying methods used during data evolution. Defaulted to all InputEvolutionTypes.

4. Generating From Scratch

If you do not have a list of example inputs, or wish to solely rely on an LLM generation for synthesis, you can also generate synthetic Goldens simply by specifying the subject, task, and output format you wish your inputs to follow.

tip

This method is especially helpful when you wish to evaluate your LLM on a specific task, such as red-teaming or text-to-SQL use cases!

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()
synthesizer.generate_goldens_from_scratch(
subject="Harmful and toxic prompts, with emphasis on dark humor",
task="Red-team LLMs",
output_format="string",
num_initial_goldens=25,
num_evolutions=20
)

There are four mandatory and three optional parameters when using the generate_goldens_from_docs method:

  • subject: a string, specifying the subject and nature of your generated Goldens
  • task: a string, representing the purpose of these evaluation Goldens
  • output_format: a string, representing the expected output format. This is not equivalent to python types but simply gives you more control over the structure of your synthetic data.
  • num_initial_goldens: the number of goldens generated before consequent evolutions
  • [Optional] num_evolutions: the number of evolution steps to apply to each generated input. This parameter controls the complexity and diversity of the generated dataset by iteratively refining and evolving the initial inputs. Defaulted to 1.
  • [Optional] enable_breadth_evolve: a boolean which when set to True, introduces a wider variety of context modifications, enhancing the dataset's diversity. Defaulted to False.
  • [Optional] evolution_types: a list of InputEvolutionType, specifying methods used during data evolution. Defaulted to all InputEvolutionTypes.

Saving Generated Goldens

To not accidentally lose any generated synthetic Golden, you can use the save_as() method:

synthesizer.save_as(
file_type='json', # or 'csv'
directory="./synthetic_data"
)

Using Synthesizer Within An Evaluation Dataset

An EvaluationDataset also has the generate_goldens_from_docs and generate_goldens methods, which under the hood is powered by the Synthesizer's implementation.

info

Except for an additional option to accept a custom Synthesizer as argument, the generate_goldens_from_docs and generate_goldens methods in an EvaluationDataset accepts the exact same arguments as those on a Synthesizer.

You can optionally specify a custom Synthesizer when calling generate_goldens_from_docs and generate_goldens through the EvaluationDataset interface if for example, you wish to use a custom LLM to generate synthetic data. If no Synthesizer is provided, the default Synthesizer configuration is used.

To begin, optionally create a custom Synthesizer:

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer(model="gpt-3.5-turbo")

Then, provide it as an argument to generate_goldens_from_docs:

from deepeval.dataset import EvaluationDataset
...

dataset = EvaluationDataset()
dataset.generate_goldens_from_docs(
synthesizer=synthesizer,
document_paths=['example.pdf'],
)

Or, to generate_goldens:

...

dataset.generate_goldens(
synthesizer=synthesizer,
contexts=[
["The Earth revolves around the Sun.", "Planets are celestial bodies."],
["Water freezes at 0 degrees Celsius.", "The chemical formula for water is H2O."],
]
)

Lastly, don't forget to call save_as() to perserve any generated synthetic Golden:

saved_path = dataset.save_as(
file_type='json', # or 'csv'
directory="./synthetic_data"
)
tip

The save_as() method returns a string to the path the dataset was saved to, just in case you need to use it in code later on.

Using a Custom Embedding Model

info

Under the hood, only the generate_goldens_from_docs() method uses an embedding model. This is becuase in order to generate goldens from documents, the Synthesizer uses cosine similarity to generate the relevant context needed for data synthesization.

Using Azure OpenAI

You can use Azure's OpenAI embedding models by running the following commands in the CLI:

deepeval set-azure-openai --openai-endpoint=<endpoint> \
--openai-api-key=<api_key> \
--deployment-name=<deployment_name> \
--openai-api-version=<openai_api_version> \
--model-version=<model_version>

Then, run this to set the Azure OpenAI embedder:

deepeval set-azure-openai-embedding --embedding_deployment-name=<embedding_deployment_name>
Did You Know?

The first command configures deepeval to use Azure OpenAI LLM globally, while the second command configures deepeval to use Azure OpenAI's embedding models globally.

Using Any Custom Model

Alternatively, you can also create a custom embedding model in code by inheriting the base DeepEvalBaseEmbeddingModel class. Here is an example of using the same custom Azure OpenAI embedding model but created in code instead using langchain's langchain_openai module:

from typing import List, Optional
from langchain_openai import AzureOpenAIEmbeddings
from deepeval.models import DeepEvalBaseEmbeddingModel

class CustomEmbeddingModel(DeepEvalBaseEmbeddingModel):
def __init__(self):
pass

def load_model(self):
return AzureOpenAIEmbeddings(
openai_api_version="...",
azure_deployment="...",
azure_endpoint="...",
openai_api_key="...",
)

def embed_text(self, text: str) -> List[float]:
embedding_model = self.load_model()
return embedding_model.embed_query(text)

def embed_texts(self, texts: List[str]) -> List[List[float]]:
embedding_model = self.load_model()
return embedding_model.embed_documents(texts)

async def a_embed_text(self, text: str) -> List[float]:
embedding_model = self.load_model()
return await embedding_model.aembed_query(text)

async def a_embed_texts(self, texts: List[str]) -> List[List[float]]:
embedding_model = self.load_model()
return await embedding_model.aembed_documents(texts)

def get_model_name(self):
"Custom Azure Embedding Model"

When creating a custom embedding model, you should ALWAYS:

  • inherit DeepEvalBaseEmbeddingModel.
  • implement the get_model_name() method, which simply returns a string representing your custom model name.
  • implement the load_model() method, which will be responsible for returning the model object instance.
  • implement the embed_text() method with one and only one parameter of type str as the text to be embedded, and returns a vector of type List[float]. We called embedding_model.embed_query(prompt) to access the embedded text in this particular example, but this could be different depending on the implementation of your custom model object.
  • implement the embed_texts() method with one and only one parameter of type List[str] as the list of strings text to be embedded, and return a list of vectors of type List[List[float]].
  • implement the asynchronous a_embed_text() and a_embed_texts() method, with the same function signature as their respective synchronous versions. Since this is an asynchronous method, remember to use async/await.
note

If an asynchronous version of your embedding model does not exist, simply reuse the synchronous implementation:

class CustomEmbeddingModel(DeepEvalBaseEmbeddingModel):
...
async def a_embed_text(self, text: str) -> List[float]:
return self.embed_text(text)

Lastly, provide the custom embedding model through the embedder parameter when creating a Synthesizer:

from deepeval.synthesizer import Synthesizer
...

synthesizer = Synthesizer(embedder=CustomEmbeddingModel())