text-embedding-ada-002 vs OpenAI's older embedding models

Embeddings are numerical representations of concepts, allowing computers to understand their relationships. OpenAI has developed various embedding models, with the latest being text-embedding-ada-002. This Answer compares this new model with OpenAI’s older embedding models, focusing on improvements, features, and use cases.

text-embedding-ada-002

Feature / Model

text-embedding-ada-002

Older OpenAI Embedding Models

Model Architecture

Transformer-based

Various (e.g., LSTM, CNN)

Pretraining Data

Diverse and large-scale data

Smaller or domain-specific data

Embedding Dimension

512

Varies (e.g., 128, 256)

Supported Languages

Multiple languages

Often English-only

Fine-Tuning Capability

Yes

Limited or No

Use Cases

General-purpose embeddings

Specific tasks or domains

Performance

Improved accuracy & robustness

Varies based on model

Availability

OpenAI API

OpenAI API

The text-embedding-ada-002 model represents a significant leap in OpenAI’s embedding technology. It outperforms all the old embedding models on text search, code search, and sentence similarity tasks and achieves comparable performance on text classification. The new model’s performance score of 53.3 surpasses the older models, ranging from 49.0 to 52.8.

One of the standout features of text-embedding-ada-002 is the unification of capabilities. It replaces five separate models that simplify the interface and perform better across diverse benchmarks. The context length has been increased by a factor of four, from 2048 to 8192, and the new embeddings have only 1536 dimensions, one-eighth the size of davinci-001 embeddings.

Moreover, the price of the new embedding models has been reduced by 90%, achieving better or similar performance at a 99.8% lower price. However, it’s worth noting that the new model does not outperform text-similarity-davinci-001 on certain benchmarks, so for specific tasks, a comparison with this older model might be necessary.

Older embedding models

Different models for different use cases characterized OpenAI’s older embedding models. There were three families of embedding models:

  • Text similarity

  • Text search

  • Code search

Each family was designed to capture specific aspects of semantic relationships, enabling applications like astronomical reports analysis, textbook content finding, and customer call transcripts tagging.

These older models achieved top performance in benchmarks like SentEval, BEIR, and CodeSearchNet. However, they were more complex, with various models catering to different functionalities, and they were also more expensive compared to the new model.

Querying the new model

Here’s a simple example of how you can query the new text-embedding-ada-002 model using Python:

import openai
import os
openai.api_key = os.environ["SECRET_KEY"]
response = openai.Embedding.create(
input="Educative answers section is helpful",
model="text-embedding-ada-002"
)
print(response)

Note: This code will only be executable when you enter your API key. To learn how to obtain OpenAI's API key, click here.

Conclusion

The introduction of text-embedding-ada-002 marks a considerable advancement in OpenAI’s embedding models. With its improved performance, unified capabilities, longer context, smaller size, and reduced price, it offers a more powerful and cost-effective solution for various natural language processing and code tasks. The older models, while still valuable, are overshadowed by the new model’s efficiency and versatility.

Copyright ©2024 Educative, Inc. All rights reserved