Embeddings are numerical representations of concepts, allowing computers to understand their relationships. OpenAI has developed various embedding models, with the latest being text-embedding-ada-002
. This Answer compares this new model with OpenAI’s older embedding models, focusing on improvements, features, and use cases.
text-embedding-ada-002
Feature / Model |
| Older OpenAI Embedding Models |
Model Architecture | Transformer-based | Various (e.g., LSTM, CNN) |
Pretraining Data | Diverse and large-scale data | Smaller or domain-specific data |
Embedding Dimension | 512 | Varies (e.g., 128, 256) |
Supported Languages | Multiple languages | Often English-only |
Fine-Tuning Capability | Yes | Limited or No |
Use Cases | General-purpose embeddings | Specific tasks or domains |
Performance | Improved accuracy & robustness | Varies based on model |
Availability | OpenAI API | OpenAI API |
The text-embedding-ada-002
model represents a significant leap in OpenAI’s embedding technology. It outperforms all the old embedding models on text search, code search, and sentence similarity tasks and achieves comparable performance on text classification. The new model’s performance score of 53.3 surpasses the older models, ranging from 49.0 to 52.8.
One of the standout features of text-embedding-ada-002
is the unification of capabilities. It replaces five separate models that simplify the interface and perform better across diverse benchmarks. The context length has been increased by a factor of four, from 2048 to 8192, and the new embeddings have only 1536 dimensions, one-eighth the size of davinci-001
embeddings.
Moreover, the price of the new embedding models has been reduced by 90%, achieving better or similar performance at a 99.8% lower price. However, it’s worth noting that the new model does not outperform text-similarity-davinci-001
on certain benchmarks, so for specific tasks, a comparison with this older model might be necessary.
Different models for different use cases characterized OpenAI’s older embedding models. There were three families of embedding models:
Text similarity
Text search
Code search
Each family was designed to capture specific aspects of semantic relationships, enabling applications like astronomical reports analysis, textbook content finding, and customer call transcripts tagging.
These older models achieved top performance in benchmarks like SentEval, BEIR, and CodeSearchNet. However, they were more complex, with various models catering to different functionalities, and they were also more expensive compared to the new model.
Here’s a simple example of how you can query the new text-embedding-ada-002
model using Python:
import openaiimport osopenai.api_key = os.environ["SECRET_KEY"]response = openai.Embedding.create(input="Educative answers section is helpful",model="text-embedding-ada-002")print(response)
Note: This code will only be executable when you enter your API key. To learn how to obtain OpenAI's API key, click here.
The introduction of text-embedding-ada-002
marks a considerable advancement in OpenAI’s embedding models. With its improved performance, unified capabilities, longer context, smaller size, and reduced price, it offers a more powerful and cost-effective solution for various natural language processing and code tasks. The older models, while still valuable, are overshadowed by the new model’s efficiency and versatility.