As part of our ongoing commitment to serve the Microsoft Azure developers with the tools of their choice, we are happy to announce that Elasticsearch now provides integration of the hosted model catalog on Microsoft Azure AI Studio into our open inference API. This complements the ability for developers to bring their Elasticsearch vector database to be used in Azure OpenAI.
Developers can use the capabilities of the world's most downloaded vector database to store and utilize embeddings generated from OpenAI models from Azure AI studio or access the wide array of chat completion model deployments for quick access to conversational models like mistral-small
.
Just recently we've added support for Azure OpenAI text embeddings and completion, and now we've added support for utilizing Azure AI Studio. Microsoft Azure developers have complete access to Azure OpenAI & Microsoft Azure AI Studio service capabilities and can bring their Elasticsearch data to revolutionize conversational search.
Let's walk you through just how easily you can use these capabilities with Elasticsearch.
Deploying a model in Azure AI Studio
To get started, you'll need a Microsoft Azure subscription as well as access to Azure AI Studio. Once you are set up, you'll need to deploy either a text embedding model or a chat completion model from the Azure AI Studio model catalog. Once your model is deployed, on the deployment overview page take note of the target URL and your deployment's API key - you'll need these later to create your inference endpoint in Elasticsearch.
Furthermore, when you deploy your model, Azure offers two different types of deployment options - a “pay as you go” model (where you pay by the token), and a “realtime” deployment which is a dedicated VM that is billed by the hour. Not all models will have both deployment types available, so be sure to take note as well as which deployment type is used.
Creating an Inference API Endpoint in Elasticsearch
Once your model is deployed, we can now create an endpoint for your inference task in Elasticsearch. For the examples below we are using the Cohere Command R model to perform chat completion.
In Elasticsearch, create your endpoint by providing the service as “azureaistudio”, and the service settings including your API key and target from your deployed model. You'll also need to provide the model provider, as well as the endpoint type from before (either “token” or “realtime”). In our example, we've deployed a Cohere model with a token type endpoint.
PUT _inference/completion/test_cohere_chat_completion
{
"service": "azureaistudio",
"service_settings": {
"api_key": "<<API_KEY>>",
"target": "<<TARGET_URL>>",
"provider": "cohere",
"endpoint_type": "token"
}
}
When you send Elasticsearch the command, it should return back the created model to confirm that it was successful. Note that the API key will never be returned and is stored in Elasticsearch's secure settings.
{
"model_id": "test_cohere_chat_completion",
"task_type": "completion",
"service": "azureaistudio",
"service_settings": {
"target": "<<TARGET_URL>>",
"provider": "cohere",
"endpoint_type": "token"
},
"task_settings": {}
}
Adding a model for using text embeddings is just as easy. For reference, if we had deployed the Cohere-embed-v3-english model, we can create our inference model in Elasticsearch with the “text_embeddings” task type by providing the appropriate API key and target URL from that deployment's overview page:
PUT _inference/text_embeddings/test_cohere_embeddings
{
"service": "azureaistudio",
"service_settings": {
"api_key": "<<API_KEY>>",
"target": "<<TARGET_URL>>",
"provider": "cohere",
"endpoint_type": "token"
}
}
Let's perform some inference
That's all there is to setting up your model. Now that that's out of the way, we can use the model. First, let's test the model out by asking it to provide some text given a simple prompt. To do this, we'll call the _inference API with our input text:
POST _inference/completion/test_cohere_chat_completion
{
"input": "The answer to the universe is"
}
And we should see Elasticsearch provide a response. Behind the scenes, Elasticsearch is calling out to Azure AI Studio with the input text and processes the results from the inference. In this case, we received the response:
{
"completion": [
{
"result": "42. \n\nIn Douglas Adams' *The Hitchhiker's Guide to the Galaxy*, a super-computer named Deep Thought is asked what the answer to the ultimate question of life, the universe, and everything is. After calculating for 7.5-million years, Deep Thought announces that the answer is 42. \n\nThe number 42 has since become a reference to the novel, and many fans of the book series speculate as to what the actual question might be."
}
]
}
We've tried to make it easy for the end user to not have to deal with all the technical details behind the scenes, but we can also control our inference a bit more by providing additional parameters to control the processing such as sampling temperature and requesting the maximum number of tokens to be generated:
POST _inference/completion/test_cohere_chat_completion
{
"input": "The answer to the universe is",
"task_settings": {
"temperature": 1.0,
"do_sample": true,
"max_new_tokens": 50
}
}
That was easy. What else can we do?
This becomes even more powerful when we are able to use our new model in other ways such as adding additional text to a document when it's used in an Elasticsearch ingestion pipeline. For example, the following pipeline definition will use our model and anytime a document using this pipeline is ingested, any text in the field “question_field” will be sent through the inference API and the response will be written to the “completed_text_answer” field in the document. This allows large batches of documents to be augmented.
PUT _ingest/pipeline/azure_ai_studio_cohere_completions
{
"processors": [
{
"inference": {
"model_id": "test_cohere_chat_completion",
"input_output": {
"input_field": "question_field",
"output_field": "completed_text_answer"
}
}
}
]
}
Limitless possibilities
By harnessing the power of Azure AI Studio deployed models in your Elasticsearch inference pipelines, you can enhance your search experience's natural language processing and predictive analytics capabilities.
In upcoming versions of Elasticsearch, users can take advantage of new field mapping types that simplify the process even further where designing an ingest pipeline would no longer be necessary. Also, as alluded to in our accelerated roadmap for semantic search the future will provide dramatically simplified support for inference tasks with Elasticsearch retrievers at query time.
These capabilities are available through the open inference API in our stateless offering on Elastic Cloud. It'll also be soon available to everyone in an upcoming versioned Elasticsearch release.