Create an Elasticsearch inference endpoint | Elasticsearch API documentation (v9)

Path parameters

task_typestring Required
The type of the inference task that the model will perform.
Values are rerank, sparse_embedding, or text_embedding.
elasticsearch_inference_idstring Required
The unique identifier of the inference endpoint. The must not match the model_id.

application/json

Body

chunking_settingsobject
Hide chunking_settings attributes Show chunking_settings attributes object
- max_chunk_sizenumber
  The maximum size of a chunk in words. This value cannot be higher than 300 or lower than 20 (for sentence strategy) or 10 (for word strategy).
- overlapnumber
  The number of overlapping words for chunks. It is applicable only to a word chunking strategy. This value cannot be higher than half the max_chunk_size value.
- sentence_overlapnumber
  The number of overlapping sentences for chunks. It is applicable only for a sentence chunking strategy. It can be either 1 or 0.
- strategystring
  The chunking strategy: sentence or word.
servicestring Required
Value is elasticsearch.
service_settingsobject Required
Hide service_settings attributes Show service_settings attributes object
- adaptive_allocationsobject
  Hide adaptive_allocations attributes Show adaptive_allocations attributes object
- deployment_idstring
  The deployment identifier for a trained model deployment. When deployment_id is used the model_id is optional.
- model_idstring Required
  The name of the model to use for the inference task. It can be the ID of a built-in model (for example, .multilingual-e5-small for E5) or a text embedding model that was uploaded by using the Eland client.
  External documentation
- num_allocationsnumber
  The total number of allocations that are assigned to the model across machine learning nodes. Increasing this value generally increases the throughput. If adaptive allocations are enabled, do not set this value because it's automatically set.
- num_threadsnumber Required
  The number of threads used by each model allocation during inference. This setting generally increases the speed per inference request. The inference process is a compute-bound process; threads_per_allocations must not exceed the number of available allocated processors per node. The value must be a power of 2. The maximum value is 32.
task_settingsobject
Hide task_settings attribute Show task_settings attribute object
- return_documentsboolean
  For a rerank task, return the document instead of only the index.

Run `PUT _inference/sparse_embedding/my-elser-model` to create an inference endpoint that performs a `sparse_embedding` task. The `model_id` must be the ID of one of the built-in ELSER models. The API will automatically download the ELSER model if it isn't already downloaded and then deploy the model.

{
    "service": "elasticsearch",
    "service_settings": {
        "adaptive_allocations": { 
        "enabled": true,
        "min_number_of_allocations": 1,
        "max_number_of_allocations": 4
        },
        "num_threads": 1,
        "model_id": ".elser_model_2" 
    }
}

Run `PUT _inference/rerank/my-elastic-rerank` to create an inference endpoint that performs a rerank task using the built-in Elastic Rerank cross-encoder model. The `model_id` must be `.rerank-v1`, which is the ID of the built-in Elastic Rerank model. The API will automatically download the Elastic Rerank model if it isn't already downloaded and then deploy the model. Once deployed, the model can be used for semantic re-ranking with a `text_similarity_reranker` retriever.

{
    "service": "elasticsearch",
    "service_settings": {
        "model_id": ".rerank-v1", 
        "num_threads": 1,
        "adaptive_allocations": { 
        "enabled": true,
        "min_number_of_allocations": 1,
        "max_number_of_allocations": 4
        }
    }
}

Run `PUT _inference/text_embedding/my-e5-model` to create an inference endpoint that performs a `text_embedding` task. The `model_id` must be the ID of one of the built-in E5 models. The API will automatically download the E5 model if it isn't already downloaded and then deploy the model.

{
    "service": "elasticsearch",
    "service_settings": {
        "num_allocations": 1,
        "num_threads": 1,
        "model_id": ".multilingual-e5-small" 
    }
}

Run `PUT _inference/text_embedding/my-msmarco-minilm-model` to create an inference endpoint that performs a `text_embedding` task with a model that was uploaded by Eland.

{
    "service": "elasticsearch",
    "service_settings": {
        "num_allocations": 1,
        "num_threads": 1,
        "model_id": "msmarco-MiniLM-L12-cos-v5" 
    }
}

Run `PUT _inference/text_embedding/my-e5-model` to create an inference endpoint that performs a `text_embedding` task and to configure adaptive allocations. The API request will automatically download the E5 model if it isn't already downloaded and then deploy the model.

{
    "service": "elasticsearch",
    "service_settings": {
        "adaptive_allocations": {
        "enabled": true,
        "min_number_of_allocations": 3,
        "max_number_of_allocations": 10
        },
        "num_threads": 1,
        "model_id": ".multilingual-e5-small"
    }
}

Run `PUT _inference/sparse_embedding/use_existing_deployment` to use an already existing model deployment when creating an inference endpoint.

{
    "service": "elasticsearch",
    "service_settings": {
        "deployment_id": ".elser_model_2"
    }
}

Response examples (200)

A successful response from `PUT _inference/sparse_embedding/use_existing_deployment`. It contains the model ID and the threads and allocations settings from the model deployment.

{
  "inference_id": "use_existing_deployment",
  "task_type": "sparse_embedding",
  "service": "elasticsearch",
  "service_settings": {
    "num_allocations": 2,
    "num_threads": 1,
    "model_id": ".elser_model_2",
    "deployment_id": ".elser_model_2"
  },
  "chunking_settings": {
    "strategy": "sentence",
    "max_chunk_size": 250,
    "sentence_overlap": 1
  }
}

Path parameters

Body

Responses