Create an ELSER inference endpoint | Elasticsearch API documentation (v9)

Path parameters

task_typestring Required
The type of the inference task that the model will perform.
Value is sparse_embedding.
elser_inference_idstring Required
The unique identifier of the inference endpoint.

application/json

Body

chunking_settingsobject
Hide chunking_settings attributes Show chunking_settings attributes object
- max_chunk_sizenumber
  The maximum size of a chunk in words. This value cannot be higher than 300 or lower than 20 (for sentence strategy) or 10 (for word strategy).
- overlapnumber
  The number of overlapping words for chunks. It is applicable only to a word chunking strategy. This value cannot be higher than half the max_chunk_size value.
- sentence_overlapnumber
  The number of overlapping sentences for chunks. It is applicable only for a sentence chunking strategy. It can be either 1 or 0.
- strategystring
  The chunking strategy: sentence or word.
servicestring Required
Value is elser.
service_settingsobject Required
Hide service_settings attributes Show service_settings attributes object
- adaptive_allocationsobject
  Hide adaptive_allocations attributes Show adaptive_allocations attributes object
- num_allocationsnumber Required
  The total number of allocations this model is assigned across machine learning nodes. Increasing this value generally increases the throughput. If adaptive allocations is enabled, do not set this value because it's automatically set.
- num_threadsnumber Required
  The number of threads used by each model allocation during inference. Increasing this value generally increases the speed per inference request. The inference process is a compute-bound process; threads_per_allocations must not exceed the number of available allocated processors per node. The value must be a power of 2. The maximum value is 32.
  
  If you want to optimize your ELSER endpoint for ingest, set the number of threads to 1. If you want to optimize your ELSER endpoint for search, set the number of threads to greater than 1.

Run `PUT _inference/sparse_embedding/my-elser-model` to create an inference endpoint that performs a `sparse_embedding` task. The request will automatically download the ELSER model if it isn't already downloaded and then deploy the model.

{
    "service": "elser",
    "service_settings": {
        "num_allocations": 1,
        "num_threads": 1
    }
}

Run `PUT _inference/sparse_embedding/my-elser-model` to create an inference endpoint that performs a `sparse_embedding` task with adaptive allocations. When adaptive allocations are enabled, the number of allocations of the model is set automatically based on the current load.

{
    "service": "elser",
    "service_settings": {
        "adaptive_allocations": {
            "enabled": true,
            "min_number_of_allocations": 3,
            "max_number_of_allocations": 10
        },
        "num_threads": 1
    }
}

Response examples (200)

A successful response when creating an ELSER inference endpoint.

{
  "inference_id": "my-elser-model",
  "task_type": "sparse_embedding",
  "service": "elser",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1
  },
  "task_settings": {}
}

Path parameters

Body

Responses