From ES|QL to native Pandas dataframes in Python

Since Elasticsearch 8.15 or with Elasticsearch Serverless, ES|QL responses support the Apache Arrow streaming format. This blog post will show you how to take advantage of it in Python. In an earlier blog post, I demonstrated how to convert ES|QL queries to Pandas dataframes using CSV as an intermediate representation. Unfortunately, CSV requires explicit type declarations, is slow (especially for larger datasets) and does not handle nested arrays and objects. Apache Arrow lifts all these limitations.

ES|QL produces tables

ES|QL to Pandas dataframes in Python

Importing test data

First, let's import some test data. As before, we will be using the employees sample data and mappings. The easiest way to load this dataset is to run these two Elasticsearch API requests in the Kibana Console.

Converting dataset to a Pandas DataFrame object

OK, with that out of the way, let's convert the full employees dataset to a Pandas DataFrame object using the ES|QL Arrow export:

from elasticsearch import Elasticsearch
import pandas as pd

client = Elasticsearch(
    "https://[host].elastic-cloud.com",
    api_key="...",
)

response = client.esql.query(
    query="""
    FROM employees
    | DROP is_rehired,job_positions,salary_change*
    | LIMIT 500
    """,
    format="arrow",
)
df = response.to_pandas(types_mapper=pd.ArrowDtype)
print(df)

Even though this dataset only contains 100 records, we use a LIMIT command to avoid ES|QL warning us about potentially missing records. This prints the following dataframe:

    avg_worked_seconds           birth_date  ...  salary still_hired
0            268728049  1953-09-02 00:00:00  ...   57305        True
1            328922887  1964-06-02 00:00:00  ...   56371        True
2            200296405  1959-12-03 00:00:00  ...   61805       False
3            311267831  1954-05-01 00:00:00  ...   36174        True
4            244294991  1955-01-21 00:00:00  ...   63528        True
..                 ...                  ...  ...     ...         ...
95           204381503  1954-09-16 00:00:00  ...   43889       False
96           206258084  1952-02-27 00:00:00  ...   71165       False
97           272392146  1961-09-23 00:00:00  ...   44817       False
98           377713748  1956-05-25 00:00:00  ...   73578        True
99           223910853  1953-04-21 00:00:00  ...   68431        True

[100 rows x 17 columns]

OK, so what actually happened here?

  • Given format="arrow", Elasticsearch returns binary Arrow streaming data
  • The Elasticsearch Python client looks at the Content-Type header and creates a PyArrow object
  • Finally, PyArrow's Pandas integration converts the PyArrow object to a Pandas dataframe.

Note that the types_mapper=pd.ArrowDtype parameter asks Pandas to use a PyArrow backend instead of a NumPy backend, since the source data is PyArrow. While this backend is not enabled by default for compatibility reasons, it has many advantages: it handles missing values, is faster, more interopable and supports more types. (This is not a zero copy conversion, however.)

For this example to work, the Pandas and PyArrow optional dependencies need to be installed. If you want to use another dataframe library such as Polars instead, you don't need Pandas and can directly use polars.from_arrow to create a Polars DataFrame from the PyArrow table returned by the Elasticsearch client.

One limitation is that Elasticsearch does not currently handle multi-valued fields, which is why we had to drop the is_rehired, job_positions and salary_change columns. This limitation will be lifted in a future version of Elasticsearch.

Anyway, you now have a Pandas dataframe that you can use to analyze your data further. But you can also continue massaging the data using ES|QL, which is particularly useful when queries return more than 10,000 rows, the current maximum number of rows that ES|QL queries can return.

More complex queries

In the next example, we're counting how many employees are speaking a given language by using STATS ... BY (not unlike GROUP BY in SQL). And then we sort the result with the languages column using SORT:

response = client.esql.query(
    query="""
    FROM employees
    | DROP is_rehired,job_positions,salary_change*
    | STATS count = COUNT(emp_no) BY languages
    | SORT languages
    | LIMIT 500
    """,
    format="arrow",
)

df = response.to_pandas(types_mapper=pd.ArrowDtype)
print(df)

Unlike with CSV, we did not have to specify any types, as Arrow data already includes types. Here's the result:

   count  languages
0     15          1
1     19          2
2     17          3
3     18          4
4     21          5
5     10       <NA>

21 employees speak 5 languages, wow! And 10 employees did not declare any spoken language. The missing value is denoted by <NA>, which is consistently used for missing data with the PyArrow backend. If we had used the NumPy backend instead, this column would have been converted to floats and the missing value would have been a confusing NaN, as NumPy integers don't have any sentinel value for missing data.

Queries with parameters

Finally, suppose that you want to expand the query from the previous section to only consider employees that speak N or more languages, with N being a variable parameter. For this we can use ES|QL's built-in support for parameters, which eliminates the risk of an injection attack associated with manually assembling queries with variable parts:

response = client.esql.query(
    query="""
    FROM employees
    | DROP is_rehired,job_positions,salary_change*
    | STATS count = COUNT(emp_no) BY languages
    | WHERE languages >= (?)
    | SORT languages
    | LIMIT 500
    """,
    format="arrow",
    params=[3],
)

df = response.to_pandas(types_mapper=pd.ArrowDtype)
print(df)

which prints the following:

   count  languages
0     17          3
1     18          4
2     21          5

Conclusion

As we saw, ES|QL's native Arrow support makes working with Pandas and other DataFrame libraries even nicer than using CSV and it will continue to improve over time, with the multi-value support coming in a future version of Elasticsearch.

Additional resources

If you want to learn more about ES|QL, the ES|QL documentation is the best place to start. You can also check out this other Python example using Boston Celtics data. To know more about the Python Elasticsearch client itself, you can refer to the documentation, ask a question on Discuss with the language-clients tag or open a new issue if you found a bug or have a feature request. Thank you!

Ready to try this out on your own? Start a free trial.
Elasticsearch has integrations for tools from LangChain, Cohere and more. Join our advanced semantic search webinar to build your next GenAI app!
Recommended Articles