Daniel Doubrovkine bio photo

Daniel Doubrovkine

aka dB., @awscloud, former CTO @artsy, +@vestris, NYC

Email Twitter LinkedIn Github Strava
Creative Commons License

This is a neat feature available in OpenSearch via an optional ingest-attachment plugin. It’s installed on AWS domains by default.

Download OpenSearch, install the ingest-attachment plugin, and start it.

wget https://artifacts.opensearch.org/releases/bundle/opensearch/2.10.0/opensearch-2.10.0-linux-x64.tar.gz
tar vfxz opensearch-2.10.0-linux-x64.tar.gz
cd opensearch-2.10.0/
./bin/opensearch-plugin install ingest-attachment
./opensearch-tar-install.sh

I’m using OpenSearch 2.10.

curl -u admin:admin -k https://localhost:9200 | jq
{
  "name": "ip-172-31-42-1",
  "cluster_name": "opensearch",
  "cluster_uuid": "gm4le40_R1eKzSDukDFWkA",
  "version": {
    "distribution": "opensearch",
    "number": "2.10.0",
    "build_type": "tar",
    "build_hash": "eee49cb340edc6c4d489bcd9324dda571fc8dc03",
    "build_date": "2023-09-20T23:54:29.889267151Z",
    "build_snapshot": false,
    "lucene_version": "9.7.0",
    "minimum_wire_compatibility_version": "7.10.0",
    "minimum_index_compatibility_version": "7.0.0"
  },
  "tagline": "The OpenSearch Project: https://opensearch.org/"
}

Create an ingest pipeline.

$ curl -k -u admin:admin -X PUT -H "Content-type:application/json" --data '{"description":"Extract","processors":[{"attachment":{"field":"data","indexed_chars":-1}}]}' https://localhost:9200/_ingest/pipeline/attachment | jq
{
    "acknowledged": true
}

Download a dummy PDF.

$ wget https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf

Ingest the PDF.

$ curl -k -u admin:admin -X PUT -H "Content-type:application/json" --data '{"filename":"dummy.pdf","title":"Dummy PDF","data":"'"$(base64 -w 0 dummy.pdf)"'"}' https://localhost:9200/my_index/_doc/1?pipeline=attachment | jq
{
    "_index": "my_index",
    "_id": "1",
    "_version": 1,
    "result": "created",
    "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
    },
    "_seq_no": 0,
    "_primary_term": 1
}

Search.

$ curl -k -u admin:admin -X POST -H "Content-type:application/json" --data '{"query":{"match":{"attachment.content":{"query":"dummy"}}}}' https://localhost:9200/my_index/_search | jq
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.39556286,
    "hits": [
      {
        "_index": "my_index",
        "_id": "1",
        "_score": 0.39556286,
        "_source": {
          "filename": "dummy.pdf",
          "data": "...",
          "attachment": {
            "date": "2007-02-23T15:56:37Z",
            "content_type": "application/pdf",
            "author": "Evangelos Vlachogiannis",
            "language": "mt",
            "content": "Dummy PDF file\n\n\n\tDummy PDF file",
            "content_length": 35
          },
          "title": "Dummy PDF"
        }
      }
    ]
  }
}