PGSync Introduction: Real-time Integration Tool For PostgreSQL And Elasticsearch

PGSync is a change data capture tool for moving data from Postgres to Elasticsearch. It allows you to keep Postgres as your source-of-truth and expose structured denormalized documents in Elasticsearch.

This can be useful for building Backend services for driving text search applications or building real-time dashboard applications. Changes to nested entities are propagated to Elasticsearch. PGSync’s advanced query builder then generates SQL queries based on your schema.

Simply describe your schema in JSON and PGSync will continuously capture changes in your data and load it into Elasticsearch. PGSync provides a self-managed solution for change data capture.

Why?

At a high level, you have data in a Postgres database and you want to expose it in Elasticsearch. This means every change to your data needs to be replicated to Elasticsearch.

At first, this seems easy and then it’s not. Simply add some code to copy the data to Elasticsearch after updating the database or perform the so-called dual writes at your application level. Writing SQL queries spanning multiple tables and involving multiple relationships can be nontrivial.

Detecting changes within nested documents can also be quite hard.
Of course, if your data never changed, then you could just take a snapshot in time and load it into Elasticsearch. Keep in mind, you shouldn’t really store your primary data in Elasticsearch. Elasticsearch is more suited as a secondary denormalized search engine to be used alongside a traditional normalized datastore.

One of the challenges is getting the data out of the source of truth and into the secondary store in a reasonable timeframe. Existing tools such as Apaches’ Kafka, Amazons’ Kinesis or Elastics’ Logstash require a fair amount of engineering and expertise.

How it works

PGSync leverages the logical decoding feature of Postgres introduced in PostgreSQL 9.4 to capture a continuous stream of change events.

PGSync’s query builder is capable of building advanced relational queries dynamically from your schema.

Simply, define a schema (JSON) describing the structure of your data in Elasticsearch, bootstrap the databases and start the PGSync daemon.

It operates both a polling and an event-driven model to capture changes made to date and notification for changes that occur at a point in time. The initial sync polls the database for changes since the last iteration and thereafter reverts to event notifications (based on triggers and handled by pg-notify) for changes to the database.

There is no need to pollute your database with fields such as `updated_at`, `timestamp` or `status` flags to detect and track row-level changes.

When to use PGSync

PGSync reduces the complexity of most application stacks.

Postgres is your read/write source of truth whilst Elasticsearch is your read-only search layer.
You have data that is constantly changing.
You have data in an existing relational database such as Postgres and you need a secondary NoSQL database like Elasticsearch for text-based or auto-complete queries.
You want to avoid the development overhead and complexity mandated by other tools.

Installing PGSync

Prerequisites: Python 3.6+, Redis 3.1.0+, Elasticsearch 5.0+, PostgreSQL 9.4+.

Install PGSync from PyPi.

 $ pip install pgsync

Create a schema configuration file in JSON format

This should mirror the structure of the resulting document in Elasticsearch.

Here is an example schema:

{
    "nodes": [
        {
            "table": "book",
            "schema": "public",
            "columns": [
                "isbn",
                "title",
                "description"
            ]
        }
    ]
}

Usage

$ pgsync --config <absolute path to JSON schema config> --daemon

More details about PGSync can be found on its Github repository.

Market Alternatives

Kafka: This is an Apache project which is an open-source stream-processing software platform.
Kinesis: Amazon managed service similar to Kafka.
Logstash: This is a product of Elastic that collects data from various sources, then parses, transforms and ships to various destinations.
ZomboDB: Postgres extension that provides full-text search via the use of Elasticsearch indices.

Also published at https://medium.com/@toluaina/real-time-integration-of-postgresql-with-elasticsearch-with-pgsync-9425ffa9b4e9