PGSync is a change data capture tool for moving data from Postgres to Elasticsearch. It allows you to keep Postgres as your source-of-truth and expose structured denormalized documents in Elasticsearch.
This can be useful for building Backend services for driving text search applications or building real-time dashboard applications. Changes to nested entities are propagated to Elasticsearch. PGSync’s advanced query builder then generates SQL queries based on your schema.
Simply describe your schema in JSON and PGSync will continuously capture changes in your data and load it into Elasticsearch. PGSync provides a self-managed solution for change data capture.
At a high level, you have data in a Postgres database and you want to expose it in Elasticsearch. This means every change to your data needs to be replicated to Elasticsearch.
At first, this seems easy and then it’s not. Simply add some code to copy the data to Elasticsearch after updating the database or perform the so-called dual writes at your application level. Writing SQL queries spanning multiple tables and involving multiple relationships can be nontrivial.
Detecting changes within nested documents can also be quite hard.
Of course, if your data never changed, then you could just take a snapshot in time and load it into Elasticsearch. Keep in mind, you shouldn’t really store your primary data in Elasticsearch. Elasticsearch is more suited as a secondary denormalized search engine to be used alongside a traditional normalized datastore.
One of the challenges is getting the data out of the source of truth and into the secondary store in a reasonable timeframe. Existing tools such as Apaches’ Kafka, Amazons’ Kinesis or Elastics’ Logstash require a fair amount of engineering and expertise.
PGSync leverages the logical decoding feature of Postgres introduced in PostgreSQL 9.4 to capture a continuous stream of change events.
PGSync’s query builder is capable of building advanced relational queries dynamically from your schema.
Simply, define a schema (JSON) describing the structure of your data in Elasticsearch, bootstrap the databases and start the PGSync daemon.
It operates both a polling and an event-driven model to capture changes made to date and notification for changes that occur at a point in time. The initial sync polls the database for changes since the last iteration and thereafter reverts to event notifications (based on triggers and handled by pg-notify) for changes to the database.
There is no need to pollute your database with fields such as `updated_at`, `timestamp` or `status` flags to detect and track row-level changes.
PGSync reduces the complexity of most application stacks.
Prerequisites: Python 3.6+, Redis 3.1.0+, Elasticsearch 5.0+, PostgreSQL 9.4+.
$ pip install pgsync
Create a schema configuration file in JSON format
This should mirror the structure of the resulting document in Elasticsearch.
Here is an example schema:
{
"nodes": [
{
"table": "book",
"schema": "public",
"columns": [
"isbn",
"title",
"description"
]
}
]
}
$ pgsync --config <absolute path to JSON schema config> --daemon
More details about PGSync can be found on its Github repository.
Also published at https://medium.com/@toluaina/real-time-integration-of-postgresql-with-elasticsearch-with-pgsync-9425ffa9b4e9