A bit of the history as I've been told by 2nd Quadrant/EDB people (my teammates):
BDR1 [0] came first and was, and is, open source. pgactive is based on BDR1. BDR2 was a closed-source rewrite of BDR1 that was later abandoned.
pglogical v1 and v2 (PGL1, PGL2) were, and are, open-source [1].
pglogical v1, after heavy modification, was eventually merged into Postgres 10.
Based on learnings from this logical replication in Postgres 10, 2nd Quadrant started pglogical v2.
pgEdge is based on pglogical v2.
Then later 2nd Quadrant started pglogical v3 (closed source) and BDR v3 (closed source). They were merged into just BDR v4. At some point the BDR product was renamed to Postgres Distributed (PGD) [2].
2ndQuadrant was acquired by EDB. We (EDB) just released PGD v6.
Right, I'm not so familiar with it but from that page:
> The replication mechanism is based on logical decoding and an earlier version of the pglogical extension provided for community by the 2ndQuadrant team.
Looks like it uses Postgres Logical replication to share changes made on one postgres instance to another. Conflict resolution is last-write-wins based on timestamp. Conflicting transactions are logged to a special table (pgactive_conflict_history), so you can see the history, resolve, etc.
Did Postgres ever get a built-in, blessed replication offering? It's been a while since I set it up, but I remember this was always a big missing feature compared to Mysql.
The basic replication mechanisms have been built-in for quite a while. What’s not there is cluster management (replica launching, leader election, load balancing, that sort of thing) that makes it practical in a nontrivial situation. There are several 3rd party solutions to that. [0]
Same situation as, e.g., backups. You can just use pg_dump, but to be serious you need a 3rd party solution that does log shipping and so on.
Sounds like "yes, with an if" where the "if" is "if you don't really care about data consistency".
"Last write wins" sounds like a recipe for disaster IMO.
This is still one of those things that keeps people on MySQL - there are not one, but two open-source solutions available that provide synchronous cluster replication, allowing for "safe" writes against multiple primaries.
Out of curiosity, what conflict resolution options exist in mysql and/or mysql cluster (never checked / exp. in PG)? Because you'll always have to address conflicts of course - we come to CAP / PACELC. Hm [1][2] - looks like they support more strategies (possibly) but I mean none of them are somehow magical, and timestamp comparison based methods comprise the better part of offered strategy set (looks like?) - and "latest timestamp wins" at least used to be the default (did not read thoroughly mind you, was just curious)?
But I could be totally wrong - (1) curious if someone could link to things / explain, and (2) fyi ('stephenr) last write wins based on timestamp is a thing im mysql world as well (though again maybe set of options / different conflict resolution methods available is larger in mysql?)
For reference those two pages are both about NDB cluster.
The two "options" I was referring to are MySQL group replication and the Galera replication plugin for MySQL. Both provide synchronous replication, so the write either succeeds to a majority of the cluster or is rejected.
I'd agree. There's so many footguns involved in multi-master setups, that most organisations should avoid this until they're big enough to hire distributed systems engineers to design a proper solution for the company. I personally don't love any of the Postgres multi-master solutions.
You can scale surprisingly far on a single-master Postgres with read replicas.
It's eventual consistency: Latest-write wins after the dust settles.
As I understand it, this is a wrapper on top of Postgres' native logical replication features. Writes are committed locally and then published via a replication slot to subscriber nodes. You have ACID guarantees locally, but not across the entire distributed system.
It all feels like they expect developers to sift through the conflict log to resolve things manually or something. If a transaction did not go through on some of the nodes, what are the others doing then? What if they can not roll it back safely?
Typically applications will have some kind of logical separation of the data.
Given this is targeted at replication of postgres nodes, perhaps the nodes are deployed across different regions of the globe.
By using active-active replication, all the participating nodes are capable of accepting writes, which simplifies the deployment and querying of postgres (you can read and write to your region-local postgres node).
Now that doesn't mean that all the reads and writes will be on conflicting data. Take the regional example, perhaps the majority of the writes affecting one region's data are made _in that region_. In this case, the region local postgres would be performing all the conflict resolution locally, and sharing the updates with the other nodes.
The reason this simplifies things, is that you can treat all your postgres connections as-if they are just a single postgres. Writes are fast, because they are accepted in the local region, and reads are replicated without you having to have a dedicated read-replica.
Ofc you're still going to have to design around the conflict resolution (i.e. writes for the same data issued against different instances), and the possibility of stale reads as the data is replicated cross-node. But for some applications, this design might be a significant benefit, even with the extra things you need to do.
I think I understand the use case. Like, we have in fact several regional Postgreses, but we want them to be one physical database for the sake of simplicity.
Probably this should be in the motivational part of the README.
There’s no free lunch. The rabbit hole is only worth going down if the benefits are worth the operational pain. I view this as a building block, not a checkbox feature that magically just works all the time.
For someone who has these requirements out of the gate, another datastore might be better. But if someone is already deeply tied to Postgres and perhaps doing their own half assed version of this, this option could be great.
What are good off-the-shelf distributed databases? We looked at MongoDB but it wasn't worth giving up SQL. To reiterate the no free lunch point, no one has figured out how to outsmart the CAP theorem yet, so all you can do is design around it.
I work for them so take with a pinch of salt, but Oracle DB. It gives you a fully multi-master horizontally scalable database with ACID transactions (not sharding), full SQL, elastic scalability, built in queues, JavaScript stored procs, automatic REST API construction, many other features. Its pricing is competitive with a cloud hosted Postgres, believe it or not (the clouds are making a lot of money off customers who are wedded to Postgres). I work through some of the numbers for an extreme case here [1].
Behind the scenes, the way it works is by combining software tricks with special hardware. You rent a (part of a) database cluster. The cluster is running on high end hardware running customized kernels, with a private Infiniband RDMA-capable interconnect between the nodes separate from the front-side network that clients connect with. A lock manager coordinates ownership of data blocks, which can be read either from disk nodes or directly out of the RAM of other database nodes. So if one node reads a block then writes to it, the only thing written to disk immediately is the transaction log. If another node then needs to write to that block, it's transferred directly over the interconnect using RDMA to avoid waiting on the remote CPU, the disk is never touched. Dirty blocks are written back to disk asynchronously. The current transaction counter is also poked directly into remote nodes via RDMA.
In the latest versions the storage nodes can also do some parts of query processing using predicate push-down, so the amount of data to be transferred over the interconnect is also lowered. The client drivers understand all the horizontal scalability stuff and can failover between nodes transparently, so the whole setup is HA. A node can die and the cluster will continue, including open transactions.
If you need to accelerate performance further you can add read-through coherent cache nodes. These act as proxies and integrate with the block ownership system to do processing locally.
Other than financial reasons (I own some stock), I've started making this argument here on HN because it's unintuitive but correct, which is just enjoyable. A lot of people in the startup world don't realize any of the above, thinking that horizontally scalable fully coherent SQL databases either don't exist or have severe caveats. E.g. one reply to you suggests FoundationDB which is great, but it's a KV store and not a SQL database.
One way we dealt with this in the past was assigning an "affinity" to each tenant and basically routing their writes/reads to that host, excepting if that host was down.
You would still get weird replication issues/conflicts when requests failed over in some conditions, but it worked fairly well the majority of the time.
These days I'd stick to single primary/writer as much as possible though tbh.
My guess is that you want to change your entire design philosophy a little bit with regards to table design, moving some entities to use a composite GUID+timestamp as PK's and replace most updates with inserts to avoid conflicts and instead resolve things at query-time (Basically a CRDT modelling philosophy contained within a relational schema).
Ideal? Not entirely but it should still give most query benefits of regular SQL and allows one to to benefit from good indexes (the proper indexes of an SQL database will also help contain the costs of an updated datamodel).
I think this is more interesting for someone building something social media like perhaps rather than anything involving accounting.
Are there any Datomic-like query layers on top of Postgres for approaches like this where you're recording immutable occurrences rather than updating mutable records?
No, this would be more be about using different ways of mapping data to the database from an applications point of view. So for an developer it would increase the load a tad in terms of exploring data when debugging simple things.
On the other hand, the increase in exploration costs should be more than offset by having most data changes logged to be able to track changes.
In principle you could use CRDTs to end up with a "not quite random" outcome that simply takes the conflict into account - it doesn't really attempt to "resolve" it. That's quite good for some cases.
This is a kind of CRDT. CRDT is just some papers defining reasonably clear terminology to cover the kind of eventually consistent replication that has been done for decades, including this kind (timestamp-based last-writer wins).
In our case, we're designing around INSERT-only tables with a composite primary key that includes the site id, so (in theory) there will never be any conflicts that need resolution.
I think that's absolutely true in the happy scenario when the internet is up.
There's a requirement that during outages each site continue operating independently and might* need to make writes to data "outside" its normal partition. By having active-active replication the hope is that the whole thing recovers "automatically" (famous last words) to a consistent state once the network comes back.
But if you drop the assumption that each site only writes rows prefixed with its site ID, then you're right back to the original situation where writes can be silently overwritten.
Not silently overwritten: the collision is visible to the application layer once connectivity is restored and you can prompt humans to reconcile it if need be.
That's correct: when the network comes back up we'll present users with a diff view and they can reconcile manually or decide to drop the revision they don't care about.
We're expecting this to be a rare occurrence (during partition, user at site A needs to modify data sourced from B). It doesn't have to be trivially easy for us to recover from, only possible.
Tangential, but related. Is there a way to have a "locally writable" read replica, ie. a secondary db that reads from a primary, but that can also hold local changes that doesn't send back to the primary?
One of the use cases is to have a development db that can get data from production or staging (and doesn't send local changes back)
What I've done usually is have some script/cron/worker run periodically to get data, either via dump or running some queries, create a snapshot, store it in S3, then have a script on the local dev code that gets the snapshot and inserts/restores the data in the local db. This works for many cases, but index building can be a pain (take a long time), depending on the data
Just FYI that most people would recommend against doing this for legal reasons. PII information and the likes are not usually allowed to land in a staging or dev environment, for various reasons. Doing this or allowing it, is a huge liability.
Agreed, and there’s a few ways to deal with that, like not including certain tables or excluding the data via queries when creating the snapshot
Having said that, legal exposure and risk will highly depend on what you are working on. Probably for most projects this isn’t a big deal. IANAL, this is not legal advice
You can use Postgres logical replication with a filter to create a one-way replica, then just disable the replication slot when you want to make local changes without affecting the primary.
> then just disable the replication slot when you want to make local changes without affecting the primary
This would be all the time, ie. in my local dev, I’d like to have an up to date (daily is ok), copy of the source db that I can modify anytime, but that everyday will sync with the source, without having to fully rebuild indices
How would you go about managing/coordinating that?
Load the snapshot to a "pristine" local database that you never modify. Whenever you need a "reset" of your dev database, drop it, then copy the pristine database using `createdb --template`. This copies prebuilt indexes rather than rebuild them, which is much faster.
Curious about this - How would local writes conflicting with remote updates be handled? I can't think of a merge strategy working on all scenario (or even most of the time)
Great question, I don't know. However, at least in my case, I wouldn't mind the source data always overwriting the local data. In fact, that's the way it works now when loading a newer snapshot, the local db is dropped and then re-built from the snapshot
Thinking about the developer experience though, when loading a snapshot manually, the dev knows they are overwriting their local db. However, if replication happened automatically/continuously on the background, it could lead to some really confusing/annoying behaviors
AFAIK that is the standard behavior with a Postgres logical replication setup. There is nothing preventing you doing writes on the replica, they just won't get sent back anywhere else.
I'm not tired of reminding everyone that "conflict resolution" is no more than an euphemism for "breaking durability by dropping already committed and acknowledged data".
Either architect for no data overlap on writes across all the "actives" (in which case software like pgactive could be a good deal) or use a purely distributed database (like Yugabyte).
I could see in the docs they recommended a scenario like: each master is the only writer for a given schema, to avoid conflicts, but the replication gives them all a copy of all the schemas to read from.
And I was wondering what other ways, besides schemas, of dividing up 'writer responsibility' would also work? Partitions?
For inserts is "easy" if there's no possibility to write the same PK value twice (e.g. PK are random values like uuids or include as part of a natural key a identifier from the writer that's unique -shard id, region id, you name it-).
Once you have done that, for updates and deletes you need to keep the same rule (i.e. don't update "foreign" rows).
If you do this, no other technique is needed. Partitions, however, are potentially a good technique to enforce some of these invariants, which gives us quick understanding of where data is originating from given the table name. Same could apply to schemas.
DSQL only uses Postgres for the query processor layer, so it doesn't require a replication library within postgres itself. Definitely NOT from DSQL.
> We’re not using any of the storage or transaction processing parts of PostgreSQL, but are using the SQL engine, an adapted version of the planner and optimizer, and the client protocol implementation. [1]
Rather, DSQL seems to do its region replication using the distributed journal abstraction [2].
In my experience multi-writer is because of latency or HADR stuff - have all your data in all regions at the same time, but the method (via the tlog) seems like it sort of defeats what those sorts of systems might be able to historically do (write multiple places from the app at the same time so as to have the lowest possible chance of data loss.)
Yeah, some instances also have all the shards maintain all the state and just accept writes in your partition of values, merge replication in sql server works like this.
After setting up numerous clusters with repmgr and patroni along with running them in zero down time production... This is the very last plugin i would ever install. I like to sleep at night.
Coincidentally I’ve been trying to figure out a nice no-nonsense way to setup a HA postgres cluster with automatic failover and restoration of nodes and point in time recovery.
I see a lot of patroni with etcd and haproxy being advised. It must work well for people to be so excited about it, but it feels a bit overwhelming to me when I look at the docker compose files.
At the same time there is pgool which looks like mostly a single thing to deploy in front of each postgres server.
Any tips from the pg-interested people here?
I’d like a docker compose like experience to setup a cluster that is highly available with point in time recovery or at least no data loss.
Seems rather niche. You can already improve async write availability with CQRS and a (durable) queue. Systems like Kafka implement this out of the box
Seems sort of like a CQRS implementation on top of PG (you're using PG replication as the change queue to loosely separate writes/reads, losing transaction guarantees in the process)
A bit of the history as I've been told by 2nd Quadrant/EDB people (my teammates):
BDR1 [0] came first and was, and is, open source. pgactive is based on BDR1. BDR2 was a closed-source rewrite of BDR1 that was later abandoned.
pglogical v1 and v2 (PGL1, PGL2) were, and are, open-source [1].
pglogical v1, after heavy modification, was eventually merged into Postgres 10.
Based on learnings from this logical replication in Postgres 10, 2nd Quadrant started pglogical v2.
pgEdge is based on pglogical v2.
Then later 2nd Quadrant started pglogical v3 (closed source) and BDR v3 (closed source). They were merged into just BDR v4. At some point the BDR product was renamed to Postgres Distributed (PGD) [2].
2ndQuadrant was acquired by EDB. We (EDB) just released PGD v6.
[0] https://github.com/2ndQuadrant/bdr/tree/bdr-plugin/REL1_0_ST...
[1] https://github.com/2ndquadrant/pglogical
[2] https://www.enterprisedb.com/docs/pgd/latest/
And there's also https://postgrespro.com/docs/enterprise/current/multimaster With a history behind it.
Right, I'm not so familiar with it but from that page:
> The replication mechanism is based on logical decoding and an earlier version of the pglogical extension provided for community by the 2ndQuadrant team.
PGDv6 is still closed source, yeah?
That's right.
Looks like it uses Postgres Logical replication to share changes made on one postgres instance to another. Conflict resolution is last-write-wins based on timestamp. Conflicting transactions are logged to a special table (pgactive_conflict_history), so you can see the history, resolve, etc.
https://github.com/aws/pgactive/tree/main/docs
Is this multi-master replication? It will be interesting if it can be accepted into Postgres proper.
Did Postgres ever get a built-in, blessed replication offering? It's been a while since I set it up, but I remember this was always a big missing feature compared to Mysql.
The basic replication mechanisms have been built-in for quite a while. What’s not there is cluster management (replica launching, leader election, load balancing, that sort of thing) that makes it practical in a nontrivial situation. There are several 3rd party solutions to that. [0]
Same situation as, e.g., backups. You can just use pg_dump, but to be serious you need a 3rd party solution that does log shipping and so on.
[0] https://www.postgresql.org/download/products/3-clusteringrep...
Streaming and logical replication is built in: https://www.postgresql.org/docs/current/runtime-config-repli...
Sounds like "yes, with an if" where the "if" is "if you don't really care about data consistency".
"Last write wins" sounds like a recipe for disaster IMO.
This is still one of those things that keeps people on MySQL - there are not one, but two open-source solutions available that provide synchronous cluster replication, allowing for "safe" writes against multiple primaries.
Out of curiosity, what conflict resolution options exist in mysql and/or mysql cluster (never checked / exp. in PG)? Because you'll always have to address conflicts of course - we come to CAP / PACELC. Hm [1][2] - looks like they support more strategies (possibly) but I mean none of them are somehow magical, and timestamp comparison based methods comprise the better part of offered strategy set (looks like?) - and "latest timestamp wins" at least used to be the default (did not read thoroughly mind you, was just curious)?
But I could be totally wrong - (1) curious if someone could link to things / explain, and (2) fyi ('stephenr) last write wins based on timestamp is a thing im mysql world as well (though again maybe set of options / different conflict resolution methods available is larger in mysql?)
[1]: https://dev.mysql.com/doc/refman/8.4/en/mysql-cluster-replic...
[2]: https://dev.mysql.com/blog-archive/enhanced-conflict-resolut... (nice writeup, maybe outdated idk?)
For reference those two pages are both about NDB cluster.
The two "options" I was referring to are MySQL group replication and the Galera replication plugin for MySQL. Both provide synchronous replication, so the write either succeeds to a majority of the cluster or is rejected.
Understood, thanks! I wasn't even sure where to look - thank you
It's all tradeoffs, with MySQL multi-master and multi-source models having their own issues and pg also has other options with their own tradoffs.
ACID+distributed== tradoffs that will always keep this a horses for courses problem.
Sounds interesting. So how soon one knows if his write has been accepted or rejected? Is it immediate or eventual?
It took 20 years to acknowledge that pushing eventual consistency to application layer is not worth it for most applications.
Seems the same is playing out out in Postgres with this extension, maybe will take it another 20 years
The idea of active-active is too seductive compared to how hard learning distributed systems is.
It is so seductive that people don’t read the footnotes that explain that active-active does not do what they think it does.
I'd agree. There's so many footguns involved in multi-master setups, that most organisations should avoid this until they're big enough to hire distributed systems engineers to design a proper solution for the company. I personally don't love any of the Postgres multi-master solutions.
You can scale surprisingly far on a single-master Postgres with read replicas.
I'm curious about what you mean here. It sounds like you're saying that applications shouldn't concern themselves with consistency. Can you elaborate?
It's eventual consistency: Latest-write wins after the dust settles.
As I understand it, this is a wrapper on top of Postgres' native logical replication features. Writes are committed locally and then published via a replication slot to subscriber nodes. You have ACID guarantees locally, but not across the entire distributed system.
https://www.postgresql.org/docs/current/logical-replication....
So the outcomes are essentially random?
It all feels like they expect developers to sift through the conflict log to resolve things manually or something. If a transaction did not go through on some of the nodes, what are the others doing then? What if they can not roll it back safely?
Such a rabbit hole.
Typically applications will have some kind of logical separation of the data.
Given this is targeted at replication of postgres nodes, perhaps the nodes are deployed across different regions of the globe.
By using active-active replication, all the participating nodes are capable of accepting writes, which simplifies the deployment and querying of postgres (you can read and write to your region-local postgres node).
Now that doesn't mean that all the reads and writes will be on conflicting data. Take the regional example, perhaps the majority of the writes affecting one region's data are made _in that region_. In this case, the region local postgres would be performing all the conflict resolution locally, and sharing the updates with the other nodes.
The reason this simplifies things, is that you can treat all your postgres connections as-if they are just a single postgres. Writes are fast, because they are accepted in the local region, and reads are replicated without you having to have a dedicated read-replica.
Ofc you're still going to have to design around the conflict resolution (i.e. writes for the same data issued against different instances), and the possibility of stale reads as the data is replicated cross-node. But for some applications, this design might be a significant benefit, even with the extra things you need to do.
I think I understand the use case. Like, we have in fact several regional Postgreses, but we want them to be one physical database for the sake of simplicity. Probably this should be in the motivational part of the README.
There’s no free lunch. The rabbit hole is only worth going down if the benefits are worth the operational pain. I view this as a building block, not a checkbox feature that magically just works all the time.
For someone who has these requirements out of the gate, another datastore might be better. But if someone is already deeply tied to Postgres and perhaps doing their own half assed version of this, this option could be great.
What are good off-the-shelf distributed databases? We looked at MongoDB but it wasn't worth giving up SQL. To reiterate the no free lunch point, no one has figured out how to outsmart the CAP theorem yet, so all you can do is design around it.
I work for them so take with a pinch of salt, but Oracle DB. It gives you a fully multi-master horizontally scalable database with ACID transactions (not sharding), full SQL, elastic scalability, built in queues, JavaScript stored procs, automatic REST API construction, many other features. Its pricing is competitive with a cloud hosted Postgres, believe it or not (the clouds are making a lot of money off customers who are wedded to Postgres). I work through some of the numbers for an extreme case here [1].
Behind the scenes, the way it works is by combining software tricks with special hardware. You rent a (part of a) database cluster. The cluster is running on high end hardware running customized kernels, with a private Infiniband RDMA-capable interconnect between the nodes separate from the front-side network that clients connect with. A lock manager coordinates ownership of data blocks, which can be read either from disk nodes or directly out of the RAM of other database nodes. So if one node reads a block then writes to it, the only thing written to disk immediately is the transaction log. If another node then needs to write to that block, it's transferred directly over the interconnect using RDMA to avoid waiting on the remote CPU, the disk is never touched. Dirty blocks are written back to disk asynchronously. The current transaction counter is also poked directly into remote nodes via RDMA.
In the latest versions the storage nodes can also do some parts of query processing using predicate push-down, so the amount of data to be transferred over the interconnect is also lowered. The client drivers understand all the horizontal scalability stuff and can failover between nodes transparently, so the whole setup is HA. A node can die and the cluster will continue, including open transactions.
If you need to accelerate performance further you can add read-through coherent cache nodes. These act as proxies and integrate with the block ownership system to do processing locally.
Other than financial reasons (I own some stock), I've started making this argument here on HN because it's unintuitive but correct, which is just enjoyable. A lot of people in the startup world don't realize any of the above, thinking that horizontally scalable fully coherent SQL databases either don't exist or have severe caveats. E.g. one reply to you suggests FoundationDB which is great, but it's a KV store and not a SQL database.
[1] https://news.ycombinator.com/item?id=44074506 (last paragraph)
FoundationFB and anything based on that.
Spanner
CockroachDB
One way we dealt with this in the past was assigning an "affinity" to each tenant and basically routing their writes/reads to that host, excepting if that host was down.
You would still get weird replication issues/conflicts when requests failed over in some conditions, but it worked fairly well the majority of the time.
These days I'd stick to single primary/writer as much as possible though tbh.
My guess is that you want to change your entire design philosophy a little bit with regards to table design, moving some entities to use a composite GUID+timestamp as PK's and replace most updates with inserts to avoid conflicts and instead resolve things at query-time (Basically a CRDT modelling philosophy contained within a relational schema).
Ideal? Not entirely but it should still give most query benefits of regular SQL and allows one to to benefit from good indexes (the proper indexes of an SQL database will also help contain the costs of an updated datamodel).
I think this is more interesting for someone building something social media like perhaps rather than anything involving accounting.
Are there any Datomic-like query layers on top of Postgres for approaches like this where you're recording immutable occurrences rather than updating mutable records?
No, this would be more be about using different ways of mapping data to the database from an applications point of view. So for an developer it would increase the load a tad in terms of exploring data when debugging simple things.
On the other hand, the increase in exploration costs should be more than offset by having most data changes logged to be able to track changes.
> So the outcomes are essentially random?
In principle you could use CRDTs to end up with a "not quite random" outcome that simply takes the conflict into account - it doesn't really attempt to "resolve" it. That's quite good for some cases.
This is a kind of CRDT. CRDT is just some papers defining reasonably clear terminology to cover the kind of eventually consistent replication that has been done for decades, including this kind (timestamp-based last-writer wins).
In our case, we're designing around INSERT-only tables with a composite primary key that includes the site id, so (in theory) there will never be any conflicts that need resolution.
> with a composite primary key that includes the site id
It doesn't look like you'd need multi master replication in that case? You could simply partition tables by site and rely on logical replication.
I think that's absolutely true in the happy scenario when the internet is up.
There's a requirement that during outages each site continue operating independently and might* need to make writes to data "outside" its normal partition. By having active-active replication the hope is that the whole thing recovers "automatically" (famous last words) to a consistent state once the network comes back.
But if you drop the assumption that each site only writes rows prefixed with its site ID, then you're right back to the original situation where writes can be silently overwritten.
Do you consider that acceptable, or don't you?
Not silently overwritten: the collision is visible to the application layer once connectivity is restored and you can prompt humans to reconcile it if need be.
Sounds like a recipe for a split brain that requires manual recovery and reconciliation.
You could implement a CRDT and partially automate that "recovery and reconciliation" workflow.
That's correct: when the network comes back up we'll present users with a diff view and they can reconcile manually or decide to drop the revision they don't care about.
We're expecting this to be a rare occurrence (during partition, user at site A needs to modify data sourced from B). It doesn't have to be trivially easy for us to recover from, only possible.
Tangential, but related. Is there a way to have a "locally writable" read replica, ie. a secondary db that reads from a primary, but that can also hold local changes that doesn't send back to the primary?
One of the use cases is to have a development db that can get data from production or staging (and doesn't send local changes back)
What I've done usually is have some script/cron/worker run periodically to get data, either via dump or running some queries, create a snapshot, store it in S3, then have a script on the local dev code that gets the snapshot and inserts/restores the data in the local db. This works for many cases, but index building can be a pain (take a long time), depending on the data
Just FYI that most people would recommend against doing this for legal reasons. PII information and the likes are not usually allowed to land in a staging or dev environment, for various reasons. Doing this or allowing it, is a huge liability.
Agreed, and there’s a few ways to deal with that, like not including certain tables or excluding the data via queries when creating the snapshot
Having said that, legal exposure and risk will highly depend on what you are working on. Probably for most projects this isn’t a big deal. IANAL, this is not legal advice
You can use Postgres logical replication with a filter to create a one-way replica, then just disable the replication slot when you want to make local changes without affecting the primary.
> then just disable the replication slot when you want to make local changes without affecting the primary
This would be all the time, ie. in my local dev, I’d like to have an up to date (daily is ok), copy of the source db that I can modify anytime, but that everyday will sync with the source, without having to fully rebuild indices
How would you go about managing/coordinating that?
Load the snapshot to a "pristine" local database that you never modify. Whenever you need a "reset" of your dev database, drop it, then copy the pristine database using `createdb --template`. This copies prebuilt indexes rather than rebuild them, which is much faster.
But when loading that pristine local db from the original source, it would still create the indices and thus take a long time?
The goal is not necessarily having an easy way to reset, but rather an easy/quick way to load real data
Curious about this - How would local writes conflicting with remote updates be handled? I can't think of a merge strategy working on all scenario (or even most of the time)
Great question, I don't know. However, at least in my case, I wouldn't mind the source data always overwriting the local data. In fact, that's the way it works now when loading a newer snapshot, the local db is dropped and then re-built from the snapshot
Thinking about the developer experience though, when loading a snapshot manually, the dev knows they are overwriting their local db. However, if replication happened automatically/continuously on the background, it could lead to some really confusing/annoying behaviors
AFAIK that is the standard behavior with a Postgres logical replication setup. There is nothing preventing you doing writes on the replica, they just won't get sent back anywhere else.
I'm not tired of reminding everyone that "conflict resolution" is no more than an euphemism for "breaking durability by dropping already committed and acknowledged data".
Either architect for no data overlap on writes across all the "actives" (in which case software like pgactive could be a good deal) or use a purely distributed database (like Yugabyte).
I could see in the docs they recommended a scenario like: each master is the only writer for a given schema, to avoid conflicts, but the replication gives them all a copy of all the schemas to read from.
And I was wondering what other ways, besides schemas, of dividing up 'writer responsibility' would also work? Partitions?
For inserts is "easy" if there's no possibility to write the same PK value twice (e.g. PK are random values like uuids or include as part of a natural key a identifier from the writer that's unique -shard id, region id, you name it-).
Once you have done that, for updates and deletes you need to keep the same rule (i.e. don't update "foreign" rows).
If you do this, no other technique is needed. Partitions, however, are potentially a good technique to enforce some of these invariants, which gives us quick understanding of where data is originating from given the table name. Same could apply to schemas.
RLS may also help enforce these invariants.
I'm scratching my head trying to think why AWS would have worked on this? I can't think of it being used in any of their products.
RDS uses block replication. Aurora uses it's own SAN replication layer.
DMS maybe?
probably [Aurora DSQL](https://aws.amazon.com/rds/aurora/dsql/) which released a couple of weeks ago
DSQL uses an internal service, Journal[1], for its mutli-region writes.
[1]: https://www.allthingsdistributed.com/2025/05/just-make-it-sc...
It's definitely DSQL with the multi-region active active feature[1].
[1]https://aws.amazon.com/rds/aurora/dsql/features/#topic-1
DSQL only uses Postgres for the query processor layer, so it doesn't require a replication library within postgres itself. Definitely NOT from DSQL.
> We’re not using any of the storage or transaction processing parts of PostgreSQL, but are using the SQL engine, an adapted version of the planner and optimizer, and the client protocol implementation. [1]
Rather, DSQL seems to do its region replication using the distributed journal abstraction [2].
[1] https://brooker.co.za/blog/2024/12/04/inside-dsql.html [2] https://brooker.co.za/blog/2024/12/06/inside-dsql-cap.html
I thought dsql was distributed transactions. :thinking:
DSQL does use distributed transactions to offer active-active alongside strong consistency and isolation guarantees. Here's how: https://brooker.co.za/blog/2024/12/05/inside-dsql-writes.htm...
Yeah, and this doesn't seems to be that useful. At least I don't understand why one should do this on a strong ACID relational database.
In my experience multi-writer is because of latency or HADR stuff - have all your data in all regions at the same time, but the method (via the tlog) seems like it sort of defeats what those sorts of systems might be able to historically do (write multiple places from the app at the same time so as to have the lowest possible chance of data loss.)
Yes, I call it spoke-and-hub. The sharded spokes accept the writes and replicate back to the hub, where all shards coexist.
Useful for metric ingestion. Not useful for bank ledgers or whatever.
Yeah, some instances also have all the shards maintain all the state and just accept writes in your partition of values, merge replication in sql server works like this.
> Aurora uses it's own SAN replication layer
I don't think that is used for cross region replication
Apparently they've offered it as a feature in RDS Postgres for a couple years now https://aws.amazon.com/about-aws/whats-new/2023/10/pgactive-...
But only last month did they officially release it as open source to the community https://aws-news.com/article/2025-06-09-announcing-open-sour...
A lot of theorising above, but this seems like the earliest reference
Thanks, that seems to match.
From the readme of the repo: "Use cases for this include running a Multi-Region high availability database cluster"
i think AWS is using pgactive to sell "Aurora Postgres Global" https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...
After setting up numerous clusters with repmgr and patroni along with running them in zero down time production... This is the very last plugin i would ever install. I like to sleep at night.
Speaking of databases, has anyone run Bloomberg's db (https://github.com/bloomberg/comdb2) recently?
Related. Others?
Pgactive: Active-Active Replication Extension for PostgreSQL on Amazon RDS - https://news.ycombinator.com/item?id=37838223 - Oct 2023 (1 comment)
Coincidentally I’ve been trying to figure out a nice no-nonsense way to setup a HA postgres cluster with automatic failover and restoration of nodes and point in time recovery.
I see a lot of patroni with etcd and haproxy being advised. It must work well for people to be so excited about it, but it feels a bit overwhelming to me when I look at the docker compose files.
At the same time there is pgool which looks like mostly a single thing to deploy in front of each postgres server.
Any tips from the pg-interested people here?
I’d like a docker compose like experience to setup a cluster that is highly available with point in time recovery or at least no data loss.
Are you looking for a tool like Barman?
I don't know. Barman would only be part of the HA puzzle no?
For failover handling repmgr is probably better (they can work together though)
Thanks. What role does repmgr play?
I've been using cloudnativepg and it is pretty turn key for all of that.
Needs Kubernetes, though which might throw a complexity wrench in if you're not familiar.
If you are familiar, then it's about as turn-key as you can get.
There's also a different PG operator based on Patroni that's supposed to work pretty well iirc
It seems async? That's a major problem for transaction isolation
Pick your poison.
Don't use this unless you know exactly what you are doing.
This is not a way to get better performance or scalability in general.
Seems rather niche. You can already improve async write availability with CQRS and a (durable) queue. Systems like Kafka implement this out of the box
Seems sort of like a CQRS implementation on top of PG (you're using PG replication as the change queue to loosely separate writes/reads, losing transaction guarantees in the process)