I’ve set up my website (app service) and database (Sql Server) in Azure. Right now it works like this:
prod website -> prod database
The prod website has connection strings to the prod database and all is good!
What I want to do now is create a dev website (slot), so I can create and test new features.
dev site (slot) -> dev database
I can create a new app service or just create a slot off my prod site, but I’m not sure if I want my dev database as a mirror, replica or copy of the production database, where if I make changes (test new features) Can I manually or on a schedule update the dev database to whatever the prod is? Ex. every night at midnight re-mirror prod to dev. So any new changes made to dev during the day get wiped out.
Questions – What’s the best solution for a dev/test db?
- Mirroring or replication or copying or copy with sync, …?
- Is there something else, as I’m new to all the different Azure features?
2
Answers
I will try to give you a solution here that covers the concerns raised by myself and others in the comments, but which still approaches what you want to do.
But we have to agree on a premise that provides the logical basis for the entire methodology: The development schema is not identical to the production schema. By definition, it is normally expected to be a newer version of the schema that has yet to be deployed to production. The development schema might have different tables, different columns, different sql module definitions, different constraints, and so on and so forth.
Given that premise, you can’t realistically use any technology that tries to keep the data in development "synched" to the data in production in near-real-time. You can’t use mirroring or an availability group replica anyway, since those would result in a development database that is either unavailable or read only, preventing you from doing any development.
You could, theoretically, try to set up replication. You would have to author custom replication procedures to do this, that "account for" potential differences in the schema. I leave it as an exercise to the reader to imagine all the difficulties you would face trying to do this.
So, on the basis that there is not (and, really, cannot be) any technology that keeps the environments "in synch in near real time" due to potential schema differences, we move to not-near-real-time synch options.
The obvious choice is backup and restore. The production system is being backed up anyway, and it’s good practise to make sure those backups are actually working by making sure they can be restored. So, we set up a scheduled job (nightly, weekly, whatever you choose) to restore the production backup to the development environment.
But this is not sufficient, we now have to address a few problems this creates.
First of all, any work that developers had done in development that day is now gone. When they come in the next morning expecting to be able to continue development, they will need to reapply any changes that had not yet made it to production. Trying to do this manually will very quickly become a frustrating nightmare.
So this restore-from-prod strategy really requires that you have some kind of automated deployment for SQL changes. After the production database is automatically restored you would also automate the deployment of any migration scripts that had been executed only on the development environment. This means you need a way of identifying which migrations those are. There are any number of ways to do this, and the decisions on how any particular org should do it are going to involve contingencies and opinions specific to the org. I will not attempt to provide a full SQL deployment automation solution in this answer.
These "migrations" should also include the creation of any data necessary for the new features to function, and any data required for devs to be able to do testing. For example, suppose my prod database has table
T(i int primary key, c char)
. A developer has altered this table for the next version, so it is nowT(i int primary key, c char, d date)
. The developer is obviously going to want to run code that depends on thed
column. So we need data in that column. This should be part of a test preparation script.In addition, the data in production is "whatever". You have no idea what rows might have been updated, deleted, or inserted. Therefore if your development team is running integration tests that check for a specific value, or result set, when running their code, those tests are now very likely to fail.
You must therefore also insert specific, known test data, defined by the implementation team, for use in the tests so that you can check a result against a known expected result. This should be done for every table. It doesn’t have to be a lot of data. A row per reference table, one or two rows per transactional table. These should be created following the usual code paths wherever possible to ensure that any downstream changes are applied.
For example, suppose you are testing an ERP. You want invoice data to test with. But the ERP also manages the general ledger, and changes to invoices should result in changes to the general ledger. If you simply run an insert against the invoice table, you will probably not get the expected changes in the general ledger.
What if those processes have changed in the next version of the system? You need to insert this data prior to applying the schema migration, and then and then populate any new/changed dev-only data after the schema migration. What if those processes require calling across application level code spread around a dozen microservices? Then you’re going to have a hard time setting this up.
OK, let’s say you now have automated the restore of the production database to development, and you have an automatic migration deployment pipeline that brings dev back to what it was before it was replaced by the production restore. Are we done? Not by a long shot.
The next problem is that the database being restored from production will bring across the production users. Suppose I have some application service account that connects to production using either a domain account or a SQL Server auth account created for that application. That account is now able to connect to the dev environment. If you simply leave things as they are and use that account to connect to the dev environment, you run the very serious risk of accidentally connecting to the prod environment from the dev application. The likelihood of this may be low, but the possible consequence of doing so could be Very Bad. It is therefore a better idea to use separate accounts for different environments in your SDLC.
That means we also need to run an environment specific post-restore script that will drop or disable the production users, and add the dev users in their place. The easiest way to manage this is to use a database role for the application. You grant all required permissions to the role. Then, when you finish your restore-from-prod-to-dev, you drop the production user out of the role, and put the dev user in the role.
You’re also probably going to want to grant developers their permissions in this script. Devs typically don’t have elevated rights on prod environments, but do on dev environments.
Technical detail: If you are using Sql Server authentication for any logins, you will will need to remap the users in the database to those logins using
sp_change_users_login
(deprecated) oralter user
.We’re still not done. Production databases contain production data. That might mean sensitive data. It could be personal information (which should be encrypted at rest anyway) that development staff shouldn’t be able to see. It could mean financial information that development staff shouldn’t be able to see.
You can use Dynamic Data Masking (a SQL Server feature) to handle this. Alternatively you can just run a DML script that goes through and updates data in the database, obfuscating any sensitive information. You might replace all email addresses with "[email protected]". You might incorporate the value of a surrogate identity column into the replacement data if you need these emails to not all be the same for some kind of testing purpose.
Note that running a script like this can result in changes to the statistics histogram, which will change the performance characteristics of queries. Your development environment won’t be a good indication of likely production performance if you do this. But to be fair, it probably wasn’t anyway. Development environments are rarely a good indicator of production performance for a whole host of reasons already.
If you feel like all of the data-obfuscation tasks here are too onerous, or if your production database is too enormous for your dev resources, there is another option entirely.
Instead of restoring the production database to dev, you drop the dev database and execute the DDL of the production schema to create a new, empty, environment that matches the prod schema. You can use lots of different tools to accomplish this (redgate, for example).
But this schema is, of course, empty. So now you run your data population scripts. You must already have some of these. For example, if a developer adds a new table which contains "system defined" (not user defined) information, that must already have been part of some migration script at some point. So you run all of those.
Then to populate the business data you can either execute scripts that you manually authored, or you can use a tool that tries to automatically populate a schema with "random" data. Again, redgate has tooling for this, but there are several options available. This also "solves" your testing issue if you set the rules for how that data is generated, such that you end up with some "known" rows which can feed the expected results of tests.
If you take this alternative, you still need to do the user management.
There are few topics to consider: