Monday, July 2, 2012

Retention Tags Fiasco

An Innocent Beginning
A user requested to have a Personal 'Delete' Retention Tag for 14 days. I created one and added it to the policy that is applied to about 1600 users.  A personal folders tag is meant for folders created by the user and can be applied to any folder. We also have a "Default -All other folders 14 Day Delete" which deletes anything over 14 days old in any normal folder without a policy applied. Essentially, the default policy.

I hate to admit this, but I did a pretty dumb thing. I chose the wrong one and applied it to the Policy for the 1600+ users.

So the person who requested this new policy calls me and says, "Hey, this was supposed to just delete stuff in my single folder but it's deleting everything older than 14 days!" Of course I think he is the typical crazy user. Then I look at my mail and notice that some old stuff I need is gone. I looked at the policy applied to me and lo and behold, there it was, big a life. "14 day delete (2 weeks)" stamped all over my mailbox folders. This is where I panicked.

I went to all 32 mailbox servers and stopped the service for the "Exchange Mailbox Assistants." I prayed that this did not get too many people... (Yes, I know I can run a command to do this remotely, but our servers are not set up for that as yet. I just never got around to it. I'll get to that as soon as I finish ...)

Then I had to humbly take myself to the boss and say, I screwed the pooch. Not only did I not have a Change Management ticket, but I made the change during the day. Two major infractions. I guess I was thinking this was only a minor maintenance thing, or I was not capable of screwing up. Probably a lot of both. So I had added this policy in, a very simple thing for sure, but it caused a very big problem.

The Powers That Be wanted to know:
  1. Who did this effect?
  2. How can we get their mail back?
  3. How do we erase the Tag?
A call went out to Microsoft. I wanted to find the quickest way to reverse what I had done.

Who did this effect?
We bumped the logging level up to expert so when we started the Mailbox Assistants back up again we might find some indicators as to which mailboxes had been touched. Then we started looking through the event logs to see if there was any indication of mailbox identities already there...

Event 9001 - the service is starting on a database
Event 9017 - Mailbox Assistant starting work, there are 600 mailboxes in here, etc -- but no information of which ones we're working on.
Event 9018 - there was a failure for a mailbox. Not saying which or why it failed. Just a count.
9021 special request started
9022 special request finished
9025 a mailbox was skipped - here is the GUID - seems many of these were disconnected mailboxes
9037 there was an error on something
9112 - list of stale mailboxes with decayed watermarks - not sure what this is yet

Bottom line: There is no way to tell who this affected. You have to look at the mailboxes themselves. But there were 1600 of them. We can't open each one. So we had to bite the bullet and send a message to all the 1600 users saying "uh, you may not have been hit ... but, uh, look at a message. If you see <this> you might be one of the people we want to help. Oh, and you can also tell because you may or may not have lost all you mail older than 14 days." -- Boy, that's gonna hurt.

How Do We Get Their Mail Back?
Restoring from backup is the only way. Unless you can live with the solution of recovering deleted items into a newly created folder. We gave that to users as a temporary solution, but did not expect them to want to refile all that data. So, again.. restore from backup.  To top it all off we don't have that much room left on any servers.

These people are scattered all over the place with primary mailboxes in one database and secondary mailboxes in others. There are 72 database involved. 16 are housing secondary mailboxes with a size of 325G or more. There is hardly a place anywhere to restore those databases. Luckily there were two servers with enough space to handle two of these databases each. The other 56 databases we got over two nights, putting one database on each server, where we could. (Did you remember that you can only have one Recovery Database mounted on a server at one time? I had to learn that one again!)

How Do We Erase The Tag?
I was afraid this was going to be the hardest part, and it turned out to be the easiest.
"If thy policy offends thee, pluck it out!"

So we set the Retention Policy on all mailboxes to $Null
Then we removed the offending Policy Tag from the Retention Policy
Then we deleted the Policy Tag
Then we deleted the Retention Policy
Finally, Restart the Mailbox Assistant Service
Since there is no Policy Tag to reference, it removes the Tag from the message. One could have just as easily applied a new policy and it would have overwritten any existing Tags. We never tested that option though. We removed all the Policy Tags and started the Mailbox assistant service on the server with my mailbox. All the scary tags disappeared.

Fixing The Wrong:
First I wanted all the mailboxes in the database that had a secondary Archive mailbox. I sorted them because I wanted some kind of easy indicator of how far along the process had gotten. The alphabet was good enough for me.

# find only the people with Archive Mailboxes
$MBX = Get-Mailboxdatabase DB01 | get-mailbox -archive | Sort DisplayName

Next it was simply piping the results to the restore command:

$MBX | Restore-Mailbox -RecoveryDatabase 'DB01Recovery' -MaxThreads 25


Lesson Learned
Don't be too lazy: I even thought about creating a test Retention Policy and applying it to the requestor but I never did. I remember talking myself out of it -- "It a person tag, just do it and move on to the next task on that long list of things to do today."

You're never too busy to do something right the first time: I am reminded of this yet again. A few years from now, I'll get in a hurry again and Murphy's Law will hit me over the head. It can actualy save you time to take the time to do it right. This screwing the pooch cost me many hours. Let's not forget the backup team, who had to jump through hoops to get me all these restores.








No comments:

Post a Comment