Monday, July 16, 2012

Retention Tags Fiasco - Part II

In my last post I talked about having to eat humble pie by going to the Powers-That-Be and admit to a mistake. A two week default retention tag was applied to a group of mailboxes and lots of mail was deleted.

As it turned out we are having to restore all the databases containing affected mailboxes. Many of those databases held secondary (archive) mailboxes. 94 mailbox databases in all I believe...

When we were restoring the primary mailboxes the command Restore-Mailbox worked flawlessly. It was speedy. It was accurate. Lovely.

Then we got down to recovering the data in the Archive mailbox, and things started to get sketchy.
Restore-Mailbox command does not understand archive mailboxes. Before I did this I called Microsoft Support with a simple question.

"I don't see a -Archive option for Restore-Mailbox, so just how does it does work with archive mailboxes?"
"Ummm, I see what you mean. Very possible it just knows the mailboxes in the databases are archive mailbox and will handle accordingly."
"So, very possible, or you're certain?"
"Yes, I'm certain."

Well he might have been certain, but when 9PM rolled around and he was watching his T.V., I was finding out the hard way he had no clue what he was talking about.

But fortunately for me, this guy did :
http://eightwone.com/2011/01/07/restoring-personal-archives/

So -- for instance -- I have a mailbox database, called Arch01. In the mailbox database there are only Archive Mailboxes.

So to get a reference to all the mailboxes that have archive mailboxes in that database:
(We only have 1600 users with archive databases.)

$ArchMBX = Get-mailbox -archive -Resultsize 2000 |
           ?{$_.ArchiveDataBase -eq 'Arch01'}

Then use the New-MailboxRestoreRequest command:

$ArchMBX | %{
   New-MailboxRestoreRequest -SourceDatabase Arch01Recovery `
   -SourceStoreMailbox $_.ArchiveGUID -TargetMailbox $_.Identity `
   -TargetisArchive -BatchName Arch01 -Suspend `
}

I am doing this on 16 very large databases and 16 small ones. The New-MailboxRestoreRequest command can be very slow. Not sure if the amount of data in some of these mailboxes (many are over 8G and a few are upwards into the 20G range) or if there is something else going on here. Some mailboxes were taking up to 18 hours to complete.

Seemed to me the Mailbox Restore Request was bogging down and getting slower and slower and moving less and less data. I did have about 450 requests queued up, but I've had more than that in some mail migrations. I experimented with restarting the Mailbox Replication Service on all the CAS servers in this AD Site. For me, every 4 hours seemed to be the magic time frame. I would just suspend all the jobs, then restart that service on all the CAS servers and resume the jobs.

Another odd thing about the New-MailboxRestore command is it was crashing store.exe on some occasions, like corruption in a mailbox. Initially, when a New-MailboxRestoreRequest failed, I'd set the -BadItem limit to something very high and resume it. Now I just make a note of that mailbox and move on. In my mind there is only a 25% chance they were hit anyway. Why sweat it? If they need to be restore, we would restore the mailbox database again, but from a different time frame.

Finally, I found that throwing a lot of recoveries at a mailbox database seems to over tax the log file space as well as indexing. We use circular logging but the data was being pumped in too fast. As item were being recovered, there were added to the indexing queue and filling it up, too.

So now I start all my new request with -Suspended to be on the safe side. I also choked the Mailbox Replication Service to only allow 5 mailbox restores max per target database. This seemed to help on those log file directories and eased the indexing some.




Monday, July 2, 2012

Retention Tags Fiasco

An Innocent Beginning
A user requested to have a Personal 'Delete' Retention Tag for 14 days. I created one and added it to the policy that is applied to about 1600 users.  A personal folders tag is meant for folders created by the user and can be applied to any folder. We also have a "Default -All other folders 14 Day Delete" which deletes anything over 14 days old in any normal folder without a policy applied. Essentially, the default policy.

I hate to admit this, but I did a pretty dumb thing. I chose the wrong one and applied it to the Policy for the 1600+ users.

So the person who requested this new policy calls me and says, "Hey, this was supposed to just delete stuff in my single folder but it's deleting everything older than 14 days!" Of course I think he is the typical crazy user. Then I look at my mail and notice that some old stuff I need is gone. I looked at the policy applied to me and lo and behold, there it was, big a life. "14 day delete (2 weeks)" stamped all over my mailbox folders. This is where I panicked.

I went to all 32 mailbox servers and stopped the service for the "Exchange Mailbox Assistants." I prayed that this did not get too many people... (Yes, I know I can run a command to do this remotely, but our servers are not set up for that as yet. I just never got around to it. I'll get to that as soon as I finish ...)

Then I had to humbly take myself to the boss and say, I screwed the pooch. Not only did I not have a Change Management ticket, but I made the change during the day. Two major infractions. I guess I was thinking this was only a minor maintenance thing, or I was not capable of screwing up. Probably a lot of both. So I had added this policy in, a very simple thing for sure, but it caused a very big problem.

The Powers That Be wanted to know:
  1. Who did this effect?
  2. How can we get their mail back?
  3. How do we erase the Tag?
A call went out to Microsoft. I wanted to find the quickest way to reverse what I had done.

Who did this effect?
We bumped the logging level up to expert so when we started the Mailbox Assistants back up again we might find some indicators as to which mailboxes had been touched. Then we started looking through the event logs to see if there was any indication of mailbox identities already there...

Event 9001 - the service is starting on a database
Event 9017 - Mailbox Assistant starting work, there are 600 mailboxes in here, etc -- but no information of which ones we're working on.
Event 9018 - there was a failure for a mailbox. Not saying which or why it failed. Just a count.
9021 special request started
9022 special request finished
9025 a mailbox was skipped - here is the GUID - seems many of these were disconnected mailboxes
9037 there was an error on something
9112 - list of stale mailboxes with decayed watermarks - not sure what this is yet

Bottom line: There is no way to tell who this affected. You have to look at the mailboxes themselves. But there were 1600 of them. We can't open each one. So we had to bite the bullet and send a message to all the 1600 users saying "uh, you may not have been hit ... but, uh, look at a message. If you see <this> you might be one of the people we want to help. Oh, and you can also tell because you may or may not have lost all you mail older than 14 days." -- Boy, that's gonna hurt.

How Do We Get Their Mail Back?
Restoring from backup is the only way. Unless you can live with the solution of recovering deleted items into a newly created folder. We gave that to users as a temporary solution, but did not expect them to want to refile all that data. So, again.. restore from backup.  To top it all off we don't have that much room left on any servers.

These people are scattered all over the place with primary mailboxes in one database and secondary mailboxes in others. There are 72 database involved. 16 are housing secondary mailboxes with a size of 325G or more. There is hardly a place anywhere to restore those databases. Luckily there were two servers with enough space to handle two of these databases each. The other 56 databases we got over two nights, putting one database on each server, where we could. (Did you remember that you can only have one Recovery Database mounted on a server at one time? I had to learn that one again!)

How Do We Erase The Tag?
I was afraid this was going to be the hardest part, and it turned out to be the easiest.
"If thy policy offends thee, pluck it out!"

So we set the Retention Policy on all mailboxes to $Null
Then we removed the offending Policy Tag from the Retention Policy
Then we deleted the Policy Tag
Then we deleted the Retention Policy
Finally, Restart the Mailbox Assistant Service
Since there is no Policy Tag to reference, it removes the Tag from the message. One could have just as easily applied a new policy and it would have overwritten any existing Tags. We never tested that option though. We removed all the Policy Tags and started the Mailbox assistant service on the server with my mailbox. All the scary tags disappeared.

Fixing The Wrong:
First I wanted all the mailboxes in the database that had a secondary Archive mailbox. I sorted them because I wanted some kind of easy indicator of how far along the process had gotten. The alphabet was good enough for me.

# find only the people with Archive Mailboxes
$MBX = Get-Mailboxdatabase DB01 | get-mailbox -archive | Sort DisplayName

Next it was simply piping the results to the restore command:

$MBX | Restore-Mailbox -RecoveryDatabase 'DB01Recovery' -MaxThreads 25


Lesson Learned
Don't be too lazy: I even thought about creating a test Retention Policy and applying it to the requestor but I never did. I remember talking myself out of it -- "It a person tag, just do it and move on to the next task on that long list of things to do today."

You're never too busy to do something right the first time: I am reminded of this yet again. A few years from now, I'll get in a hurry again and Murphy's Law will hit me over the head. It can actualy save you time to take the time to do it right. This screwing the pooch cost me many hours. Let's not forget the backup team, who had to jump through hoops to get me all these restores.