How would you back up a 300 TB database?

I got stumped by a question that I got while on the road this week, and I thought I’d throw it out.  A user came up to me who has a 300 TB Oracle database that generates 3-4 TB of transaction logs a day.  They are currently unable to back it up — at all.  When I suggested my usual, she shocked me and said that they had tried it.  Ouch.

The database in question resides on a V-Max.  First let’s talk about what we would do if we treated it like other databases.  We would need to back it up at 25 TB/hr to do a typical full backup to tape or disk.  That’s simply not going to happen.  While there are device configurations that I can think of to make a backup target fast enough to handle that, we couldn’t get it there in time.  We certainly can’t do it over IP, and I highly doubt whatever host they have the database on can handle 25 TB/hr (~7000 MB/s, BTW) through their backplane and CPU.

What if we did server-free backup that uses the SCSI 3rd party copy command?  Then you’d only have to copy the data via the SAN to the target device.  The first problem is that the target device still would have to be monstrous. The second problem is I doubt that the V-Max (although i could be wrong) could generate 7000 MB/s either.  But the final problem is actually a V-Max problem.  The user in question says that their level of transactions is so high that the cache on the V-Max fills up if they try to take a snapshot.  (I was surprised by that as 3-4TB a day is only a 1% change rate.)  No snapshot, no server-free backup.  It also means no near-CDP either, as that is snapshots and replication.

“CDP!” I shouted.  She told me that EMC’s RecoverPoint maxes out at 150 TB.  So much for that.

I told her that there are other CDP solutions out there and that they should look at them, but oddly enough “startup” products are really hard to get into this company.  (I did give her a really hard time at that point.  A product from a startup that solves the problem is still better than nothing, which is what they currently have now.)

The really sad thing is that is database couldn’t be more mission critical to their business.  If it ever died, they’d be in a serious world of hurt.

If you think you know how to solve this problem, shout out.  As for me, I’m a bit stymied at the moment.

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

26 comments
  • It sounds like the problem is not finding a protection solution, but with the design of the thing in the first place.

    I find it very hard to believe that a DB that large cannot be factored into several smaller DB’s.

    Once they do that, then protecting it comes back under the umbrella of standard solutions.

  • To back it up, (not DR), a maxed out STK/Sun/Oracle 8500 will do about 190 TBs/hour. Once it is backed up, I guess you could go old school and ship tapes. The logs could be carried by a single OC-768 line (39,813 Mb/sec) in a 24 hour period. The 100 Mbps (and 40 Mbps) Ethernet standard should be out in June too. That would get you more than double the OC-768 speed. Like the OC-768, those interfaces will probably be expensive too.

    It would be interesting to see what broke as you push all this stuff to the performance edge. OC-768 is probably not cheap, may not even be available.

    Maybe it would be better to look at the db design and operation.

    A USP-V can push more than 8000MB/sec based on SPC-2 results so I’d suspect a huge V-Max could do at least as well.

  • Most people concentrates in space, not throughput. In this case, I/O is the real problem. I think you can replicate to somewhere else, because it’s a critical database, so applies. For the backups I’d choose a real-VTL, preferable a VTS, so you send all data to disks and then to tape, perhaps in parallelism.
    Software like Hitachi Data Protection Suite or Tivoli Storage Manager can make “incremental for ever”, and you can retain the last 2 days in the VTL so you can restore from disk.
    Replication, I recommends always using the easiest mechanism possible, no RAC or things like that.

  • The 150TB limit is true for RecoverPoint/SE, but the normal RecoverPoint product (which they would need for any environment that is not 100% Windows or 100% Clariion) has a 600TB limit. They can scale the RP cluster up to 8 nodes to deal with IO/BW issues if any and consistency groups can exist across multiple appliances in the cluster. The customer should work with their EMC resources to find out if RP will work or not. The biggest concern I would have is not the size of the DB but the # and size of LUNs involved since they’d have to be part of a single consistency group to maintain filesystem and DB consistency.

    If Timefinder/snaps are causing cache issues due to the database size/IO performance, RecoverPoint may be the best way to get backups while keeping the load off the VMax itself. I assume they have a fully populated VMax (maximum cache size), otherwise they could look at adding VMax engines to increase cache and CPU resources.

  • Something that large is simply not going to be moved from one place to another efficiently. As Curtis says, 7GB/s is just not practical unless you design for this from the get go. Because this customer is where they are it’s clear they didn’t design for this. But – don’t forget that this is just the backup goal. What about the recovery? You have to design for recovery first if the data is the lifeblood of the company. If you can’t get 7GB’s of backup throughput, I doubt very much that you will get even close to that on the recovery side. I suspect that they might be ok with an 8 hour backup window, but not with an 8 hour recovery window (RTO/RPO anyone?)

    The only practical way to do protect this database is not to move data, but rather to make safe point in time copies. Those would be your primary recovery data sets. Those same copies could also be used to move the data off-site via array based replication. The off-site copies would be your "backup" copies, and could be used as the source for tape copies if longer term retention is required.

    Unfortunately, they have hit a design limit with the vMax architecture and snap shots. A number of the other companies have licked this problem: NetApp, Compelent, 3-Par, etc. Of those, I would very much not call NetApp a "startup" and the others are probably well beyond that stage as well. They all have substantially better snapshot architectures and dramatically simpler replication models. To make it even more fun, since this is an Oracle database, it might be wise to consider NFS players in this model: Oracle’s own Sun 7000, or BlueArc’s solutions (also available through HDS as HNAS). Oracle natively supports NFS and in many cases makes it easier to scale than block protocols. Each of those two products also has a a very efficient snapshot model, easy scalability, and robust replication technologies. In the case of BlueArc, I’d venture to say it might even outperform the vMax if architected properly.

    The net is that there’s a ton of ways of skinning this cat. You just can’t call it a backup problem any more and you have to look outside the EMC comfort zone. Not saying anything bad about EMC. They have great products, but they don’t have the only products. And in many cases they don’t have the right product.

  • It’s funny, I always mean “backup and recover” when i say backup, but I know not everyone does. As to the vendors you named, I wasn’t thinking of going down that route yet. I was wondering if there was a way to solve it in place, without forcing them to throw away the V-Max and buy something else. The one CDP vendor that I think MIGHT have the chops to tackle this is inMage, but they’re definitely a startup. That’s who I was thinking about.

    It’s also come to light that the 150 TB limit is no more and that RecoverPoint might be a solution as well.

    But I agree that this should have been a design consideration from the get-go.

  • So…I was going to draft out a NetApp-focused response (I’m an engineer for a VAR that carries multiple storagelines (including EMC ironically) and NetApp seemed like the best fit (especially as the customer was already going directly to EMC and I’m presuming getting more resources/expertise than I could bring to bear on the EMC side).

    But…..I found a NetApp blogger beat me to it.

    http://blogs.netapp.com/dropzone/2010/05/large-scale-data-protection.html

    In short, zero-performance penalty snapshots + replication would be the solution here (kind of along the lines of a recent blog post here actually). If there’s a need to preserve the V-Max investment, a Netapp V-Series would make sense.

  • Juan Orlandi’s comment pretty much covers it. The answer is to use a storage array with better snapshot and remote copy capabilities. That isn’t what this customer wants to hear, considering that they just implemented their vMax, but they should have thought through this “little” problem before making a decision that is going to make them nervous for the next few years. Yikes!

    I work for 3PAR and we have customers with enormous databases protecting them using our Virtual Copy and Remote Copy Software.

  • Curtis,

    HDS is using InMage, and CommVault Sympana can Bkp/CDP/CDR. But remember, with CDP you need another storage (in some cases even 2 additional devices) to process the collected data and then to replicate (if you are in that scenario).

  • None of that is news to me, but it might be to some readers. Thanks for your input.

  • I can’t tell for sure, but the symptoms sound very much like an under-cached system. Perhaps the change rate is larger than expected, or perhaps the database has grown larger than planned – in either case, the solution may be as simple as adding more memory to the array.

    If you’d like, ask the customer to drop me an email, and I’ll get someone to look into her system and determine whether or not this could be the solution.

  • Juan Orlandini had made the good point. Think outside the box, the one that you are using. If you still don’t feel comfortable just protecting the data inside the same storage using the snapshot. For example, using NetApp with NFS for the Oracle DB, the DBA can either create the script to put the DB in hot backup mode or cold backup mode and take the snapshot. It takes so little time. can even SnapMirror the DB to another site for DR. No need to restore in case of data lost since the DR copy can be cloned as a R/W volume in less than 2 minutes and serve online right the way.

  • Having investigated this the system is running a larger workload than it was originally sized for. Barry Burke’s initial read of the situation being accurate.

    At the time this was posted the customer and EMC were already working on resizing the system to deal with the increased transaction processing requirements and account for the higher rate of growth.

  • Before even trying to talk ‘solution’, don’t you have to ask what problem you are trying to solve? What are you trying to protect yourself from (Full-scale disaster? Munched table? Storage failure? Oops-factor?). Once you have established an answer to this, start looking at recovery requirements. You say the DB is vital to the company, but can they survive an hour outage? Two days? A month? Finally, what restore services do they need? Should a DBA be able to ask for a restored copy to run tests? Or is this disaster-only?

  • Obviously every solution design must start with requirements. I was merely posing the question to see what kind of ideas people had about how to do it at all. Their current (when I talked to them) backup method was “none.” There’s not a requirement in the world that would meet.

  • As Barry mentioned it could be as simple solution to resolve by adding additional memory. In regards to RecoverPoint, utilizing RecoverPoint CDP (which doesn’t require a separate storage array) or CRR would provide the best data protection as RecoverPoint provides continuous data protection without impacting the Oracle database unlike most other CDP solutions that require a host agent to be installed on the host where the database resides.

    With RecoverPoint you could configure RecoverPoint to hold say 5 days worth of journals, thus removing the requirement to perform daily backups, and each week I would use the CDP or CRR copy of the data and mount it to your favorite de-dupe / backup solution.

    As you mentioned you must start off with the requirements in order to design the correct solution. As operating systems continue to evolve, virtualization continues to grow (physical server consolidation, and SAN consolidation), customers need to be educated how they will recover, move data off-site, and test disaster recovery for IT audits without impacting production.

  • Hi Curtis,

    I’m a little surprised no storage specialsit/architect came up with an answer todate I think is correct.

    Having a single environment sized 300 TB screams for an HPC solution. As a Storage Architect specialized in IBM hardware/software GPFS is what pops up in my mind.

    First, have the Oracle database reside at a GPFS filesystem. You can throw whatever hardware you want at GPFS till you reach the 300+ TB size and the GB/s performance. You can choose to scale out (e.g. using DS3400s or DS5300s) or scale up (XIV comes to mind). Of course, you can use likewise EMC storage products too. Either way you would need something about 32 or 64 GPFS nodes although an experienced GPFS architect could compute the minimum needs much more thorough.

    I think the scale out could be cheapest, using DS3400s with one expansion enclosure. I’d use RAID10 although RAID5 is defendable and cheaper (never mind my English writing skills?). Each DS3400 would provide 2.8 TB in 4U using twenty-four 300 GB SAS disks, two of which as hotspares, the rest in RAID10. In total you would need a whopping 110 DS3400s in 11 T42 racks. You’d use almost no SAN switches when connecting each DS3400 directly to two GPFS nodes (for availability). Each GPFS node can connect to four different DS3400s using two QLE2464 (4-way 4 Gbps) adapters. This way 32 GPFS nodes can provide your customer with both the capacity and performance.

    You could also scale up using IBM XIV of which you’d still need four/five complete systems to get 300 TB usable capacity. You’d problably also need to use SAN switches to connect the GPFS nodes to the XIVs, probably two Brocade DCXs populated with FC8-32 blades.

    For backup each node would need a QLE2462 (2-way 4 Gbps HBA) totaling to 64 ports. You could well use one Brocade DCX with four FC8-32 blades providing 64 ports for GPFS nodes and 64 ports for LTO5 tape drives. The 64 LTO5 tapedrives would reside in a TS3500 library of minimal 6 frames (each frame is maxed out at 12 drives).

    How you would arrange for a nice setup of Oracle servers is a different question. Probably you’d use a fourth HBA (QLE2462/QLE2562) in each GPFS node connected to another DCX (or two for availability). The Oracle servers would also connect to this DCX.

    To me this is a complete hardware architecture on which the Oracle database may reside, perform and also might be backed up in a timely manner.

  • @Richard Anderson:
    RP supports consistency groups across multiple clusters.

    As to the snapshot problem:
    Using snapshots while having such a change rate is insane. Use BCVs or Clones.

  • @SebastianR

    Can’t say I agree with your statement about snapshots. First, the change rate is only 1-2% (3-4 TB daily of a 300 TB database.) Second, it sounds like you’re assuming that you couldn’t take snapshots if you have a very high change rate, or that the change rate would require too much storage for the delta blocks. I’d say the former is implementation specific (some products could do it; others couldn’t), and I’d say the latter is a call for the customer to make. Those 3-4 TB a day of changes have to go somewhere, and a delta-level snapshot is one of the most efficient ways to store it. You’d have to compare the cost of doing that to putting daily backups on tape or deduped storage, not to mention the infeasibility of doing daily backups (which was the point of the post).

  • @Alex Sons

    So you think that a proper backup solution involves a bunch of tape drives? Can’t say I agree with that either. You might get the daily backup done, but a restore would take a significant amount of time that is most likely not going to meet the customer’s RTO/RPO requirements. IMO, a system of this size should ONLY be backed up by a CDP or near-CDP type solution.

  • 3-4TB of changes does not necessarily translate to 3-4TB used disk space on SAVE devices.

    It is surprising that the cache would fill up, maybe they are talking about the snap cache, i.e. the SAVE devices. Again, a 4K change on the source volume could easily translate into a 64K change on the SAVE device because it is track based and not block based.

    And while I respect your knowledge about backup is impressing, your pondering if the Vmax can push 7000MB/s, raises eyebrows to say the least.

    While you would have to allocate a significant amount of resources, it is far from saturating the array. 2 directors should be able to push this easily.

    However, backing up 300TB of data off of the source volumes and snap cache and expecting that kind of performance borders on the naive. Are they writing to the virtual image, too? Integrity checks or rolling logs perhaps?

    If you want that sort of throughput you need dedicated resources, i.e. dedicated spindles and directors.

  • My experience has been that 3-4 TB of Oracle redo logs actually results in much less than 3-4 TB of changes at the block level. Your experience may be different.

    I’m not a V-Max expert (I’m not even a novice), but if it can do 7000 MB/s you can color me impressed. I know previous generations of large disk arrays have consistently disappointed me with their throughput for large backups, and their respective vendors always seem surprised, but I’ll take your comment to mean that you’re sure that it can.

    I’m not sure what you mean by the other comments.

    Please remember that this problem started with a real customer who was completely unable to meet their backup and recovery requirements using every technique EMC was throwing at them, and I was simply asking how others would do it. That was it.

  • SebastianR

    Push those eyebrows back down, my friend. ๐Ÿ˜‰

    The problem is not so much the VMax and its ability to push 7000 MB/s. It’s a combination of LUN size, disk distribution, and ORACLE’s abilities to get the data to the backup software via RMAN. I was talking this afternoon with someone working on a very similar config.

    In their case they have a 330 TB Oracle database spread across THREE V-Max’s. They did this because they believed the processing load of the database required such. They’ve been working on a tape solution for it, and they’ve connected 10 tape drives to each of the V-Max’s and are able to get about 1500 MB/s out of each V-Max to those 10 tape drives, using an Oracle multiplexing setting of 32 and no multiplexing on the NetBackup side. (I’ve asked them to enable multiplexing on the NBU side, because at 150 MB/s per drive, they’re not quite pushing the LTO5 drives to their limit, but there appears to be some resistance to it.)

    I just wanted to post this to show that I was not being naive. It’s a real problem trying to generate that kind of throughput for a backup.

  • Another possible solution is to move the DB to Exadata then use HCC (Hybrid Columnar Compression) and Active Data Guard to another Exadata. This also solves the problem of how in the world are they going to restore that 300TB db if needed in a reasonable amount of time.

  • Curtis,

    As long as “money is no object” a very robust solution exists for your situation. There is a Texas company called RAMSAN (www.ramsan.com) that has a product called the RamSan 6300 which stores 140TB of SLC Flash storage in a single 42U rack. You buy three of them and you can back up a 300TB database with room to spare.

    Their published performance numbers are 14 million sustained IOPS and 140GB/sec sustained “random bandwidth” per 140TB array.

    Ultimately, you will want to host the application on this equipment and then buy a second 3 cabinet setup for a backup / mirror copy, but for now you should be good to go!

    If you do go this route I would be interested to see what your real-world performance numbers are using this solution.

    Best of luck,

    John