CDP (and friends) are back!

CDP kind of came and went like SSPs.  Some said it stood for “Customers Didn’t Purchase.”  Others saw it as the Star Trek of the backup world – a great idea before its time.
Storage Magazine asked me to write a feature piece on it, and this is that piece.  I’m back to pushing the ideas of CDP (and near-CDP — or, if you must, snapshots and replication) as what the world needs to do to make backup all better.

We also talked about this on this weeks’ Storage Monkey’s Podcast, #62 with Chris Poelker of FalconStor.



Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

2 comments
  • When does “continuous” not mean continuous?

    Way back in high school, in the dark ages when programs were on punch cards and paper tape, I learned that the way to find the area underneath a curve – a “continuous function” was to calculate the area beneath the curve by creating a lot of small rectangles and fitting them under the curve. The more rectangles you create, the closer you got to the correct answer.

    My view of “continuous DP” is similar to that method, known as “Simpson’s Rule.” The question becomes one of granularity – “What it the granularity NEEDED for the restore or clone view of data?”

    I submit to you that the needed granularity is different in each case, which requires a number of tools in the toolset for complete CDP – otherwise, there are point solutions which work well.

    Allow me to illustrate this point:

    For databases, transaction based CDP works well. If a database can be “rolled back” to the granularity of an individual transaction, this accomplishes business objectives, as long as any referenced external objects are concurrently available.

    Email systems are specialized databases – the transaction becomes the email. Would any increment other than “an email” be relevant?

    Workstation users of file services expect the file “as it was last saved,” so file granularity becomes CDP.

    Perhaps the best illustration of where True CDP is needed is the storage of telemetry or video data – data collected in “real time” is written to disk as it is collected. There is value in collecting data right up until the moment of system failure, usually to help determine the cause of the failure.

    There are point solutions for all of these scenarios – many of them work well for the intended purpose, but have severe limitations for other purposes that make them undesirable.

    The ONLY way that “True CDP” can be done is exactly as you describe – simultaneous writes to disparate systems, with logs that change. This would also require a file system that didn’t overwrite existing data as it writes new, a tool to “roll back” the state of data to a specific date and time – without affecting the data written after that time and a host base tool to make sure that the host-based file system is “fixed” to look like valid data.

    There are a few of these on the market that use an underlying file sytem – ZFS and WAFL come to mind – that are primarily NAS based. They’re supported by Very Large companies and have many people years of thought and development around them, but their vendors provide only snapshot support – true generalized CDP hasn’t worked commercially because there are “good enough” solutions out there that are much less expensive.

    The lesson of Simpson’s Rule – you can get VERY close to an absolutely correct answer by looking at values at intervals of solutions of a function instead of the function itself – is being proven every day in IT shops all over the world. The winning solutions in the marketplace get those intervals close enough to address recovery issues for the underlying applications today. Except in outlying cases, there is little value in storage-based, continuous CDP until the applications (and operating systems and file systems) become aware of how to treat the multiple copies of data.

    Disclosure: I work for NetApp.

    Mickey

  • Great article Curtis. I like Mickey’s perspective on the various tools of CDP; and their relevance to individual applications or business recovery requirements. You mentioned the array of tools, or functions, used by the current CDP product offerings; mirror, snapshots, journal, application awareness, replication and some type of manual or automated recovery process. It’s this suite of functionality in today’s CDP offerings that make them relevant and applicable in this mixed, physical and virtual, server world. Perhaps now, like Star Trek TNG, CDP will get the attention it deserves. Also appreciate Mickey