Written by W. Curtis Preston
Friday, 20 March 2009 14:46
As you’re probably aware, my blog post “The real deal on the 3D4000” drew some rather harsh criticism from Mark Twomey of EMC. He didn't give a title, but he said that he is the, "owner of every DL in EMEA marked engineering sample since product introduction. Setter upper of systems from the cardboard box to production." The only thing I got when I asked EMC was that he was "in the sales organization." Apparently I'm the last person to figure this out, but Mark Twomey is Storagezilla. I have read his blog before, but did not equate the two together. His blog says that he is the "Information Protection Subject Matter Expert for Ireland."
Mark and I have never spoken or met before, but he claims to be “the voice of authority,” and he made several statements that can be summarized as “Preston has no idea what he’s talking about.”Since my credibility is the only thing I have going for me, I felt it was important to make a second post that proves that what I said was true.Click Read More to see the proof.
I finished this response several days ago, but let it stew for a while.The first reason is that I bear no ill will to EMC for Mark’s rather personal comments, and I don't want to come across as anti-EMC.I've got no axe to grind. I’ve also been contacted by EMC representatives and we will continue (offline) the dialogue that we started before I made that post.I am not an “anybody but EMC” person (as Mark suggested).In contrast, I like a number of products that they put out and think they’ve done very well in this industry.
BUT…I don’t believe that I have to like every product that they put out.
I also don’t have a major problem with the 3D line, or even with buying a 3D 4000 to go behind an existing EDL 4000.What I don’t understand (for reasons I have explained) is why anyone would buy a NEW EDL/3D 4000 combo.It looks like it costs more and offers far less value than any of the alternatives, so I’m simply at a loss as to why someone would buy it over a competing solution.
I even bear no ill will to Mark.He’s just a passionate evangelist for his company’s product.
BUT… I don’t think I should sit idly by while somebody says I have no idea what I’m talking about.
SO… I’m going to do my best to present Mark’s statements and my rebuttal to them as emotionless as possible.I’ll say what he said; then I’ll prove my point with quotes from the EMC 3D 4000 manual that someone sent me after they saw what Mark was saying.
My response
The first thing that Mark said in his post was that “this idea of Front End/Back End you have is wrong.”
The following is a quote from the 3D 4000 manual:
“The 3D 4000 appliance contains ... VTLs [that] are presented to the EDL as back-end physical libraries.”
So the manual actually uses my term “back-end 3D 4000.” I would argue a “back-end” implies a “front-end.”So my terminology seems OK.The other problem that Mark seemed to have with that description is how I said the data moved from the front-end to the back-end.
The manual says:
“tapes with matching barcodes are created across the 3D 4000 back-end appliance and linked to those 3D/3DR enabled virtual tapes [in the EDL]. ... Data migration is the process of transferring data written on a 3D/3DR enabled virtual tape on the EDL to its corresponding virtual tape on the back-end 3D 4000.”
Anyone familiar with the EDL (or any VTL with back-end tape) knows that the behavior described in those sentences matches exactly to how an EDL (or any VTL with back-end tape) copies virtual tapes to physical tapes.Typically the VTL inventories the physical library, and it creates virtual tapes that have the same bar codes of the physical tapes it finds in the physical library.After you back up to virtual tape ABC123, the VTL copies virtual tape ABC123 to physical tape ABC123 when you tell it to.The only thing different in this scenario is that the both sets of “tapes” are virtual, so they can be created at the same time with matching bar codes.
Let’s Review. It seems that there is a front-end and back-end VTL, and it seems that the way that data is moved between them exactly matches how I described it.
Mark then said that “data is not moved “multiple times” as you incorrectly state it is moved once from the native pool to the block pool.”
I stand by my original comment about data being moved “multiple times,” but I’ll agree that my wording of the original post made it seem like I was saying it’s worse than it really is.Let’s review the different dedupe architectures to see why I still say it is moved multiple times.(More specifically that it is moved around more times with the 3D 4000 than any other dedupe system.)
Full Inline: Native data is never written to disk. Only deduped data is written to disk.Assuming a 20:1 dedupe ratio, only 5% of the data is ever written to disk; the rest is discarded.In other words, for every 100 TB that comes into the box, 5 TB will actually be written/read to/from disk.
Full Post-Process: Native data is always written to a cache disk.All data is then read from that disk, deduped data is written to disk, and native data is then deleted from disk.So, 100% is written to disk, 100% is read from disk, 5% is written to disk (deduped data), and 100% is then deleted.(I don’t know how much actual I/O deletion from disk costs.It’s not 100%, but it’s not 0% either.I’m going to go with 10%.)From a disk I/O perspective, that’s 215% -- 210% more I/O for post-process over inline.In other words, for every 100 TB that comes into the box, 210 TB will be written/read to/from disk.
Adaptive: Quantum (and by extension the 3D line) is a bit different.It does write 100% of the native data to disk.(We’ll examine later quotes from the manual that back this up.)The dedupe process, though, tries to read as much data from RAM as it can.Since data is coming in faster than the dedupe process can dedupe, though, it can’t get it all from RAM.How much data has to be read from disk is anybody’s guess, but the 3D 4000 manual says that “The ingest rate (write throughput) of the DL3D appliance, as a rule is faster than the rate at which the written data is de-duplicated … [so] it is possible to exhaust all available space before sufficient space is reclaimed by de-duplication and truncation.”That would suggest that a significant portion of the data to be deduped will be read from disk.I’m guessing at least 75%.Then 5% (assuming 20:1) of the data is written to the block pool, followed by an eventual delete of the cached data.I’ll stick with my 10% math from the post-process statement above.So that’s 100% write (of native data), 75% read (read of native data from disk), 5% written to disk (dedupe data), and 10% write (delete of native data).From a disk I/O perspective that’s 190% -- 185% more I/O for adaptive over inline.In other words, for every 100 TB of data backed up to a Quantum box, 185 TB is written/read to/from disk.
3D 4000: In addition to the 185% I/O listed above, all data is initially written to the EDL and then read from the EDL so it can be written to the back-end 3D 4000.That’s 100% write (to the EDL), 100% read (from the EDL), all of which is added to the 185% described above.That makes a total of 385%, 380% more than a true inline system, 175% more than a typical post-process system and 200% more than a pure Quantum box (or 1500/3000 from EMC).In other words, for every 100 TB data that comes into the 3D 4000, 385 TB will be written/read to/from disk.
All I’m saying that the additional I/O required by the 3D 4000 architecture isn’t free.It requires additional I/O paths and additional CPU and RAM resources to make it happen.
Let’s review.All data that will be deduped is first written to the front-end EDL.Then it’s read from the EDL and copied to the back-end 3DL and written (a second time) in its entirety to disk.Much of the data to be deduped will then be read (again) from disk (and not from RAM).This represents another disk read process.Finally, data that is determined to be unique will then be written (a third time) to the dedupe pool.So it appears that data is moved multiple times (175% more than a traditional post-processing system).
Mark also said that “There is no cached copy.”
The 3D 4000 manual states the following
“The DL3D appliance also reduces the amount of file system capacity by truncating the de-duplicated data. Once the de-duplicated data is truncated, only the metadata is available on the file system. This reduces the amount of capacity required in the file system. Once truncated, the file must be reconstructed using its tag before you are able to access the file. ... Truncation is deferred until the file system used capacity reaches 70%. This aids in the optimal use of available capacity on the DL3D appliance by using truncation to reclaim space only when necessary. The deferred truncation is intended for helping to improve the performance of applications that read the backed up data. “
So what data is this that it is truncating (AKA deleting)?It’s obviously not the deduped backed up data (the wording in the manual confuses the question), as you wouldn’t be able to restore anything after that.It says that after data is truncated, restore performance decreases because you must rebuild the data from the tags.(IOW, you must “rehydrate” or “redupe” the data in order to restore it.)Therefore the data that is being truncated must be the ORIGINAL/NATIVE data; which is being deleted “only when necessary” to “improve the performance of applications that read the backed up data.” (Like restore.)
So, the DL3D 4000 is storing a non-deduped (native) copy of the data for the purposes of increasing restore performance.If that’s not a cache, I don’t know what is.
Let’s Review: The DL 3D 4000 (and the 1500 & 3000) have a cached copy of the native data.
Mark then said “What part of there’s only one copy of data in the system is unclear? ... If the backup data is in the native pool and you go to restore it’ll be read from the native pool. If it’s in the block pool it’ll be read from the block pool. It’ll either be one place or the other. Never both at the same time.” (emphasis mine)
The 3D 4000 manual states:
“the data is copied concurrently to the corresponding tapes on the 3D 4000 through data migration process.At this point, the data resides on both the EDL and the 3D 4000.” (emphasis mine)
The manual goes on to explain a process on the EDL that is very similar to truncation on the back-end 3D 4000.It calls this process “reclamation” and it says that it will “remove that data from the EDL, leaving the data to reside exclusively on the 3D 4000, thereby creating more available space on the EDL. ... The virtual tape on the EDL is then replaced with a tape stub. This tape stub acts as a pointer to the corresponding tape residing on the 3D 4000.”
It says that the reclamation process can be triggered as soon as the migration is successful (immediate), when a given disk space threshold is reached (Used space threshold), after a pre-determined period of time (Retention Period), or never.That’s right; one of the options is to never delete the original data on the EDL, leaving all data always on both systems.
Let’s review:There appears to be more than “one copy of data in the system” and the possibility of having data in both places “always” (if you choose the option to never delete the copy on the EDL) is a whole lot more than “never both at the same time.”
He then went on to say that my “opinion on consolidated media management is out of touch as not only is it the most frequently sold option with the Disk Library it was the very first request for enhancement submitted to me by customers.”Actually, I think that if you look at this post, you’ll find I’m trying to drive the idea of media management forward, not backward.
As to the 3D 4000 being more advanced than me, consider the following three quotes from the manual:
“3D 4000 appliance does not support Path to Tape feature.”
“3D 4000 does not support either support [sic] NAS backup or NAS sharing.”
“You must not add, edit, or delete virtual tape libraries in 3D 4000 using remote management pages as it may result in data loss. The features are available only for diagnostic purposes.”
Note to readers: The remote management pages to which the manual refers are the web pages that you can use to manage the 3D/Quantum products via its IP interface.In the 3D 4000, they’re still on, but you’re not supposed to use them.
Note to EMC: Seriously?You couldn’t at least turn off the service that advertises the remote management pages in the 3D box when you plug into the back of an EDL?They’re still available, but the customer isn’t supposed to use them?And if they use them it causes data loss?Seriously?How hard would it be to deactivate this service so that the customer doesn’t have the possibility of causing data loss via a tool provided by your product?
So unlike the 3D 1500 and 3000, the 3D 4000 does not support the Path to Tape feature, the NAS feature, or remote management pages.In fact, the remote management pages are actually available, but can result in data loss if they’re used.It seems that the 3D 4000 is a lot more “out of touch” than I am.
Finally, Mark asks “Why would a system with a drive count split between native and block pools be priced differently than any of the other post processing solutions structured the same way?”
The first cost issue is that the different “pools” are forever dedicated to the front end or back end and cannot expand or contract to meet current needs.Suppose you bought a 100 TB front end EDL and a 100 TB back-end 3DL.If you decided to dedupe everything, you cannot move any of the 100 TB of front end disk to the block pool.If you decide to not dedupe more than 100 TB of data, you cannot move any of the disk from the 3DL to the EDL for this purpose.
This is actually similar to the way the FalconStor dedupe VTL is priced/configured, as the “landing zone” (where native data is written before it’s deduped) is one pool of disk and the deduped data is written to another pool of disk; they cannot be shared.So in this respect, the 3D 4000 isn’t that much different than FalconStor.
But other post-process dedupe systems (Exagrid, Quantum, SEPATON) are different.You can change your mind any time you want to as to how much data is stored in deduped format and how much is stored in native format.The different pools can shrink or expand as you need them to.Not being able to do this will inevitably result in wasted disk and make the 3D 4000 (and its sister the FalconStor box) cost more than other systems.
The second cost issue is that the extra movements of data mentioned earlier require extra CPU and RAM resources that other systems don’t have.
In theory, since a lot of the cost of a dedupe system comes from software, it’s possible that EMC can make up these differences.Only a full pricing comparison can figure that out.
One final bit of news from the manual is that the 3D 4000 is really a 3D 3000.The hardware and software are exactly the same, but it has decreased functionality (no NAS, no Direct to tape). The only difference between the two is that a 3D 4000 is plugged into the back of an EDL.
Final Summary: I believe I've demonstrated that the 3D 4000 documentation supports everything I said about it in my first post.
Can we at least agree that Mr. Preston is very well versed on this subject? In this thread I see Mark admitting a lot of "correct" and "yes" and even an "I defer to the manual on that one and was incorrect." Since credibility is how Curtis makes his living, can we agree that Curtis has that in spades and you were way off the mark in your previous post where you stated "I'd appreciate it if you made all the relevant corrections and I realise how difficult it can be to pick out facts when you're not building these things from the screws up." Curtis seems to have his facts well in order to the point where you needed to issue your own list of agreements and corrections. Maybe it was a poor attempt at humor but if you want to be demeaning in your post, that's your choice. I'd suggest a robust debate is better when both sides hold the facts and the people in high regard.
As an aside, can we also agree that we should by default disable diagnostic services that can cause data loss? I can field test that question but I guarantee the overwhelming response from Administrators will be, "What?! No, turn it off. I'll enable it if it gets to that point."
You can be the SME in Ireland until I return to my island. I personllay 'love' EMC storage products and have installed many (20+) C/EDLs into TSM environments. The only problem I ever had with EMC was the 'people' that worked for them. While I dont' believe Curtis is always correct, he provides insights and give potential pitfalls that 'may' be encountered in the black art of data protection. But for (Skid)Mark, you really need to get out more. I can see why you've lasted so long at the Evil Machine Company, becasue you drank their cool-aid long ago. I also seriously doubt that you thnk in Exabites. The largest shops I've been to can't even begin to go to that level, and the are all primarily fortune 50.
Analysts make their money from vendors. I make my money by independent writing (i.e. articles -- NOT vendor whitepapers) and by direct work with end users.
I had no idea you were Storagezilla. That's why I never read your About page. I did google you when you put your comment in. Although the first entry is Storagezilla, I didn't see your name in the following text, so I thought that an anomaly. Now I figured out who you are, so I corrected my post. (I do make corrections when they're warranted.)
I never said I don't like the Quantum immediate model. I merely needed to point out that it is closer in I/O pattern to the post-process model than the inline model, so that I could THEN add the additional 200% of I/O that the 4000 creates. That was the point of that whole exercise. If you ever hear me speak or write about post-process vs inline, I give a good number of advantages and disadvantages to both, so I still say it doesn't matter. What matters is what you get for what you pay.
Two definitions of cache? I don't think so. Every technical definition of cache I could find can be summarized as "a copy of recent data stored to increase data transfer rates." You said repeatedly in your previous comments that "there is no cached copy," and that I should correct that. Both the EMC and the Quantum documentation backup that there is indeed a cached copy, so I don't see anything to correct.
The 256 MB to which you refer is the element of work in the Quantum dedupe engine, not a single 256 MB area. Once 256 MB comes into the box, they create what I'll call a chunk of data and write it to disk. While this is happening, the dedupe engine is trying to dedupe data. It prefers to read the data it will dedupe from RAM, but it will get it from disk if it is no longer in RAM. But ALL data in its original format (that was written in these 256 MB chunks) is left on disk until it is truncated. That data is uses to increase restore speeds of deduped data, and I suggest that any reasonable techy would call that a cache.
Whether you count on this cache being there or not is not the point. The point is that the cache that you said didn't exist exists. (I'm sorry to put such a blunt point on it, but the whole point of your initial comment to my first post was that I didn't understand the architecture and was describing it wrong. I am maintaining that I understand it just fine.)
As to the whole discussion about trusting the admins, etc.... I suggest that it has nothing to do with trust and that if you put an interface out there that someone can access, they might eventually use it. Staff continually changes and not everyone reads the manual (shocking, I know), and the fact that you have an interface available that can cause data loss bothers me. It reminds me of the "vmquery -deassignbyid" command that Veritas always told people not to use. They found it useful anyway and kept using it -- to their detriment. Eventually Veritas "fixed" the command so it no longer did the thing that caused data loss.
I am not convinced that the ESN is more advanced. I'd discussed that on other posts, but suffice it to say that I'm not sold on it. I'm not saying that I know it's not better; I'm just saying I'm not convinced it's better.
You then go on to say (my summary) that the best use case for the 3D 4000 is to dedupe an existing 4000. I said in my original post that I was OK with that use case, although I do think it's prudent to compare the cost of adding a 3000 to an existing EDL with the cost of buying a "regular" dedupe VTL and throwing away the 4000. If your product is as price competitive as EMC claims, then this recommendation should not bother you.
As to the costing issue, I've asking for pricing from all the main vendors for a separate post on that topic. So far EMC is the only vendor who hasn't replied to any of my emails on the subject. It'll be difficult to prove EMC's point that the 3D4000 is cost competitive if they won't tell me how much it costs.
Not sure who you spoke to at EMC (Oh that's a lie, I know exactly who you started speaking to at EMC and when because I checked), but they got my title wrong.
Or you could have check the About page on my blog. No one checks About pages anymore and I don't know why.
Would have been back to you sooner but end of quarter and had real customers to deal with, not analysts to be playing with. And you shouldn't take any comments as "rather personal" this is nothing more than robust debate. You'll never mistake rather personal from me for anything else should I ever get rather personal.
But that's not the case here.
Down to it but it's late so this could occur in multiple comments.
Okay, so to begin with you're not a big fan of the Quantum immediate de-dup process due to the fact it writes and reads from disk. That's the process for immediate de-dup with the 1500, 3000 and DL3D 4000 de-dup option as well as all of Quantum's native offerings.
Fair enough. You don't have to like it, just accept it as an architectural choice and move on.
"So, the DL3D 4000 is storing a non-deduped (native) copy of the data for the purposes of increasing restore performance. If that's not a cache, I don't know what is."
We have two different definitions of cache, in your case it's another copy. That's not my definition of cache in this cache as I see cache as that 256MB immediate de-dup cache area. This other copy is transient so I personally don't factor it in when it comes to restore performance as you can't count on it being there. It's dependent on the free space in the system. If you have unused space it'll be there. If you don't it'll be truncated and won't.
But there is a case here for free space being wasted space. If you have it, there is no reason *not* to use it.
And again with all these moves you point to that's true of the 1500/3000 as it's an internal operation to the unit. It's clear you don't like it, but it doesn't matter that you don't like.
"There appears to be more than "one copy of data in the system" and the possibility of having data in both places "always" (if you choose the option to never delete the copy on the EDL) is a whole lot more than "never both at the same time."
I defer to the manual on that one and was incorrect in assuming you couldn't disable the grooming process. This was an error on my part and I accept that.
"3D 4000 appliance does not support Path to Tape feature."
Correct because it supports the more advanced Embedded Storage Node (ESN) or Embedded Media Server.(EMS) Lets make no bones about it ESN/ESM do more when it comes to moving/stacking/replicating data sets in a catalogue consistent fashion.
"3D 4000 does not support either support [sic] NAS backup or NAS sharing."
Yes because all resources are dedicated to immediate de-dup and we're clear about that. It's not written anywhere that the 3D 4000 supports NAS. It's not a supported configuration and should not be used as such.
I'm not even going to say at your own risk. Don't do it.
"Seriously? You couldn't at least turn off the service that advertises the remote management pages in the 3D box when you plug into the back of an EDL? They're still available, but the customer isn't supposed to use them? And if they use them it causes data loss? Seriously? How hard would it be to deactivate this service so that the customer doesn't have the possibility of causing data loss via a tool provided by your product?"
Seriously? How about you just not use it as it's a support only thing?
Seriously? When is someone an Admin and when are they an amateur who'll login and muck around with something when they've been told not to? Is your Admin 8 years old?
Seriously? If you're capable of not logging into your backup application and relabeling all the tapes or deleting the backup catalogue you're capable of leaving well alone.
It's there for support, not for anyone else. If we were to take it to what you're looking for we'd weld the arrays shut so no one could pull out all the disk drives just because they can remove the front panels.
Lets elevate the discussion above that fact we treat the Admin as a grown up and a professional shall we?
"The first cost issue is that the different "pools" are forever dedicated to the front end or back end and cannot expand or contract to meet current needs. Suppose you bought a 100 TB front end EDL and a 100 TB back-end 3DL. If you decided to dedupe everything, you cannot move any of the 100 TB of front end disk to the block pool. If you decide to not dedupe more than 100 TB of data, you cannot move any of the disk from the 3DL to the EDL for this purpose."
The thing being that most 3D 4000 buyers are existing 4000 series owners who are adding De-Dup. They're not looking to rebalance storage they're looking to either keep more data on spinning disk and ditch tape entirely or replicate de-duplicated data off site so they can take the bandwidth reduction it gives them.
The use cases for this are specific. I know you just love hammering on this. how many words and how much time have you devoted to this? It probably dwarfs all the coverage of any other VTL product from anyone else, but it fits where it fits and where it doesn't fit the portfolio is broad enough so that something else does.
The EDL offers VTL *and* De-Dup with software from Quantum. VTL with software from FalconStor *with* optional De-Dup from Quantum and Mainframe VTL with software from BusTech. There's something for every use case.
As for CPU and RAM resources that reminds me, as we're talking about costs what's the CPU to Disk ratio of the solutions you mention in this post?
At what number of spindles or amount of capacity do I end up buying another node/engine/head and what does that do to the cost?
There's a lot of cost we can take out because we own hardware manufacturing end to end and operate at huge volumes in comparison to competitors. That is a competitive advantage.
I'll probably have more later. Take your time with any responses as I'm not going anywhere.
Sounds like old school unix filesystem DMAPI extensions or stubs to me.
Cache would imply there was a portion of the data retained for seek and access time benefit. Also - This would also suggest that the data being retained in place and in "Prime" form suggested a fragmentation potential unless idle VTL cycles are built into the design.?
If you needed to read the data the user would be left waiting reconstituted file to be "located, read and streamed to the storage"
Maybe for Deep archive but not a good choice for "useful" information.
Here is what the Quantum DXi manual says about cache. Although they never use the word cache, that is exactly what it is. The language in this manual is almost word for word identical to the 3D4000 stuff you posted about cache.
Another way the DXi system reduces the amount of file system capacity used is to truncate the de-duplicated data. Once the de-duplicated data is truncated, only the metadata is available on the file system. This reduces the amount of capacity required in the file system. Once truncated, the file must be reconstituted using it's tag before you are able to access the file.
So in not so many words, your backup data will be preserved in its native form for such a period of time until it becomes necessary to remove it due to storage requirements. After that, if you need the data restored, copied to tape, or read for the purposes of syntheitc fulls it needs to be re-duped.
That's a really good point. Unfortunately, it weighs even further against the EMC and Quantum solutions. Unlike some other solutions that use hardware compression, the Quantum box turns it off for native data that will be deduped. That's why you see the native ingest rate double if the data isn't going to be deduped. This would decrease the I/O needed by those who DO perform hardware compression of the native data.
Also, I don't want these posts about I/O to suggest that I'm saying that post-processing systems and the 3D 4000 are necessarily bad because they have extra I/O. I'm just saying that this I/O comes at a cost that must be factored into the system. For a fuller comparison of the differences between inline and post-processing systems, check this article out: http://www.backupcentral.com/content/view/134/47/
I've enjoyed reading the last few blogs and it has given me more things to think about with regards to dedupe solutions available in the market. Reading the above however I did notice that the I/O estimates don't mention that some VTLs are equiped with hardware compression which should reduce the amount of data read and written from disk. It's true the use of a hardware compression adapter isn't "free" either, but it is one more wrinkle you could include in your comparisions.
Curtis, I am on very long terms with backup deduplication buzz, and I rarely guess that EDL stands for something Electronic Data Library, but I want to tell you I like to read your posts to learn new info. What is more you do sound very convincing.
Comments
As an aside, can we also agree that we should by default disable diagnostic services that can cause data loss? I can field test that question but I guarantee the overwhelming response from Administrators will be, "What?! No, turn it off. I'll enable it if it gets to that point."
I personllay 'love' EMC storage products and have installed many (20+) C/EDLs into TSM environments. The only problem I ever had with EMC was the 'people' that worked for them. While I dont' believe Curtis is always correct, he provides insights and give potential pitfalls that 'may' be encountered in the black art of data protection. But for (Skid)Mark, you really need to get out more.
I can see why you've lasted so long at the Evil Machine Company, becasue you drank their cool-aid long ago. I also seriously doubt that you thnk in Exabites. The largest shops I've been to can't even begin to go to that level, and the are all primarily fortune 50.
Have a nice day!
I had no idea you were Storagezilla. That's why I never read your About page. I did google you when you put your comment in. Although the first entry is Storagezilla, I didn't see your name in the following text, so I thought that an anomaly. Now I figured out who you are, so I corrected my post. (I do make corrections when they're warranted.)
I never said I don't like the Quantum immediate model. I merely needed to point out that it is closer in I/O pattern to the post-process model than the inline model, so that I could THEN add the additional 200% of I/O that the 4000 creates. That was the point of that whole exercise. If you ever hear me speak or write about post-process vs inline, I give a good number of advantages and disadvantages to both, so I still say it doesn't matter. What matters is what you get for what you pay.
Two definitions of cache? I don't think so. Every technical definition of cache I could find can be summarized as "a copy of recent data stored to increase data transfer rates." You said repeatedly in your previous comments that "there is no cached copy," and that I should correct that. Both the EMC and the Quantum documentation backup that there is indeed a cached copy, so I don't see anything to correct.
The 256 MB to which you refer is the element of work in the Quantum dedupe engine, not a single 256 MB area. Once 256 MB comes into the box, they create what I'll call a chunk of data and write it to disk. While this is happening, the dedupe engine is trying to dedupe data. It prefers to read the data it will dedupe from RAM, but it will get it from disk if it is no longer in RAM. But ALL data in its original format (that was written in these 256 MB chunks) is left on disk until it is truncated. That data is uses to increase restore speeds of deduped data, and I suggest that any reasonable techy would call that a cache.
Whether you count on this cache being there or not is not the point. The point is that the cache that you said didn't exist exists. (I'm sorry to put such a blunt point on it, but the whole point of your initial comment to my first post was that I didn't understand the architecture and was describing it wrong. I am maintaining that I understand it just fine.)
As to the whole discussion about trusting the admins, etc.... I suggest that it has nothing to do with trust and that if you put an interface out there that someone can access, they might eventually use it. Staff continually changes and not everyone reads the manual (shocking, I know), and the fact that you have an interface available that can cause data loss bothers me. It reminds me of the "vmquery -deassignbyid" command that Veritas always told people not to use. They found it useful anyway and kept using it -- to their detriment. Eventually Veritas "fixed" the command so it no longer did the thing that caused data loss.
I am not convinced that the ESN is more advanced. I'd discussed that on other posts, but suffice it to say that I'm not sold on it. I'm not saying that I know it's not better; I'm just saying I'm not convinced it's better.
You then go on to say (my summary) that the best use case for the 3D 4000 is to dedupe an existing 4000. I said in my original post that I was OK with that use case, although I do think it's prudent to compare the cost of adding a 3000 to an existing EDL with the cost of buying a "regular" dedupe VTL and throwing away the 4000. If your product is as price competitive as EMC claims, then this recommendation should not bother you.
As to the costing issue, I've asking for pricing from all the main vendors for a separate post on that topic. So far EMC is the only vendor who hasn't replied to any of my emails on the subject. It'll be difficult to prove EMC's point that the 3D4000 is cost competitive if they won't tell me how much it costs.
Not sure who you spoke to at EMC (Oh that's a lie, I know exactly who you started speaking to at EMC and when because I checked), but they got my title wrong.
Or you could have check the About page on my blog. No one checks About pages anymore and I don't know why.
Would have been back to you sooner but end of quarter and had real customers to deal with, not analysts to be playing with. And you shouldn't take any comments as "rather personal" this is nothing more than robust debate. You'll never mistake rather personal from me for anything else should I ever get rather personal.
But that's not the case here.
Down to it but it's late so this could occur in multiple comments.
Okay, so to begin with you're not a big fan of the Quantum immediate de-dup process due to the fact it writes and reads from disk. That's the process for immediate de-dup with the 1500, 3000 and DL3D 4000 de-dup option as well as all of Quantum's native offerings.
Fair enough. You don't have to like it, just accept it as an architectural choice and move on.
"So, the DL3D 4000 is storing a non-deduped (native) copy of the data for the purposes of increasing restore performance. If that's not a cache, I don't know what is."
We have two different definitions of cache, in your case it's another copy. That's not my definition of cache in this cache as I see cache as that 256MB immediate de-dup cache area. This other copy is transient so I personally don't factor it in when it comes to restore performance as you can't count on it being there. It's dependent on the free space in the system. If you have unused space it'll be there. If you don't it'll be truncated and won't.
But there is a case here for free space being wasted space. If you have it, there is no reason *not* to use it.
And again with all these moves you point to that's true of the 1500/3000 as it's an internal operation to the unit. It's clear you don't like it, but it doesn't matter that you don't like.
"There appears to be more than "one copy of data in the system" and the possibility of having data in both places "always" (if you choose the option to never delete the copy on the EDL) is a whole lot more than "never both at the same time."
I defer to the manual on that one and was incorrect in assuming you couldn't disable the grooming process. This was an error on my part and I accept that.
"3D 4000 appliance does not support Path to Tape feature."
Correct because it supports the more advanced Embedded Storage Node (ESN) or Embedded Media Server.(EMS) Lets make no bones about it ESN/ESM do more when it comes to moving/stacking/replicating data sets in a catalogue consistent fashion.
"3D 4000 does not support either support [sic] NAS backup or NAS sharing."
Yes because all resources are dedicated to immediate de-dup and we're clear about that. It's not written anywhere that the 3D 4000 supports NAS. It's not a supported configuration and should not be used as such.
I'm not even going to say at your own risk. Don't do it.
"Seriously? You couldn't at least turn off the service that advertises the remote management pages in the 3D box when you plug into the back of an EDL? They're still available, but the customer isn't supposed to use them? And if they use them it causes data loss? Seriously? How hard would it be to deactivate this service so that the customer doesn't have the possibility of causing data loss via a tool provided by your product?"
Seriously? How about you just not use it as it's a support only thing?
Seriously? When is someone an Admin and when are they an amateur who'll login and muck around with something when they've been told not to? Is your Admin 8 years old?
Seriously? If you're capable of not logging into your backup application and relabeling all the tapes or deleting the backup catalogue you're capable of leaving well alone.
It's there for support, not for anyone else. If we were to take it to what you're looking for we'd weld the arrays shut so no one could pull out all the disk drives just because they can remove the front panels.
Lets elevate the discussion above that fact we treat the Admin as a grown up and a professional shall we?
"The first cost issue is that the different "pools" are forever dedicated to the front end or back end and cannot expand or contract to meet current needs. Suppose you bought a 100 TB front end EDL and a 100 TB back-end 3DL. If you decided to dedupe everything, you cannot move any of the 100 TB of front end disk to the block pool. If you decide to not dedupe more than 100 TB of data, you cannot move any of the disk from the 3DL to the EDL for this purpose."
The thing being that most 3D 4000 buyers are existing 4000 series owners who are adding De-Dup. They're not looking to rebalance storage they're looking to either keep more data on spinning disk and ditch tape entirely or replicate de-duplicated data off site so they can take the bandwidth reduction it gives them.
The use cases for this are specific. I know you just love hammering on this. how many words and how much time have you devoted to this? It probably dwarfs all the coverage of any other VTL product from anyone else, but it fits where it fits and where it doesn't fit the portfolio is broad enough so that something else does.
The EDL offers VTL *and* De-Dup with software from Quantum. VTL with software from FalconStor *with* optional De-Dup from Quantum and Mainframe VTL with software from BusTech. There's something for every use case.
As for CPU and RAM resources that reminds me, as we're talking about costs what's the CPU to Disk ratio of the solutions you mention in this post?
At what number of spindles or amount of capacity do I end up buying another node/engine/head and what does that do to the cost?
There's a lot of cost we can take out because we own hardware manufacturing end to end and operate at huge volumes in comparison to competitors. That is a competitive advantage.
I'll probably have more later. Take your time with any responses as I'm not going anywhere.
Cache would imply there was a portion of the data retained for seek and access time benefit. Also - This would also suggest that the data being retained in place and in "Prime" form suggested a fragmentation potential unless idle VTL cycles are built into the design.?
If you needed to read the data the user would be left waiting reconstituted file to be "located, read and streamed to the storage"
Maybe for Deep archive but not a good choice for "useful" information.
IMHO -
Solutionsarchitect.com
TAJ
Another way the DXi system reduces the amount of file system capacity used is to truncate the de-duplicated data. Once the de-duplicated data is truncated, only the metadata is available on the file system. This reduces the amount of capacity required in the file system. Once truncated, the file must be reconstituted using it's tag before you are able to access the file.
So in not so many words, your backup data will be preserved in its native form for such a period of time until it becomes necessary to remove it due to storage requirements. After that, if you need the data restored, copied to tape, or read for the purposes of syntheitc fulls it needs to be re-duped.
Sounds like cache to me.
Also, I don't want these posts about I/O to suggest that I'm saying that post-processing systems and the 3D 4000 are necessarily bad because they have extra I/O. I'm just saying that this I/O comes at a cost that must be factored into the system. For a fuller comparison of the differences between inline and post-processing systems, check this article out: http://www.backupcentral.com/content/view/134/47/
RSS feed for comments to this post