Microsoft StorSimple 8000 series review
Today I attended a Microsoft StorSimple 8000 series presentation at Microsoft’s offices here in Adelaide, South Australia. It’s a 2RU / 4RU device that provides primary storage through iSCSI, with inline block level automatic tiering, de-duplication and compression, where the final tier is Azure cloud storage. It incorporates volume-level snapshots “for backup” and supports thin provisioning and VAAI (and ODX) capabilties. The device must be purchased in conjunction with Azure capacity of 200TB or up to 500 TB, but locally has 15 TB or 40 TB capacities, with 5% of that capacity being SSD and the rest SAS.
Features and capabilities
The official marketing page from Microsoft is here, but the technical specs are listed on a Seagate website, where it states the device is made and distributed by Xyratex, a subsidiary of Seagate. This ownership appears to be a bit fuzzy, however I was intrigued to notice that the product does not sport Microsoft/Azure colours or branding.
The paper spec sheet is fine – it ticks all the right boxes and provides all the right features for an on-premises iSCSI Active/Passive array with the types of features and capabilities that are often included as standard for most modern low to mid-end rack-mountable arrays. Everything is N+1 redundant, it has two 10GbE uplinks per controller and appears to have SAS uplinks for expansion (although all indications are that you cannot do this).
The differentiator though is that the third tier of storage is provided by Azure. The automatic tiering of 64Kb data blocks includes a final step to move de-duplicated, compressed and encrypted data to an Azure virtual appliance. This entire process is automatic and fully integrated as a cloud integrated storage solution.
A key feature is that the metadata for each snapshot and data within each tier is stored on the appliance, so when a snapshot is mounted as a new volume to a server, this is almost instant – and when accessing data that is on a lower tier (including the Azure hosted tier), this is recovered as a seamless process when the blocks are attempted to be accessed.
All good and seamless, a drop in and go appliance that manages itself pretty well.
Microsoft StorSimple 8000 series Pricing
Then the shock. As I said before, the features and capabilities (except the tier on cloud and integrated “backup” features) are available in most SANs (don’t confuse this with bottom end NAS devices, or real SANs with Active/Active controllers).
Microsoft want to charge $100,000 for the 15TB device alone, plus $22,230 a year for support, plus the Azure prices for 200TB.
The 40TB device is 4RU in size (appears to be two units) and costs $170,000 for the device and $37,780 for the 1st year for support and further charges for the 500TB of Azure storage space.
So, it appears that not only are Microsoft charging a premium for an Active/Passive iSCSI storage appliance that you could buy elsewhere for half or even a third of the cost, but also slugging you to be tied in to a large chunk of Azure storage – which you pay for no matter if you use it all.
No, disk snapshots are NOT backup
In the presentation I attended, Microsoft attempted to assert that snapshots are backups, even referring to the fact that people point out that snapshots are understood to be not real backups. A question was asked about application consistent backups, consistency groups across multiple volumes/disks, and single file recovery – to which Microsoft still attempted to validate that a disk level snapshot was a valid backup.
No Microsoft, it is not.
Applications store in-flight transactions in memory – these need to be committed to disk for the backup to be consistent. Admittedly – VSS can help with some (particularly Microsoft) applications, but as soon as you have any application that needs more than VSS you need a separate backup tool.
Microsoft demonstrated that the recovery of a single file would require that a new disk partition gets mounted. The demo showed that the recovered disk could be mounted to any server that can address the iSCSI storage, but it was still a 30GB disk to recover a 25kb file – this then required drag-and-drop copying off the recovered disk.
Of course, this assumes that;
- Your administrator of the backup or server is also the same person (or you have good processes between teams) who initiates the StorSimple recovery process. This process includes rescanning the disks in Computer Management, bringing the disk online and assigning a new drive letter.
- This administrator will recover the file/s in a timely manner from the newly mounted disk, not access the file directly from the recovered snapshot, and release the newly mounted disk back to the StorSimple appliance. Otherwise – all the drive letters get consumed with mounted snapshots, or mounted snapshots hang around for ages – or even worse, this recovered snapshot becomes a new “live” location of the data.
- That it’s a file that can be copied!
- That it’s not a VMFS formatted volume with a VMDK – otherwise there are additional steps required to mount the VMDK to a running VM, and then you need more administrators involved!
Note: Snapshots are a technology used by backup software to perform a backup of data without being affected by running systems, and to off-load the impact of backup. Snapshots are a tool to use in a backup process, and are not the backup itself.
Tier 3 in the cloud
I’ve got a personal experience which has burnt my view of block-level tiering to very slow / high latency storage – let me explain.
In a SAN implementation (in 2011), the capabilities of the automatic block level tiering system were sold as a way to put a large capacity of cheap storage in behind a much smaller proportion of fast disk (something that is still a sales technique today). We purchased 10% of 15Krpm FC disks, with 90% of the storage being on 7.2Krpm SATA disks – connected in a shelf with an fibre interconnect of a 1.5Gbps backplane. This worked fine, until after 14 days, blocks were moved from the fast disks down to SATA. These blocks included the operating system file that were only accessed at boot time, and importantly, also data that is not frequently accessed. This was by design.
Until we needed to do monthly patching. Not only was this very slow (as all OS file data was on the 1.5Gbps backplane and on slow disks), but this also meant that the data for 400 servers being read and written to multiple times when it was patched, so the blocks were moved up to the fast disks – filling them up. The SAN responded with performing a background early migration of data off the fast FC disks down to SATA, which again saturated the 1.5Gbps backplane. For the next week, data was still migrating up and down between tiers, causing incredible performance issues for everything.
So we changed the threshold for block tiering to every 28 days instead of 14 days.
Until we did month-end reporting on a large database.
A large amount of data was again being moved up from the slower tier to the faster tier (blocks were being accessed multiple times for reports), again filling the faster disks and saturating the 1.5Gbps connection between Tier 2 and Tier 1.
Yes, this was a design problem, and yes we did resolve the issue by replacing the SATA 1.5Gbps connection with a SAS 6Gbps technology and increased the Tier 1 storage by a large amount. However, the experience that I suffered with is very applicable for this.
If you have blocks that are on very slow and high latency connection – this will slow your entire storage infrastructure.
I shudder to think of the impact on block level tiered data of a background defragmentation task, a full file-level virus scan (don’t do it! use vShield Endpoint), other tasks that are accessing / indexing / touching files in the background – you may have your own systems or tools too.
Outages?
I’m scared. I like the idea of backup to cloud, and cloud based archive – but block level tiering to cloud? That scares me.
If a backup system is unavailable for an hour, you reschedule your backups or even just skip a backup. It’s only a problem if it’s unavailable when you need to recover. This is different – individual blocks of a production system are stored in the cloud. If the cloud is slow, so is your server. If the cloud is unavailable (or any point in the 15 hops to get there, any router / switch / firewall / ISP / under-street cable), then will your server bluescreen? Will your application crash? Will it freeze?
That’s too big of a risk for me.