SOD: Sidebar Diversion

I couldn’t get the idea out of my head that the Avatar rendering cluster required 1 petabyte of storage. However, this slide show of the facilities used for filming the actors opened my eyes. [eye opening slide show]

The petabyte is required not just for the finished product. It’s needed to store all the sensor and camera data as well. Okay. I accept that Weta needed 1PB. How does one go about creating a petabyte storage facility? What are the tradeoffs? How much does it cost to build and then to maintain?

I need to get this out of my head and free up some brain cycles to continue with my Seeds of Discontent series. This article is a sidebar.

Disclaimer: I’m using a server build derived from the Seeds of Discontent serial (see also, Seeds of Discontent: System Benchmarking Hardware). It isn’t a traditional server room server but it serves as a baseline talking point. Besides, I can get budgetary pricing off newegg.com. It’s good enough for the purposes of this discussion. I’m sure the staff at Weta (or ILM or Dreamworks or Pixar) would have more insight into what *really* works.

I’ll choose the 4U chassis for this exercise since it better approximates the airflow of a desktop chassis. The 50U rack is taller than usual but it allows for 12 4-U servers plus a switch.

In the Seeds of Discontent serial, the HDD array uses the cheapest drives available; The final size is not important. In this sidebar discussion, final HDD capacity *is* important. This is not a single desktop box but a compute and storage cluster.

Each server functions as a compute unit as well as a storage unit.

A commodity multi-core CPU together with two commodity PCI Express GPU subsystems comprise the compute unit. Primary storage and swap for the compute unit is a two-disk RAID 0 SSD. The GPU cards have their own RAM while the CPU has two triple-channel banks of DDR3.

The server also hosts an HDD array which is not private storage but part of a larger storage cluster. For this exercise, I arbitrarily choose GlusterFS.

I’ve budgeted $2,500 USD per server (less the HDD array).

(circa mid-November 2010)
case+PSU       $ 200
mainboard        230
SSD              135
SSD              135
DDR3 24GB        550
CPU              850
GPU              200
CPU              200
--------------------
             $25,000 USD
              $2,500 USD

In addition, I’ve budgeted $1,000 per rack and $1,000 per switch. A fully constructed rack sans HDD is $32,000. YMMV.

rack         $ 1,000
switch         1,000
servers       30,000 (2,500 * 12 servers/rack)
--------------------
             $32,000 USD

Let’s add the HDD to build out a petabyte cluster. Since I’m using the Asus Rampage III (admitedly not a server mainboard), the two GPU fully consume the PCI Express lanes. There aren’t any lanes left for a RAID card. The two SSD drives occupy the two SATA III channels leaving the seven SATA II channels for the HDD array. Each server then adds seven HDD (and each rack adds 84 HDD) to the cluster.

As seen from this list (circa mid-November 2010), the lower capacity drives are not the cheapest drives per gigabyte.

SKU               $USD  cnt    GB   $/GB $K/PB  HDD/PB
---------------   ----  ---  ----  ----- -----  ----
WD5002ABYS-20PK   1750   20   500  0.175  175   2000
WD7502ABYS-20PK   2500   20   750  0.167  167   1333
0A39289-20PK      2700   20  1000  0.135  135   1000
WD7501AALS-20PK   1550   20   750  0.103  103   1333
WD6401AALS-20PK   1300   20   640  0.102  102   1563
WD5000AADS-20PK    900   20   500  0.090   90   2000
0F10381-20PK       900   20   500  0.090   90   2000
WD1001FALS-20PK   1800   20  1000  0.090   90   1000
WD5000AAKS-20PK    880   20   500  0.088   88   2000
WD6400AARS          55    1   640  0.086   86   1563
WD7500AADS-20PK   1250   20   750  0.083   83   1333
ST3500418AS         40    1   500  0.080   80   2000
WD7500AADS          55    1   750  0.073   73   1333
WD20EVDS-20PK     2800   20  2000  0.070   70    500
0F10383-20PK      1400   20  1000  0.070   70   1000
WD10EALS-20PK     1350   20  1000  0.068   68   1000
WD10EARS            65    1  1000  0.065   65   1000
WD10EARS-20PK     1200   20  1000  0.060   60   1000
WD15EARS-20PK     1700   20  1500  0.057   57    667
ST31000528AS        50    1  1000  0.050   50   1000
ST31500341AS        60    1  1500  0.040   40    667

Downselect the cheapest drive at each capacity point greater than or equal to 1TB.

SKU               $USD  cnt    GB   $/GB $K/PB  HDD/PB
---------------   ----  ---  ----  ----- -----  ----
WD20EVDS-20PK     2800   20  2000  0.070   70    500
ST31000528AS        50    1  1000  0.050   50   1000
ST31500341AS        60    1  1500  0.040   40    667

I want to minimize the number of drives while minimizing costs. The Seagate 1TB drive is both more expensive per GB and requires more drives per PB than the Seagate 1.5TB drive. It is immediately eliminated. The competition is between the Western Digital 2TB and the Seagate 1TB drive.

If cost were the only issue, then the Seagate drive would win. If drive count were the only issue then the Western Digital drive would win. To get closer to an answer, let’s build out the storage cluster.

Petabyte Cluster with Just a Bunch of Disks

        ======= count ======  ========= cost ==========
Drive   HDD   Servers  Racks    HDD     Rack     Total
-----  -----  -------  -----  -------  -------  -------
2.0TB    500       72      6   70,000  192,000  262,000
1.5TB    667       96      8   40,000  256,000  296,000

Even though the 2TB drives cost $30,000 more than the 1.5TB drives, the total cluster cost is $34,000 for the 1.5TB drive choice. Furthermore, there is no redundancy to protect against drive failure. How likely is a drive to fail? It’s not just likely to happen. It will happen. If a single drive will on average fail in five years, then a single drive in a pool of 500 drives will on average fail in 1/100 of a year (or roughly two drives a week).

Without debate, I posit that it is not possible to make nightly backups of a petabyte storage cluster. Firstly, we’d need a second petabyte. Secondly, that’s a lot of data to move and the cluster needs to continuously run compute jobs (rendering). The solution is redundancy either through local machine (e.g., RAID 6) or through GlusterFS replication (analogous to RAID 10 but at the cluster level).

Petabyte Cluster with RAID-6

        ======= count ======  ========= cost ==========
Drive   HDD   Servers  Racks    HDD     Rack     Total
-----  -----  -------  -----  -------  -------  -------
2.0TB    700      100    8.3   98,000  268,000  366,000
1.5TB  1,087      156     13   56,000  416,000  472,000

Petabyte Cluster with RAID-10

        ======= count ======  ========= cost ==========
Drive   HDD   Servers  Racks    HDD     Rack     Total
-----  -----  -------  -----  -------  -------  -------
2.0TB  1,000      144     12  140,000  384,000  524,000
1.5TB  1,334      292     16   80,000  512,000  592,000

The drive count starts to really add leading to more frequent drive failures. With a RAID 10 (or the equivalent GlusterFS replication scheme), the operations team can expect to replace five to six drives per week.

But the larger question is, how many servers are needed for the compute cluster? What if rendering needed no more than 60 compute nodes? If we fixed the compute nodes count to 60, we would need to add more drives per server. For the sake of discussion, assume we could load 24 drives per server but that doubles the cost per server before including drive costs (i.e., 2 * $2,500 = $5,000 per server sans HDD).

Furthermore, assume we’re using GlusterFS replication for the storage cluster redundancy. This presses the drive count up but lowers the complexity of building and maintaining local RAID systems each with 24 drives.

Petabyte 60 Server Cluster with RAID-10

        ======= count ======  ========= cost ==========
Drive   HDD   Servers  Racks    HDD     Rack     Total
-----  -----  -------  -----  -------  -------  -------
2.0TB  1,000       60      5  140,000  160,000  300,000
1.5TB  1,334       60      5   80,000  160,000  240,000

There is a $60,000 capital cost difference between the two clusters. Drive failure rates are a third lower for the 2TB drive cluster but the larger drive costs more per GB.

Drive Failure Rate
(5 Year time to fail)
Petabyte 60 Server Cluster with RAID-10

       == failure cost ==   HDD   failrate  === month ===
Drive  HDD  labor  subtot  units   (days)   fails   cost
-----  ---  -----  -----   -----  --------  -----  ------
2.0TB  140     70  $ 210   1,000     1.825   16.4  $3,518
1.5TB   60     70    130   1,334     1.368   21.9   2,851

These number presumed that both drives had the same MTBF. Digging a bit further we find that the WD20EVDS claims 1 million hours MTBF and the ST31500341AS claims 750,000 hours MTBF. That is, the operations staff can expect the Seagate drives to fail at a rate 1/3 greater than that of the Western Digital drives.

Sidenote: The two drives are in different classes. The Seagate drive spins at a faster rate (7,200 RPM) and claims performance. The slower Western Digital drive (5,400 RPM) claims consistency and lower power. However, the slower rate is fine for the storage cluster which acts as secondary storage.

I will not attempt to sort the apples to oranges comparisons between the two drive manufacturers. I shall take my previous calculations and adjust the 1.5TB drive cost up by a third.

Drive Failure Rate
(Adjusted Fail Rate)
Petabyte 60 Server Cluster with RAID-10

                        original      adjusted
       fail    HDD   === month ===  === month ===
Drive  cost   units  fails   cost   fails   cost
-----  -----  -----  -----  ------  -----  ------  
2.0TB  $ 210  1,000   16.4  $3,518   16.4  $3,518
1.5TB    130  1,334   21.9   2,851   29.2  $3,801

A reversal of recurring costs. Does it matter? No. Not really. In my opinion, it’s more important to minimize the day-to-day operation hassles. The three hundred bucks a month (one way or the other) is noise. The $60K difference in initial capital costs is significant but not as significant as reliable operations.

Rumor has it that both Seagate and Western Digital will soon release a 3TB drive.

Update 2010-12-17: Xbit Labs reports on Hitachi’s new sixth-generation perpendicular magnetic recording (PMR) which “enable 3.5″ hard drives with 4TB or even 5TB capacities.”

My final ponderings on this fantasy cluster looks at the impact of future HDD capacities. For this, I simply speculate on the cost per gigabyte. If you have pricing information more closely tied to reality, please let me know. 🙂

Drive Failure Rate
(unadjusted 5year rates)
Petabyte 60 Server Cluster with RAID-10 (or equivalent)

       $/GB   == failure cost ==   HDD   failrate  === month ===
Drive         HDD  labor  subtot  units   (days)   fails   cost
-----  -----  ---  -----  -----   -----  --------  -----  ------
5.0TB  0.100  500     70    570     400     4.564    6.6  $3,762
4.0TB  0.090  360     70    430     500     3.651    8.2   3,526
3.0TB  0.080  240     70    310     667     2.737   11.0   3,410
2.0TB  0.070  140     70    210   1,000     1.825   16.4   3,518
1.5TB  0.040   60     70    130   1,334     1.368   21.9   2,851

This isn’t the end of the tradeoff line. At some point, 3.5 inch HDD will yield to the 2.5 inch form factor. The larger larger drives just won’t be available. That dynamic will change the equation. GPU cards will become increasingly more capable. CPU core counts will increase. RAM costs decline. Fewer servers will be needed. Fewer racks. Less power. I’m sure in my naïveté I’ve underestimated much here. However, I do believe one day the entire data center used to build Avatar will fit inside a 40 foot shipping container. And then, some time later–but not much later–that compute power will shrink to fit in a 20 foot container. And so on. The important point is that the capital costs (exluding facilities) for this fantasy cluster is under a million dollars. And it gets cheaper by the day.

Red Leopard