One of the scariest things in changing from a known solution like our NetApp filers to a less familiar linux solution was the question of whether the new system could push data as fast as the old. Here I provide an overview of our system's specs and some performance numbers.
The NFS cluster nodes have the following specs:
- 1066/1333 MHz Front Side Bus
- Dual 2.33GHz Intel XEON L5410, Quad Core 12MB Cache 1333MHz
- 16GB DDR2-667 Fully Buffered ECC RAM (8 x 2GB)
- Hardware RAID 0/1/5/6/10/50/60: Adaptec 3405 w/ 128MB cache, PCI-E
- RAID 1 Volume: 238 GB (2+0 x 250GB SATA2)
- QLogic SANblade QLE2462 PCI-E Fibre Channel Adapter, 4Gbps, 2-port, optic
- Intel 9402PT Dual Port Copper Gigabit PCI-E Ethernet Card
We're using Xyratex F5412E storage enclosures:
- Dual RAID Controller, 2GB Cache
- Four 4Gb/s FC host connections
- Expansion Enclosure, 3Yr Support
- RAID 10 Volume: 858 GB (6+0 x 300GB SAS)
- RAID 10 Volume: 858 GB (6+0 x 300GB SAS)
- RAID 10 Volume: 858 GB (6+0 x 300GB SAS)
- RAID 10 Volume: 572 GB (4+2 x 300GB SAS)
We were advised by Penguin's engineers to create four smaller volumes rather than go with the temptation to throw 22 disks into a giant RAID array. They said that they've had customers experience problems with excessive rebuild times with such large arrays (may we never have to rebuild the array).
Each linux node is connected via Fibre Channel to each of the controllers in the F5412E. One node is the active one, with the other monitoring the status of the cluster. If the NFS service is interrupted (maybe the QLogic adapter fails on the primary), the secondary will kill the primary node via IPMI, and it will take over the NFS serving.
OK -- enough background. Let's talk about what this server is responsible for. It is providing NFS services for:
- 19 web servers (approx 3M pageviews/day, peak of 1M/hour, running a complex PHP application, where each pageview requires about 50-100 PHP files)
- 3 database servers
- our publishing system's database (14M records, 6.5GB, approx 600 queries/s)
- a massive sports database (17M records, 5GB, approx 25 queries/s)
- classifieds system database (1.5M records, 250MB, approx 50 queries/s)
- Flash Media Server
- Mailman
Needless to say, this system is very busy. On a very typical day, it's pushing about 160Mbps average for 16 hours (in reality, this is spikey, running as low as 80Mbps and as high as 480Mbps). During such a typical day, it's handling about 13K NFS ops/s (mostly getattrs, 11K, and reads, 2K). It's handling about 300 NFS writes/s.
With all this going on, its load hovers between 2 and 5 (with 8 processors, that's not too bad). CPU idle time is around 90%, with about 6% spent in I/O wait, and about 4% spent in system.
The highest load I've observed so far is about 20K NFS ops/s, during which its CPU idle time went down to about 85% from the more normal 90% level. The system gets its biggest workout during our database backups, when idle time dips to about 75%.
Based on the fact that we have so much headroom on the CPU idle time, and the fact that we don't appear to be I/O bound, I'm pretty sure that we would saturate our network before we would run out of CPU and disk throughput. In fact, we may have to bond two NICs together to create a 2Gbps link to the switch to really give this thing a workout.
Part 1: transitioning to an open-source nfs server
Part 3: the gotchas we encountered when deploying this system





Comments
thx
That way you could balance clients between the two IP's still get failover..