Experiences in the Home Lab With Micron P420m PCIe SSD, Part 2

By Rob Peglar - 2015-06-02

Hello again, and greetings from my home lab, where I’ve had the pleasure of running Micron’s P420m PCIe solid state drive (SSD). In my last post, I described my initial experiences—through installation, setup, and monitoring—of the SSD in a VMware 6.0 ESXi environment.  I’ve since completed quite a bit of testing, mostly measuring performance by various methods—both synthetic and application-oriented.  I’d like to summarize what I’ve found, and do a bit of comparison as well.

I’ll spare you the details of my setup, which were covered in my last post.  One thing to emphasize before I dive in to the actual results is the importance of knowing both the CPU capability within a given ESXi host as well as the I/O queueing mechanism (and how to control it) within ESXi.  The latter turns out to be essential to achieving maximum performance from the P420m since its capability is far beyond an “ordinary” SSD.

Before I proceed, here’s a small editorial for you.  I’m not a fan of 4KB block size testing in general.  Personally, I think 4KB testing is rather pointless today, and I wish all SSD providers would agree that showing “hero numbers” using 4KB I/Os doesn’t represent how the device will perform in the real world.  So, I’m not trying to wow you with “hero numbers” here, but rather to prove how important controlling queue depth is in an ESXi environment.  Running 4KB emphasizes the importance.

Synthetic Workload Test Setup

The first results I’ll show compare running a synthetic workload at device queue depth 31 and device queue depth 255.  Note, during my tests, the adapter queue depth was 255 and the device was fully preconditioned (24 hours’ worth) to ensure steady state behavior.  For this purpose, I used two well-known synthetic benchmarking utilities: SQLIO and IOMETER.  As such, my test virtual machine (VM) ran Microsoft Windows 7 Ultimate. 

Observations at Queue Depth 31

In my first test, I used queue depth 31, set with the following command:

esxcli storage core device set –m 31 –O 31 –d <P420M device name>

Note, it’s important to set both the actual device queue depth (-m) as well as the number of outstanding I/O requests with competing worlds (-O).  Check out part 1 of this blog to learn how to get the P420M device name parameter in ESXi 6.0.

At queue depth 31,  I ran a 2-vCPU, 16GB vRAM VM sending a 100% random write, 4KB, 2-burst workload through one worker thread for 15 minutes to one Windows letter drive (GPT format/NTFS) persisted by the P420m formatted as a VMFS data store (including the VM image itself) at various outstanding I/Os. Here’s what I observed:

Outstanding I/Os


Average Latency (µs)




















Notice the pattern: At device/world depth 31, the P420m SSD reaches its maximum efficiency between 8 and 16 outstanding I/Os. (Note, I only tested powers of 2.)  At 32 I/Os, you can see the efficiency reduction with IOPS decreasing and latency increasing. Having said that, even at queue depth 64—where half the I/Os are completely blocked from even entering the device queue—54,000+ IOPS at half a millisecond isn’t bad at all—but 61,000 at one-sixteenth a millisecond is even better!  In addition, these writes are 100% random across nearly the entire surface (600GB out of 700GB) at 4KB block size, which is nearly worst-case for NAND SSD devices.

Observations at Queue Depth 255

Let’s compare the results after I issued the following command for queue depth 255:

esxcli storage core device set –m 255 –O 255 –d <P420M device name>

Outstanding I/Os


Average Latency (µs)




















Notice the difference: All are enabled by a single ESXi command. One of the distinct advantages of using the P420m in an ESXi environment is that it’s designed for large queue depths (up to 255).  The “sweet spot” is much more efficient; there’s more IOPS at reduced latency relative to running at a small queue depth.  And the effect becomes more noticeable as more parallelism is applied.  At low outstanding I/Os, there’s not much difference, but once the count gets to 8 and above, the difference is significant.  Moral of the story: Use the highest maximum queue depth setting possible on your P420m.

Understanding Latency Distribution

It’s important to understand the latency distribution in order to fully comprehend the variability of response times.  As shown in my Latency Distribution graph below, at device queue depth 31 and 8 outstanding I/Os, 52,727,650 I/Os were issued during the 15-minute test.  Of those I/Os, 52,651,003 (99.855%) completed in 1000µs (1ms) or fewer.  The figure rises to 99.994% for those completed in 2ms or fewer. 

My next Latency Distribution graph below shows that compared to device queue depth of 255 and 16 outstanding I/Os—where 72,105,855 I/Os were issued with 72,070,578 (99.9511%) completed in 1ms or fewer—the figure rises to 99.9952% for I/Os completed in 2ms or fewer.  Clearly, the P420m is at its best when parallelism is in play.

In my next post, I’ll cover 32KB block size testing (a more realistic scenario) as well as a mainstream application scenario. In the meantime, let us know what you think of these numbers. Tweet us @MicronStorage or me directly @peglarr.

Rob Peglar

Rob Peglar