/* * ZFS Fragmentation problem * * Gouverneur Thomas * * Last update: 09 June 2011 */ During the past days, I've been troubleshooting some Zpools with heavy databases usage. The issue reported by the customer was a huge performance decrease after the last reboot of the cluster. The issue has been finally troubleshooted and has been identified as being a ZFS Fragmentation problem. I'll try now to share the lesson learned with this issue. First of all, the performance problem has been investigated with a drill down method, we used teamquest to visualize the differences between now and the week before, what we discovered is that the I/O on the pool which holds the DBF of the database had literally explosed. We had sometimes more that 50K write iops balanced accross the different vdev of the affected pool. The pool was actually configured like this one: ------[ SNIP ]------ # zpool status i-ora-pro06-dat1-pl pool: i-ora-pro06-dat1-pl state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM i-ora-pro06-dat1-pl ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c6t60060E800571FC00000071FC000020C3d0 ONLINE 0 0 0 c6t60060E800570FB00000070FB000020C3d0 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 c6t60060E800571FC00000071FC000020C4d0 ONLINE 0 0 0 c6t60060E800570FB00000070FB000020C4d0 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 c6t60060E800571FC00000071FC000020C5d0 ONLINE 0 0 0 c6t60060E800570FB00000070FB000020C5d0 ONLINE 0 0 0 mirror-3 ONLINE 0 0 0 c6t60060E800571FC00000071FC000020C6d0 ONLINE 0 0 0 c6t60060E800570FB00000070FB000020C6d0 ONLINE 0 0 0 errors: No known data errors ------[ SNIP ]------ The SAN disks behind have been able to handle a lot of I/O and the SAN was also checked for any problem, but, clearly, the problem was the heavy IOPs load on the LUNs. Following to this, we've ran a zpool iostat -v i-ora-pro06-dat1-pl 2 for a while, to confirm what we were thinking. This confirmed the heavy write load on the vdev. Still with teamquest, we were able to see that the kind of write operations that were done on disks, were actually very tiny write blocks. We have then opened a support case at the Oracle-SUN's support and uploaded some Guds traces which exposed the problem we faced. Here are the complete explanation of this problem as well as the way to detect it and also the fix. Basically, the problem is called as "ZFS Fragmentation". How could this happen ? as they are no mention about fragmentation inside any of the documents available on the ZFS topic. So there is no tools to detect and deal with ZFS Fragmentation. To understand a little bit how this fragmentation could possibly degrade a pool to the point it gets unusable, I'll talk now about the ZIL (ZFS Intent Log) and the way it's handled inside the pool. I'll also start with some basics information of how the Oracle Database which was using the pool was behaving. 1. Oracle Database Behaviour The database that was running on the problematic zpool was one of the most important one inside my client's company. It was considered as "realtime" database, which just means that a lot of INSERT were done each sec. This database was also monitoring something very critical for the company's business. Their concern was a delay of 1 hour between the frontend's order to INSERT and the real insert being done in the database. At Oracle db level, nothing was seen except the fact that the database writers threads were just holding the write operations due to huge service time on disks. Oracle DB itself use the disks below in a way that after every write operation, the database call a fsync(), which cause the operating system to directly commit the write operation on disk. This is just like every oracle database. 2. The ZIL behaviour Some of you may already now it, ZFS has replaced the old journal of UFS filesystem with something called ZIL, for ZFS Intent Log. Basically, every operation on a pool trigger a ZIL Transaction to ensure the requester of the pool's operation that it has completed successfully. So, in case of a write op, a ZIL block is created containing the whole operation that will be executed, once this ZIL transaction has been done, the requester is told that it's already successfull. Even if the data aren't actually written to the pool yet. ZIL transaction will ensure it will done properly. After the commit of the ZIL transaction, corresponding ZIL blocks are freed, releasing the space that was temporarly taken on disk by this ZIL transaction. There is other mecanism that are occuring at the ZIL level, but they will be covered later on this article. At each umount of a ZFS filesystem the ZIL is checked and commited to disk, then completely freed. At each mount, zfs checks the presence of ZIL entries, which indicates that the filesystem hasn't been unmounted properly, ZIL entries found are commited before the filesystem is mounted. An also good to know thing is that if the requester issue a fsync(), the ZIL transaction is forcibly written to disk instead of working into memory. 3. How fragmentation happen ? As the database is doing a lot of INSERT, the direct consequence is that a lot of ZIL transacton are created and temporarly written to the pool's disks, then deleted. We could easily understand that allocating a lot of tiny blocks for ZIL transaction then freeing them will not help to get the real data of the pool being written sequencially on the disks. But still, this is no problem. After an amount of time running this way, the time comes that the ZIL could not allocate sequencial blocks for it's transaction entries. Then comes the Gang Block mecanism. Gang blocks are part of the ZIL and are just fragmented ZIL blocks. A gang block consist of a header and up to three gang members. The header is just a container of pointers to its gang members which actually holds the data of the ZIL transaction. The header only consume one block and thus is allocatable whatever the fragmentation level of the pool. Another particularity is that gang blocks could be nested, so gang members could also consist of gang blocks. While using gang blocks, the ZIL is forced to load each time it needs, the whole gang tree in memory. When the ZIL Transaction is commited to disk, the deletion of the gang block also implies the deletion of every gang members and its childs. This is where fragmentation occurs! Every ZIL transaction that is allocated and then freed on disk cause gap to appear between ZIL's entries and pool's data. It also explain the huge number of write operations that was reported by our previous zpool iostat. Theses write operation are loading the disks and are slowing down the real bandwidth that could be used by the application behind the pool. 4. How to view the problem while it's happening ? To see the problem we should have a look at how the ZIL is behaving. If we only see ZIL's transaction creation and deletion in a normal way, we could deduce that no extreme fragmentation is happening. Otherwise, we can go deeper. To figure out if ganging actually cause problems on a system wide level, just use lockstat: # /usr/sbin/lockstat -CcwP -n 50000 -D 20 -s 40 sleep 2 You will able to see such entries if there is actually problems with gang blocks on the system: ------[ SNIP ]------ # grep gang lockstat-* lockstat-C.out: 32768 |@@ 26 zio_gang_tree_assemble_done+0x74 lockstat-C.out: 262144 | 7 zio_gang_tree_assemble_done+0x74 lockstat-C.out: 4096 |@@@@@ 146 zio_gang_tree_issue+0x78 lockstat-C.out: 8192 |@@@@@@@@@@@@@@@@@@@@@ 586 zio_gang_tree_issue+0x78 lockstat-C.out: 16384 | 13 zio_gang_tree_issue+0x78 lockstat-C.out: zio_gang_tree_issue+0x78 lockstat-C.out: zio_gang_tree_issue+0x78 lockstat-C.out: zio_gang_issue+0x48 lockstat-C.out: 2048 | 14 zio_gang_tree_issue+0x78 lockstat-C.out: 4096 |@@@@@@@ 192 zio_gang_tree_issue+0x78 lockstat-C.out: 8192 |@@@@@@@@@@@@@@@@@@@@ 496 zio_gang_tree_issue+0x78 lockstat-C.out: 16384 | 1 zio_gang_tree_issue+0x78 lockstat-C.out: zio_gang_tree_issue+0x78 lockstat-C.out: zio_gang_issue+0x48 lockstat-C.out: 2048 |@@ 11 zio_gang_tree_assemble_done+0x74 lockstat-C.out: 16384 |@@@@ 18 zio_gang_tree_assemble_done+0x74 lockstat-C.out: 4096 |@ 2 zio_gang_tree_assemble+0x5c lockstat-C.out: 8192 |@@@@@@@@@@ 15 zio_gang_tree_assemble_done+0x74 lockstat-C.out: 65536 | 0 zio_gang_tree_assemble_done+0x74 lockstat-C.out: 2048 |@@ 3 zio_gang_tree_assemble+0x5c lockstat-C.out: 4096 |@@@@ 6 zio_gang_tree_assemble_done+0x74 lockstat-C.out: 32768 | 2 zio_gang_tree_assemble_done+0x74 lockstat-C.out: 262144 | 1 zio_gang_tree_assemble_done+0x74 lockstat-C.out: zio_gang_tree_assemble_done+0x74 ------[ SNIP ]------ You could admit from that point, that you actually have a problem of gang block. 5. Fixing the issue To actually completely fix the issue, as for any UFS fragmented filesystem, the hard way is to recreate the pool. For the issue not to come again, the solution is to add a mirrored (or not) log device to the pool, this way, the ZIL transactions are written to this device and don't fragment the data vdevs of the pool. Something that may work depending the case, is to add a log device online and wait a bit for the in progress ZIL transactions to be flushed and new ones will be written to the separated device. NOTE: The last solution could help the performance but this will not de-fragment the data vdev which are like a gouda cheese after the ZIL has made its mess. But so far, this could help the issue to be dealt with, as recreating the pool needs a downtime, adding a device does not. 6. Proactive actions You can try to react proactively for this issue by watching from time to time on your systems which pools are actually doing gang blocks. ------[ SNIP ]------ # dtrace -qn 'fbt::zio_gang_tree_issue:entry { @[pid]=count(); }' -c "sleep 300" 26574 61 0 7141 26575 18949 26570 416399 # ps -eaf|egrep "26574|26575|26570" root 26574 0 0 May 26 ? 2778:02 zpool-i-ora-pro06-arc1-pl root 26570 0 0 May 26 ? 9155:49 zpool-i-ora-pro06-dat1-pl root 26575 0 0 May 26 ? 574:51 zpool-i-ora-pro06-rdo1-pl # ------[ SNIP ]------ NOTE: Theses pool may not yet suffer from this ganging, as it is sometimes normal for ganging to appear. But you may be able with a little experience with this issue to identify pools that will actually give you problemes with ganging, and then take actions on theses pools by adding log devices proactively before that fragmentation of the ZIL become a real perf issue. 7. Thanks * Openindiana source code * #opensolaris-fr @ freenode 8. References * http://blogs.oracle.com/perrin/ * http://wildcat.espix.org/txt/refs/zio.c * http://wildcat.espix.org/tmp/guds_2.9.1 * http://blogs.oracle.com/realneel/entry/the_zfs_intent_log * http://wildcat.espix.org/tmp/gang_blocks.txt * http://sunsolve.espix.org/bugid/id/6598837 * http://sunsolve.espix.org/bugid/id/6596237