/*
 * ZFS Fragmentation problem
 *
 * Gouverneur Thomas <tgo@espix.net>
 *
 * Last update: 09 June 2011
 */


During the past days, I've been troubleshooting some Zpools with heavy databases usage.
The issue reported by the customer was a huge performance decrease after the last reboot
of the cluster. The issue has been finally troubleshooted and has been identified as being a
ZFS Fragmentation problem. I'll try now to share the lesson learned with this issue.

First of all, the performance problem has been investigated with a drill down method, we
used teamquest to visualize the differences between now and the week before, what we discovered
is that the I/O on the pool which holds the DBF of the database had literally explosed.
We had sometimes more that 50K write iops balanced accross the different vdev of the affected pool.

The pool was actually configured like this one:

------[ SNIP ]------
# zpool status i-ora-pro06-dat1-pl
  pool: i-ora-pro06-dat1-pl
 state: ONLINE
 scrub: none requested
config:

        NAME                                       STATE     READ WRITE CKSUM
        i-ora-pro06-dat1-pl                        ONLINE       0     0     0
          mirror-0                                 ONLINE       0     0     0
            c6t60060E800571FC00000071FC000020C3d0  ONLINE       0     0     0
            c6t60060E800570FB00000070FB000020C3d0  ONLINE       0     0     0
          mirror-1                                 ONLINE       0     0     0
            c6t60060E800571FC00000071FC000020C4d0  ONLINE       0     0     0
            c6t60060E800570FB00000070FB000020C4d0  ONLINE       0     0     0
          mirror-2                                 ONLINE       0     0     0
            c6t60060E800571FC00000071FC000020C5d0  ONLINE       0     0     0
            c6t60060E800570FB00000070FB000020C5d0  ONLINE       0     0     0
          mirror-3                                 ONLINE       0     0     0
            c6t60060E800571FC00000071FC000020C6d0  ONLINE       0     0     0
            c6t60060E800570FB00000070FB000020C6d0  ONLINE       0     0     0

errors: No known data errors
------[ SNIP ]------

The SAN disks behind have been able to handle a lot of I/O and the SAN was also checked
for any problem, but, clearly, the problem was the heavy IOPs load on the LUNs.

Following to this, we've ran a zpool iostat -v i-ora-pro06-dat1-pl 2 for a while, to
confirm what we were thinking. This confirmed the heavy write load on the vdev.

Still with teamquest, we were able to see that the kind of write operations that
were done on disks, were actually very tiny write blocks.

We have then opened a support case at the Oracle-SUN's support and uploaded some
Guds traces which exposed the problem we faced. Here are the complete explanation of this problem
as well as the way to detect it and also the fix.


Basically, the problem is called as "ZFS Fragmentation". How could this happen ? as they are no
mention about fragmentation inside any of the documents available on the ZFS topic. So there is no
tools to detect and deal with ZFS Fragmentation.

To understand a little bit how this fragmentation could possibly degrade a pool to the point it gets
unusable, I'll talk now about the ZIL (ZFS Intent Log) and the way it's handled inside the pool.
I'll also start with some basics information of how the Oracle Database which was using the pool
was behaving.


1. Oracle Database Behaviour

  The database that was running on the problematic zpool was one of the most important one inside
my client's company. It was considered as "realtime" database, which just means that a lot of INSERT
were done each sec. This database was also monitoring something very critical for the company's business.
Their concern was a delay of 1 hour between the frontend's order to INSERT and the real insert being done in
the database. At Oracle db level, nothing was seen except the fact that the database writers threads were just
holding the write operations due to huge service time on disks.

  Oracle DB itself use the disks below in a way that after every write operation, the database call a fsync(), which
cause the operating system to directly commit the write operation on disk. This is just like every oracle database.

2. The ZIL behaviour

  Some of you may already now it, ZFS has replaced the old journal of UFS filesystem with something called ZIL, for ZFS Intent Log.
Basically, every operation on a pool trigger a ZIL Transaction to ensure the requester of the pool's operation that it has completed
successfully. So, in case of a write op, a ZIL block is created containing the whole operation that will be executed, once this ZIL
transaction has been done, the requester is told that it's already successfull. Even if the data aren't actually written to the pool yet.
ZIL transaction will ensure it will done properly.

After the commit of the ZIL transaction, corresponding ZIL blocks are freed, releasing the space that was temporarly taken on disk by this ZIL transaction.
There is other mecanism that are occuring at the ZIL level, but they will be covered later on this article.

  At each umount of a ZFS filesystem the ZIL is checked and commited to disk, then completely freed.
  At each mount, zfs checks the presence of ZIL entries, which indicates that the filesystem hasn't been unmounted properly, 
ZIL entries found are commited before the filesystem is mounted.

  An also good to know thing is that if the requester issue a fsync(), the ZIL transaction is forcibly written to disk instead of working into memory.

3. How fragmentation happen ?

  As the database is doing a lot of INSERT, the direct consequence is that a lot of ZIL transacton are created and 
temporarly written to the pool's disks, then deleted. We could easily understand that allocating a lot of 
tiny blocks for ZIL transaction then freeing them will not help to get the real data of the pool being
written sequencially on the disks. But still, this is no problem.

  After an amount of time running this way, the time comes that the ZIL could not allocate sequencial blocks for it's transaction entries. 
  Then comes the Gang Block mecanism. Gang blocks are part of the ZIL and are just fragmented ZIL blocks. 

  A gang block consist of a header and up to three gang members. The header is just a container of pointers to its gang 
members which actually holds the data of the ZIL transaction. The header only consume one block and thus is 
allocatable whatever the fragmentation level of the pool. Another particularity is that gang blocks could be nested, 
so gang members could also consist of gang blocks.

  While using gang blocks, the ZIL is forced to load each time it needs, the whole gang tree in memory. 
When the ZIL Transaction is commited to disk, the deletion of the gang block also implies the deletion of every gang members and its childs.

  This is where fragmentation occurs! Every ZIL transaction that is allocated and then freed on disk cause gap to 
appear between ZIL's entries and pool's data.
  It also explain the huge number of write operations that was reported by our previous zpool iostat. 
  Theses write operation are loading the disks and are slowing down the real bandwidth that could be used by the application behind the pool.

4. How to view the problem while it's happening ?

  To see the problem we should have a look at how the ZIL is behaving. 
  If we only see ZIL's transaction creation and deletion in a normal way, we could deduce that no 
extreme fragmentation is happening. Otherwise, we can go deeper.

  To figure out if ganging actually cause problems on a system wide level, just use lockstat:

   # /usr/sbin/lockstat -CcwP -n 50000 -D 20 -s 40 sleep 2

  You will able to see such entries if there is actually problems with gang blocks on the system:

------[ SNIP ]------
# grep gang lockstat-*
lockstat-C.out:     32768 |@@                             26        zio_gang_tree_assemble_done+0x74
lockstat-C.out:    262144 |                               7         zio_gang_tree_assemble_done+0x74
lockstat-C.out:      4096 |@@@@@                          146       zio_gang_tree_issue+0x78
lockstat-C.out:      8192 |@@@@@@@@@@@@@@@@@@@@@          586       zio_gang_tree_issue+0x78
lockstat-C.out:     16384 |                               13        zio_gang_tree_issue+0x78
lockstat-C.out:                                                     zio_gang_tree_issue+0x78
lockstat-C.out:                                                     zio_gang_tree_issue+0x78
lockstat-C.out:                                                     zio_gang_issue+0x48     
lockstat-C.out:      2048 |                               14        zio_gang_tree_issue+0x78
lockstat-C.out:      4096 |@@@@@@@                        192       zio_gang_tree_issue+0x78
lockstat-C.out:      8192 |@@@@@@@@@@@@@@@@@@@@           496       zio_gang_tree_issue+0x78
lockstat-C.out:     16384 |                               1         zio_gang_tree_issue+0x78
lockstat-C.out:                                                     zio_gang_tree_issue+0x78
lockstat-C.out:                                                     zio_gang_issue+0x48     
lockstat-C.out:      2048 |@@                             11        zio_gang_tree_assemble_done+0x74
lockstat-C.out:     16384 |@@@@                           18        zio_gang_tree_assemble_done+0x74
lockstat-C.out:      4096 |@                              2         zio_gang_tree_assemble+0x5c
lockstat-C.out:      8192 |@@@@@@@@@@                     15        zio_gang_tree_assemble_done+0x74
lockstat-C.out:     65536 |                               0         zio_gang_tree_assemble_done+0x74
lockstat-C.out:      2048 |@@                             3         zio_gang_tree_assemble+0x5c
lockstat-C.out:      4096 |@@@@                           6         zio_gang_tree_assemble_done+0x74
lockstat-C.out:     32768 |                               2         zio_gang_tree_assemble_done+0x74
lockstat-C.out:    262144 |                               1         zio_gang_tree_assemble_done+0x74
lockstat-C.out:                                                     zio_gang_tree_assemble_done+0x74
------[ SNIP ]------

  You could admit from that point, that you actually have a problem of gang block.


5. Fixing the issue

  To actually completely fix the issue, as for any UFS fragmented filesystem, the hard way is to recreate the pool.
  For the issue not to come again, the solution is to add a mirrored (or not) log device to the pool, this way, the ZIL transactions
are written to this device and don't fragment the data vdevs of the pool.

  Something that may work depending the case, is to add a log device online and wait a bit for the in progress ZIL transactions to be flushed
and new ones will be written to the separated device.

  NOTE: The last solution could help the performance but this will not de-fragment the data vdev which are like a gouda cheese after the ZIL has made its mess.
  But so far, this could help the issue to be dealt with, as recreating the pool needs a downtime, adding a device does not.

6. Proactive actions

  You can try to react proactively for this issue by watching from time to time on your systems which pools are actually doing gang blocks.

------[ SNIP ]------
#   dtrace -qn 'fbt::zio_gang_tree_issue:entry { @[pid]=count();  }' -c "sleep 300"
    26574               61
        0             7141
    26575            18949
    26570           416399
# ps -eaf|egrep "26574|26575|26570"
    root 26574     0   0   May 26 ?        2778:02 zpool-i-ora-pro06-arc1-pl
    root 26570     0   0   May 26 ?        9155:49 zpool-i-ora-pro06-dat1-pl
    root 26575     0   0   May 26 ?         574:51 zpool-i-ora-pro06-rdo1-pl
# 
------[ SNIP ]------

NOTE: Theses pool may not yet suffer from this ganging, as it is sometimes normal for ganging to appear. But you may be able 
with a little experience with this issue to identify pools that will actually give you problemes with ganging, and then take actions
on theses pools by adding log devices proactively before that fragmentation of the ZIL become a real perf issue.

7. Thanks

 * Openindiana source code
 * #opensolaris-fr @ freenode


8. References

 * http://blogs.oracle.com/perrin/
 * http://wildcat.espix.org/txt/refs/zio.c
 * http://wildcat.espix.org/tmp/guds_2.9.1
 * http://blogs.oracle.com/realneel/entry/the_zfs_intent_log
 * http://wildcat.espix.org/tmp/gang_blocks.txt
 * http://sunsolve.espix.org/bugid/id/6598837
 * http://sunsolve.espix.org/bugid/id/6596237