Introduction

The main goal of this article is to analize different OrientDB Graph Database performances, when deployed on different ZFS filesystem setup.
The test is general enough to be reffered to every Operating System that support ZFS: Solaris, IllumOS based OS, FreeBSD, Mac OS X, and so on.
Almost all the DTrace script and one-liners used for the analysis come from, or are inspired by, DTrace Tool Kit (DTTK), by Brendan Gregg.

NOTE 2014/06/30:
For an (almost) real use case, read here.

Environment

Server IBM xSeries 346
12286 MB RAM
2 CPU x86 Intel Xeon (GenuineIntel F4A Family 15 model 4 step 10 clock 3000MHz)
2 HDD 279.40 GB (IBM-ESXS-MAW3300NC FN-C206)
Solaris 11.1
OrientDB 1.6.4 Community Edition
Oracle JDK 1.7.51 - 64 bit

Method

I compared the database import time from a json backup file, having different ZFS configurations.
I used a backup from a prototype modelled by the company I’m working for. Here some details:

2 indexes
129 clusters
130 classes
21304 Total links
21305 records

I know, this is a little bit trivial, but at this time this is all I can do whitout affecting my daily job.

If someone from Orient Technologies wants give me other material - let’s say a bigger database and a correspondending set of queries, I’ll be happy to repeat my tests.

In order to perform my tests, I configured a non-global zone named orientdbZone.

Before every test, I executed these commands from the global-zone:

shutdown orientdbZone
destroy zfs partitions related to orientdbZone
create zfs partitions related to orientdbZone, with new parameters
boot orientdbZone
login into the orientdbZone
start OrientDB server
start OrientDB console
create new database

After every test, I executed these commands from the orientdbZone:

drop database
shutdown OrientDB Server
logout from the orientdbZone

Preliminary investigations

First of all I tried to understand what OrientDB server does during its import database process, so I collected some useful informatons using this simple DTrace one-liner:

root@globalZone:~# dtrace -n 'fsinfo:::write { @[args[0]->fi_mount] = quantize(arg1); }'

dtrace: description 'fsinfo:::write ' matched 1 probe

/zones/orientdbZone/root/opt/orientdb-community-1.6.4/databases

value ------------- Distribution ------------- count

0 | 0

1 |@@@@@@@@@@@@@@@@@@@@@@@@@@ 8811

2 | 0

4 | 0

8 |@ 407

16 | 77

32 |@ 182

64 | 6

128 | 0

256 | 13

512 | 12

1024 | 26

2048 | 47

4096 | 35

8192 | 0

16384 | 0

32768 | 0

65536 |@@@@@@@@@@@@ 4125

131072 | 0

Unluckily the write byte sizes aren’t well distributed, and they are mainly of two types:

>= 1 byte size (8811 times)
65536 bytes size (4125 times)

This is even more evident confronting the average size and deviation standard.
I wrote a simple writeSizeStats.d DTrace program

#!/usr/sbin/dtrace -Zs

syscall::*write*:entry

{

self->fd = arg0;

}

syscall::*write*:return

/fds[self->fd].fi_mount == $$1/

{

@media[probefunc] = avg(arg1);

@devStd[probefunc] = stddev(arg1);

}

dtrace:::END

{

printa("\n avg %s --> %@d", @media);

printa("\n stddev %s --> %@d",@devStd);

root@globalZone:~# ./writeSizeStats.d /zones/orientdbZone/root/opt/orientdb-community-1.6.4/databases

avg write --> 11335

avg pwrite --> 48896

stddev write --> 24784

stddev pwrite --> 28396

The deviation standard values are too high compared to the averages, so I expect that only setting up the ZFS recordsize equal or less than 32k I can observe a different behavior, but the only way to reach the right compromise is to experiment.

NOTE 2014/06/30:
For an (almost) real use case, read here.

Test #1 - OrientDB datafiles in a dedicated disk with ZFS partition

Preparation

I created a new zpool using the second HDD, then I created many times a new zfs partition in it, each time with a different recordsize parameter, assigned to the orientdbZone:

root@globalZone:~# zpool list

NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT

rpool 278G 19.1G 259G 6% 1.17x ONLINE -

root@globalZone:~# zpool status

pool: rpool

state: ONLINE

scan: none requested

config:

NAME STATE READ WRITE CKSUM

rpool ONLINE 0 0 0

c7t0d0 ONLINE 0 0 0

errors: No known data errors

root@globalZone:~# zpool create databases c7t1d0

root@globalZone:~# zpool list

NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT

databases 3.72G 158K 3.72G 0% 1.00x ONLINE -

rpool 278G 19.1G 259G 6% 1.17x ONLINE -

root@globalZone:~# zfs create -o mountpoint=legacy databases/datafiles

root@globalZone:~# zonecfg -z orientdbZone

zonecfg:orientdbZone> add fs

zonecfg:orientdbZone:fs> set type=zfs

zonecfg:orientdbZone:fs> set special=databases/datafiles

zonecfg:orientdbZone:fs> set dir=/opt/orientdb-community-1.6.4/databases

zonecfg:orientdbZone:fs> end

zonecfg:orientdbZone> verify

zonecfg:orientdbZone> commit

zonecfg:orientdbZone> exit

root@globalZone:~# zoneadm -z orientdbZone boot

root@globalZone:~# zlogin orientdbZone

root@orientdbZone:~# /opt/orientdb-community-1.6.4/bin/server.sh &

root@orientdbZone:~# /opt/orientdb-community-1.6.4/bin/console.sh

Test

orientdb> create database remote:localhost/kubique root root plocal graph

orientdb> import database /opt/kubique.json -preserveClusterIDs=false

The following table shows the database import time, expressed in milliseconds, for each ZFS recordsize:

recordsize	128k	64k	32k	16k	8k	4k	2k	1k	512b
import time	183909	184766	163104	162458	164102	165402	168672	170987	180703

As expected, only with a ZFS recordsize < 64k we can have a valuable process time reduction.

Interesting to note that with a too much little ZFS recordsize, the averall performance gets worse.

I thought that this is due to the ZFS copy-on-write integrity strategy (COW), that implies a checksum for every target block, verified when the block is read. So I executed another run having the minimum recordsize allowed and disabled the checksum ZFS feature:

root@globalZone:~# zfs destroy databases/datafiles

root@globalZone:~# zfs create -o mountpoint=legacy -o checksum=off -o recordsize=512 databases/datafiles

But the import database process takes the same time than before:

orientdb> import database /opt/kubique.json -preserveClusterIDs=false

[...]

Database import completed in 180257 ms

orientdb>

Then, we can deduce that the overhead is mainly due to the filesystem bookkeeping.
Using another DTrace one-liner, I verified my suspect counting the number of interrupts during the import database process:

(recordsize=128k - default)

root@globalZone:~# dtrace -n 'fbt::do_interrupt:entry { @[execname] = count(); }'

dtrace: description 'fbt::do_interrupt:entry ' matched 1 probe

[...]

zpool-databases 820

java 59766

(recordsize=512 bytes)

root@globalZone:~# dtrace -n 'fbt::do_interrupt:entry { @[execname] = count(); }'

dtrace: description 'fbt::do_interrupt:entry ' matched 1 probe

[...]

zpool-databases 5873

java 67579

Helped by procsystime from DTTK, I was able to measure how the syscall times grow up when we have a small recordsize, and a greater blocks number:

Test #2 - OrientDB data files in a dedicated disk with ZFS compression feature enabled

Preparation

I destroyed and re-created the databases/datafiles ZFS partition, having enabled the compression feature:

root@globalZone:~# zoneadm -z orientdbZone shutdown

root@globalZone:~# zfs destroy databases/datafiles

root@globalZone:~# zfs create -o mountpoint=legacy -o compression=on databases/datafiles

root@globalZone:~# zoneadm -z orientdbZone boot

root@globalZone:~# zlogin orientdbZone

root@orientdbZone:~# /opt/orientdb-community-1.6.4/bin/server.sh &

root@orientdbZone:~# /opt/orientdb-community-1.6.4/bin/console.sh

Test

orientdb> create database remote:localhost/kubique root root plocal graph

orientdb> import database /opt/kubique.json -preserveClusterIDs=false

The following table shows the database import time, expressed in milliseconds, for each ZFS recordsize:

recordsize	128k	64k	32k	16k	8k	4k	2k	1k	512b
import time	180865	184543	161304	162872	164029	167760	175149	188992	215314

Compared to previous test, the import time trend is almost the same, exept for the recordsize < 4k.
I think this is due to the fact that the import database process involves several read (and decompression) syscalls, and several COWs, then many decompress-copy-compress-checksum tasks.

However, we must have a look on the major effect of an integrated compression function: the space saved on disk

So we have a great advantage in terms of disk usage, at a neglegible CPU time cost.
Interesting to note that with a too much little ZFS recordsize, we have the worst performance again, maybe because ZFS have to store metadata for each block, and with a much greater number of blocks we are wasting storage resources.

But OrientDB has its own compression strategy, using Google Snappy library, so what about if we disable this funcion combined with the ZFS compression feature enabled? Let’s check

Change OrientDB main configuration file

root@orientdbZone:~# vim /opt/orientdb-community-1.6.4/config/orientdb-server-config.xml

[...]

[...]

</properties>

import time table (ms)

recordsize	128k	64k	32k	16k	8k	4k	2k	1k	512b
snappy on	180865	184543	161304	162872	164029	167760	175149	188992	215314
snappy off	181089	180243	162731	161978	163660	167663	178134	188672	216892

disk usage table (KB)

recordsize	128k	64k	32k	16k	8k	4k	2k	1k	512b
snappy on	5312	5625	5490	5231	5264	5365	6392	9320	11202
snappy off	5320	5643	5530	5300	5399	5688	6502	9723	12775

There aren’t meaningful differences, nor in process time neither in disk usage.

Import database is an I/O bound process, whereas the compression is CPU bound, and perhaps my database has too few records to bring out any difference.

Test #3 - OrientDB data files in a dedicated disk and WAL in another disk

Using plocal storage engine, OrientDB ensures data integrity by leveraging on a Write Ahead Log system.
We can specify a different filesystem for the WAL temporary files, avoiding read/write concurrency.

Further, we can observe both filesystems behavior, and specify a different tuning for each.

Preparation

I repeated some tests in order to verify any differences in byte sizes operations between the two filesystems

Create a path for WAL

root@orientdbZone:~# mkdir /opt/orientdb-community-1.6.4/wal

Change OrientDB main configuration file

root@orientdbZone:~# vim /opt/orientdb-community-1.6.4/config/orientdb-server-config.xml

[...]

[...]

</properties>

Create a new ZFS filesystem in an other storage pool, dedicated to the Write Ahead Log

root@globalZone:~# zoneadm -z orientdbZone shutdown

root@globalZone:~# zfs destroy databases/datafiles

root@globalZone:~# zfs create -o mountpoint=legacy databases/datafiles

root@globalZone:~# zfs create -o mountpoint=legacy rpool/wal

root@globalZone:~# zonecfg orientdZone

root@globalZone:~# zonecfg -z orientdbZone

zonecfg:orientdbZone> add fs

zonecfg:orientdbZone:fs> set type=zfs

zonecfg:orientdbZone:fs> set special=rpool/wal

zonecfg:orientdbZone:fs> set dir=/opt/orientdb-community-1.6.4/wal

zonecfg:orientdbZone:fs> end

zonecfg:orientdbZone> verify

zonecfg:orientdbZone> commit

zonecfg:orientdbZone> exit

root@globalZone:~# zoneadm -z orientdbZone boot

I Ran a couple of times the import database process, and I collected byte sizes informations using this rw_bytes.d DTrace program:

#! /usr/sbin/dtrace -Zs

syscall::*write*:entry

{

self->fd = arg0

}

syscall::*write*:return

/fds[self->fd].fi_mount == $$1/

{

@syscalls[fds[self->fd].fi_fs, probefunc]= count();

@bytes[fds[self->fd].fi_fs, probefunc, fds[self->fd].fi_mount] = sum(arg1);

@distrib[fds[self->fd].fi_fs, probefunc, fds[self->fd].fi_mount] = quantize(arg1);

}

syscall::*write*:return

/fds[self->fd].fi_mount == $$1/

{

self->fd = 0;

}

dtrace:::END

{

printa("\n %s %s %s %@d", @distrib);

printa("\n %s %s %s --> %@d bytes", @bytes);

printa("\n numberOf %s %s --> %@d", @syscalls);

}

root@globalZone:~# ./rw_bytes.d /zones/orientdbZone/root/opt/orientdb-community-1.6.4/wal

dtrace: script './rw_bytes.d' matched 13 probes

CPU ID FUNCTION:NAME

0 2 :END

zfs write /zones/orientdbZone/root/opt/orientdb-community-1.6.4/wal

value ------------- Distribution ------------- count

0 | 0

1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 5260

2 | 0

4 | 0

8 | 0

16 | 0

32 | 0

64 | 0

128 | 0

256 | 0

512 | 0

1024 | 0

2048 | 0

4096 | 0

8192 | 0

16384 | 0

32768 | 0

65536 |@@@@@@@@@@ 1819

131072 | 0

zfs write /zones/orientdbZone/root/opt/orientdb-community-1.6.4/wal --> 119215244 bytes

numberOf zfs write --> 7079

root@globalZone:~# ./rw_bytes.d /zones/orientdbZone/root/opt/orientdb-community-1.6.4/databases

dtrace: script './rw_bytes.d' matched 13 probes

CPU ID FUNCTION:NAME

1 2 :END

zfs write /zones/orientdbZone/root/opt/orientdb-community-1.6.4/databases

value ------------- Distribution ------------- count

0 | 0

1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3276

2 | 0

4 | 0

8 | 8

16 |@ 77

32 |@@ 182

64 | 6

128 | 0

zfs pwrite /zones/orientdbZone/root/opt/orientdb-community-1.6.4/databases

value ------------- Distribution ------------- count

0 | 0

1 |@@@ 275

2 | 0

4 | 0

8 |@@@@@ 399

16 | 0

32 | 0

64 | 0

128 | 0

256 | 13

512 | 12

1024 | 26

2048 |@ 47

4096 | 35

8192 | 0

16384 | 0

32768 | 0

65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2423

131072 | 0

zfs write /zones/orientdbZone/root/opt/orientdb-community-1.6.4/databases --> 13572 bytes

zfs pwrite /zones/orientdbZone/root/opt/orientdb-community-1.6.4/databases --> 159165824 bytes

numberOf zfs pwrite --> 3230

numberOf zfs write --> 3549

Unfortunately, the write size distribution is almost the same then before for both filesystems.

Test

I conducted many tests, methodically combining different WAL and Datafiles ZFS recordsize.
Here you can read a meaningful summary, I hope.

The following table compare the process time (expressed in ms) when there isn’t a WAL dedicated disk, and when we have it.

	no wal disk	wal zfs recordsize=128k
datafile zfs recordsize=128k	183909	184860
datafile zfs recordsize=64k	184766	183197
datafile zfs recordsize=32k	163104	183844
datafile zfs recordsize=16k	162458	184245
datafile zfs recordsize=8k	164102	183099
datafile zfs recordsize=4k	165402	183132
datafile zfs recordsize=2k	168672	185219
datafile zfs recordsize=1k	170987	188295
datafile zfs recordsize=512B	180703	193063

The result suggests that the ZFS datafile record size doesn’t affect the process time, when a WAL partition is configured. The only way to improve the overall performances is to modify the WAL ZFS partition tuning

The following table confirms the hypothesis:

wal zfs recordsize	no wal disk	128k	64k	32k	16k	8k	4k	2k	1k	512B
datafile zfs recordsize=128k	183909	184860	182213	161547	161700	162424	161680	161815	167806	170075

If we change the WAL ZFS recordsize we gain the best performance, doesn’t matter what is the datafile ZFS tuning.

This is because of the OrientDB ACID transaction support, then OrientDB uses the WAL for syncronous writes, where the datafiles are updated asyncronously.

When the ZFS compression feature is enabled, in one of the two filesystems or both, it affects the process time only if the recordsize is less then 8k, getting it worse.

Test #4 - OrientDB data files in a dedicated disk, WAL in another zpool with a dedicated ZIL disk

Since all the writes on the WAL partition are synchronous, we can go further if the WAL related zpool has a dedicated disk for ZIL (ZFS Intent Log).

Preparation

My test server has only two disks, then I used a 4GB USB flash disk as an additional resource

root@globalZone:~# zoneadm -z orientdbZone shutdown

root@globalZone:~# zpool destroy databases

root@globalZone:~# zfs destroy rpool/wal

root@globalZone:~# zpool create databases c7t1d0 log c10t0d0p0 <-- (USB device)

root@globalZone:~# zfs create -o mountpoint=legacy databases/wal

root@globalZone:~# zfs create -o mountpoint=legacy rpool/datafiles

root@globalZone:~# zonecfg orientdZone

root@globalZone:~# zonecfg -z orientdbZone

zonecfg:orientdbZone> remove fs

zonecfg:orientdbZone> add fs

zonecfg:orientdbZone:fs> set type=zfs

zonecfg:orientdbZone:fs> set special=databases/wal

zonecfg:orientdbZone:fs> set dir=/opt/orientdb-community-1.6.4/wal

zonecfg:orientdbZone:fs> end

zonecfg:orientdbZone> add fs

zonecfg:orientdbZone:fs> set type=zfs

zonecfg:orientdbZone:fs> set special=rpool/datafiles

zonecfg:orientdbZone:fs> set dir=/opt/orientdb-community-1.6.4/databases

zonecfg:orientdbZone:fs> end

zonecfg:orientdbZone> verify

zonecfg:orientdbZone> commit

zonecfg:orientdbZone> exit

root@globalZone:~# zpool status databases

pool: databases

state: ONLINE

scan: none requested

config:

NAME STATE READ WRITE CKSUM

databases ONLINE 0 0 0

c7t1d0 ONLINE 0 0 0

logs

c10t0d0p0 ONLINE 0 0 0

errors: No known data errors

Test

The following table shows the results:

wal zfs recordsize	128k	64k	32k	16k	8k	4k	2k	1k	0,5k
import time (ms)	162068	161338	163077	161251	161384	164044	164626	165946	170488

A ZIL disk is related to the zpool like the WAL partition is related to the datafile partition, thus doesn’t matter no more what is the WAL ZFS recordsize, as well as doesn’t matter what is the Datafiles ZFS recordsize if you use a separeted WAL partition.

Interesting to note that the best performance in this context is practically equal to the best performance in other contexts, because I always used the same disks.

If you intend to use a separate ZIL disk, this must be quite faster then the WAL disk, and well optimized for write workloads.

As a proof of concept, I repeated this test using my very slow USB disk as ZIL disk:

root@globalZone:~# zpool status databases

pool: databases

state: ONLINE

scan: none requested

config:

NAME STATE READ WRITE CKSUM

databases ONLINE 0 0 0

c10t0d0p0 ONLINE 0 0 0

logs

c7t1d0 ONLINE 0 0 0

errors: No known data errors

Database import completed in 322240 ms

Doesn’t matter how fast is your WAL disk, if you use a related ZIL disk

Be careful on this. Fast disks can be very expensive, and you need more then only one disk for ZIL, because you need fault tollerance too.

As a final test I verified what happens if I change the ZFS logbias default setting.
From the ZFS man page:
“(logbias) Controls how ZFS optimizes synchronous requests for this dataset. If logbias is set to latency, ZFS uses the pool's separate log devices, if any, to handle the requests at low latency. If logbias is set to throughput, ZFS does not use the pool's separate log devices. Instead, ZFS optimizes synchronous operations for global pool throughput and efficient use of resources. The default value is latency.”

So I re-created the databases zpool without the ZIL disk, then I re-created the ZFS partition:

root@globalZone:~# zfs create -o mountpoint=legacy -o recordsize=32k -o logbias=throughput databases/wal

root@globalZone:~# zfs create -o mountpoint=legacy -o recordsize=32k -o logbias=throughput rpool/datafiles

Database import completed in 203302 ms

I had the worst performance with the best recordsize tuning

Conclusion

Use the ZFS recordsize=32k (recordsize=16k is good as well);
Use a separate zpool for the WAL;
Use the ZFS compression=on for the datafiles partition: the process time remains the same but you can save precious disk space;
Consider adding a ZIL disk for the WAL zpool, but only if you are expericing a serious performance falling. May you can save money adding another instance in a ditributed topology.

NOTE 2014/06/30:
For an (almost) real use case, read here, especially about ZFS recordsize considerations!

Performance matters

Thursday, 27 March 2014

OrientDB on ZFS - Performance Analysis

Introduction

Environment

Method

Preliminary investigations

Test #1 - OrientDB datafiles in a dedicated disk with ZFS partition

Preparation

Test

Test #2 - OrientDB data files in a dedicated disk with ZFS compression feature enabled

Preparation

Test

Test #3 - OrientDB data files in a dedicated disk and WAL in another disk

Preparation

Test

Test #4 - OrientDB data files in a dedicated disk, WAL in another zpool with a dedicated ZIL disk

Preparation

Test

Conclusion

No comments:

Post a Comment