The first time I analyzed the OrientDB performances related to ZFS tuning, I used a truly simple "
import database" experiment.
This time, thanks to
+Luigi Dell'Aquila from Orient Technologies LTD, we can analyze an (almost) real workload, filling an empty database with more than a million of vertexes and edges from scratch.
I used the
same environment (host, disks, OS, jdk, ...) of the first analysis, but with OrientDB 1.7.4 community edition.
For general considerations, it was useful
this reading.
Workload description
The assumption is that
a graph can be expressed as a series of triplets, each of which representing a "Vertex → Edge → Vertex" relation.
From
DBPedia web site, you can download many RDF files representing the Wikipedia's Italian version. My choice was the first file in the list: "Article Categories", whose first lines are shown below as an example:
<http://it.dbpedia.org/resource/Anni_1940> <http://purl.org/dc/terms/subject> <http://it.dbpedia.org/resource/Categoria:Decenni_del_XX_secolo>
<http://it.dbpedia.org/resource/Anni_1900> <http://purl.org/dc/terms/subject> <http://it.dbpedia.org/resource/Categoria:Decenni_del_XX_secolo>
<http://it.dbpedia.org/resource/Anni_1930> <http://purl.org/dc/terms/subject> <http://it.dbpedia.org/resource/Categoria:Decenni_del_XX_secolo>
An "Importer" Java program reads each line, checks if the first and third elements already exist, creates the vertexes and the edge between them.
Here the main source code:
public class Importer {
public static void main(String[] args) throws Exception {
if (args.length != 4) {
System.out
.println("sintassi corretta: Importer
");
return;
}
creaSchema(args[0], args[1], args[2], args[3]);
importa(args[0], args[1], args[2], args[3]);
}
private static void creaSchema(String fileName, String dbUrl, String user, String pw) throws Exception {
OrientGraphNoTx db = new OrientGraphNoTx(dbUrl, user, pw);
OSchema schema = db.getRawGraph().getMetadata().getSchema();
OIndexManagerProxy indexMgr = db.getRawGraph().getMetadata().getIndexManager();
OClass nodo = null;
if (schema.existsClass("Nodo")) {
nodo = schema.getClass("Nodo");
} else {
nodo = schema.createClass("Nodo", schema.getClass("V"));
}
if (!nodo.existsProperty("name")) {
nodo.createProperty("name", OType.STRING);
}
if (!indexMgr.existsIndex("Nodo_name")) {
nodo.createIndex("Nodo_name", INDEX_TYPE.UNIQUE, "name");
}
OClass arco = null;
if (schema.existsClass("Arco")) {
arco = schema.getClass("Arco");
} else {
arco = schema.createClass("Arco", schema.getClass("E"));
}
if (!arco.existsProperty("name")) {
arco.createProperty("name", OType.STRING);
}
db.shutdown();
}
private static void importa(String fileName, String dbUrl, String user, String pw) throws Exception {
File file = new java.io.File(fileName);
FileInputStream input = new java.io.FileInputStream(file);
BufferedReader reader = new java.io.BufferedReader(new java.io.InputStreamReader(input));
OrientGraph gdb = new OrientGraph(dbUrl, user, pw);
reader.readLine();
String result = reader.readLine();
while (result != null && result.trim().length() > 0) {
String[] parts = result.split(" ");
if (parts.length > 3 && parts[3].equals(".")) {
Vertex vertex1 = createVertex(parts[0], gdb);
Vertex vertex2 = createVertex(parts[2], gdb);
Edge edge = vertex1.addEdge("Arco", vertex2);
edge.setProperty("name", parts[1]);
gdb.commit();
}
result = reader.readLine();
}
gdb.commit();
gdb.shutdown();
}
private static Vertex createVertex(String name, OrientGraph gdb) {
Iterable queryResult = gdb.command(new OSQLSynchQuery("select from Nodo where name = ?")).execute(name);
for (Vertex v : queryResult) {
return v;
}
Vertex result = gdb.addVertex("class:Nodo");
result.setProperty("name", name);
return result;
}
}
Before starting the loading process, the Importer program checks if the basic schema already exists, if not it will be created.
Here the instructions if you prefer to create the schema using the OrientDB console:
CREATE DATABASE remote:localhost/dbpedia root root plocal
CONNECT remote:localhost/dbpedia admin admin
CREATE CLASS Nodo EXTENDS V
CREATE PROPERTY Nodo.name STRING
CREATE INDEX Nodo_name on Nodo(name) UNIQUE_HASH_INDEX
CREATE CLASS Arco EXTENDS E
CREATE PROPERTY Arco.name STRING
At last, you can execute the Java program:
java -cp "$ORIENTDB_HOME/lib/orientdb-server-1.7.4.jar:$ORIENTDB_HOME/lib/*:." Importer /opt/article-categories.txt remote:localhost/dbpedia admin admin
When the Import Java program finishes, you have quite a lot of records:
orientdb {dbpedia}> select count(*) from V
----+------+------
# |@RID |count
----+------+------
0 |#-1:-1|942949
----+------+------
orientdb {dbpedia}> select count(*) from E
----+------+-------
# |@RID |count
----+------+-------
0 |#-1:-1|1390193
----+------+-------
Workload characterization
As a single threaded process, we expect that the "Importer" program is going to load the host at its limits, without generating queues on its main resources.
For the purpose of this article is not very interesting to make a deep workload analysis but, since the whole process last around 1 hour, I ran the "Importer" program capturing system activities with sar, recording metrics every 60 seconds, for 60 minutes:
sar -A -o temp 60 60
First, let's verify the disk load:
root@globalZone:~# sar -d -f temp
[...]
Average device %busy avque r+w/s blks/s avwait avserv
adpu3200 0 0.0 0 0 0.0 0.0
adpu3201 18 1.2 67 18397 0.0 17.8
sd0 14 1.1 35 14287 0.0 30.7
sd0,a 0 0.0 0 0 0.0 0.0
sd0,b 14 1.1 35 14287 0.0 31.1
sd1 5 0.1 29 4110 0.0 3.8
sd1,a 5 0.1 29 4110 0.0 3.8
In particular, the column "avwait" (average wait time) confirms our assumption.
Further confirmations come from "queue lenght" e "CPU utilization" reports:
root@globalZone:~# sar -q -f temp
[...]
runq-sz %runocc swpq-sz %swpocc
Average 1.3 35 0.0 0
root@globalZone:~# sar -u -f temp
[...]
%usr %sys %wio %idle
Average 48 14 0 38
Getting started
OrientdbZone configuration
I have 2 disks, each in one zpool, each zpool dedicated to a specific function: data files and Write Ahead Log:
root@globalZone:~/# zpool status
pool: databases
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
databases ONLINE 0 0 0
c7t1d0 ONLINE 0 0 0
errors: No known data errors
pool: rpool
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
c7t0d0 ONLINE 0 0 0
errors: No known data errors
I created a zfs file system for each zpool, without special option:
root@globalZone:~/# zfs create -o mountpoint=legacy databases/wal
root@globalZone:~/# zfs create -o mountpoint=legacy rpool/datafiles
I configured my orientdbZone in order to properly use these zfs file systems:
root@globalZone:~# zonecfg -z orientdbZone
zonecfg:orientdbZone> add fs
zonecfg:orientdbZone:fs> set type=zfs
zonecfg:orientdbZone:fs> set special=rpool/datafiles
zonecfg:orientdbZone:fs> set dir=/opt/orientdb/databases
zonecfg:orientdbZone:fs> end
zonecfg:orientdbZone> add fs
zonecfg:orientdbZone:fs> set type=zfs
zonecfg:orientdbZone:fs> set special=databases/wal
zonecfg:orientdbZone:fs> set dir=/opt/orientdb/wal
zonecfg:orientdbZone:fs> end
zonecfg:orientdbZone> verify
zonecfg:orientdbZone> commit
zonecfg:orientdbZone> exit
Before every test, I shutdown the orientdbZone, destroyed-recreated the zfs file systems, rebooted the orientdbZone.
Preliminary analysis
Using this simple DTrace program, I ran the "Import" Java program the first time only for gathering information about the page sizes written on the file system:
#! /usr/sbin/dtrace -Zs
syscall::*write*:entry
{
self->fd = arg0
}
syscall::*write*:return
/(fds[self->fd].fi_mount == "/path/to/orientdb/databases") || (fds[self->fd].fi_mount == "/path/to/orientdb/wal")/
{
@distrib[fds[self->fd].fi_fs, probefunc, fds[self->fd].fi_mount] = quantize(arg1);
}
syscall::*write*:return
/(fds[self->fd].fi_mount == "/path/to/orientdb/databases") || (fds[self->fd].fi_mount == "/path/to/orientdb/wal")/
{
self->fd = 0;
}
dtrace:::END
{
printa("\n %s %s %s %@d", @distrib);
}
The distribution is quite regular:
CPU ID FUNCTION:NAME
1 2 :END
zfs write /path/to/orientdb/databases
value ------------- Distribution ------------- count
0 | 0
1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 60
2 | 0
4 | 0
8 | 0
16 | 0
32 |@@@ 5
64 | 0
zfs write /path/to/orientdb/wal
value ------------- Distribution ------------- count
32768 | 0
65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 102876
131072 | 0
zfs pwrite /path/to/orientdb/databases
value ------------- Distribution ------------- count
0 | 0
1 | 5
2 | 0
4 | 0
8 | 5
16 | 0
32 | 0
64 | 0
128 | 0
256 | 0
512 | 0
1024 | 0
2048 | 0
4096 | 0
8192 | 0
16384 | 0
32768 | 0
65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 289755
131072 | 0
So, this time we cannot have doubts: we have to setup the zfs recordsize=64k.
As verified with the
previous analysis,
if you are using a dedicated WAL file system, the most important factor is the write() time to this file system.
Test #1: Compression
Having fixed the workload and related ZFS recordsize, I tried to compare different behaviors with different compression strategies.
OrientDB has its own compression strategies, based on snappy or gzip libraries.
ZFS has different compression algorithms:
- lzjb - optimized for performances
- gzip-N (1 ≤ N ≤ 9) - best compression if N > 5
- zle - fast, but only compresses sequences of zeros
The default compression factor for gzip is 6.
Of course we have to evaluate the compression ratio, related to the whole execution time:
| Seconds | Megabytes |
| Total Import Time | disk usage wal | disk usage data files |
zfs compression=on | 3648 | 1911 | 242 |
zfs compression=gzip | 4831 | 1147 | 146 |
zfs default, OrientDB gzip=on | 4568 | 4090 | 838 |
zfs default, snappy=on | 3692 | 4090 | 813 |
zfs compression=zle | 3536 | 2803 | 594 |
The OrientDB compression strategies seem not so useful in execution time nor in disk space saved.
In general we see that the I/O time isn't so important for this kind of workload, and the time saved in writes are balanced by the time spent in compression, so we cannot save much execution time, but we can save a lot of disk space:
The best compromise is to use ZFS default compression, combined with OrientDB default settings.
Test #2: Backup
Even dealing with backups, you have to compare OrientDB and ZFS compression strategies.
Here is the ODB config file part related to the backup options:
<handler class="com.orientechnologies.orient.server.handler.OAutomaticBackup">
<parameters>
<parameter name="enabled" value="false" />
<parameter name="delay" value="4h" />
<parameter name="target.directory" value="backup" />
<parameter name="target.fileName" value="${DBNAME}-${DATE:yyyyMMddHHmmss}.zip" />
<parameter name="db.include" value="" />
<parameter name="db.exclude" value="" />
<parameter name="compressionLevel" value="9"/>
<parameter name="bufferSize" value="1048576"/>
</parameters>
</handler>
The default compression level is set to the maximum value (value=0 means no compression).
First of all, we have to verify the page size distribution when OrientDB save the backup file:
root@globalZone:~/# ./rw_bytesBackup.d
dtrace: script './rw_bytesBackup.d' matched 13 probes
CPU ID FUNCTION:NAME
0 2 :END
zfs write /zones/orientdbZone/root/opt/orientdb-community-1.7.4/backup
value ------------- Distribution ------------- count
65536 | 0
131072 | 1
262144 | 0
524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 811
1048576 | 0
OrientDB saves the backup file writing only pages of 512 KB
How much time we can save using the right recordsize for the ZFS backup partition?
Not so much...
Since the
OAutomaticBackup class puts the data files in read only mode, my opinion is to try to complete the backup process as soon as possible, therefore we shouldn't use compression at all:
| exec time (ms) |
benchmark | 287872 |
ZFS compression=on | 31803 |
ZFS compression=gzip-9 | 41654 |
NO ZFS compression | 25704 |
NO ZFS compression, recordsize=512k | 20333 |
if data files are compressed | 15024 |
But, what about disk space?
We cannot waste so much disk space, so my suggestion is to create a new ZFS partion having the most powerful compression algorithm (compression=gzip-9) aside the backup partition, then, using a scheduled script, move the backup files into the archive partition:
root@globalZone:~/# zfs create -o mountpoint=legacy -o compression=gzip-9 databases/archive
root@globalZone:~# zonecfg -z orientdbZone
zonecfg:orientdbZone> add fs
zonecfg:orientdbZone:fs> set type=zfs
zonecfg:orientdbZone:fs> set special=databases/archive
zonecfg:orientdbZone:fs> set dir=/opt/orientdb/archive
zonecfg:orientdbZone:fs> end
zonecfg:orientdbZone> verify
zonecfg:orientdbZone> commit
zonecfg:orientdbZone> exit
root@globalZone:~# zoneadm -z orientdbZone shutdown
root@globalZone:~# zoneadm -z orientdbZone boot
root@globalZone:~# zlogin orientdbZone
root@orientdbZone:~# du -k /opt/orientdb/backup/dbpedia-20140625182046.json.gz
831031 /opt/orientdb/backup/dbpedia-20140625182046.json.gz
root@orientdbZone:~# time mv /opt/orientdb/backup/dbpedia-20140625182046.json.gz /opt/orientdb/archive/
real 2m19.710s
user 0m0.004s
sys 0m5.678s
root@orientdbZone:~# du -k /opt/orientdb/archive/dbpedia-20140625182046.json.gz
98764 /opt/orientdb/archive/dbpedia-20140625182046.json.gz
This way
the compression process is decoupled and become asynchronous, and you can achieve the best performances in execution time and space saved.
Of course, you can use an external ZFS storage mounted via NFS.
Here we can underline one important difference between Solaris ZFS and open-zfs, used in Illumos-based systems, FreeBSD and MAC OSX.
Whereas open-zfs has a some useful
extra features, the Solaris version gives you the opportunity to use a greater recordsize, very useful for such a stuff:
root@globalZone:~/# zfs set recordsize=1M databases/archive
root@orientdbZone:~/# time mv dbpedia-20140627101455.json.gz /opt/orientdb/archive/
real 0m17.464s
user 0m0.004s
sys 0m6.186s
Another, more expensive, strategy is to have a backup dedicated instance in a distributed topology, so that the main instances never slow down because of the backup process.
Test #3: Lucene indexes
Few weeks ago
+Enrico Risa developed an
OrientDB plugin which gives you the opportunity to use Lucene engine for full-text and spatial indexes:
orientdb> connect remote:localhost/dbpedia admin admin
Connecting to database [remote:localhost/dbpedia] with user 'admin'...OK
orientdb {dbpedia}> CREATE INDEX Nodo.content on Nodo (name) FULLTEXT ENGINE LUCENE;
Creating index...
2014-06-27 14:43:48:380 INFO - Rebuilding index dbpedia.Nodo.content (estimated 942949 items)... [OIndexRebuildOutputListener]
2014-06-27 14:43:58:383 INFO --> 1.13% progress, 10,653 indexed so far (1,065 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:44:10:663 INFO --> 12.20% progress, 115,065 indexed so far (10,441 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:44:20:716 INFO --> 26.91% progress, 253,784 indexed so far (13,871 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:44:30:717 INFO --> 44.41% progress, 418,770 indexed so far (16,498 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:44:40:718 INFO --> 60.86% progress, 573,886 indexed so far (15,511 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:44:50:719 INFO --> 80.00% progress, 754,379 indexed so far (18,049 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:45:00:849 INFO --> 98.30% progress, 926,889 indexed so far (17,251 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:45:01:896 INFO --> OK, indexed 942,949 items in 73,516 ms [OIndexRebuildOutputListener]
Created index successfully with 942949 entries in 74.220001 sec(s).
Index created successfully
Despite
it isn't ready for production environments yet, I've found his work very interesting, surely it is for my projects!
Let's start investigating its behavior.
As always, the first run is for the the page size distribution analysis during the index creation, having previously created a dedicated ZFS partition:
root@globalZone:~/# ./rw_bytesIndex.d
dtrace: script './rw_bytesIndex.d' matched 13 probes
CPU ID FUNCTION:NAME
0 2 :END
zfs write /your/path/to/dbpedia/luceneIndexes
value ------------- Distribution ------------- count
4 | 0
8 | 2
16 | 5
32 | 30
64 | 1
128 | 23
256 | 13
512 | 14
1024 | 30
2048 | 11
4096 | 72
8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 8983
16384 | 0
So
we have to use recordsize=8k fot this partition.
Here the performance comparison among different ZFS tuning:
| creation time (ms) | disk usage (KB) |
recodsize=8k | 82738 | 36800 |
default | 82963 | 36800 |
compression=on | 82028 | 22764 |
Writing only 36 MB in 82 seconds, doesn't let you to observe the gaining of having the right recordsize, because the process spend the most part of the time in reading the source and applying the archive algorithm.
In particular, our source isn't very suitable for this test, because every property indexed is a single word, with underscores instead of spaces.
In fact, the whole test isn't so meaningful, except for the disk space you can save, because, in my experience, a full-text index must be very fast in reading, not in writing.
Imagine you want to index the content of hundreds of thousands of PDF documents: you may can accept an asynchronous new document loading (and indexing), but surely you want a fast enough search system.
In this case, my suggestion is to use a dedicated SSD device.
Extra considerations
- Using ZFS file system can reduce the cost and complexity of managing data and storage devices. Storage product capabilities often overlap ZFS features, and my suggestion is to rely on ZFS ones, because they give you more flexibility. but if you are comfortable with traditional hardware RAID protection, then use it!
- Since ZFS copy-on-write feature relocates all writes to free disk space, keeping a certain amount o free blocks allows ZFS to easily find space in big chunks, and allows ZFS to aggregate writes and reduce the write IOPS demand on hard disks.
- ZFS manages synchronous writes internally with the ZFS intent log (ZIL). By default, synchronous writes are stored in the ZIL at low latency and shortly thereafter (within approximately 5 seconds), ZFS does a bulk update of the pool state by committing a transaction group. Another way exists for the ZIL to handle synchronous write activity, which is of slightly higher latency, but also involves almost half as much traffic on the channel to the storage array. This mode is triggered by the logbias=throughput value (thanks +Manuel Zach). This mode is recommended for file systems holding data files since writes to those files are highly threaded background operations from the database perspective, which does not warrant the utmost latency requirements. Using logbias=throughput for data files frees up the storage resource and as a consequence, allows the WAL to be handled with lower latency, especially if the storage pool is used by many zones.
Conclusions
The following table summarizes my suggestions about ZFS tuning for every partition you should use:
FILE SYSTEM | RECORDSIZE | LOGBIAS | COMPRESSION |
data files | 64k | throughput | on |
WAL | 64k | latency | on |
backup | 512k | latency | off |
archive | max available | throughput | gzip-9 |
Lucene indexes | 8k | throughput | on |