The first time I analyzed the OrientDB performances related to ZFS tuning, I used a truly simple "import database" experiment.
This time, thanks to +Luigi Dell'Aquila from Orient Technologies LTD, we can analyze an (almost) real workload, filling an empty database with more than a million of vertexes and edges from scratch.
I used the same environment (host, disks, OS, jdk, ...) of the first analysis, but with OrientDB 1.7.4 community edition.
For general considerations, it was useful this reading.
This time, thanks to +Luigi Dell'Aquila from Orient Technologies LTD, we can analyze an (almost) real workload, filling an empty database with more than a million of vertexes and edges from scratch.
I used the same environment (host, disks, OS, jdk, ...) of the first analysis, but with OrientDB 1.7.4 community edition.
For general considerations, it was useful this reading.
Workload description
The assumption is that a graph can be expressed as a series of triplets, each of which representing a "Vertex → Edge → Vertex" relation.
From DBPedia web site, you can download many RDF files representing the Wikipedia's Italian version. My choice was the first file in the list: "Article Categories", whose first lines are shown below as an example:
From DBPedia web site, you can download many RDF files representing the Wikipedia's Italian version. My choice was the first file in the list: "Article Categories", whose first lines are shown below as an example:
<http://it.dbpedia.org/resource/Anni_1940> <http://purl.org/dc/terms/subject> <http://it.dbpedia.org/resource/Categoria:Decenni_del_XX_secolo>
<http://it.dbpedia.org/resource/Anni_1900> <http://purl.org/dc/terms/subject> <http://it.dbpedia.org/resource/Categoria:Decenni_del_XX_secolo>
<http://it.dbpedia.org/resource/Anni_1930> <http://purl.org/dc/terms/subject> <http://it.dbpedia.org/resource/Categoria:Decenni_del_XX_secolo>
<http://it.dbpedia.org/resource/Anni_1900> <http://purl.org/dc/terms/subject> <http://it.dbpedia.org/resource/Categoria:Decenni_del_XX_secolo>
<http://it.dbpedia.org/resource/Anni_1930> <http://purl.org/dc/terms/subject> <http://it.dbpedia.org/resource/Categoria:Decenni_del_XX_secolo>
An "Importer" Java program reads each line, checks if the first and third elements already exist, creates the vertexes and the edge between them.
Here the main source code:
public class Importer {
public static void main(String[] args) throws Exception {
if (args.length != 4) {
System.out
.println("sintassi corretta: Importer ");
return;
}
creaSchema(args[0], args[1], args[2], args[3]);
importa(args[0], args[1], args[2], args[3]);
}
private static void creaSchema(String fileName, String dbUrl, String user, String pw) throws Exception {
OrientGraphNoTx db = new OrientGraphNoTx(dbUrl, user, pw);
OSchema schema = db.getRawGraph().getMetadata().getSchema();
OIndexManagerProxy indexMgr = db.getRawGraph().getMetadata().getIndexManager();
OClass nodo = null;
if (schema.existsClass("Nodo")) {
nodo = schema.getClass("Nodo");
} else {
nodo = schema.createClass("Nodo", schema.getClass("V"));
}
if (!nodo.existsProperty("name")) {
nodo.createProperty("name", OType.STRING);
}
if (!indexMgr.existsIndex("Nodo_name")) {
nodo.createIndex("Nodo_name", INDEX_TYPE.UNIQUE, "name");
}
OClass arco = null;
if (schema.existsClass("Arco")) {
arco = schema.getClass("Arco");
} else {
arco = schema.createClass("Arco", schema.getClass("E"));
}
if (!arco.existsProperty("name")) {
arco.createProperty("name", OType.STRING);
}
db.shutdown();
}
private static void importa(String fileName, String dbUrl, String user, String pw) throws Exception {
File file = new java.io.File(fileName);
FileInputStream input = new java.io.FileInputStream(file);
BufferedReader reader = new java.io.BufferedReader(new java.io.InputStreamReader(input));
OrientGraph gdb = new OrientGraph(dbUrl, user, pw);
reader.readLine();
String result = reader.readLine();
while (result != null && result.trim().length() > 0) {
String[] parts = result.split(" ");
if (parts.length > 3 && parts[3].equals(".")) {
Vertex vertex1 = createVertex(parts[0], gdb);
Vertex vertex2 = createVertex(parts[2], gdb);
Edge edge = vertex1.addEdge("Arco", vertex2);
edge.setProperty("name", parts[1]);
gdb.commit();
}
result = reader.readLine();
}
gdb.commit();
gdb.shutdown();
}
private static Vertex createVertex(String name, OrientGraph gdb) {
Iterable queryResult = gdb.command(new OSQLSynchQuery("select from Nodo where name = ?")).execute(name);
for (Vertex v : queryResult) {
return v;
}
Vertex result = gdb.addVertex("class:Nodo");
result.setProperty("name", name);
return result;
}
}
public static void main(String[] args) throws Exception {
if (args.length != 4) {
System.out
.println("sintassi corretta: Importer
return;
}
creaSchema(args[0], args[1], args[2], args[3]);
importa(args[0], args[1], args[2], args[3]);
}
private static void creaSchema(String fileName, String dbUrl, String user, String pw) throws Exception {
OrientGraphNoTx db = new OrientGraphNoTx(dbUrl, user, pw);
OSchema schema = db.getRawGraph().getMetadata().getSchema();
OIndexManagerProxy indexMgr = db.getRawGraph().getMetadata().getIndexManager();
OClass nodo = null;
if (schema.existsClass("Nodo")) {
nodo = schema.getClass("Nodo");
} else {
nodo = schema.createClass("Nodo", schema.getClass("V"));
}
if (!nodo.existsProperty("name")) {
nodo.createProperty("name", OType.STRING);
}
if (!indexMgr.existsIndex("Nodo_name")) {
nodo.createIndex("Nodo_name", INDEX_TYPE.UNIQUE, "name");
}
OClass arco = null;
if (schema.existsClass("Arco")) {
arco = schema.getClass("Arco");
} else {
arco = schema.createClass("Arco", schema.getClass("E"));
}
if (!arco.existsProperty("name")) {
arco.createProperty("name", OType.STRING);
}
db.shutdown();
}
private static void importa(String fileName, String dbUrl, String user, String pw) throws Exception {
File file = new java.io.File(fileName);
FileInputStream input = new java.io.FileInputStream(file);
BufferedReader reader = new java.io.BufferedReader(new java.io.InputStreamReader(input));
OrientGraph gdb = new OrientGraph(dbUrl, user, pw);
reader.readLine();
String result = reader.readLine();
while (result != null && result.trim().length() > 0) {
String[] parts = result.split(" ");
if (parts.length > 3 && parts[3].equals(".")) {
Vertex vertex1 = createVertex(parts[0], gdb);
Vertex vertex2 = createVertex(parts[2], gdb);
Edge edge = vertex1.addEdge("Arco", vertex2);
edge.setProperty("name", parts[1]);
gdb.commit();
}
result = reader.readLine();
}
gdb.commit();
gdb.shutdown();
}
private static Vertex createVertex(String name, OrientGraph gdb) {
Iterable
for (Vertex v : queryResult) {
return v;
}
Vertex result = gdb.addVertex("class:Nodo");
result.setProperty("name", name);
return result;
}
}
Before starting the loading process, the Importer program checks if the basic schema already exists, if not it will be created.
Here the instructions if you prefer to create the schema using the OrientDB console:
CREATE DATABASE remote:localhost/dbpedia root root plocal CONNECT remote:localhost/dbpedia admin admin CREATE CLASS Nodo EXTENDS V CREATE PROPERTY Nodo.name STRING CREATE INDEX Nodo_name on Nodo(name) UNIQUE_HASH_INDEX CREATE CLASS Arco EXTENDS E CREATE PROPERTY Arco.name STRING
At last, you can execute the Java program:
java -cp "$ORIENTDB_HOME/lib/orientdb-server-1.7.4.jar:$ORIENTDB_HOME/lib/*:." Importer /opt/article-categories.txt remote:localhost/dbpedia admin admin
When the Import Java program finishes, you have quite a lot of records:
orientdb {dbpedia}> select count(*) from V ----+------+------ # |@RID |count ----+------+------ 0 |#-1:-1|942949 ----+------+------ orientdb {dbpedia}> select count(*) from E ----+------+------- # |@RID |count ----+------+------- 0 |#-1:-1|1390193 ----+------+-------
Workload characterization
As a single threaded process, we expect that the "Importer" program is going to load the host at its limits, without generating queues on its main resources.
For the purpose of this article is not very interesting to make a deep workload analysis but, since the whole process last around 1 hour, I ran the "Importer" program capturing system activities with sar, recording metrics every 60 seconds, for 60 minutes:
For the purpose of this article is not very interesting to make a deep workload analysis but, since the whole process last around 1 hour, I ran the "Importer" program capturing system activities with sar, recording metrics every 60 seconds, for 60 minutes:
sar -A -o temp 60 60
First, let's verify the disk load:
root@globalZone:~# sar -d -f temp [...] Average device %busy avque r+w/s blks/s avwait avserv adpu3200 0 0.0 0 0 0.0 0.0 adpu3201 18 1.2 67 18397 0.0 17.8 sd0 14 1.1 35 14287 0.0 30.7 sd0,a 0 0.0 0 0 0.0 0.0 sd0,b 14 1.1 35 14287 0.0 31.1 sd1 5 0.1 29 4110 0.0 3.8 sd1,a 5 0.1 29 4110 0.0 3.8
In particular, the column "avwait" (average wait time) confirms our assumption.
Further confirmations come from "queue lenght" e "CPU utilization" reports:
root@globalZone:~# sar -q -f temp [...] runq-sz %runocc swpq-sz %swpocc Average 1.3 35 0.0 0
root@globalZone:~# sar -u -f temp [...] %usr %sys %wio %idle Average 48 14 0 38
Getting started
OrientdbZone configuration
I have 2 disks, each in one zpool, each zpool dedicated to a specific function: data files and Write Ahead Log:
root@globalZone:~/# zpool status pool: databases state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM databases ONLINE 0 0 0 c7t1d0 ONLINE 0 0 0 errors: No known data errors pool: rpool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 c7t0d0 ONLINE 0 0 0 errors: No known data errors
I created a zfs file system for each zpool, without special option:
root@globalZone:~/# zfs create -o mountpoint=legacy databases/wal root@globalZone:~/# zfs create -o mountpoint=legacy rpool/datafiles
I configured my orientdbZone in order to properly use these zfs file systems:
root@globalZone:~# zonecfg -z orientdbZone zonecfg:orientdbZone> add fs zonecfg:orientdbZone:fs> set type=zfs zonecfg:orientdbZone:fs> set special=rpool/datafiles zonecfg:orientdbZone:fs> set dir=/opt/orientdb/databases zonecfg:orientdbZone:fs> end zonecfg:orientdbZone> add fs zonecfg:orientdbZone:fs> set type=zfs zonecfg:orientdbZone:fs> set special=databases/wal zonecfg:orientdbZone:fs> set dir=/opt/orientdb/wal zonecfg:orientdbZone:fs> end zonecfg:orientdbZone> verify zonecfg:orientdbZone> commit zonecfg:orientdbZone> exit
Before every test, I shutdown the orientdbZone, destroyed-recreated the zfs file systems, rebooted the orientdbZone.
Preliminary analysis
Using this simple DTrace program, I ran the "Import" Java program the first time only for gathering information about the page sizes written on the file system:
#! /usr/sbin/dtrace -Zs
syscall::*write*:entry
{
self->fd = arg0
}
syscall::*write*:return
/(fds[self->fd].fi_mount == "/path/to/orientdb/databases") || (fds[self->fd].fi_mount == "/path/to/orientdb/wal")/
{
@distrib[fds[self->fd].fi_fs, probefunc, fds[self->fd].fi_mount] = quantize(arg1);
}
syscall::*write*:return
/(fds[self->fd].fi_mount == "/path/to/orientdb/databases") || (fds[self->fd].fi_mount == "/path/to/orientdb/wal")/
{
self->fd = 0;
}
dtrace:::END
{
printa("\n %s %s %s %@d", @distrib);
}
syscall::*write*:entry
{
self->fd = arg0
}
syscall::*write*:return
/(fds[self->fd].fi_mount == "/path/to/orientdb/databases") || (fds[self->fd].fi_mount == "/path/to/orientdb/wal")/
{
@distrib[fds[self->fd].fi_fs, probefunc, fds[self->fd].fi_mount] = quantize(arg1);
}
syscall::*write*:return
/(fds[self->fd].fi_mount == "/path/to/orientdb/databases") || (fds[self->fd].fi_mount == "/path/to/orientdb/wal")/
{
self->fd = 0;
}
dtrace:::END
{
printa("\n %s %s %s %@d", @distrib);
}
The distribution is quite regular:
CPU ID FUNCTION:NAME 1 2 :END zfs write /path/to/orientdb/databases value ------------- Distribution ------------- count 0 | 0 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 60 2 | 0 4 | 0 8 | 0 16 | 0 32 |@@@ 5 64 | 0 zfs write /path/to/orientdb/wal value ------------- Distribution ------------- count 32768 | 0 65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 102876 131072 | 0 zfs pwrite /path/to/orientdb/databases value ------------- Distribution ------------- count 0 | 0 1 | 5 2 | 0 4 | 0 8 | 5 16 | 0 32 | 0 64 | 0 128 | 0 256 | 0 512 | 0 1024 | 0 2048 | 0 4096 | 0 8192 | 0 16384 | 0 32768 | 0 65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 289755 131072 | 0
So, this time we cannot have doubts: we have to setup the zfs recordsize=64k.
As verified with the previous analysis, if you are using a dedicated WAL file system, the most important factor is the write() time to this file system.
Test #1: Compression
Having fixed the workload and related ZFS recordsize, I tried to compare different behaviors with different compression strategies.
OrientDB has its own compression strategies, based on snappy or gzip libraries.
ZFS has different compression algorithms:
Of course we have to evaluate the compression ratio, related to the whole execution time:
The OrientDB compression strategies seem not so useful in execution time nor in disk space saved.
In general we see that the I/O time isn't so important for this kind of workload, and the time saved in writes are balanced by the time spent in compression, so we cannot save much execution time, but we can save a lot of disk space:
The best compromise is to use ZFS default compression, combined with OrientDB default settings.
OrientDB has its own compression strategies, based on snappy or gzip libraries.
ZFS has different compression algorithms:
- lzjb - optimized for performances
- gzip-N (1 ≤ N ≤ 9) - best compression if N > 5
- zle - fast, but only compresses sequences of zeros
Of course we have to evaluate the compression ratio, related to the whole execution time:
Seconds | Megabytes | ||
Total Import Time | disk usage wal | disk usage data files | |
zfs compression=on | 3648 | 1911 | 242 |
zfs compression=gzip | 4831 | 1147 | 146 |
zfs default, OrientDB gzip=on | 4568 | 4090 | 838 |
zfs default, snappy=on | 3692 | 4090 | 813 |
zfs compression=zle | 3536 | 2803 | 594 |
The OrientDB compression strategies seem not so useful in execution time nor in disk space saved.
In general we see that the I/O time isn't so important for this kind of workload, and the time saved in writes are balanced by the time spent in compression, so we cannot save much execution time, but we can save a lot of disk space:
The best compromise is to use ZFS default compression, combined with OrientDB default settings.
Test #2: Backup
Even dealing with backups, you have to compare OrientDB and ZFS compression strategies.
Here is the ODB config file part related to the backup options:
The default compression level is set to the maximum value (value=0 means no compression).
First of all, we have to verify the page size distribution when OrientDB save the backup file:
OrientDB saves the backup file writing only pages of 512 KB
How much time we can save using the right recordsize for the ZFS backup partition?
Not so much...
Here is the ODB config file part related to the backup options:
<handler class="com.orientechnologies.orient.server.handler.OAutomaticBackup">
<parameters>
<parameter name="enabled" value="false" />
<parameter name="delay" value="4h" />
<parameter name="target.directory" value="backup" />
<parameter name="target.fileName" value="${DBNAME}-${DATE:yyyyMMddHHmmss}.zip" />
<parameter name="db.include" value="" />
<parameter name="db.exclude" value="" />
<parameter name="compressionLevel" value="9"/>
<parameter name="bufferSize" value="1048576"/>
</parameters>
</handler>
<parameters>
<parameter name="enabled" value="false" />
<parameter name="delay" value="4h" />
<parameter name="target.directory" value="backup" />
<parameter name="target.fileName" value="${DBNAME}-${DATE:yyyyMMddHHmmss}.zip" />
<parameter name="db.include" value="" />
<parameter name="db.exclude" value="" />
<parameter name="compressionLevel" value="9"/>
<parameter name="bufferSize" value="1048576"/>
</parameters>
</handler>
The default compression level is set to the maximum value (value=0 means no compression).
First of all, we have to verify the page size distribution when OrientDB save the backup file:
root@globalZone:~/# ./rw_bytesBackup.d dtrace: script './rw_bytesBackup.d' matched 13 probes CPU ID FUNCTION:NAME 0 2 :END zfs write /zones/orientdbZone/root/opt/orientdb-community-1.7.4/backup value ------------- Distribution ------------- count 65536 | 0 131072 | 1 262144 | 0 524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 811 1048576 | 0
OrientDB saves the backup file writing only pages of 512 KB
How much time we can save using the right recordsize for the ZFS backup partition?
Not so much...
Since the OAutomaticBackup class puts the data files in read only mode, my opinion is to try to complete the backup process as soon as possible, therefore we shouldn't use compression at all:
exec time (ms) | |
benchmark | 287872 |
ZFS compression=on | 31803 |
ZFS compression=gzip-9 | 41654 |
NO ZFS compression | 25704 |
NO ZFS compression, recordsize=512k | 20333 |
if data files are compressed | 15024 |
But, what about disk space?
We cannot waste so much disk space, so my suggestion is to create a new ZFS partion having the most powerful compression algorithm (compression=gzip-9) aside the backup partition, then, using a scheduled script, move the backup files into the archive partition:
root@globalZone:~/# zfs create -o mountpoint=legacy -o compression=gzip-9 databases/archive
root@globalZone:~# zonecfg -z orientdbZone
zonecfg:orientdbZone> add fs
zonecfg:orientdbZone:fs> set type=zfs
zonecfg:orientdbZone:fs> set special=databases/archive
zonecfg:orientdbZone:fs> set dir=/opt/orientdb/archive
zonecfg:orientdbZone:fs> end
zonecfg:orientdbZone> verify
zonecfg:orientdbZone> commit
zonecfg:orientdbZone> exit
root@globalZone:~# zoneadm -z orientdbZone shutdown
root@globalZone:~# zoneadm -z orientdbZone boot
root@globalZone:~# zlogin orientdbZone
root@orientdbZone:~# du -k /opt/orientdb/backup/dbpedia-20140625182046.json.gz
831031 /opt/orientdb/backup/dbpedia-20140625182046.json.gz
root@orientdbZone:~# time mv /opt/orientdb/backup/dbpedia-20140625182046.json.gz /opt/orientdb/archive/
real 2m19.710s
user 0m0.004s
sys 0m5.678s
root@orientdbZone:~# du -k /opt/orientdb/archive/dbpedia-20140625182046.json.gz
98764 /opt/orientdb/archive/dbpedia-20140625182046.json.gz
zonecfg:orientdbZone> add fs
zonecfg:orientdbZone:fs> set type=zfs
zonecfg:orientdbZone:fs> set special=databases/archive
zonecfg:orientdbZone:fs> set dir=/opt/orientdb/archive
zonecfg:orientdbZone:fs> end
zonecfg:orientdbZone> verify
zonecfg:orientdbZone> commit
zonecfg:orientdbZone> exit
root@globalZone:~# zoneadm -z orientdbZone shutdown
root@globalZone:~# zoneadm -z orientdbZone boot
root@globalZone:~# zlogin orientdbZone
root@orientdbZone:~# du -k /opt/orientdb/backup/dbpedia-20140625182046.json.gz
831031 /opt/orientdb/backup/dbpedia-20140625182046.json.gz
root@orientdbZone:~# time mv /opt/orientdb/backup/dbpedia-20140625182046.json.gz /opt/orientdb/archive/
real 2m19.710s
user 0m0.004s
sys 0m5.678s
root@orientdbZone:~# du -k /opt/orientdb/archive/dbpedia-20140625182046.json.gz
98764 /opt/orientdb/archive/dbpedia-20140625182046.json.gz
This way the compression process is decoupled and become asynchronous, and you can achieve the best performances in execution time and space saved.
Of course, you can use an external ZFS storage mounted via NFS.
Here we can underline one important difference between Solaris ZFS and open-zfs, used in Illumos-based systems, FreeBSD and MAC OSX.
Whereas open-zfs has a some useful extra features, the Solaris version gives you the opportunity to use a greater recordsize, very useful for such a stuff:
root@globalZone:~/# zfs set recordsize=1M databases/archive
root@orientdbZone:~/# time mv dbpedia-20140627101455.json.gz /opt/orientdb/archive/
real 0m17.464s
user 0m0.004s
sys 0m6.186s
root@orientdbZone:~/# time mv dbpedia-20140627101455.json.gz /opt/orientdb/archive/
real 0m17.464s
user 0m0.004s
sys 0m6.186s
Another, more expensive, strategy is to have a backup dedicated instance in a distributed topology, so that the main instances never slow down because of the backup process.
Test #3: Lucene indexes
Few weeks ago +Enrico Risa developed an OrientDB plugin which gives you the opportunity to use Lucene engine for full-text and spatial indexes:
Despite it isn't ready for production environments yet, I've found his work very interesting, surely it is for my projects!
Let's start investigating its behavior.
As always, the first run is for the the page size distribution analysis during the index creation, having previously created a dedicated ZFS partition:
So we have to use recordsize=8k fot this partition.
orientdb> connect remote:localhost/dbpedia admin admin
Connecting to database [remote:localhost/dbpedia] with user 'admin'...OK
orientdb {dbpedia}> CREATE INDEX Nodo.content on Nodo (name) FULLTEXT ENGINE LUCENE;
Creating index...
2014-06-27 14:43:48:380 INFO - Rebuilding index dbpedia.Nodo.content (estimated 942949 items)... [OIndexRebuildOutputListener]
2014-06-27 14:43:58:383 INFO --> 1.13% progress, 10,653 indexed so far (1,065 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:44:10:663 INFO --> 12.20% progress, 115,065 indexed so far (10,441 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:44:20:716 INFO --> 26.91% progress, 253,784 indexed so far (13,871 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:44:30:717 INFO --> 44.41% progress, 418,770 indexed so far (16,498 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:44:40:718 INFO --> 60.86% progress, 573,886 indexed so far (15,511 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:44:50:719 INFO --> 80.00% progress, 754,379 indexed so far (18,049 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:45:00:849 INFO --> 98.30% progress, 926,889 indexed so far (17,251 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:45:01:896 INFO --> OK, indexed 942,949 items in 73,516 ms [OIndexRebuildOutputListener]
Created index successfully with 942949 entries in 74.220001 sec(s).
Index created successfully
Connecting to database [remote:localhost/dbpedia] with user 'admin'...OK
orientdb {dbpedia}> CREATE INDEX Nodo.content on Nodo (name) FULLTEXT ENGINE LUCENE;
Creating index...
2014-06-27 14:43:48:380 INFO - Rebuilding index dbpedia.Nodo.content (estimated 942949 items)... [OIndexRebuildOutputListener]
2014-06-27 14:43:58:383 INFO --> 1.13% progress, 10,653 indexed so far (1,065 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:44:10:663 INFO --> 12.20% progress, 115,065 indexed so far (10,441 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:44:20:716 INFO --> 26.91% progress, 253,784 indexed so far (13,871 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:44:30:717 INFO --> 44.41% progress, 418,770 indexed so far (16,498 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:44:40:718 INFO --> 60.86% progress, 573,886 indexed so far (15,511 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:44:50:719 INFO --> 80.00% progress, 754,379 indexed so far (18,049 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:45:00:849 INFO --> 98.30% progress, 926,889 indexed so far (17,251 items/sec) [OIndexRebuildOutputListener]
2014-06-27 14:45:01:896 INFO --> OK, indexed 942,949 items in 73,516 ms [OIndexRebuildOutputListener]
Created index successfully with 942949 entries in 74.220001 sec(s).
Index created successfully
Despite it isn't ready for production environments yet, I've found his work very interesting, surely it is for my projects!
Let's start investigating its behavior.
As always, the first run is for the the page size distribution analysis during the index creation, having previously created a dedicated ZFS partition:
root@globalZone:~/# ./rw_bytesIndex.d dtrace: script './rw_bytesIndex.d' matched 13 probes CPU ID FUNCTION:NAME 0 2 :END zfs write /your/path/to/dbpedia/luceneIndexes value ------------- Distribution ------------- count 4 | 0 8 | 2 16 | 5 32 | 30 64 | 1 128 | 23 256 | 13 512 | 14 1024 | 30 2048 | 11 4096 | 72 8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 8983 16384 | 0
So we have to use recordsize=8k fot this partition.
Here the performance comparison among different ZFS tuning:
creation time (ms) | disk usage (KB) | |
recodsize=8k | 82738 | 36800 |
default | 82963 | 36800 |
compression=on | 82028 | 22764 |
Writing only 36 MB in 82 seconds, doesn't let you to observe the gaining of having the right recordsize, because the process spend the most part of the time in reading the source and applying the archive algorithm.
In particular, our source isn't very suitable for this test, because every property indexed is a single word, with underscores instead of spaces.
In fact, the whole test isn't so meaningful, except for the disk space you can save, because, in my experience, a full-text index must be very fast in reading, not in writing.
Imagine you want to index the content of hundreds of thousands of PDF documents: you may can accept an asynchronous new document loading (and indexing), but surely you want a fast enough search system.
In this case, my suggestion is to use a dedicated SSD device.
Extra considerations
- Using ZFS file system can reduce the cost and complexity of managing data and storage devices. Storage product capabilities often overlap ZFS features, and my suggestion is to rely on ZFS ones, because they give you more flexibility. but if you are comfortable with traditional hardware RAID protection, then use it!
- Since ZFS copy-on-write feature relocates all writes to free disk space, keeping a certain amount o free blocks allows ZFS to easily find space in big chunks, and allows ZFS to aggregate writes and reduce the write IOPS demand on hard disks.
- ZFS manages synchronous writes internally with the ZFS intent log (ZIL). By default, synchronous writes are stored in the ZIL at low latency and shortly thereafter (within approximately 5 seconds), ZFS does a bulk update of the pool state by committing a transaction group. Another way exists for the ZIL to handle synchronous write activity, which is of slightly higher latency, but also involves almost half as much traffic on the channel to the storage array. This mode is triggered by the logbias=throughput value (thanks +Manuel Zach). This mode is recommended for file systems holding data files since writes to those files are highly threaded background operations from the database perspective, which does not warrant the utmost latency requirements. Using logbias=throughput for data files frees up the storage resource and as a consequence, allows the WAL to be handled with lower latency, especially if the storage pool is used by many zones.
Conclusions
The following table summarizes my suggestions about ZFS tuning for every partition you should use:
FILE SYSTEM | RECORDSIZE | LOGBIAS | COMPRESSION |
data files | 64k | throughput | on |
WAL | 64k | latency | on |
backup | 512k | latency | off |
archive | max available | throughput | gzip-9 |
Lucene indexes | 8k | throughput | on |