May 22, 2015

Which gzip compression level for FASTQ files?

While implementing a pretty simple filter tool for gzipped FASTQ files, I noticed that the tool was much slower than expected. Profiling revealed that writing the gzipped stream with zlib was the bottleneck. The problem was that the default compression level, i.e. 6, is quite slow. Searching the web, I did not find any clues about which combination of compression level/strategy of gzip yields the optimal tradeoff between file size and speed for FASTQ files.

Thus, I did a little benchmarking. Here are the results for reading and compressing a 660MB FASTQ file on my laptop:

compression strategycompression leveltime [s]size[MB]
Z_FILTERED123.98205.73
Z_FILTERED227.21197.48
Z_FILTERED339.46187.54
Z_FILTERED446.50179.35
Z_FILTERED565.69174.62
Z_FILTERED6124.77167.14
Z_FILTERED7182.67164.17
Z_FILTERED8318.48161.31
Z_FILTERED9464.42160.29
Z_HUFFMAN_ONLY1-919.14281.14
Z_RLE1-921.04241.57
Z_FIXED124.34271.59
Z_FIXED226.67254.54
Z_FIXED339.42233.71
Z_FIXED432.91232.42
Z_FIXED560.27219.65
Z_FIXED6121.26206.17
Z_FIXED7180.66202.07
Z_FIXED8313.01200.06
Z_FIXED9456.23199.35

The result is pretty clear. You probably want to use the default strategy (Z_FILTERED) with the lowest compression (1). If the file size is very important to you, compression levels up to 4 might be considered without too much run time increase.

Note: The runtime and file size for the Linux command line tool 'gzip' are similar. So, if you compress FASTQ with it, don't forget to add the '-1' argument for best speed.

Qt: Reading text files with or without QTextStream?

The question: Qt offers two ways to read text files. You always have to open the file using QFile, but then you can either use a QTextStream on the file or not. The most obvious difference is that the QTextStream returns the lines as QString, whereas QFile returns QByteArrays. QByteArrays should be faster - but how much? The documentation does not say anything about the performance. Thus, I compared the approaches and also evaluated the influence of where the line variable is declared - before the loop or inside the loop:

The code: Here is the code for the benchmark. It creates text files with 100000 lines of length 100 to 1000. Then it reads each file 10 times with each method:
   
 QTextStream(stdout) << "#line_length\t0 (file_array_outside)\t1 (file_array_inside)\t2 (file_string_outside)\t3 (file_string_inside)\t4 (stream_string_outside)\t5 (stream_string_inside)";  
 for (int ll=100; ll<1001; ll+=100)  
 {  
     QString output = QString::number(ll);  
    
     //create file to load  
     QFile data("input.txt");  
     data.open(QFile::WriteOnly);  
     QTextStream out(&data);  
     {  
         for (int i=0; i<100000; ++i)  
         {  
             out << QString(ll, 'c') << endl;  
         }  
     }  
     data.close();  
    
     //load  
     for (int mode=0; mode<6; ++mode)  
     {  
         long size = 0;  
         QTime timer;  
         timer.start();  
         for (int i=0; i<10; ++i)  
         {  
             QFile in("input.txt");  
             in.open(QFile::ReadOnly | QFile::Text);  
    
             switch (mode)  
             {  
                 case 0:  
                 {  
                    QByteArray line;  
                    while(!in.atEnd())  
                    {  
                        line = in.readLine();  
                        size += line.length();  
                    }  
                 }  
                 case 1:  
                 {  
                    while(!in.atEnd())  
                    {  
                        QByteArray line = in.readLine();  
                        size += line.length();  
                    }  
                 }  
                 case 2:  
                 {  
                    QString line;  
                    while(!in.atEnd())  
                    {  
                        line = in.readLine();  
                        size += line.length();  
                    }  
                 }  
                 case 3:  
                 {  
                    while(!in.atEnd())  
                    {  
                        QString line = in.readLine();  
                        size += line.length();  
                    }  
                 }  
                 case 4:  
                 {  
                    QTextStream stream(&in);  
                    QString line;  
                    while(!stream.atEnd())  
                    {  
                        line = stream.readLine();  
                        size += line.length();  
                    }  
                 }  
                 case 5:  
                 {  
                    QTextStream stream(&in);  
                    while(!stream.atEnd())  
                    {  
                        QString line = stream.readLine();  
                        size += line.length();  
                    }  
                 }  
             }  
             in.close();  
         }  
         output += "\t" + QString::number(timer.elapsed());  
     }  
     QTextStream(stdout) << output;  
 }  

The result:

The benchmark results in milliseconds for Windows 7, Qt 5.1.1, MinGW 4.8.0 are:
line file array_outsidefile array_insidefile string_outsidefile string_insidestream string_outsidestream string_inside
100
1003
855
1315
1337
2198
2252
200
4735
991
1720
1793
4118
4107
300
4866
1144
2141
2258
6064
6060
400
5218
1456
2560
2584
8012
8039
500
5411
1479
3077
3244
10585
10426
600
5770
1697
3614
3697
12127
12119
700
6005
1857
3976
4026
14097
14010
800
6529
2070
4504
4509
16310
16143
900
6310
2261
4926
5061
18308
18268
1000
6851
2521
5560
5454
20427
20226

The benchmark results in milliseconds for Ubuntu 12.04 7, Qt 5.1.0, GCC 4.6.3 are:
linefile array_outsidefile array_insidefile string_outsidefile string_insidestream string_outsidestream string_inside
100
481
384
857
880
815
814
200
561
467
1436
1216
1489
1487
300
612
540
1873
1560
2206
2200
400
707
599
2309
1866
2780
2771
500
732
655
2696
2231
3416
3408
600
843
740
3171
2629
4104
4071
700
1371
824
3597
2963
4750
4743
800
1100
930
4063
3361
5567
5954
900
1324
1034
4514
3742
6242
6649
1000
1254
1150
4973
4136
7040
7381

The conclusion:
Reading text files with QFile alone is much faster than with QTextStream. Using QByteArray for the line variable further speeds up reading the file. The line variable should be declared inside the loop because it speeds up the calculation, at least under Linux.

Qt: Workaround for the leaky QProcess output channel

It started with a pretty simple task. I wanted to run a QProcess and put the console output into a QTextEdit.

My initial solution was this:
 ### MainWindow.h ###  
 class MainWindow  
   
  : public QMainWindow  
 {  
  Q_OBJECT  
   
  private:  
   QTextEdit* text_edit;  
   Process process;   
 }  

First, it seemed to work fine. But then I noticed, that sometimes the end of the output is not written to the QTextEdit. Some digging in the web revealed that this is quite a common problem. On Windows there seems to be a buffering problem that can truncate the output when piping it to something other than a file.
 ### MainWindow.cpp ###  
 MainWindow::MainWindow(QWidget* parent)  
 {   
  process.setProcessChannelMode(QProcess::MergedChannels);  
  connect(&process, SIGNAL(readyReadStandardOutput()), this, SLOT(printOutput()));  
 }   
 void MainWindow::execute(QString command)  
 {  
  text_edit->clear();  
  process.start(command);  
 }  
 void MainWindow::printOutput()  
 {  
  QString text = process->readAllStandardOutput();  
  text_edit.moveCursor(QTextCursor::End);  
  text_edit.insertPlainText(text);  
 }  

I tried many different suggestions from several forums (unbuffered output etc.), but in the end only writing the output to a file instead to the QProcess directly fixed the problem. Unfortunately that requires a bit more work. We have to continuously check for new output in the file ourselves, e.g. triggered by a QTimer:

 ### MainWindow.h ###  
 class MainWindow  
  : public QMainWindow  
 {  
  Q_OBJECT  
   
  private:  
   QTextEdit* text_edit;  
   QProcess process;  
   QTimer process_timer;  
   QString process_file;  
   qint64 process_file_pos;  
 }   

 ### MainWindow.cpp ###  
 MainWindow::MainWindow(QWidget* parent)  
 {   
  process_file = QDir::temp() + "some_file_name.txt";  
  process.setProcessChannelMode(QProcess::MergedChannels);  
  process.setStandardOutputFile(process_file);  
  connect(&process, SIGNAL(finished(int,QProcess::ExitStatus)), this, SLOT(executeFinished()));  
  connect(&process, SIGNAL(error(QProcess::ProcessError)), this, SLOT(executeError(QProcess::ProcessError)));  
   
  process_timer.setInterval(100);  
  process_timer.setSingleShot(false);  
  connect(&process_timer, SIGNAL(timeout()), this, SLOT(printOutput()));  
 }  
 void MainWindow::execute(QString command)  
 {   
   QFile::remove(process_file);  
   process_file_pos = 0;  
   process.start(command);  
   process_timer.start();  
 }  
 void MainWindow::printOutput()  
 {  
  QFile file(process_file);  
  if (!file.open(QIODevice::ReadOnly)) return;  
   
  if (file.size()>process_file_pos)  
  {  
   file.seek(process_file_pos);  
   text_edit->moveCursor(QTextCursor::End);  
   text_edit->insertPlainText(file.readAll());   
   process_file_pos = file.pos();  
  }   
  file.close();  
 }   
 void MainWindow::executeFinished()  
 {  
  process_timer.stop();  
  printOutput();  
 }  
 void MainWindow::executeError(QProcess::ProcessError)  
 {  
   process_timer.stop();  
   printOutput();  
 }  
The code above is of cause not complete. A working example project for QtCreator can be found here.

Any suggestions how my approach can be simplified are welcome :)

May 10, 2012

Spanish to English Dictionary for Kindle 4

The good news:
The Evil Genius created a nice Spanish-English dictionary for the Kindle ebook reader:
http://www.evilgeniuschronicles.org/wordpress/2010/01/07/my-free-and-open-kindle-formatted-spanish-to-english-dictionary/


The bad news:
The dictionary is just what I was looking for to improve my Spanish. Unfortunately, it was created for Kindle 2, which uses only one dictionary at a time.
When using it on a Kindle 4 (which manages one dictionary for each language), it can be selected for English books only, which makes it pretty useless.

The solution:
The actual problem is the missing language meta data of the dictionary.
After several unsuccessful attempts to change the language meta data, I discovered the Java Mobi Metadata Editor: https://github.com/gluggy/Java-Mobi-Metadata-Editor
It is easy to use and lets you change the "dictionary input" and "dictionary output" language meta data.
Thus, the dictionary now works on Kindle 4, Kindle Touch and Kindle Paperwhite.
If you're also interested in the updated version of the dictionary, you can download it.

Update 2014-05-09:
It seems the dictionary cannot be used on the Kindle Fire: [link].