1.14.2012

Python vs. C++ in parsing large files

I have a handful of Python scripts that process some very large data files. They parse some bandwidth data and create dictionaries/hash tables of IPv4 addresses and the amount of data coming to them. The process is very slow, mainly because of disk I/O. I have recently decided to stop being such a C zealot and learn some C++, especially since the latest standard (C++11/C++0x) has added a bunch of goodies. Anyway, I wanted to compare the performance of a fast interpreted (really JIT) language, Python, to the 2nd fastest compiled language on the planet, C++. I must say, I thought for sure that C++ would really whoop Python, but this is not the case in I/O bound applications such as this.

I created two programs, one in Python and one in C++, and ran them on the same data file (approx. 1.6 GB of text). They were both run on an Intel Core i7 processor running Windows 7 x64. The version of Python used is 2.7.1, 64-bit. The .cpp file was compiled with TDM-GCC's MinGW-w64-based g++ (version 4.6.1) using the following switches:

g++ -std=c++0x -O2 -march=native -Wall -Wextra test.cpp

Instead of placing the code directly in this post, I have placed links so that you can download them since Blogger bastardizes code.

C++ code:
http://dl.dropbox.com/u/57289645/test.cpp

Python code:
http://dl.dropbox.com/u/57289645/test.py

Results:
C++: 24 seconds
Python: 45 seconds

Notes:
  • Python has an unfair advantage here, which is the fact that the "with open(file, 'r') as fin:" structure uses a built-in readahead buffer in Python. So the Python script is reading large chunks of the file into memory in order to parse the file faster, whereas the C++ program is just using normal file I/O.
  • Python on Windows is compiled using MSVC. The C++ program would probably be faster if compiled with MSVC. It would certainly be faster if compiled with Intel's C++ compiler, icc.
  • Python is splitting each line into a list, whereas C++ is using a more brute force type method by directly searching for the tab characters and only extracting what is needed. Python could just as easily search for tabs, as you could certainly dump the line into a list in C++.
In any case, I'm sure there are improvements that can be made to both pieces of code. The thing to learn out of this is that in disk I/O bound applications, writing a program in a lower-level/faster/compiled language might not be as big of a benefit as you'd think.

6.23.2011

VirtualBox 4.0.8 on a Windows 7 Host

...is terrible. I tried installing Lubuntu on VirtualBox 4.0.8 and VMware Player 3.1.4 VMs and VMware Player completed the installation in near half the time VirtualBox did (45 mins vs. 80 mins). This is appalling because VirtualBox used to be much faster (on the same computer on Windows XP and Linux), perhaps it just performs poorly on Windows 7. In any case, VMware Player is a lifesaver since it's made this computer usable again.

I'll also add that I got a chance to use VMware Server at work and must say that it and VMware Player seem much more polished than VirtualBox. My two cents.