I created two programs, one in Python and one in C++, and ran them on the same data file (approx. 1.6 GB of text). They were both run on an Intel Core i7 processor running Windows 7 x64. The version of Python used is 2.7.1, 64-bit. The .cpp file was compiled with TDM-GCC's MinGW-w64-based g++ (version 4.6.1) using the following switches:
g++ -std=c++0x -O2 -march=native -Wall -Wextra test.cpp
Instead of placing the code directly in this post, I have placed links so that you can download them since Blogger bastardizes code.
C++ code:
http://dl.dropbox.com/u/57289645/test.cpp
Python code:
http://dl.dropbox.com/u/57289645/test.py
Results:
C++: 24 seconds
Python: 45 seconds
Notes:
- Python has an unfair advantage here, which is the fact that the "with open(file, 'r') as fin:" structure uses a built-in readahead buffer in Python. So the Python script is reading large chunks of the file into memory in order to parse the file faster, whereas the C++ program is just using normal file I/O.
- Python on Windows is compiled using MSVC. The C++ program would probably be faster if compiled with MSVC. It would certainly be faster if compiled with Intel's C++ compiler, icc.
- Python is splitting each line into a list, whereas C++ is using a more brute force type method by directly searching for the tab characters and only extracting what is needed. Python could just as easily search for tabs, as you could certainly dump the line into a list in C++.