Wednesday, August 3, 2016

Reading ASCII file in Python3.5 is 2-3x faster as bytes than string

I wrote an essay which compared the read performance in Python3.5 between bytes and the Unicode text options 'newline' and 'encoding'. I concluded that I couldn't get the Unicode string performance to within a factor of 2 of the binary byte performance, so chemfp will be working with bytes, not strings.

I also checked how the RDKit handles invalid Unicode, to see what another toolkit did for the same problem. I concluded that it uses bytes internally and exposes strings, which causes problems if those bytes cannot be converted to strings.

