The Water Demo Performance -- Part 2 -- Tweak the Python Code

From PyWiki

Jump to: navigation, search

Step 1 -- rewrite the calculateNormals routine

Lets spend time making the Python Code as fast as possible so we'll break down the first 'loop'

for i in range(self.numFaces):
   p0 = vinds[3*i]  
   p1 = vinds[3*i+1]  
   p2 = vinds[3*i+2]  
   v0= ogre.Vector3 (self.vertexBuffers[buf][3*p0], self.vertexBuffers[buf][3*p0+1], self.vertexBuffers[buf][3*p0+2]) 
   v1 = ogre.Vector3 (self.vertexBuffers[buf][3*p1], self.vertexBuffers[buf][3*p1+1], self.vertexBuffers[buf][3*p1+2]) 
   v2 = ogre.Vector3 (self.vertexBuffers[buf][3*p2], self.vertexBuffers[buf][3*p2+1], self.vertexBuffers[buf][3*p2+2]) 
   diff1  = v2 - v1  
   diff2 = v0 - v1  
   fn = diff1.crossProduct(diff2) 
   self.vNormals[p0] += fn  
   self.vNormals[p1] += fn  
   self.vNormals[p2] += fn

We know that self.numFaces = 2 * complexity * complexity which with complexity being set to 64 (by default) we are going through the loop 8192 times per frame -- a good place to optimise!

The first issue is likely to be the fact that we are creating (and deleting) 6 Vector3 objects in each pass (v0,v1,v2,diff1,diff2 and fn) with each one being a call into the Ogre library (via boost etc) -- we are also using vNormals as an Ogre Vector3 array.

Hence lets make a change and instead of using Vector3 lets change to simple Python Arrays (after all a Vector3 is simply a way to hang onto 3 floats) -- makes the code a little more complex but should make for a performance improvement. First lets change the way we use vNormals -- change from this:

## allocate space for normal calculation
self.vNormals=[]
for x in range ( self.numVertices ):
   self.vNormals.append(ogre.Vector3().ZERO )

to this:

## allocate space for normal calculation
self.vNormals=array.array('f')
for x in range ( self.numVertices * 3 ):
   self.vNormals.append(0)

and now we change the complete calculateNormals function to use pure Python 'objects'

def calculateNormals(self):
    ## zero normals
    for i in range(self.numVertices*3) :
        self.vNormals[i]=  0
 
    ## first, calculate normals for faces, add them to proper vertices
    # use helper function
    vinds = buffer ( self.indexBuffer)
    vinds.lock (0, self.indexBuffer.getSizeInBytes(), ogre.HardwareBuffer.HBL_READ_ONLY)
 
    pNormals = self.normVertexBuffer.lock(
        0, self.normVertexBuffer.getSizeInBytes(), ogre.HardwareBuffer.HBL_DISCARD) 
    pNormalsAddress=(ctypes.c_float * (self.normVertexBuffer.getSizeInBytes()*3)).from_address(ogre.castAsInt(pNormals))
 
    # make life easier (and faster) by using a local variables
    buf = self.vertexBuffers[self.currentBufNumber]
    vNormals = self.vNormals
 
    ## AJM so here's a case where accessing a C++ object from python shows a performance hit !!
    for count in range(self.numFaces) :
        p0 = vinds[3*count]  
        p1 = vinds[3*count+1]  
        p2 = vinds[3*count+2]  
        # this is slow
        # v0= ogre.Vector3 (self.vertexBuffers[buf][3*p0], self.vertexBuffers[buf][3*p0+1], self.vertexBuffers[buf][3*p0+2]) 
        # v1 = ogre.Vector3 (self.vertexBuffers[buf][3*p1], self.vertexBuffers[buf][3*p1+1], self.vertexBuffers[buf][3*p1+2]) 
        # v2 = ogre.Vector3 (self.vertexBuffers[buf][3*p2], self.vertexBuffers[buf][3*p2+1], self.vertexBuffers[buf][3*p2+2]) 
 
        # so use python arrays instead of Vector3's
        i0 = 3*p0
        i1 = 3*p1
        i2 = 3*p2
        v0 = [buf[i0], buf[i0+1], buf[i0+2]]
        v1 = [buf[i1], buf[i1+1], buf[i1+2]] 
        v2 = [buf[i2], buf[i2+1], buf[i2+2]] 
 
        # Do the vector subtraction by 'hand' instead of original
        # diff2 = v0 - v1  
        diff1  = [v2[0]-v1[0],v2[1]-v2[1],v2[2]-v2[2]]  
        diff2  = [v0[0]-v1[0],v0[1]-v2[1],v0[2]-v2[2]]  
 
        # and now we need to do a crossProduct by hand..
        # fn = ogre.Vector3(*diff1).crossProduct(ogre.Vector3(*diff2)) 
        fn = [diff1[1] * diff2[2] - diff1[2] * diff2[1],
            diff1[2] * diff2[0] - diff1[0] * diff2[2],
            diff1[0] * diff2[1] - diff1[1] * diff2[0]]
        # And of course now add the values into the normals
        # self.vNormals[p0] += fn  
        # self.vNormals[p1] += fn  
        # self.vNormals[p2] += fn  
        vNormals[i0] += fn[0] 
        vNormals[i0+1] += fn[1] 
        vNormals[i0+2] += fn[2] 
        vNormals[i1] += fn[0] 
        vNormals[i1+1] += fn[1] 
        vNormals[i1+2] += fn[2] 
        vNormals[i2] += fn[0] 
        vNormals[i2+1] += fn[1] 
        vNormals[i2+2] += fn[2] 
 
    ## now normalize vertex normals
    complexity = self.complexity
    for y in range(complexity) :
        for x in range(complexity) :
            numPoint = y*(complexity+1) + x  
            v = 3*numPoint
            n = ogre.Vector3(vNormals[v],vNormals[v+1],vNormals[v+2])  
            n.normalise()  
            v = 3*numPoint
            pNormalsAddress [v] = n.x
            pNormalsAddress [v+1] = n.y
            pNormalsAddress [v+2] = n.z
 
    self.indexBuffer.unlock() 
    self.normVertexBuffer.unlock()

This takes the frame rate up to 13 FPS (complexity == 64 and psyco enabled) -- a big improvement over the previous <3 FPS but still way short of the C++ version. And making complexity == 32 then we get nearly 50 FPS

However we need to take this another step further

Personal tools