Python itertools with izip and count

Tuesday, 22nd March 2011 - Michael Halls-Moore - 0 Comments

A common usage scenario in Python occurs when retrieving records from a database and adding them into an appropriate data structure. One such structure involves a list of dictionaries, with each record in the DB represented by an element of the list, with a key-value pair for each field in the dictionary. Here is an example for a stock price time series:

[{'timestamp': date(2011, 3, 22, 0, 0, 0), 'stock_price': 50.34},
#..
{'timestamp': date(2010, 8, 5, 0, 0, 0), 'stock_price': 20.45}]

It is often necessary to compare items in two of these lists. In this instance the stock price could be compared to another asset on each day to see which one is higher. An alternative consideration might be to assess whether a stock is in a portfolio or not on a particular day.

The naive approach to solve this list comparison task is to run the comparison within a nested loop iterating across both lists:

for f in first_list:
    for s in second_list:
        if f['timestamp'] == s['timestamp']:
            if f['stock_price'] > s['stock_price']:
                print "Date: %s, First price is greater!" % \
                    f['timestamp']

This is a lot of computational effort to compare two equal sized data structures that are already ordered in the correct fashion. Python provides a far speedier alternative with the zip() function. The function accepts two lists and produces a pairwise list of tuples of their elements. If the size of the original lists is N, then the iteration over the zipped data structure reduces the comparison from an O(N^2) operation to O(N).

for i, (f, s) in enumerate(zip(first_list, second_list)):
    if f['stock_price'] > s['stock_price']:
        print "%s, Date: %s, First price is greater!" % \
            (i, f['timestamp'])

So far I've just discussed a simple case of list comparison via the zip function. Nothing crazy as of yet. Although, the operation has reduced the computational time significantly. The fun doesn't stop there though. We can bring in the itertools library to provide a further speed increase. Itertools ships with two functions, izip and count, which can speed up this operation even further. The following code provides the same result as the above, but in less time and with less memory usage:

from itertools import izip, count
for i, f, s in izip(count(), first_list, second_list):
    if f['stock_price'] > s['stock_price']:
        print "%s, Date: %s, First price is greater!" % \
            (i, f['timestamp'])

I've run tests on actual financial data and have seen each of the above three scenarios reduce respectively from a 30 second process to 1s and then finally 0.7s. So, next you time you feel yourself reaching for the zip function, consider using izip instead.

If anybody believes they can speed this operation up further, either by an alternative data structure approach or with further refinement of the iteration, I'm keen to hear suggestions.

Mathematics and Physical Sciences Education

Thursday, 17th March 2011 - Michael Halls-Moore - 0 Comments

It's been quite a while since I've made a blog post. The cliche excuse would be to say that I've been snowed under with real life tasks. This is only partly true. Regardless, I thought it was time to update you all on some of my recent musings involving education.

It is quite clear that I am passionate about education, particularly in the fields of mathematics and physical sciences. My blog post on learning theoretical physics has turned out to be my most viewed post so far. Given that my background is in the physical sciences, I thought it was time I did something to contribute to the world's pool of knowledge beyond writing a few blog posts and handing in a thesis.

Having spent a lot of time on communities such as Hacker News and Slashdot I have come to realise that there is a growing thirst for online learning materials, particularly in the scientific arena. Organisations and movements like Khan Academy, MIT OpenCourseWare and Open Culture are pioneering a new era in education.

At this stage, I still don't feel like it is possible for autodidacts to achieve the same level of knowledge as that achieved by a graduate student finishing a masters-level science degree. The reasons are clear: It is a "niche market", the effort involved in providing a similar experience is substantial and crowdsourced methods at these levels would probably devolve into frivolous discourse and petty argument.

Thus I have taken it upon myself to correct this problem. I have decided to undertake the rather ambitious task of writing an entire set of lecture notes, question sheets, exam-style handouts and video lectures for a top-tier undergraduate mathematics and physics degree. Depending upon my free time availability, this task will not likely be "complete" for a long time. This does not matter. Every day adds a new sentence to the project and hence more knowledge than yesterday.

I won't pollute this post with an extensive map of all subjects with their hierarchical relationships, but I will provide a brief overview of my short term plan and the progress made so far. My rationale is that there are plenty of courses on Calculus/Linear Algebra 101 around the net. Khan Academy provides a great set of Linear Algebra videos as does Gilbert Strang at MIT. I do not want to reinvent the wheel just yet.

On the other hand, 200+ level courses are severely lacking. Thus it seemed appropriate to apply my areas of expertise to write some notes on the more advanced undergraduate subjects. Topics that I can provide solid information for include Vector Calculus, Partial Differential Equations, Fluid Dynamics, Numerical Linear Algebra, Classical Mechanics and Relativity. This is where I have decided to begin.

Fast-forward two months from my initial foray into online education and I have almost finished my first chapter of "Vector Calculus", discussing vector differential operators (grad, div, curl, Laplacian). The experience has been extremely insightful. I've written it from the perspective of a mathematician, rather than an engineer or physicist, and I had (almost!) forgotten the necessary level of precision required. It has forced to me fully understand a concept before committing it to text.

The notes will eventually reside at a domain I purchased a couple of years ago. I intend to output them both in PDF and in HTML/JS form, using the brilliant MathJax library. I also hope to gain a large number of "long tail search" visitors via SEO, as many of the niche mathematical terms are not competitive. If I gain enough in donations, then I may be able to purchase one of these to film video lectures!

I will stress now that the resources will always remain free in the sense of beer and speech. There will be no restrictions on usage or downloads - this is a core component of the idea. It is my way of returning the favour for all of the open source software and educational resources I have utilised over the years. This is not a startup - it is an educational resource that everyone can access and it will always remain so.

If anybody is keen to get involved with this project, particularly if they have extensive subject expertise in Pure Mathematics (Topology, Algebra), Statistics, Computer Science or Engineering, then I would be very keen to hear from you. Let's see what we can do to educate the world!

Agile Development Tools Server Part 2 - Trac

Saturday, 8th January 2011 - Michael Halls-Moore - 2 Comments

In Part 1 of this series I explained the benefits of building a separate agile development tools server to host your version control, project management, continuous integration, monitoring and backup tools. I outlined how to install the open-source version control system Subversion (SVN). Now I wish to show you how to tie SVN into a project management tool created by Edgewall, known as Trac. Trac provides bug tracking, repository integration with timelines, source code browsing, project wikis and a host of additional features with plugins. This tutorial will describe how to install Trac, configure it and connect it to SVN.

Let's begin by connecting to the server you created in Part 1 and installing Trac. I didn't explicitly specify whether I was using a VM or a hosted server instance for our Ubuntu Server 10.04 LTS setup, so you can decide how to perform the connection. We will use Python Setuptools to install Trac, rather than the Ubuntu binaries, as this will give us the most up to date stable version. At the time of writing this post the current stable version was 0.12:

sudo apt-get install python-setuptools
sudo easy_install Trac==0.12

Now we create the directory that Trac will use to store the configuration and database for our project. As with the Subversion directory, I prefer to keep these project roots out of any user home directory. On a multi-developer machine this ensures that access privileges are more straightforward to administer. We will also allow the Apache2 user (www-data) to take ownership of the directory:

sudo mkdir /var/lib/trac
sudo chown -R www-data:www-data /var/lib/trac

The next stage is to tell Apache how we would like to host the site. Firstly though, a word of warning on potential security threats. The Trac Wiki is a common place to store username/password combinations for the myriad services that crop up when running a startup. Serving Trac on the default HTTP port of 80 will send any information back and forth to the server via plaintext, including passwords intended for Wiki files. Thus, as with Subversion, we will serve Trac on a Secure Socket Layer (SSL) over port 443 and use https://***/ when we wish to interact with it. This stops potential snoopers stealing our passwords. Let's configure Apache with a new virtualhost with this information in mind:

sudo emacs /etc/apache2/sites-available/trac

My personal preference is to use subdomains for all of these services. For instance if your project is called My Project, then Subversion would live at https://svn.myproject.com/ while Trac would live at https://trac.myproject.com/ Add the following code to the file, replacing your ServerAdmin email and ServerName as appropriate.

<VirtualHost *:443>
  ServerAdmin youremail@myproject.com
  ServerName trac.myproject.com
  DocumentRoot /var/www/myproject
  
  <Location />
    SetHandler mod_python
    PythonInterpreter main_interpreter
    PythonHandler trac.web.modpython_frontend
    PythonOption TracEnvParentDir /var/lib/trac
    PythonOption TracUriRoot /
    AuthType Basic
    Authname "Trac"
    AuthUserFile /etc/apache2/svn.passwd
    Require valid-user
  </Location>
</VirtualHost>

Let's discuss what is happening in this file. The first line tells Apache that you want to listen on port 443 (HTTPS) for a new virtualhost directive. The name to listen for is provided by the ServerName attribute. This means that when your DNS is correctly configured, any traffic sent to this server on port 443 for "trac.myproject.com" will be redirected to this configuration file.

We're telling Apache to use the mod_python Python module to handle all of the Trac code for serving. The SetHandler and PythonInterpreter tell Apache to use the default Python installation (v2.6 on Ubuntu Server 10.04 LTS). Apache also needs to know where the Trac environment parent directory containing our configuration is located. In addition, we need to inform Apache which URL postfix to look for to serve Trac. We're telling Apache to serve it directly from the webroot. As such, http://trac.mydomain.com/ will point to the Trac page itself.

The remaining four lines indicate that we wish to use Basic Authentication to stop clients accessing our project who do not provide access credentials. We authorise against the password file that we created in Part 1.

Let's continue the tutorial by creating the DocumentRoot directory that we specified in the Apache configuration. Then we will enable the Trac site and restart Apache:

sudo mkdir /var/www/trac
sudo a2ensite trac
sudo /etc/init.d/apache2 restart

The next step is to initalise the Trac environment for myproject with some default configuration, using the trac-admin tool. This is where trac keeps the database for wiki pages, tickets and reports. You will be prompted to enter some information, but I suggest just using the defaults if you are unsure:

sudo trac-admin /var/lib/trac/myproject initenv

We are now ready to integrate Trac with Subversion. This will provide us with the ability to view the codebase across all revisions (very handy!) and keep track of multiple branches/tags for release. Let's begin by backing up our Trac initialisation file (in case we make a mistake), allowing Apache to read it and then editing it:

sudo cp /var/lib/trac/myproject/conf/trac.ini \
/var/lib/trac/myproject/conf/trac.ini.bak
sudo chown www-data:www-data \
/var/lib/trac/myproject/conf/trac.ini.bak
sudo emacs /var/lib/trac/myproject/conf/trac.ini

Find the [header_logo] heading and edit the file so that it points to a logo of your choice, if you so desire. You will need to change the Alt text, the height/width, the link to point the logo to (perhaps your startup homepage) and the location of the image. In addition locate the [trac] heading and modify the repository_dir to point to your Subversion repository:

[header_logo]
alt = MyProjectLogo
height = -1
link =
src = /logo.gif
width = -1
 
..
..
 
[trac]
..
..
repository_dir = /var/lib/svn/repo

We need to provide administrative privilege for at least one user. This will allow you to update task categories, milestones and other meta information about your project. When you login to Trac with the following username, you will see an additional Admin tab:

sudo trac-admin /var/lib/trac/myproject \
permission add myuser TRAC_ADMIN

The final task before fully integrating with Subversion is to modify a set of Subversion hooks. These hooks are analagous to callback functions which are to be performed after certain events have been triggered. We are going to adjust the post-commit and post-revision-property-change hooks so that Trac is aware of the changes and is always up to date with our SVN repo. Let's begin with post-commit.tmpl:

cd /var/lib/svn/repo/hooks
sudo emacs post-commit.tmpl

Add the following lines to the post-commit.tmpl file:

TRAC-ENV="/var/lib/trac/myproject"
/usr/bin/trac-admin "$TRAC-ENV" changeset added "$REPOS" "$REV"

Save the changes, exit and open up post-revprop-change.tmpl:

sudo emacs post-revprop-change.tmpl

Add the following lines to that file:

TRAC-ENV="/var/lib/trac/myproject"
/usr/bin/trac-admin "$TRAC-ENV" changeset modified "$REPOS" "$REV"

You should now find that interacting with your repository will push updates to your Trac environment.

That concludes the tutorial on installing and configuring Trac. In my opinion, this is the minimum you need in order to have a functional tools server. A robust production grade system would not be complete without continuous integration or server monitoring. In addition, I have not discussed backup strategies or network access restriction via a firewall. I will also provide a workflow tutorial which will outline "best practice" for using the tools. All of these topics will be fleshed out in the remaining tutorials.

Let me know how you get on with your Trac environment, in the comments - I'm always keen to hear about other setups as well so do get in touch!

Agile Development Tools Server Part 1 - Subversion

Wednesday, 5th January 2011 - Michael Halls-Moore - 3 Comments

Creating an agile development tools server is possibly the biggest time saver for a developer when beginning a startup. Frantically running a manual deployment process, hacking together code from five developers without version control or breaking code without running unit tests are common scenarios which can be eliminated via the proper use of agile tools, in particular an agile tools server.

This is Part #1 of a series of tutorials which will show you how to build a highly robust server that you can use for version control, project management, continuous integration and full stack monitoring. In the first part we will cover setting up a version control repository using the open source Subversion (SVN) agile tool on the Ubuntu Server 10.04 LTS operating system.

The first step is to obtain yourself a fresh Ubuntu LTS machine (virtual or physical!). If you are just trying out some agile development tools and wish to give SVN a go, I recommend following my other tutorial on setting up a web development environment with VirtualBox and stop when you reach "Configuring the Virtual Machine". Alternatively, you can start a micro server instance in Amazon EC2 with one of the Canonical AMIs. I outline the basics of how to do that in my other tutorial on Desktop Ubuntu in Amazon EC2 - The Right Way. You can follow everything up until "Desktop Installation".

We will set up the tools box up by installing the Apache2 webserver, using it to serve SVN across SSL so that we have a secure encrypted connection to our repository. This stops others sniffing our packets and stealing our prized code! In addition we will enforce user read/write privileges so that we can choose who can modify our codebase.

Let's begin by updating and upgrading our server so that it has the latest security updates. Log onto the server and type:

sudo apt-get update
sudo apt-get upgrade

I generally install some "admin packages". These include my favourite text editor Emacs 2.3, htop (a much more usful version of top), screen (multiple terminals in one terminal!) and build-essential for all any compilation via source that may be required:

sudo apt-get install build-essential emacs23 htop screen

The next step is to install all of the packages that Subversion requires in order to run. We install Apache and SVN, then the Python Module for Apache and Python-Subversion which is used to communicate with your repository via Python, funnily enough!

sudo apt-get install apache2 subversion libapache2-svn \
libapache2-mod-python python-subversion

We are now going to create a permanent location for your repository. I tend to create my repos outside of the home directory as I may need to change usernames at a later date. Let's put our repo in /var/lib/svn. To achieve this we run a subversion command, svnadmin, which populates the repo directory with all of the necessary configuration and storage files:

sudo mkdir /var/lib/svn
sudo svnadmin create /var/lib/svn/repo

We need to make sure that Apache has access to this repo, so let's change the ownership recursively (-R) to the www-data user for the svn directory:

sudo chown -R www-data:www-data svn

We now need to create a skeleton codebase to import into our new repository. You may already have a codebase in use that you wish to you add, in which case you should add the code underneath the /tmp/myproject/trunk directory below, otherwise follow the proceeding steps to create a new codebase:

sudo mkdir -p /tmp/myproject/branches
sudo mkdir /tmp/myproject/tags
sudo mkdir /tmp/myproject/trunk
sudo svn import /tmp/myproject \
file:///var/lib/svn/repo/myproject -m "Initial import"

The next step involves configuring Apache to make use of the encryption provided by a Secure Socket Layer (SSL). SSL makes use of public-private key encryption. The "plaintext" is encrypted via the public key and decrypted via the private key. In order to employ SSL with Apache it is necessary to use a signed certificate which usually involves authenticating our identity against a third party. In this instance we will use a self-signed certificate as there is no need to prove to third parties (i.e. website users) that we are, indeed, who we say we are. Let's install SSL:

sudo apt-get install openssl

Now we need to generate our keys and certificates. We are going to use the genrsa command to create an RSA-based DES3 key (of 1024-bit strength) for our Certificate Signing Request (CSR):

openssl genrsa -des3 -out server.key 1024
openssl req -new -key server.key -out server.csr

You will be prompted for a challenge password. It is convenient, but highly insecure, to leave this blank. I recommend you do not leave it blank.

Note: Upon rebooting the server once all of this installation is complete, apache2 will initialise and prompt you for this passphrase. On my VM I did not yet have keyboard access to the TTY and hence I was unable to enter it. The only way I could (immediately) find around this problem was to create an insecure key without a passphrase. I wasn't fussed as this was a test VM. On a live server I would have to spend longer thinking through the issue. Any suggestions would be greatly appreciated in the comments!

Now that we have created our key for our signing request, it is time to actually self-sign our certificate. We input our signing request and our key and output the SSL certificate file. Then we copy the files to Ubuntu's SSL directory. Finally, we enable the SSL Apache module with the shortcut a2enmod:

openssl x509 -req -days 365 -in server.csr \
-signkey server.key -out server.crt
sudo cp server.crt /etc/ssl/certs
sudo cp server.key /etc/ssl/private
sudo a2enmod ssl

It's time to configure Apache to serve our SVN repository over SSL. We need to create a basic authentication htaccess password file so that any random Joe who points their web browser at https://svn.yourproject.com/ is not allowed to view your codebase. Run the htpasswd command to create a new user myuser. You will be prompted for a password:

sudo htpasswd -cm /etc/apache2/svn.passwd myuser

In addition, we need to tell Apache to listen to port 443 (the default HTTP SSL port) so let's open up the virtual host configuration file and add in a NameVirtualHost for that port:

sudo emacs /etc/apache2/conf.d/virtual.conf

Add the following lines in the appropriate place:

NameVirtualHost *:80
NameVirtualHost *:443

Now we need to create the configuration for the SVN virtual host itself. I've called it svn-myproject but you are free to call it anything that does not clash with another virtual host. We're going enforce SSL encryption and basic auth. The listing below should be reasonably self-explanatory regarding your own server information:

sudo emacs /etc/apache2/sites-available/svn-myproject

Now add the following in the file, replacing the ServerAdmin and ServerName attributes as necessary:

<VirtualHost *:443>
  ServerAdmin youremail@myproject.com
  ServerName svn.myproject.com
  DocumentRoot /var/www/myproject
  SSLEngine on
  SSLOptions +StrictRequire
  SSLCertificateFile /etc/ssl/certs/server.crt
  SSLCertificateKeyFile /etc/ssl/private/server.key
 
  <Location />
    DAV svn
    SVNPath /var/lib/svn/repo
    AuthType Basic
    AuthName "My Project SVN Server"
    AuthUserFile /etc/apache2/svn.passwd
    Require valid-user
  </Location>
</VirtualHost>

Let's create the web root to stop Apache throwing a warning about non-existent directories:

sudo mkdir /var/www/myproject

Enable the site and reboot Apache:

sudo a2ensite svn-myproject
sudo /etc/init.d/apache2 restart

You can test that your repository is working by visiting your repo (don't forget to prefix with HTTPS) in your browser with the IP/domain configured as for your virtual host configuration above. It is quite handy to be able to browse the structure over a web connection sometimes!

Currently this will give any user who passes the basic auth test full read/write access to your repository. To allow more fine-grained access, you can edit the /var/lib/svn/repo/conf/authz file and add usernames with read and write privileges ("r", "w" or "rw"). This is particularly useful if you are working in a team or wish to share your repo over the internet and only want to allow the masses read access.

In the next set of agile tools development server tutorials I will outline how to install Trac for project management, continuous integration and monitoring tools as well as off-site backup strategies.

I'd love to hear about other setups being used - perhaps Git or Bazaar with alternative project management tools. If there's anything you think I've left out, please let me know - I want the tutorial to be as accurate as possible!

My First Attempts Using OpenGL

Sunday, 2nd January 2011 - Michael Halls-Moore - 0 Comments

Every Christmas my parents are baffled with my choices when I provide them with a list of books that I would like to read. Last year (2010!) I decided that I would finally learn 3D graphics programming. First I needed to decide between OpenGL and DirectX as my primary platform. I have used DirectX before in a learning capacity, but that was back when I was a full-time Windows user. My bias towards open software led to me to consider OpenGL as any program I wrote could be ported to additional platforms. I had already researched some fantastic game and graphics development textbooks, but the OpenGL SuperBible (5th Edition) stood out among the crowd. I added it to the my Christmas wishlist and Santa Claus delivered.

The book is well organised, teaching you all of the basic principles of OpenGL from the ground up, with no prior experience required. I am quite fortunate in that having studied mathematics at University, I don't have to worry about the "dreaded maths chapters" that cause confusion to some people when they first attempt 3D graphics. In addition, I am quite well-versed in C++ so the language choice is no barrier. Hence the book has been highly enjoyable to read through so far.

OpenGL is essentially a relatively low-level API to the graphics hardware. I have found that the level of abstraction sits perfectly between performance and ease of development. As the programmer you have to assemble your primitives (shapes) and order them in optimised buffer data structures. You then need to provide coordinate transformations to alter their positions as well as projective geometry transforms in order to provide the illusion of three-dimensionality on a two-dimensional surface.

Although I am nearly at the stage of programming my own very basic shaders (custom code applied to vertices and surfaces) through the book, I have only had one night of coding so far. The first tutorial involves drawing a single triangle on a blue background - in essence an OpenGL "Hello World". Nothing spectacular, but the process of achieving this requires a lot of understanding of how the different components fit together. The next major task is to draw more advanced primitives and rotate them. Then textures and finally custom shaders for a more realistic look.

The reason for learning OpenGL is twofold. The first is that I am always interested in learning new mathematically-related software concepts. 3D graphics has a large body of mathematical literature associated with it and I imagine I can scour the arXiv and prior SIGGRAPH papers for interesting techniques to code up. The second reason is that I want to attempt to make a 3D game engine that is roughly equivalent to that of the Quake III engine. I want to learn about Binary Space Partitioning (BSP) as applied to graphics engines as well as spline-based curve rendering, which was one of the engine's main "selling points". There are also a multitude of interesting shader techniques that can be applied - such as bump mapping. If I really wanted to get creative I could try implementing some of the techniques on the latest and greatest CryEngine 3!

I'm off to read more of the book now. I'll keep you updated with my progress, however. Watch this space.