Quantitative Trading (1) – Software Development Fundamentals

This article is the first in a series of reviews written about books covering fundamental topics in Quantitative Development and Analysis. The goal of this article is to help you identify the areas of Quantitative Trading you’re currently unfamiliar with. To facilitate this objective, each sub-series of reviews begins with an overview of the topics covered.

Without further ado, here are the reviews:

Software Development Fundamentals

Writing code that works is essential, but there’s a lot more to software development than just that. Creating clear, reusable, maintainable code that is highly performant and robust is equally important. Source Control, Test Driven Development (TDD) and Continuous Integration facilitate workflows through which high quality code can be efficiently produced.

Standard Algorithms, Data Structures and Design Patterns provide solutions to abstract problems and consequently help to separate code solving general problems from code specific to the problem at hand. Critically, these generic solutions increase code reusability and are optimised to minimise execution time (time complexity) and memory usage (space complexity). An Object-Orientated Programming approach uses encapsulation, polymorphism and inheritance to define groups of objects, their behaviours and their relationships in a way that promotes clear, reusable, maintainable code.

Having understood the abstract theory, we need “”tools”” to concretely implement the ideas. Each tool brings its own set of individual capabilities and limitations that need to be carefully assessed on a problem-by-problem basis when considering the tool’s effectiveness in solving a specific task. Note, I use the word “”tool”” very loosely here to refer to a technology used in the creation of software, be it programming languages, applications, libraries, databases, operating systems, text editors, integrated development environments (IDEs), etc.

As an example, code written in C++ can be made to run extremely fast but it’s at the expense of requiring the programmer to manually manage memory resources allocated to the heap. Manual memory management complicates code and necessitates the adoption of exception-safe programming techniques. Moreover, C++ code is verbose in comparison to a dynamically-typed language, such as Python or R, leading to longer development times and less concise code. On the overhand, the static-typing of C++ can help catch errors at compile time, its explicitness can improve clarity and the extensive collection of C++ libraries, such as the Standard Template Library (STL) and Boost, offer ready-made optimised solutions to a host of generic problems.

More details about specific tools can be found in the book reviews that follow. Briefly, we cover: C++, C#/.NET, Common Language Runtime (CLR), Component Object Model (COM), LINQ, Excel Interfacing, Excel DNA, Python, Pandas, NumPy, IPython, MatPlotLib, SQL, Microsoft SQL Serverand Relational Databases. In the course of discussing particular programming languages, multi-threading, exception-safe programming and vectorised programming will be introduced when pertinent as they are not applicable to all languages.

To start off the reviews we return to the abstract theory of software development.

Design Patterns: Elements of Reusable Object-Orientated Software

The seminal book: Design Patterns: Elements of Reusable Object-Orientated Software, often referred to as the Gang of Four (GoF) book, introduced the idea of design patterns. The authors abstracted out twenty-three “”techniques”” they found used repeatedly throughout professional software (which they termed design patterns), established nomenclature for these design patterns and recorded them in written form.

The book is split into two halves; the first is a Case Study that demonstrates how design patterns can be applied in the creation of a Document Editor and the second is a comprehensive Catalogue of the twenty-three design patterns. Being aware of these design patterns will allow you to quickly and effortless identify when they can be applied to your own software to make your code more flexible, reusable and maintainable.

Most importantly, the terminology established in the book provides a common language that will enable you to easily communicate your software solutions both verbally and in code form. Each design pattern catalogued in the book is given a meaningful name, in addition to a description of its intent, motivation, capabilities, limitations, structure, implementation details and sample code examples.

After reading this book, you will likely notice that you have unknowingly used a number of design patterns in your pre-existing software and realise alternative decisions that you could have made to improve your software.

The Art of Unit Testing: with examples in .NET

Verifying production code works as intended is as important as the creation of the production code itself. Thus, effort of comparable magnitude should be spent towards creating production code as spent creating code to test it. Unit Testing is the process of checking that all of the small parts of code which make up a piece of software (units) individually behave in the way they were intended to.

The Art of Unit Testing explains the complete process of creating, using and evaluating unit tests. Most valuably, the author (Roy Osherove) explains how to create unit tests in such a way that they do not break every time a single line of production code is changed, what is being tested can be immediately comprehended and so that the unit tests can be unvaryingly relied upon to correctly diagnose the functionality of any particular unit of code.

The book is split into four parts. The first covers the theory of unit testing, explaining what it is and why it is important. The second covers isolation frameworks, which can be used to help create unit tests, and demonstrates the use of a particular isolation framework called NUnit. It introduces the concepts of Stubs and Mocks and how to design code so that it can be easily tested. The third is the most valuable; it provides practical advice on how to write readable, trustworthy and maintainable tests and how to organise your tests within your project. The final part suggests how to go about introducing unit testing practices into your organisation and applying unit testing to legacy code.

Unit tests allow you to immediately and precisely identify the location of bugs. Consequently, they enable the programmer to change production code without fear of breaking the entire program. This gives programmers the courage to take action in improving the software and hence better designed software results. Write unit tests!

Test Driven Development (TDD): A workflow for creating Unit Tests

Test driven development (TDD) is a workflow through which production code and unit tests are effectively created in unison. In test driven development, the goal of creating unit tests takes top priority and guides the creation of production code.

Robert Martin, also known as “”Uncle Bob””, provided these three rules that facilitate a TDD workflow:
1) You are not allowed to write any production code unless it is to make a failing unit test pass.
2) You are not allowed to write any more of a unit test than is sufficient to fail the unit test; and compilation failures of a unit test are failures.
3) You are not allowed to write any more production code than is sufficient to pass the one failing unit test.

Following this workflow ensures that your code is testable. And in order for your code to be testable, you are forced to design your code in a way that adheres to many core principles of well-designed code. In addition, by making your tests fail before they pass, you are simultaneously checking that the test code itself works correctly and the diagnoses of the units they test can be trusted.

Integration Testing

In contrast to unit testing is integration testing, which is the process of simultaneously testing a collection of two or more units. The problem with using (only) integration testing is that it does not allow you to immediately identify which specific unit of code a bug resides in. This leads to more time spent debugging than it would have taken if you had unit tests at hand.

Moreover, even if all your integration tests go as planned, you still cannot claim that your code is entirely correct as the integration tests will not have covered every path of logic. Write integration tests in addition to unit tests to verify collections of units combine (integrate) together correctly.

C++ Design Patterns and Derivatives Pricing

Over the first seven chapters of C++ Design Patterns and Derivatives Pricing, the author, Mark Joshi, describes the process of creating a program that prices a class of financial products known as Exotic Options. Moving through each of these chapters, he iteratively adds new features and improves the software by applying some of the design patterns cataloged in the Design Patterns: Elements of Reusable Object-Orientated Software book reviewed above.

To follow the algorithm used to price options, Martingale pricing theory and stochastic calculus first needs to be understood. To be more specific, the program prices options uses Monte Carlo simulations to calculate the risk-neutral expectation of the discounted option price at maturity assuming the underlying stock follows geometric Brownian motion. If you have not yet learnt stochastic calculus and Martingale pricing theory, you should first read chapters one through seven of Mark Joshi’s introductory book to Financial Mathematics: The Concepts and Practice of Financial Mathematics.

Throughout C++ Design Patterns and Derivatives Pricing, but in particular the later chapters, a number of important points specific to writing code in C++ are covered, such as: exception-safe programming, smart-pointers, passing-by-reference, virtual functions, virtual tables (vtables),Boost, QuantLib, Standard Template Library (STL) and decoupling. A chapter on Excel-Interfacing using the xlw package is also provided but I recommend using Excel DNA and managed C++ instead for creating Excel Add-Ins. Excel DNA is covered in C# for Financial Markets and my supplementary book review below.

In addition to creating a C++ program that prices Options using Monte Carlo simulations, a program to price Options using Trees and a program to compute the implied volatility are also implemented. The overarching purpose of the book is to walk the reader through detailed examples that demonstrate how Financial Mathematics, C++ and Design Patterns all fit together.

C# for Financial Markets

Duffy’s C# for Financial Markets covers a breadth of tools useful for quantitative finance that can be used in conjunction with C# and the .NET framework. Tools covered include: C#, Component Object Model (COM), Common Language Runtime (CLR), Reflection and Metadata,Excel Interfacing, Excel DNA, Real-time Data (RTD) Server,LINQ.

In addition to these tools, financial topics such as: Bond Pricing from yields; Option Pricing using the Binomial Method; Derivative Pricing using the Trinomial and Finite Difference Methods; Curve Building via the Bootstrapping Process and Interpolating Interest Rate Curves are covered. Beyond the very basics, however, I would not recommend learning financial mathematics from this book. There are clearer, more thorough and less-error prone treatments elsewhere. For example, see Mark Joshi’s introductory book: The Concepts and Practice of Financial Mathematics.

The book begins by describing the basics of the C# programming language. Notable differences between the C# and C++ programming languages that should be highlighted include: automatic garbage collection, finalisers, extension methods, delegates, multi-cast delegates, getter/setter methods, sealed classes. Following the introduction to C#, an explanation of the financial mathematics topics listed in the preceding paragraph are then provided.

Next, the author provides an exposition of how the .NET framework enables code written in C# to seamlessly interact with other .NET managed languages, such as C++, SQL, R; and with Microsoft Applications such as Excel. Moreover, how to write C# code that can be called from within an Excel Spreadsheet through the C API by using Excel DNA is described.

Finally, the topics of multi-threading and creating parallel applications in C# are introduced. Multi-threading is the process of using multiple threads to simultaneously execute separate portions of a program’s code. The benefit of multi-threading is faster, more-responsive programs. However, since the order in which certain parts of code are executed and the state of resources (data) shared across threads are critical for a program to function correctly, multi-threading introduces a number of challenging programming problems over single-threaded programming.

Excel DNA: Steps to create an Excel Add-In in Visual Studio

First download Excel DNA and extract the zipped folder.

Open Visual Studio and create a project using the “”Visual C#”” -> “”Class Library”” template. Then, within Visual Studio, carry out the following steps:

Create the following three files
Press <Ctrl>+<Shift>+<A> to create each of the files.
1) Create a new file named <VisualStudioProjectName>.dna
The <VisualStudioProjectName>.dna file defines the path to the dynamic link library (dll) that will contain the C# functions which we wish to be able to call from Excel. This dll file will be built at step (7). In addition to the path to the dll, other metadata is also defined in the <VisualStudioProjectName>.dna file.

An initial template for <VisualStudioProjectName>.dna that you can use is given below:
<DnaLibrary Name=””Add-in Name“” RuntimeVersion=””v4.0“”>
<ExternalLibrary Path=””VisualStudioProjectName.dll“” />
</DnaLibrary>

2) Create <VisualStudioProjectName>.xll by copying
In the Excel DNA directory that you downloaded earlier there is a “”Distribution”” subdirectory that contains a file named “”ExcelDna.xll””.

The <VisualStudioProjectName>.xll file is a copy of this ExcelDna.xll file renamed to have the same filename as the <VisualStudioProjectName>.dna file created in step (1). Simply copy and paste the ExcelDna.xll file into your Visual Studio project and rename it appropriately.

3) Rename and edit the Class1.cs file to <anotherFilename>.cs
A Class1.cs file will have been automatically created for you when you created the project. Rename this C# file to <anotherFileName>.cs as you please. The <anotherFilename>.cs file contains the C# code for the managed Excel Add-In you wish to create.

As an example, to create an Add-In that adds two integers we add the following content to the <anotherFilename>.cs file:

using ExcelDna.Integration;

namespace VisualStudioProjectName
{
public class Calculator
{
[ExcelFunction(Description = “”Adds two integers””)]
public static int AddTwoIntegers(int first_int, int second_int)
{
return first_int + second_int;
}
}
}

A reference to the ExcelDna.Integration dynamic link library needs to be added to fix the errors that will appear after creating the above C# file.

Reference ExcelDna.Integration

4) Reference the ExcelDna.Integration dll
Add a reference to the ExcelDna.Integration.dll in your Visual Studio Project so that the classes and functions from the ExcelDna.Integration library can be used.

Right-click “”References”” in the Solution Explorer, click “”Add Reference…””, click “”Browse””, locate the “”Excel.Integration.dll”” found the “”ExcelDna”” directory you downloaded earlier and click “”OK””.

Set File Properties

5) Set “”Copy to Output Directory”” property of the files <VisualStudioProjectName>.dna and <VisualStudioProjectName>.xll to “Copy if Newer”.

6) Set “Copy Local”” property of ExcelDna.Integration reference to “False”.

Build and Load the Add-in

7) Build the Visual Studio Project

8) Load the Add-In in Excel
Open the built <filename>.xll file from within Excel (File –> Open –> path/to/<VisualStudioProject>.xll). The path to the <filename>.xll file will likely be located at one of the following locations:
path/to/visual_studio_project/bin/debug/<VisualStudioProjectName>.xll
path/to/visual_studio_project/bin/release/<VisualStudioProjectName>.xll

9) Call the C# functions using Excel Formulae
For example, to call the AddTwoIntegers C# function defined in step (3), we would write “=AddTwoIntegers(1,2)””.

Excel Interfacing using the Component Object Model (COM)

An alternative method to create Excel Add-Ins is to make use of the Component Object Model (COM), which allows you to register a reference to an object (such as a C# class) in the Windows Register. An object registered in the Windows Register can then be accessed from an external program by providing its global unique identifier (GUID).

To access the Windows Register, admin privileges that normal account users do not have are required, thus preventing normal users from using Excel Add-Ins constructed in this way. On the overhand, Excel DNA communicates with Excel through the Excel C API and XLLs, which does not require admin privileges and offers the quickest way to interface C# to Excel. Therefore it is recommended that you use Excel DNA over the Component Object Model. Note that it is not possible to directly access the C API from C# code and that this is exactly the problem Excel DNA solves.

Python for Data Analysis

Wes McKinney, the author of Python for Data Analysis, is the creator of a python module called Pandas. Wes created Pandas to help solve the problems he was tackling during his employment as a Quantitative Analyst at a Hedge Fund. In this book, he explains how to use the Pandas, IPython, NumPy, MatplotLib python modules to prepare data for statistical and exploratory data analysis. Note that the book does not go beyond this preparatory step, that is there are no advanced statistical analyses to be found here. For more on statistical analysis, see the Statistics Fundamentals book reviews in the next article.

IPython allows you to execute Python commands interactively and thus is indispensable for exploratory data analysis, where you do not know in advanced exactly what commands you will be running. MatplotLib provides extensive plotting capabilities, enabling you to easily visualise your data. To output matplotlib plots directly inside (inline) of IPython, from the command line execute:ipython qtconsole ‑‑pylab=inline. NumPy offers an array of mathematical operations, stores data internally in an efficient manner and optimises performance by allowing you to make full use of the vector processor under the hood of your machine.

Pandas contributes DataFrame and Series data structures that makes the reshaping, joining and aggregation of tabular and time series data simple. For time series data, functionality that accounts for the intricacies of calendars (leap years, holidays, timezones, etc) are taken care of for you. The DataFrame construct provides operations that will be familiar to anyone with SQL and relational database experience. Among many other features, it allows you to easily select, update, insert, delete, slice-and-dice, group, aggregate, pivot, join, concatenate and transform tabular data.

Finally, a brief overview of Python is found in the Appendix of this book, however, it should be noted that this book is not intended to be an introduction to the Python programming language.

Vectorised Programming

Having mentioned NumPy in this book review, now is a good place to interject an explanation of Vectorised Programming (also known as Array Programming).Array programming makes use of a computer’s vector processor to reduce the time it takes to apply a single operator to every element of an array. NumPy enables us to use array programming techniques in Python.IPython: arrayProgrammingExample.py
Run IPython and execute the below commands.
# Computes the element-wise square of the array [1, 2, …, 10^6]
# Result: [1, 4, …, 10^12]
import numpy as np
LENGTH_OF_LIST = int(1E6)

# Method (1) – Without array programming:
list_py = range(1, LENGTH_OF_LIST + 1)
%timeit for x in list_py: x * x
# output: 10 loops, best of 3: 137 ms per loop

# Method (2) – With array programming:
list_np = np.arange(1, LENGTH_OF_LIST + 1)
%timeit list_np * list_np
# output: 1000 loops, best of 3: 1.67 ms per loop(Note you can copy and paste the commands into IPython by first copying the text and then executing the following four IPython commands: %cpaste, <ENTER>, <Ctrl-v>, <Ctrl-d>.)

Both compute the element-wise square of the same list of numbers but Method (2) is 82 (=137 ms/1.67 ms) times faster (note: the precise run-time measurements will be different on your computer)! To understand why, the instructions carried out by the processor must be examined.

Method (1) of computing the square of the array is converted into the following sequence of instructions:

execute this loop 10^6 times
read the next instruction and decode it
fetch the number
square it (execute the decoded instruction)
put the result here
end loop

Method (2) of computing the square of the array is converted into the following sequence of instructions:

read the instruction and decode it (once)
fetch the 10^6 numbers
square them (execute the decoded instruction)
put the results here

The processor is capable of fetching data while simultaneously decoding an instruction but it is unable to decode the next instruction until it has finished with the current. All of the instructions here are the same (square the data) but Method (1) redundantly decodes the same instruction multiple times. More critically, if your computer’s vector processor has multiple functional units it is able to execute a single instruction with multiple data inputs simultaneously in parallel. From the instructions detailed above we see that Method (2) makes full use of these capabilities while Method (1) does not.

Note that the Python Interpreter is unable to construct the instructions of Method (2) because it interprets each line of Python code one at a time. NumPy calls a compiled C program to carry out the operation. As the C code is compiled, the surrounding lines of code are considered and thus the instructions of Method (2) can be constructed. It should be noted that the standard Python Interpreter (CPython) does not support multi-threading; CPython has a global interpreter lock that prevents multiple threads from simultaneously executing Python bytecodes as CPython’s memory management is not thread-safe.

For more information regarding array programming, refer to the “”Description”” subsection of the Vector Processor Wikipedia page, and the Single Instruction Multiple Data (SIMD) Wikipedia page.

Learning SQL

Learning SQL will teach you how to store data in a relational database and how to get the information you want out if it. The book covers the complete basics including the SELECT/INSERT/UPDATE/DELETE SQL syntax and the first three forms of normalisation that help to reduce data redundancy to slightly more advanced topics such ascorrelated subqueries, transactions, rollups, views, triggers and metadata.

The book is agnostic to the relational database server used, be it MySQL, Oracle or Microsoft SQL Server. Where there are minor differences between these servers and their SQL languages, the author explains what the differences are. An important topic that is outside the scope of the book due to the large differences between the servers is execution plan optimisation. For MySQL the author recommends the book High Performance MySQL for learning how to generate indexes, analyse execution plans, influence the optimiser via query hints and tune your server’s start-up parameters.

Once you have read Learning SQL and done all of the end-of-chapter exercises, you will have a good command of the SQL programming language. However, you will need to do further independent research into the more advanced SQL features and practice applying the knowledge you acquire to fully appreciate the potential of SQL. You will be pleasantly surprised by what can be achieved using only the elementary operations in SQL.

The advanced topic of execution plan optimisation not discussed in the book is described below. We focus on the Microsoft SQL Server.

Execution Plan Optimisation in Microsoft SQL Server (MSSQL)

Unlike other programming languages, SQL does not allow the programmer to specify exactly how a SQL query is carried out. Instead an execution plan is generated by MSSQL’s Query Optimiser. The execution plan details the sequence of operations that will be executed to produce the result of the query. The Query Optimiser attempts to find the optimal sequence of operations that achieve the result set given the specific database conditions and query.

To view the Execution Plan in Microsoft SQL Server Management Studio, click the “”Query”” menu item and then “”Display Estimated Execution Plan””. The diagram that appears is read by starting in the top-right (not left) corner and reading from top-to-bottom and right-to-left (not left-to-right). The percentages displayed by each node in the diagram represent the estimated cost of the single operation relative to the complete set of operations that make up the entire query. It should be stressed that it is only an estimate and there are cases that you should be aware of where it will likely be significantly inaccurate; for example, when temporary tables are involved, it is always assumed that a table variable contains only a single row.

Using the execution plan of a query we can identify ways to tune (improve the performance) of the query. Assuming the estimated execution plan operation cost estimates are accurate, we can focus on improving those operations with a high relative cost. Methods to improve an operation include: removing unnecessary tables and columns from a query; creating covering indices on the searched columns and creating an index view (materialised view) for the query.

To measure the actual time spent executing each operation and the number of input/output operations, place the following SQL statements around your SQL query:

SET STATISTICS IO ON
GO
SET STATISTICS TIME ON
GO

‑‑ Place your SQL query here

GO
SET STATISTICS TIME OFF
GO
SET STATISTICS IO OFF
GO

In a database there are several dozens of queries spread across stored procedures, views and functions. Tuning all of these queries would not be a good use of time, so we need a way to narrow down the complete set of queries to a smaller subset that we get the most performance enhancement out of. To identify the top twenty most expensive queries in terms of read input/output cost (that still have their plans cached at the time you ran the query), execute the following query:

SELECT TOP 20 SUBSTRING(qt.text, (qs.statement_start_offset/2)+1,
((CASE qs.statement_end_offset
WHEN -1 THEN DATALENGTH(qt.text)
ELSE qs.statement_end_offset
END – qs.statement_start_offset)/2)+1),
qs.execution_count,
qs.total_logical_reads, qs.last_logical_reads,
qs.min_logical_reads, qs.max_logical_reads,
qs.total_elapsed_time, qs.last_elapsed_time,
qs.min_elapsed_time, qs.max_elapsed_time,
qs.last_execution_time,
qp.query_plan
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) qt
CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) qp
WHERE qt.encrypted=0
ORDER BY qs.total_logical_reads DESC

See this Microsoft article titled “”Optimising SQL Server Query Performance”” for more details.

Clean Code: A Handbook of Agile Software Craftsmanship

When beginning programming, your single and only concern is to get the code to work. As a result, a long list of bad habits are developed that you may not question later; the book Clean Code will help you to fix these bad habits and develop new ones. Robert Martin, also known as “”Uncle Bob””, establishes best practices for structuring code in a way that is easily maintained, extended and tested in his book Clean Code: A Handbook of Agile Software Craftsmanship.

The first part of the book, spanning the first thirteen chapters, sets out to define, evaluate and justify principles for developing “”clean”” code. Three Case Studies demonstrating how to apply these principles to existing “”dirty”” codebases are presented in the second part of the book. In the third and final section, a summary of the common problems and their associated best practice solutions are listed.

Reading the first section of this book will significantly improve the readability and overall quality of your code. The following principles and topics are covered: Naming Conventions, Structuring Functions, Expression through Code not Comments, Arranging Code, Data/Object Anti-Symmetry, Error Handling, Unit Tests.

Further Resources and Software Development Topics

In addition to the tools covered in this article, there is still more you will need to know about software development. For now, I leave you with a list topics you can independently research:

source control (git), scientific computing (floating-point number representation, overflow, underflow, bits), continuous integration, building programs from source, computer architecture, interpreters versus compilers, linking libraries, makefiles, virtual machines (Virtual Box), Linux operating system (Ubuntu), grep/sed/awk, text editors and IDEs (vim, PyCharm, Visual Studio, eMacs), refactoring, NoSQL (ElasticSearch, Kibana, Apache Spark), continuum anaconda.

Good luck with your studies!