Using sklearn’s decision trees to form simple trading decisions for trading-bots

Justin Güse
7 min readOct 9, 2022

--

I have been writing and developing analytics and trading bots for quite some time now, usually involving very complex LSTM (deep learning) models, which get harder and harder to handle.

This is when I asked myself if I can capture a similar performance using simple if/else-based conditions. I like simple things, and the more an algorithm generalizes, the less risk of overfitting occurs. This has been an issue recently, as highly specialized trading bots performed quite badly during the “corona crashes”, which they could simply not handle because they were overly specialized.

So what “simple” trading decisions could possibly exist?
Everyone might know the “standard” youtube algorithms like “golden”- or “death-cross”, where you buy/sell depending on if the 50-day SMA crosses over the 200-day SMA. This works nicely on some examples but usually fails to capture the more complex market movements happening.
Additionally, a lot of people worked out simple correlations of RSI combined with trend lines, but usually, these perform equally badly.

The result in the end

This is where I got the idea to use simple ML to figure out possible correlations or decisions. Whilst recent network models are performing quite well, they also represent their logic in hidden layers which are hard to understand as a human, which is why I have decided to go with the most basic of basic models and see how well they perform: decision trees

Find the whole Jupyter https://github.com/JustinGuese/tradingbot22-tradingbots/blob/main/jupyternotebooks/decision-tree-tradingbot.ipynb

What are decision trees?

A decision tree is simply put in a range of if/else statements like if relative-strength-index (RSI) is below 30, do this, if not, do that.

Sklearn can help us with this using from sklearn.tree.DecisionTreeClassifier

Preparing the data

To use this decision tree classifier, I want to “tag” data into classes. The most simple classes with trading data are “uptrend” and “downtrend”, or “buy” and “sell”.

But first, let us get some data including technical indicators.

Downloading data

I am using the Python package “yfinance” to download stock data.

This function downloads daily data for a given ticker that is listed on https://finance.yahoo.com/, adds all kinds of Technical Indicators using the ta library, adds a simple moving average (SMA), and fills nan values backward.

Example data of MSFT

How to create the target up- and downtrend values?

But how can we tell from the above chart what an uptrend and downtrend is?

Many people work with “pct_change”/percent change and tagging up or down compared to the previous timestep, but I found that this is too finely grained, especially if we want to create simple trading decisions. Therefore I would prefer if we can somehow capture the general “flow”, or trend.

One way to do this is so first smoothen the “line”, which can be done with an SMA, which averages the values and therefore smoothens the line, and then get the local maxima and minima of this line.

For this, we can use the numpy function “argrelextrema”

The result of np.argrelextrema

As we can see, this nicely tags downtrends as “-1” / “sell”, and uptrends as “1” / “buy” in the DataFrame column “signal”.

Creating a baseline

It is always a good idea to create some kind of comparison or baseline for your algorithm. Therefore let us just calculate how much you would have earned if you would have just bought all the stocks you could get on the first day of data, and then sold everything on the last. I have grabbed MSFT (Microsoft) data from 2010 until 2022.

with just holding you would have made 96545.12$ 
or 74.57% per year
or 621.40$ per month

Not too bad! Keep in mind that we could not have known back in 2010 that Microsoft will grow so much -> classic survivorship bias.

Now we are having a baseline that our algorithm should beat, otherwise, it does not make sense.

Simulating a trading environment

To simulate a trading environment, we need to loop through the data and make a trading decision on each day. For this I have written the “oneRun” function:

Additionally, we are setting a variable “COMMISSION”, which represents the commission on each trade. For interactive brokers this is 0,00025%. Set this to the fee you are paying with COMMISSION = 0.00025.

X is the DataFrame we created earlier.

You might notice the “model” that is passed. This should be a model containing a “predict” function, which returns 1 for “buy” and “-1” for sell.

Let us use a “Fake”-Model for now, which randomly returns something.

As you can see, it will return an array of the predictions, as sklearn models would do it.

Executing the “oneRun” function with the FakeModel will result in:

with the random model you would have made 1461.83$
or 1.13% per year
or 9.41$ per month

As you might have guessed, randomly trading would end in a bad result (still positive though 👌).

Using a decision tree classifier

Using a decision tree classifier is quite straightforward.

          precision    recall  f1-score   support
-1.0 0.85 0.90 0.87 1178
1.0 0.94 0.91 0.92 2035
accuracy 0.91 3213
macro avg 0.90 0.90 0.90 3213
weighted avg 0.91 0.91 0.91 3213
[[1058 120] [ 185 1850]]

Woohoo, we will be rich!! 0.91 accuracy?! As the seasoned Data Scientist in you might have noticed, we completely ignored train/test splits, therefore massively overfitting our model and getting these false results.

Why are we doing it then? We just want to use the model to create trading ideas/observations, and not use the model to .predict() something.

This is why we are going to limit the depth and leaves in the next steps, to reduce the model down to “general” and basic trading ideas.

One reason it performs quite well right now is, that for example really really specific logics as “if price == 30.8: buy” are in there, which are not at all realistic for real-time prediction. This is because the model “knows” the future, and can easily say “yeah on the 20.6.12 we will get an uptrend, so go all in”.

Generalizing and “trimming down” the decision tree

We will generalize and “trim” the decision tree using two techniques:

  1. Grabbing data from different stocks -> This is how we will avoid specific assumptions for a stock, like “if volume > 32000: buy”
  2. Limiting the depth and leaves of the decision tree

This will hurt performance, but the objective of this post is to find generalized models which can be represented by brief “if/else” statements, and can tolerate unknown market situations quite well.

The bigDf will contain data on a lot of stocks. If you want to build a resistant model you should include all kinds of stocks. I was lazy and just included tech stocks I like.

Next let us create a function that “trims down” the decision tree

We are basically favoring smaller trees, if the win of the simulated trade is basically (95%) more than the last biggest.

the best depth is 15, best leave 70. with a win of 31677.42$

The accuracy of the model is still 0.88, which is not too bad!

Next, let us use graphviz to visualize our decision tree:

For visualization purposes, I limited the model to depth=10 and leaves=10

Decision tree result with depth 10 and leaves 10

Nice! As we can see a lot of statistics represent more generalized and not individualized metrics like volatility, aaron-trend indicator and others.

If you would want to decide according to this logic, you would

  1. Check the Aaron trend indicator. If it is less or equal to 10.0, go down the left path, if not go down the right path
  2. If you went down the left path, the likeliest decision is “sell”/”downtrend”, which can be further narrowed down by looking at “volatility_dcp” …

Let the magic happen: Transform a decision tree to if/else statements

Now let us do some magic and convert the decision tree to if/else statements!

The result will look like this:

Cool! Keep in mind that you still need to convert the “0” to “-1”/”sell”/”downtrend”.

Congrats, this will be a tradingbot that will be hyper-efficient without needing any libraries (once you have the getData() )!

Meaning if you want to get a trading decision for the current data, you can simply run

df = getData("BTC-USD", date(2022, 5, 1)
decision = getDecision(df.iloc[-1])
if decision == 1:
print("i say buy!")
elif decision == 0:
print("i say sell!")

This is just for statistical purposes, do not form your trading decisions based on this data!

If you liked this article leave a like and comment, if I get enough feedback I will show you how to build bots that are actually performing better using “lookback” windows and “sanity checks” on the decisions.

Visit my website https://datafortress.cloud/ for more, or find the whole Jupyter notebook at https://github.com/JustinGuese/tradingbot22-tradingbots/blob/main/jupyternotebooks/decision-tree-tradingbot.ipynb

--

--

Justin Güse

Justin Güse is the CEO of DataFortress.cloud. He loves Big-Data, Cloud and Virtual- and Augmented Reality, as well as books on self-reflecting in general.