Handling Monetary Ledgers with Pandas, Python

This article is also available as a real notebook

A ledger is a table that records monetary trasactions, as credit or debit, and is always indexed in a chronological way, by date and time. It is used since ancient times to log account transactions and tell and reconstruct the history of events on that account.

Ledger is a fundamental data management tool implemented in systems of banks and financial institutions. Before computer systems, banks already used ledgers as regular paper books to manage their clients accounts.

An ancient ledger. Source: Wikipedia

Here is an example of a ledger recording transactions that happened on an account across 5 years. Positive transaction values are credits to that account, while negative values are debits or withdrawals.

Ledger example

Let’s use code to create our exemplary ledger:

import pandas as pd
The ledger DataFrame as displayed by Jupyter

Although they might look like ones, ledgers are not what we use to call time series. A ledger holds a lot of semantic but we’ll have to convert it into a time series in order to make it useful in calulations and plots or even predictions. The ledger seen as a time series is a table with the total balance at that point in time. For that we’ll use Panda’s cumsum() function that computes the cumulative sum — also known as balance — at each data point.

ledger_as_timeseries = ledger[['value']].cumsum()

Don’t forget to put in chronological order before making calculations with cumsum(), otherwise we'll get completely meaningless results. Our example is already sorted, but just in case:

balance = (
ledger[['value']]
.sort_index()
.cumsum()
.rename(columns={'value': 'balance'})
)
The historical balance computed from ledger facts

This table is now a genuine time series. It conveys the ups and down of the balance across those 5 years of account history. It says that after its inception in 2015, with a credit of 100, it received more values but had withdrawals too, had a big credit in May 2019 and finished 2020 with a balance of 1605. Can we plot it? Let’s try (even if we are not ready yet).

balance.plot()
First attempt to plot the balance as a ragged time series

This plot is wrong. Balance in 2015 did not grow continuosly as the line shows. Ledger data shows that balance was actually flat for most of the year of 2015, with balance of 100, and then abruptly went to 300 in 2016–02.

This bias happens because our time series (and ledger) has no regular frequency. Data is disposed as events happened, in random dates, which are not in a daily or monthly basis. We have a ragged time series and we’ll have to turn it into a regular one, at least to corretly plot it.

Pandas has outstanding tools to handle time. Some of those are resample() and pandas.Resampler which has the ability to put a discrete time series in an abstract state where time is linear and then automatically manipulate the discrate values and infer what would be their linear incarnation. Then we'll use the asfreq() method to convert them back into something useful for plots.

# daily frequency, but could be `1m` for
# monthly, or `1y` for yearly, or `1w`
# or `3m`...
frequency='1d'
A resampled time series with daily entries

This result is now much bigger because it was resampled into a daily time series (frequency='1d'). The space in between was filled with useless NaNs because we didn't tell resample() how to infer the empty space. For a fixed frequency table with daily balance, sorted in ascending order, we need pandas.Resampler’s pad() or ffill(). Use backfill() if your table is sorted in descending order. These methods also already encapsulate what asfreq() does, so we won't need it anymore.

frequency='1d'
balance.resample(frequency).ffill()
A resampled time series with daily data filled as copies of chronologically last known data

A lot of duplicate data. But at least we are now ready to plot it correctly.

frequency='1d'
balance.resample(frequency).ffill().plot()
Correct plot of the balance that was resampled and adjusted from a cumulative sum of the original ledger

Much better! Because it is now correct.

Handling multiple events at the same date and time

The methods above will fail if you have multiple events at the same date. In Pandas terminology, you can’t do resampling if you have 2 or more rows with same index value. Let’s add an event to our ledger on a day that already has an event, 2017–10–14, and see how to handle that:

ledger=ledger.append(
pd.DataFrame(
index=pd.DatetimeIndex(['2017-10-14']),
data={
'value': -20,
'comment': 'some transaction'
}
)
).sort_index()
Our ledger augmented with a data point on 2017–10–14

To have 2 events on “2017–10–14” technicaly means for Pandas’ Resampler class that they both happened at the same second, which is “2017-10-14 00:00:00”. Resampler doesn't know the actual order they really happened. Turns out that intra-day chronology doesn't matter for a balance sheet with resolution as high as 1 day. So we'll add some random number of seconds to each date in the index, just to make them unique. We'll use NumPy for that.

import numpy as np
The precise_ledger DataFrame

Now our pandas.DatetimeIndex has unique values while still keeping the semantics of the date the transactions happened.

Let me know if you have a better way to achieve same results.

Let’s plot, even if visually it will be very similar:

frequency='1d'

Run intermediate parts of the above one-liner in your notebook to understand what is happening in between. A recap:

  • We start with only the value parts of our ledger, excluding textual comments
  • Sort the index to have meaningful results
  • Calculate cumulative sum — a.k.a. balance — in the sorted DataFrame to convert a ledger into a time series
  • Rename the “value” column to “balance” just to keep up with the correct meaning
  • Use resample() to convert the discrete time series into a more linear one, with resolution of 1 day
  • Fill the dates that have no data with the last known balance, a forward fill operation
  • Plot it

Put all this procedure inside a useful function

Note that we are permanently modifying the ledger index so it will be in sync with the generated balance. This is a bad practice for real production code, but makes this article shorter.

def balance_from_ledger(ledger, column='value', frequency=None):
# Increase index presicion converting it from plain
# date to a random time in the middle of that date.
# This is a technique to avoid duplicate index
# points while keeping the meaning of the transaction
# happened at that day.
# We need this to make resampling work as expected.

Put transactions and balance side by side

This can be achieved with a simple join:

balance = balance_from_ledger(ledger,'value')
ledger.join(balance)[['value','balance','comment']]
Transactions, balance and comment

Compute balance yearly growth rate

Investment applications are interested into growth rate which is a number that must be looked as a percentage and explains how much a monetary value growed or shrinked between 2 dates. It is defined by the simple formula

having 𝑑 as the current period.

So you can calculate balance of this period having the previous period and the growth rate:

First lets get a yearly perspective of balance:

yearly_balance = balance_from_ledger(
ledger,
column='value',
frequency='1y'
)
The balance on the last day of each year

This technique consists of putting current and previous balance in the same row and then do a simple math operation on that row. We’ll use Panda’s shift() and join() for that.

(
yearly_balance
.shift(1, fill_value=0)
.rename(columns={'balance': 'previous_balance'}))
)
The balance of the previous period of what the index says
(
yearly_balance
.join(
yearly_balance
.shift(1, fill_value=0)
.rename(columns={'balance': 'previous_balance'})
)
)
A utility table required to calculate growth per row

All together in one shot with a fluent interface to calculate growth rate:

yearly_balance.join(

# Create a table that has current and
# previous balance on each year

yearly_balance
.shift(1, fill_value=0)
.rename(columns={'balance': 'previous_balance'})

).apply(
# Our growth formula goes here:

func=lambda year: year['balance']/year['previous_balance']-1,
axis=1

).rename('balance_growth')

The result is just a Series, but a very meaningful one:

2015-12-31         inf
2016-12-31 1.700000
2017-12-31 -0.259259
2018-12-31 1.400000
2019-12-31 2.468750
2020-12-31 -0.048048
Freq: A-DEC, Name: balance_growth, dtype: float64

We read this like:

  • Balance on year end 2016 (270) increased 170% from year end 2015 (100)
  • Balance of year end 2017 (200) has shrinked 26% when compared to previous year (270)
  • And so on

To compute growth rate for monthly or weekly regular periods (or any other period) you just start with a different resampling. Lets do it in one block for a weekly view. The original output will have most of its balance_growth==0 because balance doesn't change too frequently in this 5-year dataset, so we'll filter to show only the weeks that actually have growth using query('balance_growth != 0'). Additionally, we'll put growth side by side with the weekly balance with a second join().

weekly_balance=balance_from_ledger(
ledger,
column='value',
frequency='1w'
)
The account balance on a weekly basis along with the growth rate when compared to previous period

Pandas is an outrageously powerful library and I'm glad it is free and available to anyone to use.

I'm an experienced hacker focusing lately on Data Science and Data Engineering. I've also done a lot of customer-facing IT activities, programming, teaching etc