mildbyte's notes tagged 'tech'

(notes) / (tags) / (streams)

View: compact / full
Time: backward / forward
Source: internal / external / all
Public RSS / JSON

(show contents)

Introducing Splitgraph: a diary of a data scientist

@mildbyte (link) 6 months, 4 weeks ago | programming | work | tech | splitgraph |

It's been two years since project Morrowind (which apparently now has been made an official speedrun category). During that time, I've been working on another exciting project and it's time to finally announce it.

TL;DR / executive summary

Today, we are delighted to launch Splitgraph, a tool to build, extend, query and share datasets that works on top of PostgreSQL and integrates seamlessly with anything that uses PostgreSQL. It brings the best parts of Git and Docker, tools well-known and loved by developers, to data science and data engineering, and allows users to build and manipulate datasets directly on their database using familiar commands and paradigms.

Splitgraph launches with first-class support for multiple data analytics tools and access to over 40000 open government datasets on the Socrata platform. Analyze coronavirus data with Jupyter and scikit-learn, plot nearby marijuana dispensaries with Metabase and PostGIS or just explore Chicago open data with DBeaver and do so from the comfort of a battle-tested RDBMS with a mature feature set and a rich ecosystem of integrations.

Feel free to check out our introductory blog post the frequently asked questions section!

A diary of a data scientist

Let's consider the life of a more and more prominent type of worker in the industry: a data scientist/engineer. Data is the new oil and this person is the new oil rig worker, plumber, manager, owner and operator. This is partially based on my own professional experience and partially on over-exaggerated horror stories from the Internet, just like a good analogy should be.

Day 1

Came to work to a small crisis: a dashboard that our marketing team uses to help them direct their door hinge (yes, we sell door hinges. It's a very niche but a very lucrative business) sales efforts is outputting weird numbers. Obviously, they're not happy. I had better things to do today but oh well. I look at the data that populates the dashboard and start going up the chain through a few dozen of random ETL jobs and processes that people before me wrote.

By lunchtime, I trace this issue down to a fault in one of our data vendors (that we buy timber price data from: apparently it's a great predictor of door hinge sales): overnight, they decided to change their conventions and publish values of 999999 where they used to push out a NULL. I raise a support ticket and wait for it to be answered. In the meantime, I enlist a few other colleagues and we manage to repair the damage, rerunning parts of the pipeline and patching data where needed.

Day 2

Support ticket still unanswered (well, they acknowledged it but said they are dealing with a sudden influx of support tickets, I wonder why) but at least we have a temporary fix.

In the meantime, I start work on a project that I wanted to do yesterday. I read a paper recently that showed that another predictor of door hinge sales is council planning permissions. The author had scraped some data from a few council websites and made the dataset available on his Web page as a CSV dump. Great! I download it and, well, it's pretty much what I expected it to be: no explanation of what each column means and what its ranges are. But I've seen worse. I fire up my trusty Pandas toolchain and get to work.

By the evening, there's nothing left of the old dataset: I did some data patching and interpolation, removed some columns and combined some other ones. I also combined the data with our own historical data for door hinge sales in given postcodes. In conjunction with this data, the planning permission dataset indeed gives an amazing prediction accuracy. I send the results to my boss and go home happy.

This is the happiest I'll be this week.

Day 3

The timber sales data vendor has answered our support ticket. In fact, our query made them inspect the data closer at which point they realised they had some historical errors in the data which they decided to rectify. The problem was that they couldn't send us just the rows that were changed and instead linked us to an SQL dump of the whole dataset.

I spend the rest of the day downloading it (turns out, there's a lot of timber around) and then hand-crafting SQL queries to backfill the data into our store as well as all the downstream components.

In the meantime, marketing together with my boss has reviewed my results and is really excited about council planning permission data. They would like to put it into production as soon as possible.

Day 4

I send some e-mails to the author of the paper to find out how they generated the data and if they would be interested in sharing their software, whilst also trying to figure out how to plumb it into our pipeline so that the projections can make it into their daily reports.

Boss is also unhappy about our current timber data vendor and is wondering if I could try out a dataset provided by another vendor. Easier said than done, as now I have to somehow reproduce the current pipeline on my machine, swap the new dataset in and rerun all our historical predictions to see how they would have fared.

Day 5

The council planning permission data project is probably not happening. Firstly, it's because the per-postcode sales data that I used in my research is in a completely different database engine that we can't directly import into our prediction pipeline. But in worse news, the author of the paper doesn't really remember how he produced the data and whether his scraping software still works.

After a whole day of searching, I did manage to find a data vendor that seems to be doing similar things, with no pricing data, no nothing. I drop them an e-mail and go home.

A diary of a software engineer

Day 1

Come to work to learn about an overnight production issue that the operators managed to mitigate but now I have to actually fix. Oh well. I get the tag of the container we had running in production and do a docker pull. I fire it up locally and use a debugger (and the container's Dockerfile) to locate the issue: it's a bug in an open-source library that we're using. I do a quick scan of GitHub issues to see if it's been reported before. Nope. I raise an issue and also submit a pull request that I think should fix it.

In the meantime, the tests I run locally for that library pass with my fix so I change the Dockerfile to build the image from my patched fork. I do a git push on the Dockerfile, our CI system builds it and pushes the image out to staging. We redirect some real-world traffic to staging and it works. We do a rolling upgrade of the prod service. It works.

I spend the rest of the day reading Reddit.

Day 2

Github issue still unanswered, but we didn't have any problems overnight anyway.

I have some more exciting things to do: caching. Some guys from Hooli have open-sourced this pretty cool load-balancing and caching proxy that they wrote in Go and it fits our use case perfectly for an internal service that has always had performance issues.

They provide a Docker container for their proxy, so I quickly get the docker-compose.yml file for our service, add the image to the stack and fiddle around with its configuration (exposed via environment variables) to wire it up to the service workers. I run the whole stack up locally and rerun the integration tests to hit the proxy instead. They pass, so I push the whole thing out to staging. We redirect some requests to hit the staging service in order to compare the performance and correctness.

I spend the rest of the day reading Reddit.

Day 3

The Github issue has been answered and my PR has been accepted. The developer also found a couple of other bugs that my fix exposed which have also been fixed now. I change our service to build against the latest tag, build on CI, tests pass.

I look at the dashboards to see how my version of the service did overnight: turns out, the caching proxy reduced the request latency by about a half. We agree to push it to prod.

I spend the rest of the week reading Reddit.

State of the art in software engineering

There is a lot of tools, workflows and frameworks in software engineering that made developers' lives easier and that paradoxically haven't been applied to the problem of data processing.

Version control and images

In software, you do a git pull and bring the source code up to date by having a series of diffs delivered to you. This ability to treat new versions as patches on top of old versions has opened up more opportunities like rebasing, pull requests and branching, as well as inspecting history and merge conflict resolution.

None of this exists in the world of data. Updating a local copy of the dataset involves downloading the whole image again, which is crazy. Proposing patches to datasets, having them applied and merging several branches is unspoken of and yet is a common workflow in data science: why can't I maintain a fork of data from a vendor with my own fixes on top and then do an occasional git pull --rebase to have my fork up to date?

In software, we have learned to use unique identifiers to refer to various artifacts, be it Git commit hashes, Docker image hashes or library version numbers. When someone says "there's a bug in version 3.14 (commit 6ff3e105) of this library", we know exactly which codebase they refer to and how we can get and inspect it.

This doesn't happen with data pipelines: most of the time we hear "the data we downloaded last night was faulty but we overwrote chunks of it and it's propagated downstream so I've no idea what it looks like now". It would be cool to be able to refer to datasets as single, self-contained images and for any ETL job to be just a function between images: if it's given the same input image, then it will produce the same output image.

Bringing development and production closer

To expand on that, Docker has made this "image" abstraction even more robust by packaging all of the dependencies of a service together with that service. This means that this container can be run from anywhere: on a developer's machine, on a CI server, or in production. By giving the developers tools that make replicating the production experience easier, we have decreased the distance between development and production.

I used to work in quant trading and one insight I got from that is that getting a cool dataset and finding out that it can predict the returns on some asset is only half of the job. The other half, less talked about and much more tedious, is productionizing your findings: setting up batch jobs to import this dataset and clean it, making sure the operators are familiar with the import process (and can override it if it goes wrong), writing monitoring tools. There's the ongoing overhead of supporting it.

And despite that, there is still a large distance between research and production in data science. Preparing data for research involves cleaning it, importing it into say Pandas, figuring out what every column means, potentially hand-crafting some patches. This is very similar to old-school service set up: do a sudo apt-get install of the service, spend time setting up its configuration files, spend time installing other libraries and by the end of the day don't remember exactly what you did and how to reproduce it.

Docker made this easier by isolating every service and mandating that all of its dependencies (be it Linux packages, configuration files or any other binaries) are specified explicitly in a Dockerfile. It's a painful process to begin with but it results in something very useful: everyone now knows exactly how an image is made and its configuration can be experimented on. One can swap out a couple of apt-get statements in a Dockerfile to install, say, an experimental version of libc and get another version of the same service that they can compare against the current one.

In an ideal world, that's what would happen with data: I would write a Dockerfile that grabs some data from a few upstream repositories, runs an SQL JOIN on tables and produces a new image. Even better, I should be able to have this new image kept up to date and rebase itself on any new changes in the upstream. I should be able to rerun this image against other sources and then feed it into a locally-run version of my pipeline to compare, say, the prediction performance of the different source datasets.

Faster prototyping

We are slowly coming to a set of standards on how to distribute software which has reduced onboarding friction and allowed people to quickly prototype their ideas. One can do a docker pull, add an extra service to their software stack and run everything locally to see how it behaves within minutes. One can search for some software on GitHub and git clone it, knowing that it probably has fairly reproducible build instructions. Most operating systems now have package managers which provide an index of software that can be installed on that system as well as allow the administrator to keep those packages up to date.

There is a ton of open data out there, with a lot of potential hidden value, and most of it is unindexed: there's no way to find out what it is, where it is, who maintains it, how often it's updated and what does each column mean. In addition, all of it is in various ad hoc formats, from CSVs to SQL dumps, from HDF5 files to unscrapeable PDF documents. For each one of these datasets, an importer has to be written. This raises the friction of onboarding new datasets and innovating.

Less opinionated tools

One thing that Git and Docker are popular for is that they're unopinionated: they don't care about what is actually being versioned or run inside of the container. If Git only worked with a certain folder structure or required one to execute system calls to perform checkouts or commits, it would never have taken off. That is, git doesn't care whether what it's versioning is written in Go, Java, Rust, Python or is just a text file.

Similarly with Docker, if it only worked on artifacts produced by a certain programming language or required every program to be rewritten to use Docker, that would slow down adoption a lot if not outright kill the tool.

Both of these tools build up on an abstraction that has been around for a while and that other tools use: the filesystem. Git enhances tools that use the filesystem (such as the IDE or the compiler) by adding versioning to the source code. Docker enhances the applications that use the filesystem (that is, all of them) by isolating them and presenting each one with its own version of reality.

Such an abstraction also exists in the world of data: it's SQL. A lot of software is built on top of it and a lot of people, including non-technical ones, understand SQL and can write it. And yet most tools around want users to learn their own custom query language.

Towards composable, maintainable, reproducible data science

All these anecdotes and comparisons show that there are a lot of practices that data scientists can borrow from software engineering. They can be combined into three core concepts:

  • Composability: Much like with Docker, where it's easy to take existing containers and extend them or compose multiple containers into a single service, it should be straightforward to extend or apply patches to datasets or create derivative datasets. Most of the hidden value in data comes from joining multiple, sometimes seemingly unrelated datasets together.
  • Maintainability: Coming up with a predictive model by doing some exploratory work in a Jupyter notebook is only half of the battle. Maintaining the data pipeline, keeping the derivative data up to date and being able to quickly locate, fix and propagate fixes to issues in upstream data should be made easier. What-if analyses should be a matter of changing several lines of configuration, not manually tracking changes through several layers of ETL jobs. Updating data should be done by git pull, not by downloading a new data dump and rerunning one's import scripts.
  • Reproducibility: The same Dockerfile and the same build context will result in the same Docker container which will behave the same no matter where it is. One should always know how to rebuild a data artifact from scratch by only relying on its dependencies and a build recipe and sharing datasets should be as easy as pushing out a new Docker image or a new set of Git commits.

Over the past two years, I and Miles Richardson have been building something in line with this philosophy.

Splitgraph is a data management tool and a sharing platform that is inspired by Docker and Git. It currently is based on PostgreSQL and allows users to create, share and extend SQL schema images. In particular:

  • It supports basic git-like operations, including commit, checkout, push, pull etc to produce and inspect new commits to a database. Whilst an image is checked out into a schema, it's no different from a set of ordinary PostgreSQL tables: any other tool that speaks SQL can interact with it, with changes captured and then packaged up into new images.
  • It uses a Dockerfile-like format for defining images, called Splitfiles. Much like Dockerfiles, it uses command-based image hashing so that if the execution results in an image that already exists, the image will be checked out instead of being recreated. Furthermore, the Splitfile executor does provenance tracking so that any image created by it has its dependencies recorded and can be updated when the dependencies update.
  • Data ingestion is done by either writing to the checked-out image (since it's just another PostgreSQL schema) or using Postgres Foreign Data Wrappers that allow to mount any database (currently we include the open-source PostgreSQL, MySQL and MongoDB FDWs in our engine as well as an FDW for the Socrata Open Data platform) as a set of local SQL tables.
  • There's more exciting features designed to improve data ingestion, storage, research, transformation and [].

We have already done a couple of talks about Splitgraph: a short one at a local Docker meetup in Cambridge (slides) talking about parallels between Docker and Splitgraph and a longer one at a quantitative hedge fund AHL (slides) discussing the philosophy and the implementation of Splitgraph in-depth. A lot has changed since then but it still is a good introduction to our philosophy.

Interested? Head on to our quickstart guide or check out or the frequently asked questions section to learn more!

An e-mail template for entrepreneurs investigating an industry

@mildbyte 2 years, 9 months ago | thoughts | tech | entrepreneurship |

This is a quick template for a cold e-mail that can be used for initial reconnaissance and information gathering for a software engineer/entrepreneur that would like to break into a new industry and doesn't know where to start. Feel free to use it and alter it as you see fit!

Dear (name of CEO),

My name is (name) and I'm an entrepreneur based in (city) with (n) years of experience crafting software products for (Google/Facebook/Apple/Amazon/Microsoft).

As someone with the gift of an analytical mind that leaves nothing unexamined, I believe I am uniquely positioned to quickly get up to speed with the domain knowledge intrinsic to a given field that would take normal people years to acquire. I have hence been interested in creating a business that solves some of the unique problems that (industry) faces. I was wondering if you had a few minutes to answer some questions so that I can find out what these problems actually are?

Firstly, I'm interested in the business processes in your day-to-day work that have the potential to be replaced with a CRUD application. Do you use an Excel spreadsheet for any of your operating procedures or business intelligence? Do you think there is scope for migrating some of those in-house systems to a software-as-a-service platform so that you can focus on your business' competitive advantages?

Secondly, I am a big proponent of distributed ledger technology as an alternative to single-point-of-failure classical databases in this increasingly more trustless society. Would you consider replacing part of your data storage, inventory tracking or other business operations with a custom-tailored blockchain-based solution that leverages smart contracts to ensure a more robust experience for all stakeholders?

Finally, do you think running your organisation could be made easier with a bespoke shared economy service that empowers workers to flexible time and lowers your human resources overhead while allowing you to tap into an immensely larger pool of workforce? This is an innovation that has successfully added value to taxis, hotels and food delivery and I firmly believe that there is a unique proposition in applying it to (industry).

I look forward to hearing from you.
(my name)

Sent from my (device)

Yes, this is satire. But if you remove the over-the-top buzzword soup, the messiah complex and flavour-of-the-month technology, it really doesn't seem like there's much a person without any connections or experience in an industry can do besides cold-emailing people and asking them "what do you use Excel for?"

You're probably not going to get rich in the stock market

@mildbyte 2 years, 9 months ago | python | thoughts | tech | finance |

source: XKCD

Abstract: I examine inflation-adjusted historical returns of the S&P 500 since the 1870s with and without dividends and backtest a couple of famous investment strategies for retirement. I also deploy some personal/philosophical arguments against the whole live-frugally-and-then-live-frugally-off-of-investments idea.

Disclaimer: I'm not (any more) a finance industry professional and I'm not trying to sell you investment advice. Please consult with somebody qualified or somebody who has your best interests at heart before taking any action based on what some guy said on the Internet.

The code I used to produce most of the following plots and process data is in an IPython notebook on my GitHub.


Early retirement is simple, right? Just live frugally, stop drinking Starbucks lattes, save a large fraction of your paycheck, invest it into a mixture of stocks and bonds and you, too, will be on the road towards a life of work-free luxury and idleness driven by compound interest!

What if there's a stock market crash just before I retire, you ask? The personal finance gurus will calm you down by saying that it's fine and the magic of altering your bond position based on your age as well as dollar cost averaging, together with the fact that the stock market returns 7% on average, will save you.

As sweet as that would be, there is something off about this sort of advice. Are you saying that I really can consistently make life-changing amounts of money without doing any work? This advice also handwaves around the downside risks of investing into the stock market, including the volatility of returns.

I wanted to simulate the investment strategies proposed by personal finance and early retirement folks and actually quantify whether putting my "nest egg" into the stock market is worth it.

This piece of writing was mostly inspired by NY Times' "In Investing, It's When You Start And When You Finish" that shows a harrowing heatmap of inflation-adjusted returns based on the time an investment was made and withdrawn, originally created by Crestmont Research. They maintain this heatmap for every year here.

This post is in two parts: in the first one, I will backtest a few strategies in order to determine what sort of returns and risks at what timescales one should expect. In the second one, I will try to explain why I personally don't feel like investing my money into the public stock market is a good idea.

Simulating common stock market investment strategies


The data I used here is the S&P 500 index. and I'm assuming one can invest into the index directly. This is not strictly true, but index tracker funds (like Vanguard's VOO ETF) nowadays are pretty good and pretty cheap.

A friend pointed me to a paper that has an interesting argument: using the US equity markets for financial research has an implicit survivorship bias in it. Someone in the 1870s had multiple countries' markets to choose from and had no way of knowing that it was the US the would later become a global superpower, large amounts of equity gains owing to that.

As a first step, I downloaded Robert Shiller's data that he used in his book, "Irrational Exuberance", and then used it to create an inflation-adjusted total return index: essentially, the evolution of the purchasing power of our portfolio that also assumes we reinvest dividends we receive from the companies in the index. Since the companies in the index are large-cap "blue chip" stocks, dividends form a large part of the return.

I compared the series I got, before adjusting for inflation, with the total return index from Yahoo! Finance and they seem to closely match the Yahoo! data from the 1990s onwards.

The effect of dividends being reinvested changes the returns dramatically. Here's a comparison of the series I got with the one without dividends (and one without inflation):

The average annual return, after inflation, and assuming dividends are reinvested, is about 7.5%. Interestingly, this seems to contradict Crestmont Research's charts where the average annual pre-tax (my charts assume there are no taxes on capital gains or dividends) return starting from 1900 is about 5%.

On the other hand, the return with dividends reinvested does seem to make sense: without dividends, the annualized return is about 4.25%, which implies a dividend yield close to the actual observed values of about 4.4%.

Another observation is that the returns are not normal (their distribution has statistically significant differences in kurtosis and skewness).

Common strategies

Lump sum investing

First, I wanted to plot the annualized returns from investing in the S&P 500 at a given date and holding the investment for a certain number of years.

The result kind of confirms the common wisdom that the stock market is a long term investment and is fairly volatile in the short term. If one invested in the late 1990's, say, and withdrew their investment in 5 or even 10 years, they would have lost about 4% every year after inflation. Only with investment horizons of 20 years do the returns finally stabilise.

Here are the distributions of returns from investing a lump sum into the S&P 500. This time they were taken from 10000 simulations of paths a given portfolio would follow (by randomly sampling monthly returns from the empirical returns' distribution). I also plotted the inflation-adjusted returns from holding cash (since it depreciates with inflation) for comparison.

What can be seen from this is that in 20 or 30 years, it seems possible to double or quadruple one's purchasing power.

I also plotted a set of "hazard curves" from those sampled distributions. Those are the simulated probabilities of getting less than a given return depending on the investment horizon. For example, there's a 30% chance of getting a return of less than 0% after inflation (losing money) for a 1 year investment and this chance drops down to about 5% for a 20 year investment. Conversely, there's a 100% chance of getting a return of less than 300% (quadrupling the investment), but after 20 years there's a 50% chance of doing so.

Dollar cost averaging

DCA is basically investing the same amount of money at given intervals of time. Naturally, such an investment technique results in one purchasing more of a stock when it's "cheap" and less when it's "expensive", but "cheap" and "expensive" are relative and kind of meaningless in this case.

DCA is usually considered an alternative to lump sum investing, but for a person who is investing, say, a fraction of their paycheck every month it's basically the default option.

I did some similar simulations of dollar cost averaging over a given horizon against investing the same sum instantly.

Unsurprisingly, DCA returns less than lump sum investment in this case. This is because the uninvested cash depreciates with inflation, as well as because the average return of the stock market is positive and hence most of the cash that's invested later misses out on those gains.

DCA would do better in a declining market (since, conversely, most cash would miss out on stock market losses), but if one can consistently predict whether the market is going to rise or decline, they can probably use that skill for more than just dollar cost averaging.

In my tests, dollar cost averaging over 20 years gave a very similar distribution to investing a lump sum for 9 years. Essentially, returns of DCA follow those of investing a lump sum for a shorter period of time.

Finally, here are the "hazard curves" for dollar cost averaging.

After a year of monthly investing we'd have an almost 40% chance of losing money after inflation and even after 20 years we still have a 10% chance. After 20 years, doubling our purchasing power is still a coin toss.

Is it worth it?

Difference in expected utility

What we have gleaned from this is that the long-term yearly return from the stock market is about 7% after inflation (but before taxes), assuming one invests over 20-30 years. Compounded, that maybe results in quadrupling of one's purchasing power (less if one invests a fraction monthly, and even less with dividend and capital gains taxes).

While doubling or trebling one's purchasing power definitely sounds impressive (especially whilst doing only a few hours of work a year!), it doesn't really feel big if you look into it. Let's say you're about to retire and have saved up $1 million (adjusted for inflation) that you weren't investing. If you were putting a fraction of that money monthly into the stock market instead you would have had $2-3 million, say. But you can live as comfortably off of the interest on $1 million as you would on $2-3 million.

And on the contrary, if you have only saved $100k, dollar-cost averaging would have yielded you $300k instead, the interest on both amounts (and in retirement it has to be risk-free or almost-risk free) being fairly small.

One could argue that every little bit helps, but what I'm saying here is that the utility of wealth is non-linear. It's kind of sigmoidal, in a way. I like this graph from A Smart Bear:

source: A Smart Bear

As long as one has enough money to get up that "utility cliff" beyond which they can live off of their investments in retirement, it doesn't matter how much is it. Conversely, saving by investing into the stock market is worth it only if one is very sure that that's the thing that will get them over the line.

Flexibility, predicting the future and barbell investing

This possible climb up the cliff comes at a cost of locking in one's capital for a large amount of time (as short-term volatility in the stock market makes it infeasible to invest only for a few years). This lock-in leaves one completely inflexible in terms of pursuing other opportunities.

One's basically treating the stock market as some sort of a very long-term bond where they put the money in at one end and it comes out on the other side, slightly inflated. There was also an implicit assumption in all these simulations that the future follows the same pattern as the past.

Returns only become consistent after 30 years with dollar cost averaging. Someone who started investing a fraction of their savings into the general stock market 30 years ago would have gotten a 5-7% annualized return. Could they have predicted this? Probably. But could they have predicted the rise of the Web, the smartphones, the dotcom boom and crash? Probably not.

I'm not saying that whatever extra money one has should be invested into Bitcoin, a friend's startup or something else. Money can also be used to buy time off (to get better acquainted with some new development or technology) or education. Is using it to buy chunks of large corporations really the best we can do?

I also like the idea of "barbell investing", coined by Nassim Nicholas Taleb: someone should have two classes of investments. The first one is low-risk and low-return, like bonds or even cash. The second one is "moonshots", aggressive high-risk, low-downside, high-upside investments. Things like the stock market are considered to be neither here nor there, mediocre-return investments that might bear some large hidden downside risks. that you can do what you really love?

There's an argument that one should still save up excessive amounts of money (as investments or just cash) whilst living extremely frugally so that after several years of hard work they "never have to work again" and can retire, doing what they really would have loved to do this whole time.

One of Paul Graham's essays kind of sums up what I think about it:

Conversely, the extreme version of the two-job route is dangerous because it teaches you so little about what you like. If you work hard at being a bond trader for ten years, thinking that you'll quit and write novels when you have enough money, what happens when you quit and then discover that you don't actually like writing novels?

Paul Graham, "How To Do What You Love"

Here I am, young and retired. What do I do next? Anything? How do I know what I like? Do I have any contacts that I can rely upon to help me do that "anything"? I don't feel that toiling away somewhere, being bored for a couple of decades, so that I can then quit and be bored anyway (since I hadn't learned what makes me tick), is a useful life strategy.

Warren Buffett?

What about all those people who did become rich investing into the stock market? Warren Buffett is one of them, probably one of the most famous investors in the world. But he made his first couple of million (in 2018 dollars) working for Benjamin Graham, in salary and bonuses. In essence, if he wanted to, he could have retired right there and then.

Only then did he proceed to increase his wealth by investing (first running an investment partnership and so working with the partnership's money and presumably charging a performance fee, then through Berkshire Hathaway). None of these methods are available to someone with a personal investment account.


Essentially, I think that the stock market is a wealth preservation, not a wealth generation engine. Publicly-listed companies are large and stable, paying consistent and healthy dividends with the whole investment yielding a solid, inflation-beating return.

But for me? I think it's important to stay flexible and keep my options open. If someone offers me an opportunity somewhere, I don't want to say "Sorry, I have a mortgage and the rest of my money is invested in healthy companies with solid, inflation-beating returns that I can't currently sell because it's in drawdown and would you look at the tax advantages". I want to be able to say "Sure, let me give a month's notice to my landlord. Where are we going?"

I've never felt less in control of my own hardware

@mildbyte 2 years, 11 months ago | thoughts | tech | 1 comment


Once upon a time I bought a car. I drove it straight out of the dealership and parked it on my driveway. This is a thought experiment, as youth in the UK can't afford neither a car nor a driveway and I can't drive, but bear with me.

I bought a car and I drove it around for a few months or so and life was great. It did everything I wanted it to do, it never stalled and never broke down. One day, I got an e-mail from the dealership saying that they'd need to update my car. Sure, I thought, and drove there to get it "updated". They didn't let me look at what they were doing with the car, saying that they were performing some simple maintenance. Finally, I got my car back, without seeing any visible changes, and drove it back home.

This happened a few more times over the next several months. I'd get an e-mail from the dealer, drive the car back to them, have a cup of coffee or three while they were working, and then get the car back. Sometimes I'd get it back damaged somehow, with a massive dent on the side, and the dealership staff would just shrug and say that those were the manufacturer's instructions. Then a few days later, I'd get summoned again and after another update cycle the dent would be gone.

At some point, I stopped bothering and after a few missed calls from the dealership my car stopped working completely. I phoned the dealership again and sure, they said that the car wouldn't work until another update. Great. I had the car towed to the dealership and drove back without any issues.

One night I got woken up by a weird noise outside my house. I looked out the window and saw that some people in dark overalls were around my car, doing something to it. I ran out, threatening to call the police. They laughed and produced IDs from the dealership, telling me that since people were frustrated with having to update their cars so often, they'd do it for them in order to bother them less.

I sighed and went back to sleep. This continued for quite some time, with the mechanics working on my car every week or so. I'd invite them for tea and they would refuse, quoting terms of service and all that sort of thing.

One day I woke up to a car covered in ads that seemed to be related to what I was browsing last night. There wasn't anything controversial, thankfully, but it was still a bit unsettling. The dealership support staff said it was to offer me products relevant to my interests. I asked if I could take them down and got told that it was a part of the whole offering and the vehicle wouldn't work without them. So I had to drive to work surrounded by more pictures of the Trivago woman that I was comfortable with.

After a year or so, the manufacturer had innovated some more. When I got into the car and turned the ignition key, a bunch of mechanics appeared seemingly out of nowhere, and picked the car up, ready to drag it as per my instructions. Turns out, the night before they had removed the engine and all the useful parts from the vehicle, turning it into a "thin client". It was supposed to make sure that when there was an issue with the car, they could debug it centrally and not bother all their customers.

Finally, one morning I got into the car and nothing happened. Turns out, the manufacturer was acquired by a larger company last night and their service got shut down. I sat at the wheel, dreading being late for work again, and suddenly woke up.

Meanwhile in the real world

Once upon a time I opened a new tab in Firefox on my Android phone to find out that besides a list of my most visited pages to choose from, there also was a list of things "suggested by Pocket". What the hell was Pocket, why was it suggesting me things and, more importantly, how the hell did it get into my Firefox?

I remember when pushing code to the user's device was a big deal. You'd go to the vendor's website, download an executable, launch it, go through an InstallShield (or a NullSoft) install wizard if you were lucky and only then you would get to enjoy the new software. You'd go through the changelogs and get excited about the bugfixes or new features. And if you weren't excited, you wouldn't install the new version at all.

I remember when I went through my Linux phase and was really loving package managers. It was a breath of fresh air back then, a single apt-get upgrade updating all my software at once. And then Android and Apple smartphones came around and they had exactly the same idea of a centralized software repository. How cool was that?

I'm not sure when mobile devices started installing updates by default. I think around 2014 my Android phone would still meekly ask to update all apps and that was an opportunity for me to reestablish my power over my device and go through all the things I wasn't using anymore and delete them. In 2016, when I got a new phone, the default setting changed and I would just wake up to my device stating, "Tinder has been updated. Deal with it".

And it's been the case on the desktop too. I realised that Firefox and Chrome hadn't been pestering me about updates for quite a while now and, sure, they've been updating themselves. I like Mr Robot, but I like it in my media player, not in my browser.

It's not even about automatic updates. I could disable them, sure, but I would quickly fall behind the curve, with services that I'm accessing stopping supporting my client software. In fact, it's not even about browsers. I'm pretty sure I wouldn't be able to use Skype 4 (the last decent version), that is, if I could find where to download it. As another example, I recently launched OpenIV, a GTA modding tool, at which point it told me "There's a new version available. I won't start until you update". Uuh, excuse me? Sure, I could find a way around this, but still, it's not pleasant being told that what's essentially a glorified WinRAR that was fine the night before can't run without me updating it.

(as an aside, WinRAR now seems to be monetized by having ads on its "please buy WinRAR" dialog window.)

If I go to a Web page, gone are the days of the server sending me some HTML and my browser rendering it the way I wanted. No, now the browser gets told "here's some JavaScript, go run it. Oh, here's a CDN, go download some JavaScript from there and run it, too. Oh, here a DoubleClick ad server, go get some pictures that look like Download buttons from over there and put them over here. Also, here's the CSRF token. If you don't quote at it at me the next time, I'll tell you to go away. Oh yeah, also set this cookie. Oh, and append this hexadecimal string to the URL so that we can track who shared this link and where. The HTML? Here's your HTML. But the JavaScript is supposed to change the DOM around, so go run it. It changes your scrolling behaviour, too, so you get a cool parallax effect as you move around the page. You'll love it."

As a developer, I love web applications. If a customer is experiencing an issue, I don't need to get them to send me logs or establish which version of the software they're running. I just go to my own server logs, see what request they're making and now I am able to reproduce the issue in my own environment. And with updates now getting pushed out to users' devices automatically, there are fewer and fewer support issues which can be resolved by saying "update to the newest version" and instead I can spend time on better pursuits. Finally, I don't need to battle with WinAPI or Qt or Swing or any other GUI framework: given enough CSS and a frontend JavaScript framework du jour, things can look roughly the same on all devices.

However, that leaves users at the mercy of the vendor. What used to be code running on their own hardware is now code running on the vendor's hardware or code that the vendor tells their hardware to run. So when they end up in a place with no Internet connection or the vendor goes out of business, the service won't work at all instead of working poorly.

By the way, here's an idea on how to come up with new business ideas. Look at the most popular Google services and have a quick think about how you would write a replacement for them. When they inevitably get sunsetted, do so and reap kudos. For example, I'm currently writing a browser addon that replaces the "View Image" button from Google Image search results.

In fact, it's not just an issue with applications. Once upon a time in early 2017, I came home from work to find my laptop, that had been on standby, decided to turn itself on and install a Windows 10 update. The way I found that out was because the login screen had changed and it was using a different time format. And then things became even weirder as I realised that all the shortcuts on my desktop and in the Start Menu were missing. In fact, it was almost as if someone broke into my house and reimaged my whole hard drive. Strangely enough, all the software and the data was still there, tucked away in C:\Program Files and its friends, it's just that Windows wasn't aware of it. Thankfully, running a System Restore worked and the next update didn't have these problems, but since then I stopped allowing automatic updates. Except there's no way I can figure out what a given update does anyway.


I'm really scared of my phone right now. Here it is, lying by my side, and I've no idea what's going on in its mind and what sort of information it's transferring and to where. When I was a kid and got gifted my father's old Nokia 6600, I was excited about having a fully fledged computer in my pocket, something with a real multitasking operating system that I could easily write software for. Now, I'm no longer so sure.

Copyright © 2017–2018 Artjoms Iškovs ( Questions? Comments? Suggestions? Contact