1. Your database is probably not designed for the workload
Most databases are designed for a high-read, low-write use-case, with some exceptions. These exceptions are not traditional databases and are not the main point of this article. If you are using something like MySQL, Postgres, Oracle, or similar, your DB is optimized to support many reads and fewer writes. Logging to this type of database is an unnecessary performance hit.
2. Keeping infrequently accessed data in a database is expensive
Either the database is hosted on owned or rented hardware or it is deployed via a cloud service provider. Any way you slice it, putting a lot of infrequently accessed data into the system will balloon costs for storage, RAM, and CPU/IO to support the writes. Other systems can support storing this data more cheaply – look, for example, at Athena/S3 or Redshift Spectrum.
3. Once other systems rely on the location of the data it is hard to move
Analysts will want to build graphs, charts, reports, and make business decisions based on user behavior. If user behavior is primarily recorded in the production database, that is where they will go to find the information. Once a substantial number of reports leverage the data, it will be increasingly difficult to move or remove the data source, and it will require more and more buy-in from various teams. Technically, moving a large amount of semi-structured data to a new warehouse can also be challenging.
4. Security concerns
Most web frameworks support log filtering quite well. Most systems that write logs or events to the database circumvent these utilities. It’s far too easy for passwords, credit card numbers, and other information that shouldn’t be stored in the clear to be written in by mistake. Once that information finds its way in, it can become impossible to remove due to backup approaches and the simple dynamics of large tables. Speaking of that…
5. It can be hard to delete a large volume of data from a database
Large deletes can lock tables – tables that break production if they aren’t writable.
6. Data retention policies are harder to implement in the context of a transactional DB
Because of Point #5, it can be difficult to enforce a data retention policy. If the schema is really badly designed, it can be impossible. Consider the difficulty also of clearing data from historical database backups – this is far from trivial.
7. Mixing analytical queries with application queries can lead to brownouts or downtime
￼A large analytical query that joins across a number of tables can starve production workloads of resources leading to brownouts. Running those on a system ETL’d out of your production DB? Well, then I guess you don’t need that log data in the production DB!
8. Migrations of the table you’re logging to become untenable quickly
Migrations on massive tables are potentially dangerous and generally slow. Add a high write throughput and the situation gets much worse. It’s possible to easily run out of space on your database cluster with solutions that copy data into a new version of the table to prevent locking.
9. You will have a lot of garbage data to sort through to find what you need
Most of what’s captured in logs is useless. Why store it in your expensive, highly-available production database?
10. There are better tools for the job
See: Kafka, Kinesis, S3, Athena, BigQuery, Apache Druid, Presto, Snowflake, ElasticSearch, DataDog, Amplitude, Mixpanel, Logly, Hive, Segment.io, and Clickhouse, just to name a few technologies that can help you solve this situation that are top of mind right now.