Mon, Sep 13 2021

DipDup v3.0.0 release candidate introduces hooks, better scalability, and stability improvements

Our full-stack dapp developing framework reaches another significant milestone.

Lev Follow

DipDup is a framework for building selective indexers for Tezos DApps. It helps to reduce boilerplate code and lets developers focus on what's important — the business logic. It works on top of TzKT API, which provides normalized and humanized blockchain data via REST and WebSocket endpoints. This article will guide you through the recent DipDup changes with code snippets and demo samples.

As time goes by, more cool projects on Tezos blockchain choose DipDup as a backend solution. Besides being a joyful event for us, this also reveals new challenges for the framework.

Today we are proud to introduce the next major DipDup version. This time it's marked as a pre-release, which means we will continue to support the 2.0 branch until the release of the stable version. If you're asking yourself, "Should I upgrade now?" the answer is simple. There are three reasons not to wait for the stable 3.0 release:

You are experiencing issues when using index factories (processing originations matched by source/ similar_to fields)
You need to conveniently execute lots of SQL scripts and scheduled jobs
You just want to be an early adopter and provide some valuable feedback 😃

In any way, think twice before using this version in production environments. Since almost every change in this version breaks backward compatibility, there will be no separate "Breaking Changes" paragraph this time. Instead, look for a fancy warn ⚠ emoji in a paragraph header to know if your action is needed to perform the migration.

# New entity: Hooks

Before version 3.0.0-rc1, every project had two handlers called "default": on_configure fired before indexing starts and on_rollback fired when TzKT Datasource receives the reorg message. In addition, arbitrary SQL scripts from sql/on_restart and sql/on_reindex project directories could be executed on restart and reindex, respectively.

Later we realized there are some flaws in this approach:

"Default handlers" are not exactly handlers since they are not linked to any index.
Adding new events when needed could be painful.
A lack of ability to invoke SQL scripts from handlers and jobs.
Jobs are very similar to default handlers and SQL scripts: arbitrary code which is executed on a specific event (by schedule in this case)

To solve these problems, we decided to significantly redesign this part of the framework and introduce hooks. Hooks are user-defined callbacks called either from the ctx.fire_hook method or by scheduler (jobs config section, we'll return to this topic later).

Let's assume we want to calculate some statistics on-demand to avoid blocking an indexer with heavy computations. Add the following lines to DipDup config:

hooks:
  calculate_stats:
    callback: calculate_stats
    atomic: False
    args:
     major: bool
     depth: int

A couple of things here to pay attention to:

An atomic option defines whether hook callback will be wrapped in a single SQL transaction or not. If this option is set to true main indexing loop will be blocked until hook execution is complete. Some statements like REFRESH MATERIALIZED VIEW do not require to be wrapped in transactions, so choosing a value of the atomic option could decrease the time needed to perform initial indexing.
Values of args mapping are used as type hints in a signature of a generated callback. We will return to this topic later in this article.

Now it's time to call dipdup init. The following files will be created in the project's root:

├── hooks
│   └── calculate_stats.py
└── sql
    └── calculate_stats
        └── .keep

Content of the generated callback stub:

from dipdup.context import HookContext

async def calculate_stats(
    ctx: HookContext,
    major: bool,
    depth: int,
) -> None:
    await ctx.execute_sql('calculate_stats')

By default, hooks execute SQL scripts from the corresponding subdirectory of sql. Remove or comment out the execute_sql call to prevent this. This way, both Python and SQL code may be executed in a single hook if needed.

# ⚠ Default handlers require manual migration

Now it's time to get rid of deprecated "default handlers". Here's a mapping of old and new callbacks for internal DipDup events:

`handlers` (old)	`sql`	`hooks` (new)
`on_configure`	`on_restart`	`on_restart`
	`on_reindex`	`on_reindex`
`on_rollback`		`on_rolback`

Perform the following actions:

If you have any custom logic implemented in default handlers, move it to corresponding hooks using the table above to find the right destination.
Remove default handlers from the project's handlers directory.
sql directory could be left as it is.

Like in previous releases, unprocessed rollback leads to reindexing. Other events have no default action.

# ⚠ `jobs` become schedules for hooks

Since we already have an entity for user-defined callbacks (both Python and SQL ones), jobs can refer to hooks without having their own callbacks.

jobs:
  daily_cron_stats:
    hook: calculate_stats
    crontab: 0 0 * * * *
    args:
      major: True
      depth: 9000
  leet_interval_stats:
  	hook: calculate_stats
	interval: 1337
	args:
	  major: False
	  depth: 1

If you already had job callbacks implemented in your project before updating to 3.0.0, you should convert those callbacks to hooks manually:

Comment out the jobs section in config. Add new items to the hooks section.
Call dipdup init to update project structure and generate callback stubs.
Move code from old job callbacks to new hook callbacks.
Remove the jobs directory from your project's root.
Restore the jobs section in config describing schedules for freshly created hooks as in an example above.

# Arguments typechecking

DipDup will ensure that arguments passed to the hooks have correct types when possible. CallbackTypeError exception will be raised otherwise. Values of an args mapping in a hook config should be either built-in types or __qualname__ of external type like decimal.Decimal. Generic types are not supported: hints like Optional[int] = None will be correctly parsed during codegen but ignored on type checking.

# Context (`ctx`)

That is a brief reminder of what context is. The first argument of every callback in a DipDup project is called a context. Hook and handler callbacks receive instances of dipdup.context.HookContext and dipdup.context.HandlerContext, respectively. For now, these classes mostly share the same helper methods.

# ⚠ `add_contract` and `add_index` methods return coroutines

This change aims to save contracts and indexes spawned from within factories as soon as possible and thus correctly maintain the state of index factories.

# ⚠ `commit` and `reset` methods removed

Those methods were used to notify DipDup that the config has been modified during callback execution, and it's time to spawn missing indexes. Now the only correct way to add a new index in runtime is to call an add_index method. Be careful! Modifying config via ctx.config is not forbidden implicitly (this requirement is hard to enforce without extra CPU ticks), but adding a new item to the indexes section will have no effect.

# New methods: `fire_hook`, `execute_sql`

You can trigger hook execution either from handler callback or by job schedule. Or even from another hook if you're brave enough.

ctx.fire_hook('calculate_stats', major=True, depth=1)

The same applies to the execute_sql method.

ctx.execute_sql('calculate_stats')

The execute_sql argument could be either name of a file/directory inside of the sql project directory or an absolute/relative path. If the path is a directory, all scripts having the .sql extension within it will be executed in alphabetical order.

# Hasura

# ⚠ Hasura integration requires schema_name to be `public`

The current version of Hasura GraphQL Engine treats public and other schemas differently. Table schema.customer becomes schema_customer root field (or schemaCustomer if camel_case option is enabled in DipDup config). Table public.customer becomes customer field, without schema prefix. There's no way to remove this prefix for now. You can track related issue at Hasura's GitHub to know when the situation will change. Since 3.0.0-rc1 DipDup enforces public schema to avoid ambiguity and issues with the GenQL library. You can still use any schema name if Hasura integration is not enabled.

# Internal models

Internal table dipdup_state used by DipDup to keep track of itself's state was removed. Four new models come to replace it:

model	table	description
`dipdup.models.Schema`	`dipdup_schema`	Hash of database schema to detect changes that require reindexing.
`dipdup.models.Index`	`dipdup_index`	Indexing status, level of the latest processed block, template, and template values if applicable. Relates to `Head` when status is `REALTIME` (see `dipdup.models.IndexStatus` for possible values of `status` field)
`dipdup.models.Head`	`dipdup_head`	The latest block received by a datasource from a WebSocket connection.
`dipdup.models.Contract`	`dipdup_contract`	Nothing useful for us humans. It helps DipDup to keep track of dynamically spawned contracts. A Contract with the same name from the config takes priority over one from this table if {any, exists, provided?}.

With help of these tables, you can set up monitoring of DipDup deployment to know when something goes wrong:

SELECT NOW() - timestamp FROM dipdup_head;

# Index factories

# ⚠ `stateless` config option is removed

Index factories are now processed the same way as regular indexes do. DipDup will apply the following logic while restoring states of indexes on restart:

Regular index: verify config hash and continue indexing
Templated index: recreate index config from the template using saved values, verify config hash
Templated index, but a template is missing: reindex
Regular index, but missing in config: ignore (maybe it's just commented out for a while)

# Miscellaneous

⚠ first_block/last_block fields were renamed to first_level and last_level respectively (used with --oneshot CLI flag only).
⚠ init command does not overwrite typeclasses that have been already generated. Use the --overwrite-types flag if it's not the desired behavior.
A long-awaited fix for a graceful shutdown. No more ugly stack traces on SIGTERM 🎉
SQL scripts are executed with one transaction per statement. Queries that require to be executed in a single transaction now could be put to the same file.
Exceptions, occurred during job callback execution are now considered critical and lead to DipDup crash.
Fixed an issue when views and some other database entities survive reindexing.
If callback execution takes longer than one second, a warning will be printed. Increase level of dipdup.callbacks logger to print it every time.

# ⚠ Known issues

Multiple issues related to WebSocket connection have been reported. TzKT outages are not processed correctly. We are aware of these issues and will try to fix them as soon as possible. DipDup crashes caused by WebSocket issues do not corrupt data already indexed, so a simple restart of the application is enough.

# What's next?

The most critical task is the ability to subscribe to operations by an entrypoint rather than by specific addresses. This change should drastically reduce the load on TzKT server for index factories with hundreds of originations.
Rollbacks of more than one block are infrequent but inevitable. We are going to implement the hotswap of database schemas to preserve data processed before rollback until reindexing is complete.
Support streaming replication to make DipDup more scalable.
Support sending transactions from DipDup in addition to indexing them. This is not a 20 minutes adventure, so no ETA yet.

DipDup is a free, open-source project driven by your, fellow Tezos developers, needs. Let us know what do you think about the recent changes and our further plans! Come join Baking Bad Telegram group, #baking-bad channel at tezos-dev Slack workspace, and our Discord server.

DipDup Updates

# New entity: Hooks

# ⚠ Default handlers require manual migration

# ⚠ jobs become schedules for hooks