Cancellation Shields and Scoped Lifestyle Resource Management

Unlike unstructured concurrency, structured concurrency has a cancellation mechanism to stop the work of the task and its children. On cancellation, it synchronously stops the entire task tree down through a shared mechanism for signalling cancellation. This mechanism is cooperative in the sense that the actual cancellation effect and report must be triggered by a check.

Library and Application authors have the freedom to choose how their types react and report on cancellation. Depending on the type of work done, the author can decide where to put the cancellation check and whether to return or discard the partial result. The Foundation standard library has automatic checks for cancellation in some of its async APIs, like for example in Task.sleep(nanoseconds:). We'll be using this in our examples below.

The approach to handling cancellation varies based on the type of work being done. In most cases, we simply want to stop the request or commands. However, in certain situations, you may need to perform clean-up on a resource regardless of whether the task completes successfully or is canceled due to an external or internal failure. But how would you handle this if a parent node cancels the entire task tree while a child node is still performing the necessary clean-up?

The solution is to protect the asynchronous task from being canceled. This is known as a "cancellation shield" or "async shield." In Swift, this involves wrapping the asynchronous function in a Task and immediately requesting its value:

Task {
    await work()
}.value

A shielded task can be grouped with other cancellable tasks inside a parent task. When the parent task is canceled, the cancellation will successfully stop the cancellable child tasks, but the shielded async task will continue to run uninterrupted.

I'd love to dive into an example right away, but bear with me a bit longer so I can present it more clearly. To set things up, let's first discuss scoped resource lifestyle management.

Scoped lifestyle management and with-style methods

When defining "lifestyles" for a resource, a scoped lifestyle refers to creating a single instance of a resource within a clearly defined scope. When the scope ends, so does the resource's lifetime. In Swift, this scoped lifecycle often appears in the form of “with-style” methods.

"with-style" functions provide access to one instance of a resource and do the "cleaning" of the resource when the underlying scope ends. There are many example of these methods in the standard library like the with(Throwing/Discarding)TaskGroup family, and array.withUnsafeBytes to name a few.

// Before destroying the TaskGroup, all tasks inside the group must have been awaited.
try await withTaskGroup(...) { group in 
    // Add task to the group
}

try await withThrowingTaskGroup(...) { group in 
    // Add task to the group
}

try await withDiscardingTaskGroup(...) { group in 
    // Add task to the group
}

[1,2,3].withUnsafeBytes { pointer in
    // The pointer argument is valid only for the duration of the function's execution.
}

In the server-side ecosystem we find postgresClient.withConnection from PostgresNIO.

postgresClient.withConnection { connection in 
	// Leases a connection for the provided closure's lifetime
}

In all of these methods is wrong to store the resource (pointer, connection or TaskGroup) for later use.

The with-style methods allow to hide the implementation details of how the clean-up of the resource is done. For instance, imagine a resource, which is not going to cross isolation domains and has a synchronous clean-up method we need to call. Then we can write a simple with-style method as

struct Resource {
    func interact() {
        // ...
    }
    
    func shutdown() {
        // ...
    }
}

func withResource(
    _ body: (Resource) -> Void
) {
    let resource = Resource()
    defer { resource.shutdown() }
    body(resource)
}

withResource { resource in
    resource.interact()
}

Things become more interesting when the shutdown process takes a significant amount of time. In such cases, we can use an asynchronous clean-up method. At the moment of writing this, the defer function only works with synchronous functions. Additionally, we want to ensure that the shutdown always completes successfully, which brings us back to the topic of cancellation shields. Let's explore how this would work in a simple example.

Example

Consider the following Scanner, which can perform scanning and shutdown operations concurrently

final class Scanner {
    func shutdown() async {
        var iterator = Array(1...5).reversed().async.makeAsyncIterator()

        while let i = await iterator.next() {
            try? await Task.sleep(nanoseconds: 1_000_000_000)
            print("Shutting down in \(i) seconds...")
        }
    }

    func scan() async {
        var iterator = (1...3).async.makeAsyncIterator()

        while let _ = await iterator.next() {
            try? await Task.sleep(nanoseconds: 1_000_000_000)
            print("scanning...")
        }
        print("finished scanning")
    }
}

The scanner scans during 3 seconds before finishing and it takes 5 seconds to shutdown. We introduce a with-style function withScanner that let us interact with the Scanner resource and which takes care of the shutdown. This helps us to always call the shutdown after using the scanner without having to write it explicitly each time. Furthermore, we allow this scoped function to be called on arbitrary actor isolated contexts.

func withScanner(
    isolation: isolated (any Actor)? = #isolation,
    _ execute: (inout sending Scanner) async -> Void
) async throws {
    var scanner = Scanner()
    await execute(&scanner)
    // Cancellation shield
    await Task {
        await scanner.shutdown()
    }.value
}

Notice how we've added a cancellation shield to the asynchronous shutdown to prevent resource leaks. With this in place, imagine starting a scan and then canceling the operation midway through

let task = Task {
    try? await withScanner { scanner in
        await scanner.scan()
    }
}

try? await Task.sleep(nanoseconds: 1_000_000_000)

task.cancel()
print("task canceled")

try? await Task.sleep(nanoseconds: 6_000_000_000)

We get as expected

scanning...
task canceled
finished scanning
Shutting down in 5 seconds...
Shutting down in 4 seconds...
Shutting down in 3 seconds...
Shutting down in 2 seconds...
Shutting down in 1 seconds...

What if we cancel the parent work in the middle of the shutdown?

let task = Task {
    try? await withScanner { scanner in
        await scanner.scan()
    }
}

try? await Task.sleep(nanoseconds: 5_000_000_000)

task.cancel()
print("task canceled")

// Need a barrier so that Task doesn't get out of scope before the test ends
try? await Task.sleep(nanoseconds: 5_000_000_000)

The shutdown will still complete fully until the end!

scanning...
scanning...
scanning...
finished scanning
Shutting down in 5 seconds...
task canceled
Shutting down in 4 seconds...
Shutting down in 3 seconds...
Shutting down in 2 seconds...
Shutting down in 1 seconds...

This can be helpful for services that use the async-http-client, when the shutdown must be triggered manually.

The following approach could be useful when dealing with interactions that may throw errors while accessing the resource.

func withThrowingScanner<ReturnType: Sendable>(
    isolation: isolated (any Actor)? = #isolation,
    _ execute: (inout sending ThrowingScanner) async throws -> ReturnType
) async throws -> ReturnType {
    var scanner = ThrowingScanner()
    var result: ReturnType
    do {
        result = try await execute(&scanner)
    } catch {
        // Cancellation shield
        try await Task {
            try await scanner.shutdown()
        }.value
        throw error
    }

    try await Task {
        try await scanner.shutdown()
    }.value

    return result
}

For a further look at the isolation aspect of with-style functions, I'd like to refer you to Franz Busch's talk Leveraging structured concurrency in your applications at Server Side Swift Conference 2024. This talk inspired the writing of this post and the exploration of cancellation shields.

Other Options

The standard library provides a similar function with a cancellation handler, withTaskCancellationHandler(operation:onCancel:isolation:), but its semantics is different. In this case, you're expected to pass an atomic flag (or any other safe concurrent-access value) to the operation and check it during execution. This differs from the cooperation mechanism with checkCancellation, as the onCancel closure is always called immediately when the task is canceled. The purpose of the onCancel closure is to reverse the flag and stop the operation from running. If cancellation occurs while the operation is in progress, the cancellation handler runs concurrently with the operation.

In this case, we need to modify the scan() method and introduce a driving flag.

import Atomics

final class AtomicScanner {
    func shutdown() async {
        var iterator = Array(1...5).reversed().async.makeAsyncIterator()
        while let i = await iterator.next() {
            try? await Task.sleep(nanoseconds: 1_000_000_000)
            print("Shutting down in \(i) seconds...")
        }
    }

    func scan() async {
        try? await Task.sleep(nanoseconds: 1_000_000_000)
        print("scanning...")
    }
}

func withScannerHandler(
    isolation: isolated (any Actor)? = #isolation,
    _ execute: (AtomicScanner) async -> Void
) async throws -> Void {
    let condition = ManagedAtomic<Bool>(true)
    await withTaskCancellationHandler {
        let scanner = AtomicScanner()
        while condition.load(ordering: .relaxed) {
            await execute(scanner)
        }
        print("finished scanning")
        await scanner.shutdown()
        print("exit")
    } onCancel: {
        condition.store(false, ordering: .relaxed)
    }
}

On cancellation the operation async function runs with the inverted flag, but it does not spawn new asynchronous work. We can test that with

let task = Task {
    try await withScannerHandler { scanner in
        await scanner.scan()
    }
}

try? await Task.sleep(nanoseconds: 2_000_000_000)

task.cancel()
print("task canceled")

try? await Task.sleep(nanoseconds: 6_000_000_000)

where we get

scanning...
task canceled
scanning...
finished scanning
exit

By adding the cancellation shield to scanner.shutdown, we can observe the shutdown countdown ocurring as expected.

Conclusion