Migrating Tasking Manager: From Flask to FastAPI and psycopg2 to asyncpg

prabinoid · July 25, 2024, 11:00am

Migrating Tasking Manager: From Flask to FastAPI and psycopg2 to asyncpg

In the ever-evolving world of software development, performance and maintainability are crucial. Tasking Manager, an open-source project, is no exception. As part of our continuous efforts to improve our codebase and address the technical debt, we recently undertook a significant migration: transitioning from Flask to FastAPI for our web framework and from psycopg to asyncpg for our database interactions. This post will walk you through our journey, the rationale behind these changes, and the benefits we anticipate.

Why FastAPI and asyncpg?

FastAPI:

Performance

FastAPI is built on top of Starlette for the web parts and Pydantic for the data parts, making it incredibly fast and efficient. It leverages asynchronous programming using Python’s async and await syntax, allowing for non-blocking operations. This means FastAPI can handle many requests simultaneously, resulting in high performance, especially in I/O-bound operations.

Ease of Use

FastAPI simplifies API development by automatically generating interactive API documentation using Swagger UI and ReDoc. This feature is invaluable for developers as it provides a visual interface to explore and test API endpoints, making both development and API consumption easier and more intuitive.

Modern Features

FastAPI embraces modern Python features, including type hints. Type hints improve code readability by clearly specifying the expected types of function arguments and return values. They also reduce errors by enabling automatic validation and providing better support for IDEs and type checkers.

asyncpg:

Asynchronous Support

asyncpg is designed specifically for asynchronous programming. It enables the handling of multiple database queries concurrently, making it possible to perform several operations at once without blocking the execution of the program. This concurrent handling significantly improves the throughput of database interactions.

Performance

asyncpg is highly optimized and often outperforms other PostgreSQL drivers. Its focus on efficiency and speed makes it a top choice for applications that require fast, reliable database operations. The driver takes advantage of asynchronous programming to minimize latency and maximize performance.

The chart below shows the geometric mean of benchmarks obtained with PostgreSQL client driver benchmarking toolbench in June 2023.

To understand synchronous and asynchronous execution better, we first need to distinguish between subroutines and co-routines. A subroutine is a block of code that can be invoked as needed, transferring control of the program to it and returning to the main program once its task is complete. Subroutines run until they finish and cannot be paused and resumed.

Conversely, a co-routine is a special type of function that allows its execution to be paused and resumed, maintaining its state between pauses. This capability makes co-routines ideal for tasks that involve waiting, such as I/O operations, database calls, and HTTP requests. The term “co-routine” combines “co” (together) and “routine,” suggesting routines that can run cooperatively.

In a typical single-threaded application, all code and subroutines run sequentially, which is simple but can be inefficient. To optimize resource utilization, we use concurrency and parallelism. Concurrency allows the start and stop times of multiple co-routines to overlap, while parallelism enables different threads to execute simultaneously.

Now, let’s see how to implement concurrency in Python using the async and await keywords and the asyncio module. We’ll create two functions: fry_egg and make_toast. The fry_egg function will use the sleep function to simulate a 4-second task, while make_toast will simulate a 3-second task. Running these functions synchronously would take 7 seconds, but there’s no need to wait for the egg to finish frying before starting to make the toast. We can make this process more efficient using co-routines.

Synchronous Code:

Asynchronous Code:

By using asyncio, the total time is reduced to 4 seconds, showcasing the efficiency of asynchronous programming. This demonstrates how co-routines can help manage tasks that can run concurrently, making better use of computing resources and reducing overall execution time.

The Migration Process

WWe began our migration by focusing on the most critical API: the endpoint for getting projects. This API was essential as it interacted with many parts of the application. Successfully migrating this endpoint gave us confidence and a deeper understanding of how to proceed with the rest of the project.

Next, we decided to migrate the codebase module by module. We compiled a comprehensive list of all the APIs that needed migration and documented them by module. This structured approach ensured that we could track progress and tackle each part of the application systematically.

We started with the campaigns and organizations modules, as these were less complex and provided a good testing ground for our new asynchronous setup. Completing these modules allowed us to identify and resolve potential issues early on, minimizing complications when moving on to more complex APIs.

Currently, we have completed the campaigns and organizations modules, including all the CRUD operations and related functionalities. This approach ensured that we covered various aspects of asynchronous programming and avoided common pitfalls.

The teams module and parts of the project module are also in progress. Specifically, project list and retrieve functionalities have been successfully migrated, and work on other parts of the project module is ongoing. This incremental and modular approach has allowed us to manage the complexity of the migration effectively and ensure each part is thoroughly tested before moving on.

To align with the architectural patterns encouraged by FastAPI, we are also exploring refactors like incorporating Pydantic field validators to streamline variable assembly and also investigating libraries like ‘databases’ to effectively utilize SQLAlchemy Core expressions. These efforts aim to make the codebase more manageable and to ensure a robust and maintainable application structure.

Challenges

Learning Curve: Transitioning to asynchronous programming can be challenging for developers unfamiliar with the concept.
Refactoring Code: Migrating from Flask to FastAPI and shifting from synchronous to asynchronous database interactions requires substantial refactoring. This includes updating each CRUD function and library to ensure compatibility with the async framework. Debugging asynchronous code introduces additional challenges due to concurrency and potential race conditions, and proper error handling and propagation must be maintained. Additionally, API endpoint migration involves rewriting Flask routes to FastAPI endpoints while ensuring that all functionalities are preserved and that clients consuming the API are informed of any significant changes. Concurrency management requires careful handling of concurrent requests and shared state in an asynchronous environment. Finally, updating deployment configurations and ensuring Docker containers are correctly set up for FastAPI and asyncpg, including environment variables and volume mounting, are crucial for a smooth transition.
Libraries and Third-Party Integration: Deciding whether to use libraries like ‘databases’ or async SQLAlchemy involves evaluating their suitability for your specific needs and getting familiar with their functionalities. This decision impacts how non-blocking database operations are managed in your application. Additionally, integrating these libraries with third-party services can be challenging, as not all Flask extensions or services have direct equivalents for FastAPI and asyncpg. This may require finding suitable alternatives, writing custom solutions, or ensuring that existing integrations function correctly with the new asynchronous framework.
Lazy and Eager Loading: Properly managing lazy and eager loading of relationships can be challenging in an asynchronous environment. In traditional synchronous frameworks like Flask with SQLAlchemy, you may have well-established patterns for loading related data. However, in FastAPI with asyncpg, you need to adapt these patterns to work efficiently with async database operations. This involves ensuring that related data is loaded appropriately without causing performance issues or excessive database queries, which may require rethinking how you handle relationships and optimize queries in the new setup.
Codebase Refactoring: Organizing the codebase to fit the architectural patterns encouraged by FastAPI, which is different from those used in Flask.

Migrating Tasking Manager from Flask to FastAPI and from psycopg2 to asyncpg is an ongoing journey. While we are still in the process, we are optimistic about the potential improvements in performance and scalability. For any open-source projects or applications looking to enhance performance and scalability, adopting FastAPI and asyncpg is a promising step. We hope our experience so far provides valuable insights for the community and encourages others to explore these tools.

Feel free to contribute to Tasking Manager or reach out with any suggestions.Your feedback, suggestions, and collaboration have been instrumental in driving this project forward. Here’s to continued innovation and improvement in the world of open-source software.

qeef · July 27, 2024, 6:17pm

Thank you for the post. But let me be skeptical a little bit. Because I am always skeptical. These are my reasons:

There is no pull request, so I was finding for branches. There is fastapi-develop and tasking-manager-fastapi that introduce 30 and 51 new commits respectively. Both being 243 commits behind develop, i.e., are based on the code from March. This usually complicates the merge of new feature into the current code base.
Looking over the changes in fastapi-develop, they mostly consist of changes in imports and minimal other changes to make actual spaghetti code work with FastAPI instead of Flask. That makes the migration much harder than first refactor the code to get rid of unused and/or bad things, then migrate to FastAPI.
The most critical API is that of tasks. This is where mappers and validators request HOT TM to help them with their work. Looking at changes in backend/api/tasks/actions.py, sure it became FastAPI route, but I can see no await within the code. Thus, there is no advantage from the concurrency yet.
I do not see any performance analysis, profiling, nor load testing to understand the bottlenecks. I do not see any design considerations.
Last, let me make a note on race conditions. The race conditions in HOT TM raise because many people use it, not because of complicated data flow. The flow of the data is pretty straightforward – from API endpoint, thru small rearrangements, into the database, and back. That’s it. I am saying it because I think that HOT TM developers sometimes forget this. The complexity arise from many people using the simple data flow. And from unfortunate (politely said) code base. Therefore, I am afraid that while currently the race conditions can be observed at the edges of procedures (or functions or methods), introducing async could leverage code base difficulties at places no one expect, increasing number of failures. I hope I am wrong.

spwoodcock · August 1, 2024, 9:51am

Thanks for the candid feedback!

I 100% agree about regular rebasing / keeping branches up to date
While I agree with you that a redesign prior to migration would have been easier, we are really lacking in available time and resources! I guess the fear is that we don’t want to completely redesign the system and risk having to spend a lot of time debugging things. Currently only bug fixes and minor improvements are ongoing for TM. The new design is definitely being incorporated into FMTM and Drone TM however
I think we are migrating to an async database driver, to derive some benefit from the concurrency - work ongoing!
I did this brief performance assessment for some database queries: Tasking manager indexing vs partitioning research · GitHub it was a very brief start on it as I didn’t have much time, but please feel free to comment!
You raise a good point about race conditions and async potentially introducing more issues - time and testing will tell! Do you feel the design you proposed would eliminate the race conditions from multiple concurrent users?

Thanks again!

qeef · August 4, 2024, 2:59pm

Thank you for the reply, I would like to elaborate on some parts:

While I agree with you that a redesign prior to migration would have been easier, we are really lacking in available time and resources! I guess the fear is that we don’t want to completely redesign the system and risk having to spend a lot of time debugging things. Currently only bug fixes and minor improvements are ongoing for TM. The new design is definitely being incorporated into FMTM and Drone TM however

I do understand these concerns. My note was more about unused code, like zoom level, as pointed out in Dive into the HOT Tasking Manager codebase.

One part of the code maintenance, as I understand it, is removing stuff that’s not being used anymore. Doing so helps with further refactoring, like switching API framework and, in my opinion, is very neglected.

I did this brief performance assessment for some database queries: Tasking manager indexing vs partitioning research · GitHub it was a very brief start on it as I didn’t have much time, but please feel free to comment!

In my opinion, it’s crucial to have performance analysis of the whole HOT TM to find out its bottlenecks and I have a feeling that performance analysis is underestimated. What I can understand from the published assessment, is that indexes makes sense, which is true for sure. However, I believe there is plenty room for the improvement to the performance analysis.

You raise a good point about race conditions and async potentially introducing more issues - time and testing will tell! Do you feel the design you proposed would eliminate the race conditions from multiple concurrent users?

Instead of eliminate, I would say that good design will help to minimize and localize race conditions. Made-up example is when two mappers request the application to map random task simultaneously.

In poorly designed application, each request from a mapper triggers the following procedures:

Receive request with project identifier.
Get all the project’s tasks to map from the database and store them in a list.
Choose randomly one of the tasks in the list.
Update status of that task in the database.
Respond the task identifier to the mapper.

We can see some warning signs here:

Even if the communication with the database is serialized, the state of the database between the two accesses can change. In other words, if we perform (2) instead of (4), it may happen that two lists with “project’s tasks to map” will differ.
Computation in (3) may return the same number for multiple users. This is different problem than the previous one. And because of Birthday problem, it is not insignificant.
The flow of the data is complicated, which introduces more locations, where a problem may occur: mapper → application (receive request) → database (request tasks to map) → application (select random task) → database (update task’s stauts) → application (respond to the mapper) → mapper.

I believe we can extract some principles from the above:

Simplify data flow to: mapper → application → database → application → mapper. Ideally, there is a single database query per request.
It may be impossible to have a single database query per request. Then, put multiple database requests within single transaction.
The above requires the database-related code to be localized within the application. It is nearly impossible to serialize database requests that are cluttered over code base. Particularly, HOT TM is not good use-case for CRUD application.
Eliminate computation in the application—use smart database queries. This approach helps to localize possible problems. Particularly, in the example above, instead of (2), (3), and (4), use the combination of WITH ... AS ( ..., ORDER BY RANDOM(), and INSERT INTO ... or UPDATE ... in the single serialized database query.

Please, note this is not a design. However, I believe that keeping these principles is a good start to come up with a decent design.

spwoodcock · August 29, 2024, 4:47pm

Thanks again!

We have incorporated your comments into our design for the following:

Tasking Manager

Improvements such as de-duplicating the task_history table, making the database design more event-driven, utilising CTE raw SQL queries, etc will all definitely come in time, but after the current FastAPI / async db refactor we have undertaken.

Unfortunately we have factored the FastAPI / async db driver updates into our roadmap already and only have specific developer time allocated to it. You make a valid point that refactoring the codebase prior to the migration would have been wise, but we are where we are now!

As for performance assessment, we are doing both performance and load testing of the resulting APIs after our initial refactor. Results will be published somewhere public once done (limited resources and timeline)

FMTM & DroneTM

As these projects are heavily inspired by the Tasking Manager schema and are a lot less well established, the refactoring has less risk involved.

We have already incorporated much of your feedback into the design - I’ll just re-iterate how thankful we are for your input here, as it’s been very useful to reflect on!