[SIP-35] Proposal for Improving Superset’s Python Code Organization
See original GitHub issue[SIP] Proposal for Improving Superset’s Python Code Organization
Motivation
As I was in the weeds fixing a bunch of Pylint failures, Max and I started going back and forth on this PR: https://github.com/apache/incubator-superset/pull/8777, which we ultimately closed. The root cause of that was a lack of shared understanding on the best code structure for Superset going forward. Talking with others at Preset, I realized that the issue was larger than just a new contributor not understanding project practices. Without a shared understanding, we are lacking a cohesive approach towards refactoring Superset - we need a technical North Star for project structure.
Preset’s developers met and identified a number of pain points in the existing code base. Here they are, with a bit of color to make the meaning clear:
- Lack of separation of concerns
- There are a number of modules that have grown organically over time to truly massive size. Often these modules contain utility functions, classes of multiple types, and a broad array of other functionality. These modules are very hard to work with.
- Business logic is frequently intermixed with display logic, making code reuse more difficult.
- The model layer contains a large amount of business logic which has nothing to do with data storage.
- Dependency Confusion
- Circular dependency issues crop up regularly with the existing module design
- Many modules have implicit dependencies on other modules which requires a specific load order
- Lack of Composability
- There’s a substantial amount of app-specific code in shared modules
- Views are currently loaded in all contexts, which makes light-weight command line usage (for example) impossible
- Deep Class Inheritance
- In some instances, particularly where there is pre-existing inheritance in Flask-AppBuilder, the inheritance chain can be over 10 classes deep. This can make understanding some classes very challenging.
Proposed Change
In order to address these concerns, we’d like to propose a few guiding principles for Python code structure going forward:
- Prefer small, functionally-homogeneous modules with only classes and pure functions in module scope
- Separate business logic from display logic
- Keep the view layer lightweight and delegate to business logic objects
- To support this work, we’d like to recommend a Command pattern - more on that below
- Prefer thin ORM objects
- To support this work, we’d like to recommend introducing Data Access Objects or DAOs to Superset to hold complicated query logic. More on that below.
- Keep
__init__.py
files empty- This will help cut down on implicit dependencies
- Prefer composition over inheritance
- Collaborator objects can be simpler to understand than a deeply-nested class hierarchy.
- Organize modules function and back-end objects rather than products
- Prefer:
- Dashboard
- Chart
- Query
- Database
- Models
- api
- Avoid:
- SQL Lab
- Explore
- Prefer:
New patterns to introduce:
Command:
There are multiple patterns named Command, and the one we reference here is most similar to an Alembic migration. Commands perform a single cohesive business operation, have a very small public API, and either entirely succeed or entirely fail. In practice, this requires that database transaction behavior be controlled at the Command level. When possible, commands should be modeled such that they perform a single database transaction. In the case of failure, they raise an exception or return an error code, and it is the responsibility of the caller to translate that into a correct response for a user.
Example command:
# This is obviously not a full finished command. The point
# is to illustrate the structure of a command as
# encapsulating a business operation, and illustrate
# a likely public API.
import DashboardDAO from daos.dashboard
class MissingDashboardError(CommandError):
pass
class UpdateFailedError(CommandError):
pass
class InvalidDashboardError(CommandError):
pass
class UpdateDashboardCommand < Command:
def __init__(current_user, dashboard_id, dashboard_properties):
self.actor = current_user
self.dashboard_id = dashboard_id
self.dashboard_properties = dashboard_properties
def run():
self.dashboard = DashboardDAO.find_dashboard(actor, dashboard_id)
if self.dashboard IS NONE:
raise MissingDashboardError("Dashboard could not be found")
if !self._is_dashboard_valid():
raise InvalidDashboardError("Dashboard parameters are invalid.")
update_result = DashboardDAO.update(dashboard, dashboard_properties)
if update_result != True:
raise UpdateFailedError("Dashboard could not be updated.")
def _is_dashboard_valid():
# Check validity of dashboard data before save
DAO (Data Access Object):
A DAO in the context of Superset would be an object that would manage the lifecycle of SQLAlchemy models. Custom queries would be implemented in DAOs and then called from Command objects. In comparison to Command objects, DAOs have relatively broad public interfaces. DAOs should not depend on other DAOs to avoid circular dependencies. If results are needed cross-DAO, that should be orchestrated in the Command. Here’s a sample simplified DAO for illustrative purposes:
# This is not a fully finished DAO. The intention
# is to illustrate the public interface of a
# Data Access Object, and how it might interface
# with a SQLAlchemy model.
from superset import db
import DataAccessObject from daos.base
import Dashboard from models.dashboard
class DashboardDAO(DataAccessObject):
@staticmethod
def find_for_update(actor, dashboard_id):
dashboard = db.session.query(Dashboard).filter(id=dashboard_id).one_or_none()
if dashboard
if actor in dashboard.owners:
return dashboard
else
return None
else
return None
@staticmethod
def update(dashboard, dashboard_params):
for key, value in dashboard_params.iteritems():
setattr(dashboard, key, value)
try:
db.session.merge(dashboard)
db.session.commit()
except SQLAlchemyError as e:
logger.error('Failed to update dashboard: ?', e.message)
db.session.rollback()
return False
return True
Proposed example package structure that follows the above principles:
superset/dashboards/api.py # Contains FAB and Flask endpoints related to the public API
superset/dashboards/schemas.py # Contains marshmallow schemas used by both the api and view endpoints
superset/dashboards/boot.py # Contains initialization and configuration code for API endpoints, invoked by `create_app()`
superset/dashboards/views.py # Contains FAB and Flask endpoints related to page-refresh views.
superset/dashboards/dao.py
superset/dashboards/commands/add_user.py
superset/dashboards/commands/create.py
superset/dashboards/helpers.py
superset/charts/api.py
superset/charts/daos.py
superset/charts/boot.py
superset/charts/schemas.py
superset/charts/commands/add_to_dashboard.py
superset/charts/commands/create.py
superset/charts/helpers.py
superset/models/chart.py # Models are thin SQLAlchemy objects - they do not implement custom queries or filters. That functionality will be found in the DAO, which manipulates the underlying SQLAlchemy model.
superset/models/dashboard.py
In this design, all systems related to a specific back-end resource have been grouped under a top-level folder. __init__.py
files should be left empty to enable only pulling in the portions of the system necessary for a specific entrypoint (Celery shouldn’t need api.py
or views.py
for instance)
New or Changed Public Interfaces
Over time, the internals of Superset will evolve towards the new structure. Public HTTP interfaces will not be likely to change as a result of the above proposal, but code will move and alter to conform. This will impact organizations that apply their own customizations to Superset.
New dependencies
None
Migration Plan and Compatibility
Introduce refactors to existing code at a manageable pace to allow organizations relying on Superset internals time to adapt.
Rejected Alternatives
Preset discussed Service Objects as an alternative to both Commands and DAOs. We felt that Commands provided easier entrypoints for the ports of our application (API endpoints, views, command line invocations, Celery tasks) than Service Objects, and that introducing DAOs as well helped further break down concerns.
We also considered structuring top-level folders by function (api, models, etc.) but found this resulted in drastically more Python modules overall without substantially simplifying the question of where code should live.
Individuals consulted in creating this SIP
@mistercrunch @craig-rueda @dpgaspar @robdiciuccio @suddjian @nytai
Issue Analytics
- State:
- Created 4 years ago
- Reactions:11
- Comments:17 (15 by maintainers)
Top GitHub Comments
I agree with many issues/concerns raised in this SIP though I’m not overly familiar with commands and DAOs. Are there examples of other apps (preferably Flask) which use these constructs?
Note at Airbnb we have a number of Flask apps which follow best practices, i.e., are modular, have empty
__init__.py
files, use Flask blueprints, etc. which are fairly easy to understand and structured in a way that result in no circular dependencies.I’m all for refactoring the Python codebase (leaning more on blueprints), though I’m unsure whether need to restructure the logic at this time to leverage commands and DAOs. A full restructure (as opposed to a refactor) seems like a considerable amount of work and may be somewhat overkill to address the concerns raised in this SIP.
I really this we need to untangle the
app
from the various components. This is the partially to blame for the spaghetti code, circular dependency issues we’re currently facing.All these modules, components, etc. need to be app agnostic and should be implemented as blueprints. If there are config or similar aspects these modules need leverage one can use the
current_app
proxy. The loading and registering (binding) of blueprints should occur withinapp.py
.