question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SIP-35] Proposal for Improving Superset’s Python Code Organization

See original GitHub issue

[SIP] Proposal for Improving Superset’s Python Code Organization

Motivation

As I was in the weeds fixing a bunch of Pylint failures, Max and I started going back and forth on this PR: https://github.com/apache/incubator-superset/pull/8777, which we ultimately closed. The root cause of that was a lack of shared understanding on the best code structure for Superset going forward. Talking with others at Preset, I realized that the issue was larger than just a new contributor not understanding project practices. Without a shared understanding, we are lacking a cohesive approach towards refactoring Superset - we need a technical North Star for project structure.

Preset’s developers met and identified a number of pain points in the existing code base. Here they are, with a bit of color to make the meaning clear:

  • Lack of separation of concerns
    • There are a number of modules that have grown organically over time to truly massive size. Often these modules contain utility functions, classes of multiple types, and a broad array of other functionality. These modules are very hard to work with.
    • Business logic is frequently intermixed with display logic, making code reuse more difficult.
    • The model layer contains a large amount of business logic which has nothing to do with data storage.
  • Dependency Confusion
    • Circular dependency issues crop up regularly with the existing module design
    • Many modules have implicit dependencies on other modules which requires a specific load order
  • Lack of Composability
    • There’s a substantial amount of app-specific code in shared modules
    • Views are currently loaded in all contexts, which makes light-weight command line usage (for example) impossible
  • Deep Class Inheritance
    • In some instances, particularly where there is pre-existing inheritance in Flask-AppBuilder, the inheritance chain can be over 10 classes deep. This can make understanding some classes very challenging.

Proposed Change

In order to address these concerns, we’d like to propose a few guiding principles for Python code structure going forward:

  • Prefer small, functionally-homogeneous modules with only classes and pure functions in module scope
  • Separate business logic from display logic
    • Keep the view layer lightweight and delegate to business logic objects
    • To support this work, we’d like to recommend a Command pattern - more on that below
  • Prefer thin ORM objects
    • To support this work, we’d like to recommend introducing Data Access Objects or DAOs to Superset to hold complicated query logic. More on that below.
  • Keep __init__.py files empty
    • This will help cut down on implicit dependencies
  • Prefer composition over inheritance
    • Collaborator objects can be simpler to understand than a deeply-nested class hierarchy.
  • Organize modules function and back-end objects rather than products
    • Prefer:
      • Dashboard
      • Chart
      • Query
      • Database
      • Models
      • api
    • Avoid:
      • SQL Lab
      • Explore

New patterns to introduce:

Command:

There are multiple patterns named Command, and the one we reference here is most similar to an Alembic migration. Commands perform a single cohesive business operation, have a very small public API, and either entirely succeed or entirely fail. In practice, this requires that database transaction behavior be controlled at the Command level. When possible, commands should be modeled such that they perform a single database transaction. In the case of failure, they raise an exception or return an error code, and it is the responsibility of the caller to translate that into a correct response for a user.

Example command:

# This is obviously not a full finished command. The point
# is to illustrate the structure of a command as
# encapsulating a business operation, and illustrate
# a likely public API.

import DashboardDAO from daos.dashboard

class MissingDashboardError(CommandError):
  pass

class UpdateFailedError(CommandError):
  pass

class InvalidDashboardError(CommandError):
  pass

class UpdateDashboardCommand < Command:
  def __init__(current_user, dashboard_id, dashboard_properties):
    self.actor = current_user
    self.dashboard_id = dashboard_id
    self.dashboard_properties = dashboard_properties
    
  def run():
    self.dashboard = DashboardDAO.find_dashboard(actor, dashboard_id)
    if self.dashboard IS NONE:
      raise MissingDashboardError("Dashboard could not be found")
      
    if !self._is_dashboard_valid():
      raise InvalidDashboardError("Dashboard parameters are invalid.")
    
    update_result = DashboardDAO.update(dashboard, dashboard_properties)
    
    if update_result != True:
      raise UpdateFailedError("Dashboard could not be updated.")
      
  
  def _is_dashboard_valid():
    # Check validity of dashboard data before save
DAO (Data Access Object):

A DAO in the context of Superset would be an object that would manage the lifecycle of SQLAlchemy models. Custom queries would be implemented in DAOs and then called from Command objects. In comparison to Command objects, DAOs have relatively broad public interfaces. DAOs should not depend on other DAOs to avoid circular dependencies. If results are needed cross-DAO, that should be orchestrated in the Command. Here’s a sample simplified DAO for illustrative purposes:

# This is not a fully finished DAO. The intention
# is to illustrate the public interface of a 
# Data Access Object, and how it might interface
# with a SQLAlchemy model.

from superset import db
import DataAccessObject from daos.base
import Dashboard from models.dashboard

class DashboardDAO(DataAccessObject):
  
  @staticmethod
  def find_for_update(actor, dashboard_id):
    dashboard = db.session.query(Dashboard).filter(id=dashboard_id).one_or_none()
    if dashboard
      if actor in dashboard.owners:
        return dashboard
      else
        return None
    else
      return None
  
  @staticmethod
  def update(dashboard, dashboard_params):
    for key, value in dashboard_params.iteritems():
      setattr(dashboard, key, value)
    try:
      db.session.merge(dashboard)
      db.session.commit()
    except SQLAlchemyError as e:
      logger.error('Failed to update dashboard: ?', e.message)
      db.session.rollback()
      return False
      
    return True

Proposed example package structure that follows the above principles:

superset/dashboards/api.py # Contains FAB and Flask endpoints related to the public API
superset/dashboards/schemas.py # Contains marshmallow schemas used by both the api and view endpoints
superset/dashboards/boot.py # Contains initialization and configuration code for API endpoints, invoked by `create_app()`
superset/dashboards/views.py # Contains FAB and Flask endpoints related to page-refresh views.
superset/dashboards/dao.py
superset/dashboards/commands/add_user.py
superset/dashboards/commands/create.py
superset/dashboards/helpers.py
superset/charts/api.py
superset/charts/daos.py
superset/charts/boot.py
superset/charts/schemas.py
superset/charts/commands/add_to_dashboard.py
superset/charts/commands/create.py
superset/charts/helpers.py
superset/models/chart.py # Models are thin SQLAlchemy objects - they do not implement custom queries or filters. That functionality will be found in the DAO, which manipulates the underlying SQLAlchemy model.
superset/models/dashboard.py

In this design, all systems related to a specific back-end resource have been grouped under a top-level folder. __init__.py files should be left empty to enable only pulling in the portions of the system necessary for a specific entrypoint (Celery shouldn’t need api.py or views.py for instance)

New or Changed Public Interfaces

Over time, the internals of Superset will evolve towards the new structure. Public HTTP interfaces will not be likely to change as a result of the above proposal, but code will move and alter to conform. This will impact organizations that apply their own customizations to Superset.

New dependencies

None

Migration Plan and Compatibility

Introduce refactors to existing code at a manageable pace to allow organizations relying on Superset internals time to adapt.

Rejected Alternatives

Preset discussed Service Objects as an alternative to both Commands and DAOs. We felt that Commands provided easier entrypoints for the ports of our application (API endpoints, views, command line invocations, Celery tasks) than Service Objects, and that introducing DAOs as well helped further break down concerns.

We also considered structuring top-level folders by function (api, models, etc.) but found this resulted in drastically more Python modules overall without substantially simplifying the question of where code should live.

Individuals consulted in creating this SIP

@mistercrunch @craig-rueda @dpgaspar @robdiciuccio @suddjian @nytai

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:11
  • Comments:17 (15 by maintainers)

github_iconTop GitHub Comments

3reactions
john-bodleycommented, Feb 14, 2020

I agree with many issues/concerns raised in this SIP though I’m not overly familiar with commands and DAOs. Are there examples of other apps (preferably Flask) which use these constructs?

Note at Airbnb we have a number of Flask apps which follow best practices, i.e., are modular, have empty __init__.py files, use Flask blueprints, etc. which are fairly easy to understand and structured in a way that result in no circular dependencies.

I’m all for refactoring the Python codebase (leaning more on blueprints), though I’m unsure whether need to restructure the logic at this time to leverage commands and DAOs. A full restructure (as opposed to a refactor) seems like a considerable amount of work and may be somewhat overkill to address the concerns raised in this SIP.

2reactions
john-bodleycommented, Feb 15, 2020

I really this we need to untangle the app from the various components. This is the partially to blame for the spaghetti code, circular dependency issues we’re currently facing.

All these modules, components, etc. need to be app agnostic and should be implemented as blueprints. If there are config or similar aspects these modules need leverage one can use the current_app proxy. The loading and registering (binding) of blueprints should occur within app.py.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[SIP-35] Proposal for Improving Superset's Python Code ...
[SIP] Proposal for Improving Superset's Python Code Organization ... we'd like to propose a few guiding principles for Python code structure ...
Read more >
Comprehensive Tutorial for Contributing Code to Apache ...
This tutorial post will cover all of the steps needed to make your first code contribution to the Apache Superset project.
Read more >
Advanced Superset for Engineers API's, Version ... - YouTube
Presented by Srini Kadamati at Airflow Summit 2021.Apache Superset is a modern, open- source data exploration & visualization platform ...
Read more >
Superset - Confluence Mobile - Apache Software Foundation
Proposal. Superset is business intelligence (BI) software that helps modern organizations visualize and interact with their data.
Read more >
PEP 474 – Creating forge.python.org
Proposal. This PEP proposes that an instance of the self-hosted Kallithea code repository management system be deployed as “forge.python.org”.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found