Maintaining Oppia's Core Data with Backend Validation Checks
See original GitHub issueAll Validation Check Issues:
Introduction
Background
Oppia’s data storage is frequently updated to have the most recent features. To ensure a smooth user and developer experience, we need to have checks in place which ensure the integrity of the data currently being stored, as well as the integrity of future data.
These checks should exist in both the frontend and the backend. Oppia’s users interact with Oppia’s data in the GUI (frontend), so we need frontend validation checks that stop the user from inputting anything invalid. In the case where the user does manage to do something bad, the inputs get routed to our Python backend, and that’s where we need backend validation checks. These backend validation checks stop the bad inputs from reaching Oppia’s data storage and are the final line of defense.
This starter task focuses on the backend half of the validation checks.
Getting Started
Review the example and instructions below on how to add a backend validation check. Then leave a message on the issue thread asking to be assigned to a validation check. We’ll assign you to the check by adding your username next to the item. If you have any questions, send a message on the issue thread, and we’ll help you out!
Example
Context
Let’s say we want to guarantee that all explorations (lessons) have titles with no more than 36 characters. We previously did not have a limit like this, but want to add it so that titles display nicely on Android phones. Since we did not have this check before, there may be explorations in Oppia’s data storage that have titles with more than 36 characters.
Phase 1
We need to figure out which explorations (if any) violate this new validation! This is called a Beam Job. We use Beam Jobs to audit (search through) Oppia’s data storage and find data that violate the validation check that we want to add. After the Beam Job is run, if we are lucky and find that no exploration titles violate it, then we can move on to Phase 2. Otherwise, we need to figure out what to do with those violating explorations. Maybe we need to raise the limit from 36 to 72? Maybe we need to cut off all of the characters after the 36th? That decision is up to you to discuss with other contributors!
Phase 2
Once we know that no exploration titles that are already stored have more than 36 characters, we need to add a backend validation check to Oppia’s code that will stop any new explorations from having violating titles. Recently, we added a frontend validation check in the UI that should hopefully stop the user from creating any invalid exploration titles. However, since we want to be extra secure, we still add backend validation checks as the last line of defense. That backend validation check is Phase 2.
How to Add Backend Validation Checks for Core Models
Phase 1: Write a Beam Job
Idea: Write a script that audits all of Oppia’s existing data and finds any data that does not follow the backend validation check that we want to add. Sample PR: #14343 NOTE: don’t make separate PRs for Phase 1 and Phase 2. Just modify the same PR.
- Firstly, confirm that we don’t already have a backend validation implemented for this check. You can find relevant the files in the
core/domain
folder. (For example, we would look inexp_domain.py
for the example above.) If a backend validation already exists, then we probably don’t need a Beam job, and you can take up another available check. - Write the validation job, following the documentation on Beam Jobs. Your Beam Job file should be
core.jobs.batch_jobs.<model_type>_validation_jobs.py
. - Test your Beam Job locally as a Release Coordinator (see instructions).
- Run the Beam Job unit tests:
python -m scripts.run_backend_tests --test_target=core.jobs.batch_jobs.<model_type>.validation_jobs_test
, and ensure they all pass. - Create a PR and wait for the review!
- After you’ve been given the OK on the PR, submit a request for your Beam Job to be tested on a production server using this form. Your Beam Job is an audit job. You can optionally read more about this here. For an example of what the Beam job instructions look like, see this Google Doc.
- Wait for an Oppia admin to send you the results of your Beam Job.
- If you receive errors, do the following:
- Check whether any of the errors correspond to the curated lessons; if so, record them in this spreadsheet. (You’ll find a list of curated exploration IDs in the spreadsheet.)
- Update the job tracker spreadsheet to keep track of the Beam job results and decisions.
Note -> We are following a particular template for our beam job results, make sure to follow this template so that it will be easier for us to keep track of the results.
template for beam job errors - The id of {{tag}} is {{id}} and its {{field that's being validated}} is {{current data}}
eg - The id of exploration is 10 and its category is Test
(reference)
Phase 2: Add the Backend Validation Check
Idea: Add a check that stops any invalid data from entering Oppia’s storage in the future. Sample PR: #14962 NOTE: don’t make separate PRs for Phase 1 and Phase 2. Just modify the same PR.
- Add a backend validation to guarantee that no new data violates the issue we are working on. In our case, there should be no exploration with a title whose length is greater than 36 characters.
- You will be adding the backend validation in the
validate()
method in the relevant object class in the domain files (see thecore/domain/
folder). This layer validates the data before finally storing it. In the example above, we will be adding the validation inexp_domain.py
. - Find the appropriate class which contains the field for which you will be applying the validation. In our case, it will be
class Exploration
, since that class contains thetitle
field. Add your validation check to that class, and raise avalidation error
in case of violation. - Add a test in the test file associated with the domain file. In our case, it’s the
exp_domain_test.py
file. - Now, after implementing the back-end validation, we need to conduct a small investigation which will let us know if our changes break anything or not. You can take a reference here. If some errors occur while doing this, make sure to add a front-end validation which handles your validation error, so that the user can fix the error without it reaching the backend.
- Once you’re done with the above, raise the backend validation PR and you are good to go!
Validation Checks
Available Checks
In order of priority:
🏷️ General State Validation (for Question
)
-
labelled_as_correct
should not be True if destination ID is (try again). 🏷️Outcome
- The answer group should have at least one rule spec. 🏷️
AnswerGroup
- The default outcome should have a valid destination node. 🏷️
DefaultOutcome
- Answer specified in interaction should actually be a correct answer. 🏷️
Solution
-
destination_id
should be non-empty and match the ID of a state in the exploration. 🏷️Outcome
🏷️ Core Model Validation
-
AnswerGroup
’s taggedskill misconception IDs
should be a list of misconception ID-s attached to one of the skills pointed to by the question’slinked skill IDs
. 🏷️Question
:Question.state
- Exploration’s
title
,category
,objective
,language_code
, andtags
should all match those of the corresponding exploration summary (can useget_exploration_summary_by_id
to find corresponding exploration summary). 🏷️Exploration
andExplorationSummary
- Question summary’s
interaction_id
should be a valid ID, should match theinteraction_id
of the corresponding Question’sInteractionInstance
, and should be contained within the list of ANDROID-allowed interactions excludingContinue
andEndExploration
. 🏷️Question
andQuestionSummary
-
inapplicable_skill_misconception_ids
should be a (not necessarily strict) subset of the optional misconceptions associated with the linked skills.inapplicable_skill_misconception_ids
should not intersect with the tagged skill misconception ids for the answer groups, but their union should be all of the misconception ids of all the linked skills. 🏷️Question
andSkill
- Question summary’s
misconception_ids
should be the union of allmisconception_ids
for all of the corresponding question’s linked skills (can useget_question_by_id
andget_skill_by_id
). 🏷️Question
andQuestionSummary
andSkill
🏷️ Low Priority
- Subtopic
skillIds
should be a list of unique strings where each string represents an existingskill_id
. 🏷️Topic
- Topic
canonical_name
should be the lowercase version of thetopic_name
. 🏷️Topic
<-- only needs backend validation - Topic
practice_tab_is_displayed
istrue
only when there are at least10
practice questions in the topic. 🏷️Topic
andQuestion
- Story
corresponding_topic_id
should be valid, and that topic should contain this story. 🏷️Topic
andStory
Claimed Checks
🏷️ Curated Lessons (lessons in a topic) @soumyo123-prog
- State classifier
model_id
should beNone
for curated lessons. 🏷️State
- Outcome
param_changes
should be empty for curated lessons 🏷️Outcome
- Outcome
refresher_exploration_id
should be None for curated lessons. 🏷️Outcome
- Outcome
missing_prerequisite_skill_id
should be None or the ID of a skill. 🏷️Outcome
- Exploration
param_specs
andparam_changes
should be empty for curated lessons. 🏷️Exploration
- Training data should be empty for curated lessons. 🏷️
AnswerGroup
🏷️ General State Validation (for Exploration
) @lkbhitesh07
-
labelled_as_correct
should not be True if destination ID is (try again). 🏷️Outcome
- The answer group should have at least one rule spec. 🏷️
AnswerGroup
- The default outcome should have a valid destination node. 🏷️
DefaultOutcome
- Answer specified in interaction should actually be a correct answer. 🏷️
Solution
-
destination_id
should be non-empty and match the ID of a state in the exploration. 🏷️Outcome
🏷️ Core Model Validation
- Exploration title should have a max length of 36. 🏷️
Exploration
@lkbhitesh07 - Exploration tags should be a list of at most 10 non-empty strings without duplicates, where each tag has a max length of 30. 🏷️
Exploration
@sahiljoster32 #15086 -
AnswerGroup.tagged_skill_misconception_id
should beNone
. 🏷️Exploration
:Exploration.state
@lkbhitesh07
🏷️ Low Priority
- Rubric explanations should be a list of at most 10 strings of 300 characters each. 🏷️
Skill
@soumyo123-prog #15173 - Chapter
thumbnail
should have background color of#F8BF74
,#D68F78
,#8EBBB6
, or#B3D8F1
. 🏷️Story
@soumyo123-prog - Story
notes
should have at most5000
characters. 🏷️Story
@gopivaibhav #15324 -
story_is_published
should be a boolean. 🏷️Topic
@gopivaibhav <-- only needs backend validation
Completed Checks
- Exploration user rights (
owner_ids
,editor_ids
,voice_artist_ids
,viewer_ids
) should not have any user IDs in common. @EricZLou - Story description should have at most 1000 characters. @soumyo123-prog #15038
- Subtopic
thumbnail
should have background color of#FFFFFF
. @Lawful2002 -
Misconception ID
should be an integer >= 0. @Lawful2002 #15039 - Topic
abbreviated_name
should have at most39
characters. @Lawful2002 #15094 - Question state data schema version should be >= 27. @sahiljoster32 #15264
- Topic
page_title_fragment_for_web
should be non-empty, with min-length5
and max-length50
. - There must be at least one explanation for the
Medium rubric
. @lkbhitesh07 #15235 - Exploration
scaled_average_rating
should be a non-negative float between 0 and 5, inclusive. 🏷️Exploration
@Lawful2002 #14995 - Subtopic
url fragment
should be non-empty and match the RegEx “^[a-z]+(-[a-z]+)*$” with at most 25 characters. 🏷️Topic
@Lawful2002 #15500 - Story
thumbnail
should have background color of#F8BF74
,#D68F78
,#8EBBB6
, or#B3D8F1
. 🏷️Story
@soumyo123-prog #15137 - Exploration
category
should be one of the fixed list of categories defined byALL_CATEGORIES
inconstants.ts
. 🏷️Exploration
@Lawful2002 #15342
Issue Analytics
- State:
- Created 2 years ago
- Comments:25 (25 by maintainers)
Top GitHub Comments
@lkbhitesh07 Thank you for asking I am working on some other issues involving discussion docs and needs to fixed quickly I will start working on this task soon.
@gopivaibhav Thanks for the reply and for clearing the ambiguity!!