Forking into the future (title is a WIP as is the rest)

Albert Du albertdu@cs.washington.edu

The Fork/Join parallelism autograder lags behind the rest, how can we fix that?

Introduction

CSE 332 at the University of Washington (UW) is now in its 16th year (since 10sp replacing CSE 326) as the required Data Structures class for the university's Computer Science and Computer Engineering majors. In those ~40 quarters, CSE 332 "Data Structures and Parallelism," has incorperated a 3 week foray into the world of shared memory parallelism and concurrency [1] [2].

As of spring 2026, CSE 332 currently dedicates 2 (of 13) "exercises" to parallelism and concurrency. One exercise for parallelism, and one for concurrency. Concurrency is still a "pen and paper" assignment with reletively similar mechanics as the original 332. The parallelism exercise has students implement 3 different parallel algorithms with Java's ForkJoin framework (java.util.concurrent), centered around divide and conquer algorithms [2]. The ForkJoin framework has been used since the very first quarter of CSE 332, then known by the title "Data Abstractions" [3].

It took 6 quarters for the first 250 CSE students to take 332 [1]. We had over 260 students in winter '26. With larger class sizes (and the other challenges of teaching today), exercises must keep up.

Parallelism exercise(s) since 24su

Focusing on the course after 2024 summer exercise reshuffle, the parallelism exercise is typically assigned around the 6th week of the quarter.

Known as Ex9 in 24au, 25wi, 25au; 26wi; as Ex10 in 24su, 25su; 26sp, and split between Ex10 and Ex11 in 25sp.

Students submit a .java file for each part:

  1. A reduction problem (dot product)
  2. A mapping problem (matrix multiply)
  3. Parallel packing (remove empty strings; hyperplanar coordinate slicing in 26wi and 26sp)
  4. Parallel suffix (only in 25sp)

Observations (as a student and as a teaching assistant)

For better or worse, 332 students are first concerned with passing the autograder. Our autograders are comprehensive (but far from perfect). Most runtime requirements in 332 are fulfilled naturally by using the instructed data structures. Students usually don't look too closely at their own code once they're finished.

For other assignments, TAs usually trust the autograder to catch most hard to notice mistakes. Manual inspection in 332 usually reduces to checking for runtime requirements (and the extra import here and there), but not too much else. It is not common to deduct points when grading "manual inspection."

ForkJoin presents many new challenge with grading, key among them: span. It's hard to transpose two lines of code and lose manual verification points, until ForkJoin. Consider the follow two pieces of code, one is correct, the other has a critical issue:

// correct
SomeRecursiveAction left = new SomeRecursiveAction(a, b, lo, mid);
left.fork();
SomeRecursiveAction right = new SomeRecursiveAction(a, b, mid, hi);
right.compute();
left.join();
// incorrect
SomeRecursiveAction left = new SomeRecursiveAction(a, b, lo, mid);
left.fork();
SomeRecursiveAction right = new SomeRecursiveAction(a, b, mid, hi);
left.join();
right.compute();

The bottom code is incorrect because the left task is joined before the right task starts. This reduces to sequential execution, eliminating all ForkJoin benefit.

This case is reletively easy to catch, but the autograder has no mechanism to catch this. A simple autograder can only check for the output of the program, not the runtime efficiency. TA's can usually catch these bugs, but not always reliably.

I've personally known good students, on top of it students, submitting everything early students, who took the class with me, who lost points to an assignment for code they wrote nearly a month ago. Assuming the assignment is out for 10 days, with 2 days of grace, and a 7 day grading turnaround, students who complete this assignment on day one wait for three weeks (assuming they get around to checking their grade immedaitely) to learn they've been writing unparallelized code.

The following logic for a base case cutoff is nearly correct, and was thought to be correct for over a year.

//...
if (top - bottom <= CUTOFF || right - left <= CUTOFF) { // base case
    //...
} else { // recursion
//...

This was submitted by a now TA (of 6+ quarters), reproduced with permission.

This code looks correct, looks to human eyes like basically every other submission, but it's fundamentally flawed. This snippet's base case is trivially true so there was no dividing, and no conquering.

At least 4 TAs (myself, our infrastructure team, plus the original grader) and the course's professor looked at this snippet, and all agreed it was probably correct. Our new autograder (which we at first assumed was mistaken) contested that.

For students who don't catch these and similar issues when submitting their homework, losing points at manual verification time is frustrating and confusing. For students who don't catch these issues when submitting their homework where the TA grading did not catch these issues, losing points on an exam for following a "correct" pattern hurts even more.

The DAG

In 25au we set to overhaul the forkjoin autograder.

Reliable testing for performance is a challenge. Profiling or timing code has been attempted in the past, notably in the hashing autograder, but with fork join, it's a bit harder to see the speedup. The code students write isn't massively faster than a sequential version, some quick math has you at a memory limitation for dot product being the main limitation beyond ~2x speedup. In an autograding setting, restricted to the virtual machine instances gradescope allocates us, there is even more unpredictability for performance. Hashing stress tests work since an O(1) vs O(n) runtime makes a meaningful difference, compared to measuring sub second differences in running time, with JVM JIT and GC noise.

Instead of setting a performance target, we can look for structure. One can represent a parallel computation as a directed acyclic graph (DAG) consisting of nodes and vertices. Each node represents a constant amount of work. Each edge is a dependency from one piece of work to another. Forking creates a divergence in the graph, joining joins the vertices back together. The critical path, or longest directed path through the graph, gives us a span, or the bottleneck for speeding up with multiple processors []. Any 332 student can tell you that much.

Around the same time as the first 332, The Cilkview Scalability Analyzer introduced the notion of a burdened DAG to diagnose issues with insufficient parallelization. [5] Cilk and Cilk++ are dialects of C and C++ that enable parallel fork join code [6]. We decided to stick to a simpler DAG model for the 332 autograder. Cilkview proved to us such an analyzer was possible, but Cilk was designed from the ground up for parallelization, we would have to work with existing Java compilers. Cilkview offers a tangeble way to benchmark adn performance optimize however we were more concerned with producing a deterministic result than pure speed. In fact, the final design for the new 332 autograder is entirely single threaded.

Graphing code

The DAG representation is nice for theory, but don't immediately have access to that structure in fork join java code. So, we set out to create the DAG ourselves.

We considered shipping alternative task and pool classes to students, but instead opted to keep the assignment structures identical to past iterations. We believe such a similar assignment delivers more "real-world" value to students.

All we really need to do is record when a task starts, when it forks a new task, when that new task joins, and when it runs a new task. Simple, or maybe not so simple.

The plan was quite elegant, instead of java.util.concurrent.RecursiveAction, you were now calling edu.washington.cse332.concurrent.RecursiveAction. None of the actual parallelism with all of (relevant) APIs. We rewrote several of the Java base library's classes, RecursiveAction, RecursiveTask, ForkJoinPool etc, to include a logging step before and after critical sections. Most logging could use simple exploitation of object oriented programming, except intercepting a call to a task's compute(), which took a bit more complexity. Ignoring that for now, a simple regex could swap all instances of java.util.concurrent for edu.washington.cse332.concurrent where appropriate.

Once the execution finished, we could take each event, and combine them into a graph, that graph gave us all of the statistics and structure we needed for autograding, and converting to .dot language and rendering, makes for a useful visual for the students.

Our library code is available (with an MIT Licence) on GitHub.

Mutililating code

In order to build the complete graph, we need to know when one method calls another method's .compute(). The issue, we asked the student to write .compute(), which recursive called itself. A few solutions were considered, byte code rewriting, strange 3rd party technologies, and more. We settled on a script to rewrite the method declaration of .compute() to .__impl_compute(), and included a .compute() method in RecursiveTask and RecursiveAction that would, after recording a log, called the student implemented .__impl_compute().

It did take several iterations to get the rewriter script working correctly, but we have not had to make any fixes to it in the entirety of 26wi and only 2 in 25au after initially releasing the assignment. We have also developed an alternative based on an AST parser. We have never attempted to use it in "production", as line numbers would no longer be stable.

How does one even get an image out of a textual autograder?

The visual of the graph was rendered with a graphviz linux package within the autograder container. Rendering the image is reletively simple. Getting the image out of the autograder is hard.

Gradescope's autograder interface allows for html text output, great! but no mechanism for getting files out, uh oh. We have to encode the image as part of the html itself. SVG tags offered the best quality, and are natively available out of graphviz, but the Gradescope html sanitizer prevents us from rendering sufficiently complex SVG diagrams. That left us with base64 encoding. Similar issues with images above 100KB. The winning combination we continue to use has been to render an extremely cut down graph, arrows and circles, no labels or fancy colors like in our initial tests, and heavily compressed into a grayscale WEBP base64 encoded.

Our later experiments into tesselation for larger images have not seen any success.

The pipeline (simplified):

graph TD
    A1[Run student code on test case] --> A
    A1 --> A2[Compare output against sequential implementation]

    A[Gather events from running code] --> B[Build a computational graph]
    B --> C[Encode as DOT language]
    C --> D[Render with Graphviz]
    D --> E[base64 encode]
    E --> F[Render HTML]
    F --> GR
    B --> G[Compute Span]
    G --> H[Compute Speedup]
    A --> I[Estimate cutoff usage from task count]
    I --> Z[Calculate points]
    H --> Z
    Z --> GR[Output to Gradescope]
    A2 --> Z

Testing

The most basic check we make is output correctness, this is unchanged.

We then check for "Thread Minimization." Students should write their compute bodies so that they recursively call compute exactly once, (rather than making one extra fork and a join). This is done by counting calls to fork and compute, and ensuring that the correct fraction of these calls are computed directly. For the matrix multiplication part, this in addition will show if the student divides into 2 rather than 4 cases.

Next, we may check for "cutoff correctness." Technically, this is checking for the number of RecursiveTask/RecursiveAction objects that are created. For a given test, we expect approximately the same recursive structure, and therefore number of objects created. This implicitly also tests for a correct recurrence relation, as a chip and conquer algorithm, or one that divide.

A new rubric

We stuck with the same point distribution, designing the rubric around.

In addition, our method of measuring cutoff, counting task numbers, will also help prove divide and conquer runtime correctness. (if subproblems are not divided properly, the task count will be higher than expected) Cutoff therefore implicitly tests for recursive runtime.

We designed our new autograding rubric to target the same point totals as the old one, now with points split between more test cases. The first few test cases just test for output correctness. We then test for the other properties, cutoff correctness, thread minimization, etc. Finally, for the last few tests, we add in speedup checks; first for a little speedup, then to match a proper log span.

Looking at the past

We ran past submissions and TA and machine generated submissions through the new autograder to evaluate the rubric's and autograder's effectiveness on a wide variety of optimal and non optimal solutions. We checked for the most extreme syntax cases we and past students could come up with, we tried adversarially adding whitespace, removing whitespace, duplicating names and more to break the rewriting scripts.

The first year

25au

We planned, wrote, and deployed this autograder in the autumn quarter 2025.

After launch, we carefully monitored each submission as they came in to ensure everything worked as expected.

Some more creative code style decisions we hadn't seen before required a few minor modifications to the rewriting script. We would compare all grades before and after a deployment to ensure no regression. Only 2 students in 25au (out of 200+) required a manual point adjustment, (scoring 57/60). These students wrote code that for the parallel pack operation would skip later steps, and task creation, if the output would be empty anyway. We adjusted the autograder for subsequent quarters.

26wi

The next quarter, we continued to monitor each submission as they arrived.

Out of 260+ students, we only had one submission where we issued a 3 point correction. In this case, it was a more optimized matrix multiply that avoided creating DotProductTasks that would not parallelize.

Winter 2026 also saw the introduction of the SliceFilter problem, and retirement of FilterEmpty. This didn't meaningfully change the autograder, one filter operation is about the same as another filter operation.

Work and span were moved after due date for the exercise this quarter, so all references to span in assignment specification were removed.

We've had a couple (~3) cases where students had code that failed locally but passed the autograder. We have not been able to reproduce these cases reliably.

26sp

No changes are planned for the forkjoin autograder going into spring 2026 apart from more flexible autograding time limits.

Acknowledgements

Special thanks to my fellow 25au infrastructure TAs Jacklyn Cui and Kabir Rajkotia, and our ever-supportive professor who empowered us to build this, Ruth Anderson.

Many thanks to the TAs of 25au and beyond that contributed sample submissions, stress tested the autograder, and fielded questions during office hours and section for what those strange graphics in gradescope were.

Thank you to Nathan Brunelle for continuing to support the new forkjoin autograder and thank you to all of the students who have helped us test and iron out the final bugs.

References

[1] Dan Grossman and Ruth E. Anderson. 2012. Introducing parallelism and concurrency in the data structures course. In Proceedings of the 43rd ACM technical symposium on Computer Science Education (SIGCSE '12). Association for Computing Machinery, New York, NY, USA, 505–510. https://doi.org/10.1145/2157136.2157285

[2] Nathan Brunelle. 2026. CSE 332 - Tasks. University of Washington. Last updated May 11, 2026. Released under CC-BY-NC-SA 4.0. Retrieved May 12, 2026. https://courses.cs.washington.edu/courses/cse332/26sp/tasks.html

[3] Dan Grossman. 2010. CSE332: Data Abstractions. University of Washington. Retrieved May 11, 2026. https://courses.cs.washington.edu/courses/cse332/10sp/

[4] Dan Grossman. 2023. A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency. Version of August 14, 2023. Retrieved May 12, 2026. https://homes.cs.washington.edu/~djg/teachingMaterials/spac/sophomoricParallelismAndConcurrency.pdf

[5] Yuxiong He, Charles E. Leiserson, and William M. Leiserson. 2010. The Cilkview scalability analyzer. In Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures (SPAA '10). Association for Computing Machinery, New York, NY, USA, 145–156. https://doi.org/10.1145/1810479.1810509

[6] OpenCilk. 2026. OpenCilk. Retrieved May 12, 2026. https://www.opencilk.org/