From Onsets and Frames paper
"We also use the thresholded output of the onset detector during the inference process, similar to concurrent research described in [24]. An activation from the frame detector is only allowed to start a note if the onset detector agrees that an onset is present in that frame."
From referenced paper [24]
"Finally, we peak pick the two-channel activation matrix to convert the framewise piano roll to a list of note events. Per note, we step through each time frame and place an onset at positions where the articulation channel is above a set threshold, and then include all frames onward until the sustain channel is under another fixed threshold, at which point we output an offset. If a new articulation is found during an active note event we simply fragment it by outputting additional offsets and onsets."
where articulation channel refers to the parallel piano-roll channel where only note frames corresponding to note onsets are active, so here onset labels (onsets = articulations in authors' lingo), and sustain channel would be our frame-level predictions corresponding to note-level frame labels.