The seasonal deltas are easy. Converging consistently, not so much. Knowing whether the occupant habitually kept it at 74F with a couple of windows cracked, or how often they used the wood stove or when and how long they left on vacation (and what they'd set the T-stat to) are purely speculative. Month to month working some reasonable assumptions on background load it doesn't always converge- you HAVE to know the habits & occupancy rate of the place to narrow it down, which isn't always knowable the way it is at your own place.

My manual-J on my mother's place came in at ~18K, which was (uncharacteristically) at the low end of the 17K-26K range I was coming up with using reasonable guesses based on the prior owner's billing history & HDD data. Third party company proposals for mini-splits running other heat loss calculation software ranged from 17.6K-20.1K. (Maybe that overheated HW heater in the uninsulated utility closet WAS skewing the result, or the prior owner was a fresh-air freak? We'll never know.) Reality may be as low as 15K with the windows closed, but probably not much lower. Without consistent occupancy and use information there's only so much faith one should put in a fuel use calc for sizing the equipment, unless the data are rock-solid consistent from one billing period to another. With diligence a Manual J approach can still overshoot by 25-50% , but almost never by 100%, despite it's many shortcomings.

Working from a heat pump power use is fraught with error, at least on the heating season end when auxilliary resistance strips can cut in, even if it's somewhat consistent for cooling season use. Modeling a heat pumps COP curve against the actual vs. average daily outdoor temperature already has error bars as big as anything in a Manual-J. Most mechanical equipment is oversized and will keep up no matter what it's state of decrepitude, and even undersized equipment doesn't often fall short by much.