I would dispute that doing what you intend is the most productive way for a number of reasons, but to answer your question: I don’t believe what you are looking for exists, and the other thing is that (saying this from experience) is that voice notes/comments don’t really work well . But it is possible to play black geo-referenced voice notes in JOSM but afaik there would have to be separate from the photos.
The most productive is whatever works for you. Personally, I create lots of numbered notes in a file that I then email to myself, and merge that with a GPS trace so that the numbered notes become waypoints, and use that resulting GPX in an OSM editor when back at home.
Having recently moved from Android to iPhone I have to say that what worked well for me on Android does not seem to exist on iPhone.
Concur with SimonPoole that geo-referenced voice notes are not as productive as you might think. When I tried that type of mapping I found that it slowed my walking, took a long time with JOSM listening to recover the data and then some data was unusable due to background noise, etc.
The old version of KeypadMapper I self-compiled for Android was really useful: I could pre-key in a house number as I approached the building then tap the left, right or forward button at the appropriate point without breaking pace. If/when I passed a letter box, fire hydrant, bench or other notable item a quick switch to the also running OsmTracker could get me a quick geo-referenced note or photo.
If I can find a iPhone equivalent of OsmTracker and/or KeypadMapper then I will be happy. But so far nothing I’ve tried has worked as well for me as I’d like.
I’ll try and find apps for Android and iOS that let users add an audio commentary when taking pictures, and see how it goes. It’s for a “mapping by riding around” outing.
I see that JOSM has an audio menu, but found no tutorial on how to import and work with geo-referenced pics (EXIF) + audio files in JOSM. Do I need some plug-ins for this purpose?
Edit: The following page doesn’t seem to include the solution I mention, ie. using a GPS-enabled smartphone to take pictures and record audio at the same time (with the smartphone, not a dictaphone) → JOSM will have the GPS coords, a picture, and an audio file to play to the user for more infos