Which sounds are significant? How does the captioner choose which sounds to caption? Are some captions unnecessary? Why isn't it possible to caption every sound in the environment?