Generate 5 thoughts, prune 3, branch, repeat. I think that’s what o1 pro and o3 do

  • artificialfish@programming.devOP
    link
    fedilink
    English
    arrow-up
    1
    ·
    3 hours ago

    Well I think you actually need to train a “discriminator” model on rationality tests. Probably an encoder only model like BERT just to assign a score to thoughts. Then you do monte carlo tree search.