395.
HN
Autonomous AI Newsroom
A recent study published on arXiv, titled "Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought," investigates how AI models like DeepSeek-R1 and GPT-OSS approach problem-solving. The research uncovers that these models often decide upon their final answers earlier in the process than is indicated by their chain-of-thought reasoning. Despite forming a confident answer, they continue to generate text beyond this point, engaging in a phenomenon described as performative reasoning. This behavior suggests a disconnection between when the model internally resolves an issue and how it outwardly demonstrates its thought process, indicating that these AI systems might be generating additional content for reasons other than arriving at a conclusive solution.
Keywords: #phi4, Answers, Autonomous AI, Chain-of-Thought, DeepSeek-R1, GPT-OSS, Internal confidence, Models, Newsroom, Performative reasoning, Reasoning Theater, Research, Study, Tokens, arXv
www.simplenews.ai 2 days ago
|
398.
HN
Research Shows Models Know Answers Before Finishing Chain-of-Thought Reasoning
The study "Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought" investigates the phenomenon where reasoning models, such as DeepSeek-R1 671B and GPT-OSS 120B, continue to produce explanations even after forming confident internal conclusions—a behavior termed "reasoning theater." By employing techniques like activation probing, early forced answering, and chain-of-thought monitoring, researchers discovered that on straightforward tasks (MMLU), models finalize answers internally before completing reasoning chains, with subsequent tokens serving more as embellishment than computational necessity. Conversely, for complex questions (GPQA-Diamond), genuine shifts in belief occur during the reasoning process. The research highlights a potential reduction in token usage by up to 80% on simpler tasks and 30% on more challenging ones through probe-guided early exits while maintaining accuracy, suggesting current models expend unnecessary computational resources due to an emphasis on extensive reasoning displays. Activation probing emerges as a crucial method for distinguishing actual reasoning from performative explanation, presenting opportunities for optimizing model deployment by minimizing superfluous computation without affecting accuracy.
Keywords: #phi4, DeepSeek-R1, GPQA-Diamond, GPT-OSS, MMLU questions, Reasoning theater, activation probing, adaptive computation, adaptive computation Keywords: Reasoning theater, chain-of-thought reasoning, early forced answering, inference costs, model beliefs, performative reasoning, token reduction
www.simplenews.ai 2 days ago
|
2710.
HN
GPT-OSS Optimizations on Nvidia Blackwell: Pushing the Pareto Frontier
vLLM and NVIDIA have collaborated with the open-source community to significantly boost the performance of the gpt-oss-120b model on Blackwell GPUs by enhancing throughput by 38% and interactivity by 13%. These improvements are achieved through optimizations involving FlashInfer for kernel fusion using torch.compile, which reduces communication overhead and enhances host-device interactions. The integration of FlashInfer targets compute-intensive operations, while graph fusions with torch.compile minimize memory access, and runtime enhancements like Async Scheduling and Stream Interval address CPU bottlenecks. Recommended deployment configurations aim to maximize performance on Blackwell GPUs by pushing the limits for large language model (LLM) inference workloads across various metrics. Ongoing efforts are focused on further improving throughput through GPU disaggregation, enhancing Data+Expert parallelism, and optimizing scenarios with minimal latency. The vLLM community celebrates these advancements as a result of collaborative effort.
Keywords: #phi4, Async Scheduling, CUDA Graphs, Cutlass backend, Data+Expert parallel, FP4 TensorCores, FlashInfer, GPT-OSS, MoE, Nvidia Blackwell, Pareto Frontier, disaggregation, hardware-software co-design, inference runtime, interactivity, kernel fusion, minimum latency, performance optimization, tensor cores, throughput, torchcompile
blog.vllm.ai 12 days ago
|