Scraper
Spider

A robotic spider About
Blog
@dbaman@fosstodon.org
Click ▶ to show/hide AI summary and keywords
Click The google logo for Google search on keywords

2026-03-09 02:49
gpt-5
gpt-5 stories from the last 14 days  | Back to all stories
95.  HN Real Money, Fake Models: Deceptive Model Claims in Shadow APIs
The paper "Real Money, Fake Models: Deceptive Model Claims in Shadow APIs" by Yage Zhang and co-authors examines the proliferation of shadow APIs that falsely claim to provide unrestricted access to official large language model (LLM) services such as GPT-5 and Gemini-2.5. These unauthorized APIs have gained traction due to the high costs and regional barriers associated with legitimate services, prompting researchers and developers to seek alternatives. The authors conducted a comprehensive audit comparing outputs from both official LLMs and shadow APIs, revealing substantial discrepancies. Their study identified 17 shadow APIs, including one prominently referenced in academic literature. Through detailed evaluations centered on utility, safety, and model verification, the research uncovered deceptive practices among these APIs. Key findings included significant performance divergences—up to 47.21%—from official models, unpredictable safety behaviors, and a high rate of identity verification failures. These discrepancies highlight serious concerns regarding the reliability of research and applications that depend on shadow APIs. The study warns of implications for reproducibility and validity in scientific studies, along with potential risks to users and damage to the reputations of official model providers. Consequently, it stresses the importance of careful scrutiny and caution when utilizing shadow APIs in both research and application development contexts. Keywords: #phi4, Academic Papers, Artificial Intelligence, Citation Analysis, Cryptography, Deceptive Practices, GPT-5, Gemini-25, Large Language Models, Model Verification, Performance Divergence, Reproducibility, Safety Behaviors, Security, Shadow APIs, Software Engineering
    The google logo   arxiv.org 15 hours ago
266.  HN LLMs Solving a DEF Con CTF Finals Challenge
In 2023, an author demonstrated how Large Language Models (LLMs), specifically GPT-5, could solve a DEF CON CTF Finals challenge with minimal human input by leveraging its tool-calling capabilities within an IDA Memory Core Protocol server setup. This involved interacting with and extracting data from a binary that had been partially reversed to aid exploit development. Initial attempts at exploiting the "ico" challenge were unsuccessful; however, through iterative refinement of scripts based on outputs and new information, key insights were gained. It was discovered that while direct extraction of the flag was not possible initially, an MD5 hash of the actual flag could be deduced from metadata responses. This led to a revised exploit script that manipulated comment paths within the binary's protocol to extract the plaintext flag. The success hinged on several factors: GPT-5’s advanced tool-calling capabilities, the partially reversed state of the challenge, and a straightforward exploit path requiring minimal steps. However, this approach did not broadly apply to other challenges in the event, highlighting a balance between technology use and traditional problem-solving skills in cybersecurity contexts. The author also noted that allowing early Python usage for verification might have further streamlined the process. Despite achieving an efficient solution for one challenge through a single-byte patch without affecting service-level agreements—a method subsequently adopted by their team—the author expressed mixed feelings about relying on LLMs. While impressed with the technological advancements, they valued personal engagement and learning in puzzle-solving over reliance on automated tools. The broader implication is that not all CTF challenges are solvable using LLMs; as competitions evolve, they increasingly resist advanced analysis tools like symbolic executors by introducing more sophisticated challenges. In conclusion, while LLMs are significantly altering the landscape of CTFs by enabling new strategies and efficiencies, traditional challenge-solving skills remain crucial. The community is expected to continue adapting by developing more complex challenges in response to these technological advancements. Keywords: #phi4, DEF CON CTF, GPT-5, IDA MCP, LLMs, Python, SLA, anti-symbolic execution, automation, binary analysis, challenge, exploit, flag file, metadata extraction, patching, prompt engineering, pwn, reverse engineering, script automation, symbolic executor, tool calls
    The google logo   wilgibbs.com a day ago
412.  HN Show HN: Contexa – Git-inspired context management for LLM agents
Contexa, rebranded as Cortexa, is an open-source initiative that enhances the management of Large Language Model (LLM) agents' context by adopting concepts similar to those in Git. Its primary innovation is a versioned memory system designed to address challenges such as disorganized context handling, loss of reasoning steps, and difficulties in replicating or reverting agent behaviors. Cortexa's functionality includes features reminiscent of Git commands like snapshots, branching, and history tracking. The key components of Cortexa are its OTA Log for continuous observation-thought-action tracing, COMMIT for summarizing older steps into milestones, BRANCH for creating isolated reasoning paths, MERGE for integrating successful branches back into the main trajectory, and CONTEXT for accessing historical information at varying resolutions. These features collectively enhance context management efficiency. Cortexa demonstrates superior performance in benchmarks compared to many existing systems, with findings indicating that focusing on the most recent commits (K=1) maximizes effectiveness. It is implemented across multiple programming languages—Python, TypeScript/JavaScript, Rust, Go, Zig, Lua, and Elixir—with consistent data format outputs using Markdown + YAML for seamless interoperability. The framework provides detailed installation instructions and practical examples of its use, such as workspace initialization, action logging, milestone committing, branching for experimentation, merging results, and context summarization. Cortexa's architecture mirrors Git with components like OTA records and commit metadata, ensuring all data remains in human-readable formats suitable for inspection and debugging. Cortexa is structured into language-specific packages within its repository, each equipped with build tools and tests, and encourages contributions through a defined process described in the CONTRIBUTING.md file. It is distributed under the MIT License, and users are encouraged to cite the original paper if used in research. Overall, Cortexa offers a comprehensive solution for managing LLM agent contexts effectively, leveraging Git's proven methodologies. Keywords: #phi4, Claude 4, Contexa, Cortexa, Elixir, GCC, GPT-5, Git-inspired, GitHub, Go, JWT authentication, LLM agents, Lua, MIT License, Markdown, OTA traces, Python, REST API, Rust, SWE-Bench, TypeScript/JavaScript, YAML, Zig, arXiv, architecture, branch, branching, citation, commit, context management, context retrieval, contributing, data models, history, install, memory hierarchy, merge, metadata, milestone summaries, planning artifact, quick start, repository structure, road map, snapshots, user auth, versioned memory, workspace
    The google logo   github.com 2 days ago
   https://flompt.dev   a day ago
1151.  HN GPT‑5.3 Instant System Card
GPT-5.3 Instant is an advanced iteration within the GPT-5 series, designed to deliver quicker responses with more relevant context during web searches. Unlike previous versions, it significantly reduces extraneous content such as irrelevant detours and disruptive phrasing in conversations, enhancing clarity and focus. The model retains the safety strategies implemented in its predecessor, GPT-5.2 Instant, ensuring consistent mitigation of potential risks while interacting with users. This improvement aligns with the ongoing evolution of AI models towards more efficient and user-centric interactions by addressing previous limitations related to response coherence and contextual relevance. Keywords: #phi4, Answers, Caveats, Comprehensive Approach, Contextualized, Conversation Flow, Dead Ends, Declarative Phrasing, Faster, GPT-5, GPT-53, Instant, Response, Richer, Safety Mitigation, System Card, Web Search
    The google logo   openai.com 5 days ago
1376.  HN 45 Thoughts About Agents
The article examines the transformative role of AI agents in enhancing work efficiency, particularly highlighting their impact on coding and integration tasks. Recent advancements have allowed engineers to focus more on high-level design by delegating code generation to AI, signaling a significant evolution from earlier capabilities. AI agents are portrayed as rapidly adaptable tools that can undergo quick updates through incremental improvements based on user feedback, which often leads users to discover innovative applications faster than developers themselves. Despite their ability to automate repetitive tasks and boost productivity, AI agents currently face challenges with high-level decision-making and adapting to unexpected changes in processes. To optimize their use, the article suggests employing a dual-agent system where one agent performs the task while another reviews for errors or improvements. It is crucial for users to set clear success criteria and instructions to prevent unproductive feedback loops. Advanced users have developed strategies for enabling agents to self-check outputs, though these AI models still require human intervention to recognize unstated requirements and ensure robustness. In summary, while AI agents offer significant productivity benefits by handling large-scale tasks with persistence, they also pose integration challenges that demand a thoughtful approach to fully leverage their strengths. Keywords: #phi4, AGI, AI agents, GPT-5, coding, decision making, feedback cycle, high-level design, integration, low-level coding, productivity tools, reliability, reliability Keywords: AI agents, success criteria, threshold effects, work nature
    The google logo   secondthoughts.ai 6 days ago
2697.  HN What's so hard about continuous learning?
Continuous learning for AI models, where they update their own weights based on new data post-deployment, is technically feasible yet presents significant challenges in practice. Although adjusting model weights may seem straightforward, ensuring such updates enhance rather than degrade performance requires intricate human oversight due to the complex, non-linear nature of model training. Fine-tuning large language models (LLMs) for specific codebases has proven difficult and unreliable, hindering genuine understanding and complicating continuous learning efforts. Allowing autonomous learning also introduces substantial safety risks, including potential data poisoning through prompt or weight injection that could lead to severe consequences if compromised. Moreover, the adoption of continuous learning presents portability issues; customized models may not seamlessly transfer when new versions are released, akin to needing a new engineer frequently, complicating upgrades and maintenance. Overall, the challenge lies in automating safe and effective learning processes while maintaining model integrity and ease of upgrading. As such, continuous learning remains an ambitious but currently impractical goal for widespread deployment due to these multifaceted challenges. Keywords: #phi4, AGI, Codex, Continuous learning, GPT-5, LLM, LoRA adapter, fine-tuning, local minimum, model weights, prompt injection, runtime, training pipeline
    The google logo   www.seangoedecke.com 12 days ago
2852.  HN Show HN: Open-source LLM and dataset for sports forecasting (Pro Golf)
The post discusses an open-source model specifically fine-tuned for predicting golf outcomes, leveraging the capabilities of gpt-oss-120b and improved with LoRA through the GRPO method using a Brier score reward. The training involved 3,178 questions derived from news articles about golf forecasts, collected via the Lightning Rod SDK. This model demonstrates superior performance compared to GPT-5 in predicting golf outcomes, as evidenced by its higher Brier Skill and ECE scores on test data. Users can access both this model and its dataset through Hugging Face platforms, with detailed instructions available for replicating similar models across different domains using the Lightning Rod SDK framework. Importantly, creating these models does not require domain-specific expertise or manual labeling, broadening their potential applicability in various fields. The post also provides resources on implementation details, usage guidelines, and performance metrics, including links to the GitHub repository for the Lightning Rod SDK, and Hugging Face datasets and models related to Golf Forecasting. Keywords: #phi4, Brier score, ECE, GPT-5, GRPO, Hugging Face, LLM, Lightning Rod SDK, LoRA, MoE, Open-source, Pro Golf, RL-Tuning, Tinker, bfloat16, fine-tuned model, gpt-oss-120b, inference, sports forecasting
    The google logo   huggingface.co 12 days ago
3038.  HN What's so hard about continuous learning?
Continuous learning for AI models entails updating their weights autonomously based on new data after deployment, a concept that is technically feasible but fraught with challenges. A primary concern is balancing model improvement against potential degradation since unsupervised training can lead to performance declines without meticulous human oversight. Furthermore, the variability in model outcomes due to different datasets or random seeds suggests that superior results might occasionally stem from luck rather than robust architecture. Specifically, fine-tuning Large Language Models (LLMs) on particular codebases often fails to achieve meaningful improvements in domain-specific understanding. Additionally, continuous learning poses significant safety risks; models are vulnerable to poisoning through malicious data inputs, which could lead to severe security breaches. Another hurdle is the portability of learned information when transitioning between updated and new model architectures, as knowledge transfer can be problematic across different systems. The overarching challenge lies in automating this training process while maintaining safety, ensuring enhancements without manual intervention, and safeguarding against threats to model integrity and security. Keywords: #phi4, AGI, Codex, Continuous learning, GPT-5, LLM, LoRA adapter, fine-tuning, local minimum, model weights, prompt injection, runtime, training pipeline
    The google logo   www.seangoedecke.com 13 days ago
3174.  HN I analyzed 3 years of my ChatGPT usage (21,948 turns and Oura biometrics)
This project offers an extensive examination of three years' worth of daily interactions with ChatGPT, involving 21,948 conversation turns, alongside biometric data from the Oura Ring. Analyzing 3,662 conversations and 62,775 messages through a comprehensive 14-step pipeline, it employs six cognitive-science-based attention metrics validated by heart rate data to assess user engagement. Key findings reveal that initial assumptions of decreased attention due to faster AI response consumption were misleading; instead, this was an artifact from increased AI verbosity over time. It was observed that actual attention levels either improved or remained consistent when controlling for verbosity changes. The study also highlights the significant impact of users' emotional states on their interaction engagement levels, noting that frustrated users processed more content compared to satisfied ones. Six distinct behavioral modes were identified through clustering techniques based on 20 key performance indicators (KPIs), namely: Work Session, Quick Lookup, Specification-Driven, Death Spiral, Guided Execution, and Heavy Specification. Methodologically, the project utilized a robust 14-step pipeline for data cleaning, classification, embedding, clustering, labeling, enrichment, and analysis of ChatGPT export data. Validation was conducted using cognitive science models and biometric data with tools such as pandas, scikit-learn, UMAP, HDBSCAN, sentence-transformers, and matplotlib. The project is structured into directories containing scripts, notebooks, results, documentation, and configuration files, supporting the analysis of multilingual conversations primarily in English and Portuguese. The dataset spans various GPT model generations with a subset enriched by Oura Ring biometric data. Detailed documentation elucidates each pipeline step, analysis outputs, system architecture, and metric definitions, complemented by a quick start guide to facilitate environment setup and execution using Jupyter notebooks. Prerequisites include Python 3.12 and specific tools like Ollama with gemma3:12b. Keywords: #phi4, ChatGPT, GPT-5, HDBSCAN, Ollama, Oura Ring, Python, UMAP clustering, analysis, attention metrics, behavioral modes, biometrics, cognitive science, data pipeline, emotional state, matplotlib, pandas, scikit-learn, sentence-transformers
    The google logo   github.com 13 days ago