329.
HN
Show HN: Llama 3.2 3B and Keiro Research achieves 85% on SimpleQA
The text evaluates the performance of Llama 3.2 3B integrated with Keiro Research's retrieval API on the SimpleQA benchmark, achieving an 85% success rate across 4,326 questions. This result is noteworthy given its smaller model size when compared to larger models like ROMA (357B) and OpenDeepSearch (671B), which achieve higher scores of 93.9% and 88.3%, respectively. Despite the significant difference in parameters, Llama 3.2 3B's relatively close performance raises questions about the necessity for much larger models to accomplish similar tasks effectively. The discussion points towards the potential benefits of using smaller, web-enabled models, particularly in non-coding contexts, suggesting that they might offer comparable or superior outcomes without the need for extensive resources. To facilitate further exploration, links are provided to a benchmark script and Keiro Research's API documentation.
Keywords: #phi4, AI Search, Data Extraction, Keiro Research, Llama, OpenDeepSearch, ROMA, SimpleQA, Sonar Pro, benchmark, compute, parameters, retrieval, web scraper API
www.keirolabs.cloud a day ago
|
682.
HN
The Custom ASIC Thesis
The article explores recent advancements in AI technology, emphasizing Taalas's introduction of a high-performance API service for the Llama 3.1 model. This new service achieves an impressive processing rate of 16,960 tokens per second per user while simultaneously reducing costs and power consumption. Despite these successes, challenges related to quantization are acknowledged and will be addressed by HC2.
The narrative then shifts focus to a strategic pivot towards custom ASICs (Application-Specific Integrated Circuits) for AI models, driven by insights from Martin Casado. He advocates that crafting specialized chips tailored to particular AI applications can significantly cut costs and enhance efficiency over generic hardware solutions like those offered by Nvidia. This strategy is corroborated by recent partnerships, such as OpenAI's agreement with Broadcom.
The article highlights the dual benefits of customized ASICs: cost reduction and enhanced model performance. It predicts a rapid closure of the performance gap between custom and generic solutions, fueled by ongoing advancements in integrating model design with chip architecture and standardizing large language models (LLMs). AI engineers are encouraged to explore these innovations, anticipating marked improvements within two years.
Additionally, the article briefly touches on evaluations involving frontier models like Gemini 3.1 Pro using benchmarks such as SWE-bench and MRCR, alongside discussions of real-world performance metrics.
Keywords: #phi4, AI Engineers, Claude C Compiler, Custom ASIC, FP4, Gemini 31 Pro, Huggingface, Llama, METR, MRCR, Martin Casado, Nvidia, OpenAI Broadcom deal, Opus, SWE-bench, Sarah Wang, Taalas, accelerators, billion dollar training run, capability market fit, chip tapeout, frontier quality, ggml, inference, integrated model-chip codesign, quantization
www.latent.space 3 days ago
|
1008.
HN
Running Llama Inference on Intel Itanium
The article explores optimizing Llama inference on an Intel Itanium-equipped HP server, achieving notable performance improvements through various compiler strategies. Initially, using the Open64 compiler tripled performance compared to GCC. However, even greater optimization was possible with HP's C compiler, which introduced compatibility challenges due to its reliance on a big-endian HP-UX system. To address these issues, modifications were made in Llama2.c to manage endianity differences by reversing the byte order for 32-bit values using `objcopy`, allowing model files to run seamlessly on HP-UX while keeping character data intact.
These adjustments facilitated successful inference execution on HP-UX, incorporating both OpenMP and fast math optimizations. The optimizations led to substantial performance gains: achieving 39.24 tokens per second with OpenMP enabled, and a significant increase to 73.84 tokens per second when utilizing fast math. Although comparisons with AMD Ryzen showed modest improvements for Itanium, the results were still impressive considering its age. The article suggests future potential enhancements by analyzing assembly output from HP C or exploring alternative implementations.
In conclusion, while showcasing sample outputs at varying levels of optimization, the article hints at further avenues for performance improvement in future studies.
Keywords: #phi4, AMD Ryzen 9 5900HX, GCC, HP C compiler, HP server, HP-UX, Intel Itanium, Llama inference, Open64 compiler, OpenMP, TransformerWeights, assembly, big-endian, endianity, fast math, implementation, objcopy, performance, tokens per second
medium.com 4 days ago
|
1744.
HN
Show HN: I built an open-source D&D app using Python and Llama 3.1
DM Co-Pilot is an open-source application designed for Dungeons & Dragons (D&D) that leverages Python and Meta Llama 3.1 to significantly reduce the administrative load on Tabletop Game Masters (GMs). By automating critical tasks such as scheduling, game balancing, and text summarization, it aims to decrease preparation time by up to 80%. The app features a Campaign Matchmaker for filtering players based on schedules using compatibility scores generated by Llama 3.1, an Encounter Architect that automates monster selection from a dataset of over 400 monsters with tools for Challenge Rating (CR) analysis and estimation, a Session Scribe for converting unstructured session notes into narrative summaries with local saving options, and Quick Improv Tools offering on-the-fly solutions like NPC generation and loot balancing. Developed with Streamlit on the frontend, it utilizes Pandas for data processing and integrates AI capabilities through the Groq API enhanced by Meta Llama 3.1. Overall, DM Co-Pilot enhances the GM experience by streamlining campaign management and providing intelligent automation and data-driven insights.
Keywords: #phi4, AI-powered, CR vs HP, Challenge Rating, D&D app, DM Co-Pilot, Encounter Architect, File I/O, Groq API, Kaggle dataset, Llama 31, Loot Anxiety Curer, Meta Llama 31, NPC Generator, Pandas, Python, Quick Improv Tools, SQL-inspired algorithms, Session Scribe, Streamlit, burnout, campaign management, micro-AI generators, narrative journal, workflow automation
github.com 8 days ago
|
2590.
HN
Show HN: Stintly – Offline-first app for freelancers with on-device AI
Stintly is an offline-first application tailored for freelancers and self-employed professionals, offering comprehensive business management tools while prioritizing data privacy. The app facilitates various tasks including invoicing with signature capture, expense tracking through receipt scanning and voice input, time tracking, project management, client management, tax tracking, and analytics—all without requiring accounts or cloud storage. By utilizing on-device AI for over 20 features, such as OCR and tax optimization, Stintly ensures data is processed locally, thus maintaining privacy by keeping information within the device. Developed using React Native/Expo and SQLite, Stintly is accessible for iOS users with a free tier available alongside paid plans starting at $12.99 per month. The app incorporates feedback from freelancers and indie developers to refine its features and includes an optional iCloud backup feature, allowing data synchronization across Apple devices while still maintaining its privacy-first approach.
Keywords: #phi4, Cloudflare Workers, Llama, Metal GPU, OTA updates, Premium tier, Pro tier, R2, React Native, SQLite, Stintly, cash flow forecasting, client insights, client management, estimates, expense tracking, freelancers, iCloud backup, iOS, invoicing, offline-first, on-device AI, privacy-first, project management, receipt OCR, tax optimization, tax tracking, time tracking, voice input
stintly.app 11 days ago
|
2606.
HN
Benchmarking the best base small model for fine-tuning
The study evaluates 12 small language models (SLMs) across eight tasks, demonstrating that fine-tuned SLMs can match or exceed the performance of larger models like GPT-OSS-120B in most benchmarks. Notably, Qwen3-4B-Instruct-2507 achieves remarkable success, especially on the SQuAD 2.0 task, emerging as the preferred model for maximum accuracy when GPU memory permits. Smaller models exhibit substantial improvements from fine-tuning; Llama-3.2-1B shows the highest tunability and offers significant benefits in resource-constrained settings.
Before any adjustments, larger models such as Qwen3-8B show superior base performance. However, after fine-tuning, the performance gap between smaller and larger models significantly diminishes, highlighting domain-specific adaptation's effectiveness over initial model size. The study recommends using Qwen3-4B for peak accuracy, Llama-3.2-1B or Qwen3-0.6B in limited compute environments, and Qwen3-8B when fine-tuning is not feasible.
Ultimately, the research suggests that the benefits of fine-tuning can surpass those derived from selecting a larger base model, enabling smaller models to achieve comparable performance at reduced computational costs. This finding underscores the potential for deploying small models efficiently in on-premises setups and resource-limited environments like mobile or IoT devices. The study plans further expansion of benchmarks to enhance the robustness and reliability of these insights.
Keywords: #phi4, Fine-tuning, GPT-OSS, Gemma, Granite, Llama, LoRA, Qwen3-4B, SQuAD 20, SmolLM2, base performance, benchmarks, classification, consumer GPU, distil labs, distillation pipeline, edge deployment, evaluation, examples, expert agent, few-shot, hyperparameters, inference costs, learning rate, production-ready model, question answering, small models, synthetic data, task description, tunability, zero-shot
www.distillabs.ai 11 days ago
|
3060.
HN
Show HN: PureBee – A software-defined GPU running Llama 3.2 1B at 3.6 tok/SEC
PureBee is an innovative software-defined GPU that reimagines machine learning inference by eliminating the need for conventional hardware such as GPUs. Developed initially from a theoretical exploration into whether silicon-based components are necessary for GPU-like operations, PureBee demonstrates that core GPU functionality can be achieved through parallelized mathematical rules encoded in software. The system operates efficiently on a single CPU core using pure software constructs and is capable of running Llama 3.2 1B at 3.6 tokens per second.
The architecture of PureBee comprises four distinct, auditable layers—Runtime, Instruction Set, Engine, and Memory—which handle execution, matrix operations, and data management respectively. This structured approach contributes to a significant performance increase, evidenced by a 45× speedup from its JavaScript baseline to full implementation utilizing WebAssembly (WASM), quantization techniques, SIMD, and worker threads.
PureBee's defining features include the elimination of hardware dependencies such as GPUs or CUDA, allowing it to run on any device with Node.js version 20 or higher. This software-driven methodology not only democratizes AI inference by broadening accessibility across varied devices but also maintains robust performance for full inference tasks without relying on GPU acceleration. The project is open-source under the FSL-1.1 license, with future plans to switch to Apache 2.0, and supports a range of model sizes while actively seeking contributions that adhere to its principles of clarity and transparency.
For users interested in deploying PureBee, they can access it through GitHub by cloning the repository, downloading models via Node.js, and executing inference scripts, ensuring sufficient memory allocation with specific heap size settings. This initiative challenges traditional assumptions about AI hardware requirements, promoting a flexible and transparent approach to AI computing that includes plans for browser deployment and expanded instruction sets.
Keywords: #phi4, CPU core, JavaScript, Llama 32, PureBee, SIMD, WASM, accessibility, architecture, inference, performance, portability, quantization, software-defined GPU, tok/sec, transparency
github.com 13 days ago
https://github.com/PureBee/purebee/pull/1 12 days ago
|