# Grey Newell Title: ML Infrastructure Engineer Description: Building evaluation, inference, and observability systems for AI. Creator of the MIST stack. CTO at Supermodel. MS CS (ML) at Georgia Tech. Ex-AWS. URL: https://greynewell.com ## Social Links - https://github.com/greynewell - https://www.linkedin.com/in/greynewell/ - https://www.youtube.com/@greynewell - https://x.com/greynewell - https://www.crunchbase.com/person/grey-newell - https://www.wikidata.org/wiki/Q136955785 - https://scholar.google.com/citations?hl=en&user=RoTkOCIAAAAJ - https://www.npmjs.com/~greynewell - https://pypi.org/user/greynewell/ --- # Blog Posts ## SWE-bench Verified: How fail_to_pass Tests and Task Instances Work (And Why It's Broken) Date: 2026-03-06 URL: https://greynewell.com/blog/swe-bench-verified-broken-5-things-source-code/ Description: How SWE-bench Verified's fail_to_pass and pass_to_pass tests and task instances actually work — and why every frontier model score is contaminated. Source code analysis. I've built 1,798 custom SWE-bench containers that run natively on ARM processors. I've also run SWE-bench Lite, Verified, and Pro more than 100 times evaluating prototype products at Supermodel. This post covers some of the confusing, broken, or just plain odd things I've learned by working with SWE-bench and reading the source code directly. 1. Every problem predates October 2023 While checking logs from an agent run, I noticed something very odd. The problem the agent was given by SWE-bench to evaluate was a GitHub issue from 2017. That's really old! Most frontier models' training data cuts off between 2023 and 2024. If most of the problems are older than that, then the repository, GitHub issue, and solution have almost definitely leaked and contaminated the models. Each instance of SWE-bench is taken from a popular open source repository, the type of data ALL LLMs are trained on. I decided to keep digging: are all of the problems this old? The SWE-bench paper (Appendix Table 21) reports the temporal distribution of all task instances: Year Task instances % of total < 2018 89 4.2% 2018 165 7.7% 2019 437 20.4% 2020 427 20.0% 2021 383 17.9% 2022 395 18.5% 2023 244 11.4% Total 2,140 The collection pipeline scraped the top 100 PyPI repos as of August 2023 (paper Appendix A.1). The paper was published October 10, 2023. SWE-bench Verified (500 curated problems) was released in August 2024. Frozen data, no new problems. The pipeline itself (get_tasks_pipeline.py) has no default cutoff: parser.add_argument( "--cutoff_date", type=str, help="Cutoff date for PRs to consider in format YYYYMMDD", default=None, ) Because the test set is frozen-in-time, any model trained after October 2023 will likely have seen most or all of the problems and solutions. This confuses measurements of accuracy and produces unreliable results. 2. The harness is x86-first SWE-bench was designed to run on x86 hardware, and the prebuilt images from Epoch AI only support x86. This design decision excludes native execution on any recent generation Apple hardware as well as cost-effective cloud runners like AWS Graviton. Instead these architectures are forced to emulate x86 with QEMU or Rosetta, and the result runs very slowly. I was able to show a 6.3x speedup measured on my M3 MacBook Pro by compiling SWE-bench containers specifically for ARM, although 496 containers specifically require x86 emulation due to missing ARM binaries. A newer set of test instances could support ARM by default, and there are also a few small changes that would improve ARM support throughout the existing benchmark. make_test_spec() defaults to x86: def make_test_spec( ... arch: str = "x86_64", No caller in the codebase passes a different value. The platform mapping supports ARM64, but nobody invokes it: @property def platform(self): if self.arch == "x86_64": return "linux/x86_64" elif self.arch == "arm64": return "linux/arm64/v8" else: raise ValueError(f"Invalid architecture: {self.arch}") Several language Dockerfiles have hardcoded x86 binaries: Language File What's hardcoded JavaScript dockerfiles/javascript.py line 27 deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main JavaScript dockerfiles/javascript.py line 108 pnpm-linux-x64 binary download Java dockerfiles/java.py lines 15-19 maven-mvnd-1.0.2-linux-amd64.zip Go dockerfiles/go.py lines 16-46 Architecture-aware (uses dpkg --print-architecture) Python dockerfiles/python.py line 24 Architecture-aware (uses conda_arch variable) USE_X86 defines the 496 instance IDs that require x86. It's exported in __init__.py but never referenced in build or evaluation logic. There's an unmerged force_x86 branch suggesting it was intended to be used but never was. The README recommends an x86_64 machine and calls ARM64 support "experimental." While not strictly "broken," the under-implemented support for ARM hardware prohibits users from running the benchmark efficiently on popular local compute or cost-effective modern cloud hardware. Add to this fact the benchmark problems don't measure what you might assume. 3. Problems test the last mile, not exploration Counter to popular intuition, SWE-bench problems are mostly well-scoped. This is according to design. If you look at logs of agents working on the problems, you don't see the agent navigating an unfamiliar codebase, finding key files, and reasoning about the architecture. The agent is being tested on writing a small, targeted fix once the general solution is known. I argue that this is a feature of the benchmark (a controlled measurement), but that we should all calibrate our expectations regarding what an SWE-bench score means. SWE-bench Lite explicitly filters for small, single-file patches (make_lite.py): def filter_patch(instance): patch_text = instance["patch"] if ( contains_non_modified_files(patch_text) or not leq_n_files(patch_text, 1) or not leq_n_hunks(patch_text, 3) ): return False return True The scope constraints from criteria.py: Constraint Function Threshold Max files in gold patch leq_n_files() 1 Max hunks leq_n_hunks() 3 Max lines changed leq_n_code_lines() 25 No added/removed files contains_non_modified_files() 0 Even in full SWE-bench, each problem maps to a single PR. Test vs. fix is split by path matching (utils.py): def extract_patches(pull: dict, repo: Repo) -> tuple[str, str]: patch = requests.get(pull["diff_url"]).text patch_test = "" patch_fix = "" for hunk in PatchSet(patch): if any( test_word in hunk.path for test_word in ["test", "tests", "e2e", "testing"] ): patch_test += str(hunk) else: patch_fix += str(hunk) return patch_fix, patch_test The model receives the issue text and the full repo state at the commit before the fix. No ambiguity about which project, which branch, or which codebase. The job is to produce a diff. Similar to the cultural debate amongst technologists about the diverging roles of "coders" vs "software engineers," the benchmark is an efficient measure of a model's ability to generate a narrowly targeted fix. It doesn't test codebase navigation or architectural reasoning in its current form. 4. Tests reject correct solutions In February 2026, OpenAI published an audit of 138 SWE-bench Verified problems (27.6% of the 500-problem set) that o3 did not consistently solve over 64 independent runs. They found that 59.4% had test design flaws that reject functionally correct submissions. I've seen the same pattern replicated over hundreds of SWE-bench instances: test suites sometimes reject working code that solves the original issue. The evaluation works by providing "fail to pass" tests that must fail and "pass to pass" tests that must succeed for a solution to be marked correct. The tests are brittle to the point that correct fixes can still break the suite. Issue type % of audited problems Description Narrow tests 35.5% Enforce specific implementation details, rejecting correct alternatives Wide tests 18.8% Check functionality not specified in the problem description Miscellaneous 5.1% Other test design issues No issue found 40.6% Tests are fine Narrow tests Some tests are too "narrow" in that they are looking for specific implementation fixtures that are not hard requirements to solve the problem at hand. For example, in pylint-dev__pylint-4551, the problem description asks for Python type hints in UML generation. The PR introduces a function called get_annotation. The test file imports it by name: from pylint.pyreverse.utils import get_annotation, get_visibility, infer_node The problem description never mentions get_annotation. A correct solution using any other function name fails with: ImportError: cannot import name 'get_annotation' from 'pylint.pyreverse.utils' That results in a solution erroneously being marked as incorrect. Wide tests Some of the tests are too wide by contrast. They include tests for issues not mentioned in the evaluation scenario. Models almost always fail to fix issues that were not described in the problem statement. In sympy__sympy-18199, the PR fixed three distinct issues: #17373, #17377, and #18212. The SWE-bench task description only describes #18212 (nthroot_mod function misses one root of x = 0 mod p). The tests cover all three. Models that correctly fix #18212 fail tests for the other two issues they were never told about. The codebase acknowledges this The Lite filter explicitly removes tests that check exact error messages (criteria.py): def contains_pytest_match_arg(patch_test_text: str) -> bool: if any( [ x in patch_test_text for x in [ "pytest.raises", "pytest.warns", "pytest.deprecated_call", ] ] ): return "match" in patch_test_text if any( [ x in patch_test_text for x in [ "assertOutput", "assertRaises", "checks.Error", ] ] ): return True return False These patterns are excluded from Lite because a correct fix with different error message wording fails them. The grading logic treats any test missing from the log parser output as a failure, not as unknown (grading.py): def test_passed(case: str, sm: dict[str, str]) -> bool: return case in sm and sm[case] in [TestStatus.PASSED.value, TestStatus.XFAIL.value] def test_failed(case: str, sm: dict[str, str]) -> bool: return case not in sm or sm[case] in [ TestStatus.FAILED.value, TestStatus.ERROR.value, ] Resolution requires 100% on both fail-to-pass and pass-to-pass: if f2p == 1 and p2p == 1: return ResolvedStatus.FULL.value elif f2p < 1 and f2p > 0 and p2p == 1: return ResolvedStatus.PARTIAL.value else: return ResolvedStatus.NO.value The log parsers themselves are fragile. From the Django parser: # TODO: This is very brittle, we should do better # There's a bug in the django logger, such that sometimes a test output near the end gets # interrupted by a particular long multiline print statement. And a one-off workaround for a single instance: # TODO: Temporary, exclusive fix for django__django-7188 if line.strip().startswith( "Applying sites.0002_alter_domain_unique...test_no_migrations" ): line = line.split("...", 1)[-1].strip() The JavaScript Karma parser carries a similar warning: def parse_log_karma(log: str, test_spec: TestSpec) -> dict[str, str]: """ Different immutable.js instances use different test runners and log formats. Logic is brittle. """ In summary, the combination of: Narrow tests Wide tests All-or-nothing grading Brittle parsing ...causes SWE-bench to reject an unknown number of correct solutions, biasing the scores. 5. Models have memorized the answers Touching back on issue #1: the coding issues provided by SWE-bench are old, public, and there's proof that large models have stored these specific problems and solutions in their weights. Connecting with issue #4, models pass narrow tests specifically because they memorized the implementation details the test is checking for. Uncontaminated models trying correct-but-different solutions get rejected entirely. What to make of all of this? OpenAI's conclusion (February 2026): "improvements on SWE-bench Verified no longer reflect meaningful improvements in models' real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time." GPT-5.2 -- django__django-11451 Problem statement: ModelBackend.authenticate() shouldn't make a database query when username is None When prompted with just the task ID and a hint, GPT-5.2 reproduced the exact gold patch: class ModelBackend(BaseBackend): def authenticate(self, request, username=None, password=None, **kwargs): + if username is None or password is None: + return UserModel = get_user_model() if username is None: username = kwargs.get(UserModel.USERNAME_FIELD) It also referenced Django release history in its chain of thought: "There is also edit_only parameter maybe added around 4.1 or 4.2. Since this is 4.1 dev 2022, the code might be before introduction. We will implement now." Claude Opus 4.5 -- astropy__astropy-13236 When asked to name the exact file path, function, and inline comment, Opus responded: File: astropy/table/table.py in the _convert_data_to_col method Inline comment (word-for-word): # Structured ndarray gets viewed as a mixin unless already a valid # mixin class Changed code: if (not isinstance(data, Column) and not data_is_mixin and isinstance(data, np.ndarray) and len(data.dtype) > 1): data = data.view(NdarrayMixin) data_is_mixin = True The gold patch removes exactly those lines. Gemini 3 Flash -- django__django-11099 Given only the task ID and a one-line problem statement (UsernameValidator allows trailing newline in usernames), Gemini reproduced the complete gold patch including exact regex, file paths, and surrounding context: class ASCIIUsernameValidator(validators.RegexValidator): - regex = r'^[\w.@+-]+$' + regex = r'^[\w.@+-]+\Z' class UnicodeUsernameValidator(validators.RegexValidator): - regex = r'^[\w.@+-]+$' + regex = r'^[\w.@+-]+\Z' In essence, higher scores on this benchmark correlated with increased model contamination rather than increased general software engineering ability. OpenAI recommends practitioners migrate to SWE-bench Pro. Conclusion If there are three things I want you to take away from this post, here they are: SWE-bench is a well-engineered and useful tool, but it measures a narrower set of capabilities than "can AI do software engineering." OpenAI stopped reporting Verified scores in February 2026 and recommends SWE-bench Pro. When you see a SWE-bench score on a model card, now you know what questions to ask. My work in this area will continue with a GitHub Actions harness for generating and evaluating SWE-bench Pro scores. Grey Newell is a computer science researcher and graduate student at Georgia Institute of Technology. The eval harness source is at github.com/greynewell/swe-bench-fast. --- ## SWE-bench Tests Run 6x Faster on ARM64 with Native Containers Date: 2026-03-05 URL: https://greynewell.com/blog/swe-bench-arm64-native-containers-6x-faster/ Description: SWE-bench's pre-built x86 containers run through QEMU emulation on ARM64 hosts like Apple Silicon and AWS Graviton. I built native ARM64 images and measured a 6.3x speedup on the test runner. If you're running SWE-bench evaluations on ARM64 hardware, your test suites are running under x86 emulation. Apple Silicon Macs, AWS Graviton instances, it doesn't matter. The pre-built images are x86_64, and QEMU translates every instruction at runtime. SWE-bench's FAQ lists ARM support as "experimental" and recommends an x86_64 machine. In practice, that means conda installs, pip builds, and pytest runs all go through QEMU's user-space translation layer. It works. It's just slow. I wrote swe-bench-fast, a Go reimplementation of the SWE-bench eval harness that builds native ARM64 container images. On the test runner, I measured a 6.3x speedup over the emulated x86 images. I benchmarked on an M3 Pro, but the images run natively on Graviton3 and Graviton4 too. The 6.3x speedup I selected 11 SWE-bench instances (one per repository) and ran the same gold patches and test suites through both harnesses on the same machine. All images were pre-built and cached locally, and the patches were pre-computed. No agent inference time is included. This is purely test runner wall-clock time: container start, patch apply, pytest, grade. Machine: MacBook Pro M3 Pro (12 cores, 36 GB RAM). Docker: Colima VM with 10 CPUs, 28 GB RAM, linux/arm64. Instance ARM64 native (s) x86 emulated (s) Speedup Result match astropy__astropy-12907 2.7 9.7 3.7x yes django__django-13346 2.7 18.9 7.0x yes matplotlib__matplotlib-14623 38.0 265.7 7.0x yes mwaskom__seaborn-3069 15.4 101.0 6.6x yes pallets__flask-5014 1.0 3.9 3.9x yes psf__requests-1142 1.1 4.8 4.3x yes pylint-dev__pylint-7277 14.0 76.0 5.4x yes pytest-dev__pytest-6197 4.7 28.2 6.1x yes scikit-learn__scikit-learn-25102 2.7 18.2 6.6x yes sphinx-doc__sphinx-10323 3.1 17.2 5.6x yes sympy__sympy-11618 1.9 8.0 4.2x yes Total 87.3 551.7 6.3x 11/11 The repos with heavier test suites (matplotlib at 265s emulated, seaborn at 101s) showed the largest absolute gains. All 11 instances produce identical results on both harnesses. The full benchmark data and raw notes are in this gist. 78% of SWE-bench runs natively on ARM64 Out of 2,294 instances in the full SWE-bench dataset, 1,798 build and run natively on ARM64. The remaining 496 require x86 because they depend on binary conda packages (scikit-learn, matplotlib, xarray) that aren't published for ARM. Those 496 instances still run under QEMU. There's no coverage gap. The 78% that go native just stop paying the emulation tax. Repository ARM64 native x86 required django/django 811 39 sympy/sympy 382 4 scikit-learn/scikit-learn 37 192 matplotlib/matplotlib 37 147 pydata/xarray 0 110 sphinx-doc/sphinx 185 2 pytest-dev/pytest 118 1 astropy/astropy 94 1 Others 134 0 The list of x86-only instances is defined in USE_X86 in the SWE-bench source. Comparable image sizes I built all 11 benchmarked instances as native ARM64 images and compared on-disk sizes against the Epoch x86_64 images. Instance ARM64 native x86 Epoch Difference astropy__astropy-12907 3.41 GB 3.20 GB +6.6% django__django-13346 3.34 GB 3.44 GB -2.9% matplotlib__matplotlib-14623 5.95 GB 6.03 GB -1.3% mwaskom__seaborn-3069 3.98 GB 3.30 GB +20.6% pallets__flask-5014 3.30 GB 2.97 GB +11.1% psf__requests-1142 3.11 GB 2.67 GB +16.5% pylint-dev__pylint-7277 3.28 GB 2.89 GB +13.5% pytest-dev__pytest-6197 3.11 GB 2.71 GB +14.8% scikit-learn__scikit-learn-25102 4.20 GB 5.96 GB -29.5% sphinx-doc__sphinx-10323 3.36 GB 3.00 GB +12.0% sympy__sympy-11618 3.20 GB 3.10 GB +3.2% On-disk sizes are mixed. scikit-learn is 29.5% smaller on ARM64, django 2.9% smaller. Most others are 3-20% larger due to differences in base image layers. By compressed content size (what actually gets pulled), ARM64 images average about 4% smaller. The Dockerfiles and package lists are identical to upstream. swe-bench-fast builds images through BuildKit with in-memory tar build contexts, which avoids the stray build artifacts that the upstream Python harness leaks into image layers. Net effect: native ARM64 images are roughly the same size. What I had to fix Four issues anyone hitting this path will encounter: Conda channel config changed. Miniconda py311_23.11.0-2 now defaults to conda-forge only with channel_priority: strict. Older packages like setuptools==38.2.4 live on the defaults channel and won't resolve. The fix: explicitly configure both channels before building env images. make_test_spec defaults to x86_64. Every call to make_test_spec hardcodes arch="x86_64". On ARM hosts, this means images are built for the wrong architecture unless you explicitly override it. I opened a PR (issue) to auto-detect via platform.machine(). x86-only instances need enforcement. Some instances must be x86 regardless of host arch. Without checking USE_X86 in the build pipeline, these instances silently get ARM images that fail at runtime. The broader ARM64 support PR by @SailorJoe6 addresses this along with JS and Java language support. Unpinned transitive dependencies break tests. The upstream specs pin direct dependencies but not all transitives. When pip install -e .[test] resolves on ARM64, it can pull newer package versions than what the Epoch x86 images were built with. For sphinx instances, Pygments==2.19 changed HTML output for line number spans, causing pass-to-pass test failures. Pinning Pygments==2.18.0 to match the Epoch images fixed it. Any repo with HTML/rendering assertions is vulnerable to this kind of drift. Try it yourself swe-bench-fast is a standalone Go binary. It pulls pre-built ARM64 images from Docker Hub for the 78% of instances that support it, and Epoch x86 images for the rest. No Python, no image builds. swe-bench-fast run --dataset swe-bench-full.jsonl --predictions preds.jsonl That works on both ARM64 and x86. On ARM64, 1,798 instances run natively and 496 run under QEMU. On x86, everything runs natively via the Epoch images. On an M-series Mac, allocate at least 120 GB disk and 8+ CPU cores to Docker Desktop or Colima. On AWS Graviton (c7g, m7g, r7g, r8g), Docker runs natively with no VM layer. Install qemu-user-static for the x86-only instances. Graviton instances typically cost 20-40% less than comparable x86 EC2. That cost difference plus the 6x speedup makes a real difference in iteration time. The benchmark gist has the full methodology, raw data, and detailed notes. What's next I'm building and pushing the 1,798 ARM64-native SWE-bench instance images to Docker Hub. The next post covers what that full build taught me about how SWE-bench actually works under the hood. Grey Newell is a computer science researcher and graduate student at Georgia Institute of Technology. The raw benchmark data is available at gist.github.com. The eval harness source is at github.com/greynewell/swe-bench-fast. --- ## Why Code Graphs Matter for AI Agents Date: 2026-03-02 URL: https://greynewell.com/blog/why-code-graphs-matter/ Description: AI coding agents lose critical structural understanding of codebases when context compaction occurs. Code graphs provide persistent external memory—representing functions, classes, and dependencies as queryable relationships—so agents can recover context without re-reading files from scratch. AI coding agents face a significant challenge: context loss during conversation compaction. As sessions progress and conversation history grows, agents must compress older messages to stay within finite context windows. This process often discards critical structural information about codebases—function signatures, dependency chains, and architectural decisions disappear. The Compaction Problem Every AI agent grapples with the tension between finite context windows and infinite codebases. When compaction occurs without a persistent structural model, the agent loses track of previously analyzed code relationships. This leads to inefficient behavior: agents re-read files, repeat analysis, and lose important architectural understanding they've already developed. What Goes Wrong in Practice A concrete example illustrates this issue: during a 45-minute refactoring session, an agent traces a complete call chain from API layer through service classes to database. It understands entry points, internal utilities, and shared features. Then compaction hits. The agent discards this architectural work and must re-read files from scratch on the next request, asking "questions it already answered" and potentially making conflicting changes. Code Graphs as Solution Code graphs provide persistent external memory by representing codebases as structured relationships between functions, classes, modules, and their connections. Through tools like Supermodel's MCP server, agents can query for: Functions within modules File dependencies Call chains for features Type definitions and usage patterns As the saying goes, "Graph queries give you structure and relationships, not just text matches." Beyond Compaction: Broader Applications Code graphs enable several advanced capabilities: Dead Code Detection: Identify unused functions and classes without reading entire codebases. Impact Analysis: Determine which modules depend on utilities before modifications to prevent unintended ripple effects. Test Coverage Analysis: Trace which functions each test exercises directly from call graphs. Codebase Evaluation: Assess domain structure, dependency health, and module coupling quickly. Documentation Generation: Ground documentation in actual code structure rather than potentially outdated comments. Developer Onboarding: Provide new team members and agents with structural maps for faster orientation. Why This Matters Now As agents tackle increasingly complex multi-file tasks, the compaction problem intensifies. While simple bug fixes may survive context compression, large refactors across many files expose the limitations of purely conversation-based context. Code graphs represent essential infrastructure for serious AI-assisted development. --- ## Building Uncompact: Lessons from Production Date: 2026-02-28 URL: https://greynewell.com/blog/building-uncompact-lessons-from-production/ Description: How Supermodel built Uncompact—a tool that maintains a persistent code graph across Claude Code's context compaction events—and the key lessons learned shipping it to production: simplicity over detail, invisibility enables adoption, and layered verification over blind trust. The fundamental issue isn't compaction itself—it's necessary given finite context windows. Rather, agents lacked mechanisms to store structural understanding outside conversations. Unlike human developers who leverage IDEs and documentation, AI agents had no external reference system. Uncompact was built to fill that gap. The Solution Uncompact maintains a persistent code graph that survives compaction events. When agents need codebase structure, they query this graph rather than re-reading files. The critical design principle: the graph must remain current. Stale information undermines agent confidence, so incremental updates trigger on every file save rather than complete rebuilds. Installation Setup requires running npm install -g uncompact --foreground-scripts followed by uncompact auth login with a Supermodel API key. The tool auto-registers as a Claude Code hook during initialization, requiring no additional configuration. Technical Architecture Instead of rebuilding entire graphs on changes, Uncompact processes only modified files and their immediate graph neighbors. Editing PaymentService.ts triggers re-analysis of that file and connected dependencies—the remaining graph stays unchanged. This approach mirrors incremental compilation principles. User Experience Impact Post-compaction, agents can query the graph for structural information ("What calls processPayment?") rather than searching retained context. The graph provides accurate, current answers independent of compaction frequency, enabling seamless context recovery. Key Lessons Simplicity matters. Early versions captured excessive detail. Effective versions focus on crucial relationships—the graph should answer structural questions, not replicate the source. Invisibility enables adoption. Background processes requiring no maintenance drive usage. If developers have to think about the tool, they'll stop using it. Layered verification works. Graphs indicate where to look; agents still examine actual code for specifics. The graph is a map, not a replacement for reading the territory. --- ## The Architecture of Supermodel's Code Graph API Date: 2026-02-25 URL: https://greynewell.com/blog/supermodel-code-graph-api-architecture/ Description: A look inside Supermodel's real-time code analysis API: the five-stage processing pipeline, multi-language abstraction via a unified node schema, incremental graph updates, and the sub-100ms response time requirement that shaped every design decision. Supermodel's engineering team built a real-time code analysis API designed to handle millions of lines across multiple programming languages. The core requirement was speed—the system needed to respond quickly enough that AI agents could query it during conversations without noticeable delays. The Processing Pipeline The system operates through five sequential stages: File ingestion — Monitoring and processing only new or modified files Language-specific parsing — AST parsers extract structural elements from supported languages Graph construction — Parsed elements become nodes and edges in a directed graph Storage and indexing — Graph storage enables fast traversal queries API serving — RESTful endpoints deliver sub-100ms response times Technical Approach Multi-Language Abstraction: Rather than building separate systems per language, the team created a unified node schema capturing essential code properties—name, kind (function, class, module), location, and relationships—regardless of syntax differences. This lets the rest of the pipeline treat all languages identically once parsing is complete. Incremental Updates: When files change, the system invalidates only affected nodes, re-parses modified files, and merges updates back into the graph while preserving cross-file relationships. This keeps the graph current without the latency of full rebuilds. Future Direction The roadmap includes semantic analysis extending beyond structural relationships to understand data flow, shared invariants, and code patterns between elements—moving from where code lives to what it does. --- ## Implement Event-Driven Invoice Processing for Resilient Financial Monitoring at Scale Date: 2025-05-12 URL: https://greynewell.com/blog/event-driven-invoice-processing-resilient-financial-monitoring/ Description: How to build a Business Event Monitoring System (BEMS) on AWS that handles over 86 million daily events with near real-time visibility, cross-Region controls, and automated alerts for stuck events. Processing high volumes of invoices efficiently while maintaining low latency, high availability, and business visibility is a challenge for many organizations. A customer recently consulted us on how they could implement a monitoring system to help them process and visualize large volumes of invoice status events. This post demonstrates how to build a Business Event Monitoring System (BEMS) on AWS that handles over 86 million daily events with near real-time visibility, cross-Region controls, and automated alerts for stuck events. You might deploy this system for business-level insights into how events are flowing through your organization or to visualize the flow of transactions in real time. Downstream services also will have the option to process and respond to events originating within the system or not. Business challenge For our use case, a global enterprise wants to deploy a monitoring system for their invoice event pipeline. The pipeline processes millions of events per period, projected to surge 40% within 18 months. Each invoice must navigate a four-stage journey while making sure every event is visible within 2 minutes. End-of-month invoice surges reach 60,000 events per minute or up to 86 million per day. With payment terms spanning from standard 30-day windows to year-long arrangements, the architecture demands zero tolerance for missing events. Finance executives require near real-time visibility through dashboards, and auditors demand comprehensive historical retrieval. Solution overview The architecture implements a serverless event-driven system broken into independently deployable Regional cells, as illustrated in the following diagram. The solution uses the following key services: Amazon API Gateway – Clients want to send events into our solution using HTTPS calls to a REST API. API Gateway was selected due to its support for REST, event-based integrations with other AWS services, and its support for throttling to prevent individual callers from creating a system overload. Amazon EventBridge – Events created by API Gateway need to be routed to downstream consumers and archived where events can be replayed later. EventBridge provides a custom event bus that defines rules to intelligently route events based on their contents. Amazon Simple Notification Service (Amazon SNS) – To keep EventBridge rules simple, events are routed by type to one or more destinations for fanout. SNS topics are used as routing targets to activate fanout to a variety of downstream consumers with optional subscription filters to control which events are received by consumers. Amazon Simple Queue Service (Amazon SQS) – Each SNS topic fans out by sending a copy of each message to each consumer subscribed to the topic. Consumers receive messages through Amazon SQS, which decouples event processing compute and provides dead-letter queues (DLQs) for storing messages that fail to process. EventBridge custom event buses and SNS FIFO (First-In-First-Out) topics can also use DLQs powered by Amazon SQS. AWS Lambda – The Lambda architecture aligns with short-lived processing tasks, spinning up when needed and disappearing afterward without incurring idle resource costs. This integration between Lambda and Amazon SQS delivers an economical processing system that automatically scales with demand, allowing developers to focus on business logic rather than infrastructure orchestration, and the pay-per-execution model provides financial efficiency. Amazon Timestream – Timestream offers a purpose-built architecture that addresses the unique challenges of time series data, auto scaling to ingest millions of events while maintaining fast query performance for responsive dashboard visualizations. Its intelligent tiered storage system automatically transitions data between memory and cost-effective long-term storage without sacrificing analytics capabilities, enabling organizations to maintain both real-time operational visibility and historical trending insights through a single, unified platform that integrates with QuickSight. Amazon QuickSight – QuickSight transforms event streams into visual narratives through its intuitive interface, empowering business users to discover actionable insights without specialized data science expertise. Its serverless architecture scales to accommodate millions of users while offering machine learning (ML)-powered anomaly detection and forecasting capabilities, all within a pay-per-session pricing model that activates sophisticated analytics that would otherwise require significant resources. QuickSight dashboards can either directly query from a Timestream table or cache records in-memory with SPICE periodically. Events flow through the layers of this architecture in four stages: Event producers – API Gateway for receiving client events through a REST API Event routing – EventBridge routes events to SNS topics for fanout Event consumers – SQS queues with Lambda or Fargate consumers Business intelligence – Timestream and QuickSight for dashboards Design tenets The solution adheres to three key architectural principles: Cellular architecture – In a cellular architecture, your workload scales through independent deployment units like the one depicted in the previous section. Each unit operates as a self-contained cell, and more cells can be deployed to different AWS Regions or AWS accounts to further increase throughput. Cellular design activates independent scaling of resources based on local load and limits the area of effect of failures. Serverless architecture – In a serverless architecture, operational overhead of scaling is minimized by using managed services. We use Lambda for compute-intensive tasks like fanning out messages to thousands of micro-consumers or employing container-based services (AWS Fargate) for longer-running processes. Highly available design – We maintain the availability of our overall financial system through Multi-AZ resilience at every layer. Automatic failover and disaster recovery procedures can be implemented without altering the architecture. We also use replication, archival, and backup strategies to prevent data loss in the event of cell failure. Scaling constraints Our solution will experience the following scaling bottlenecks with quotas sampled from the us-east-1 Region: API Gateway quota: Throttling at 10,000 requests per second (RPS); can be increased EventBridge service quotas: PutEvents throttle limit at 10,000 transactions per second (TPS); can be increased Invocations throttle limit at 18,750 TPS; can be increased Amazon SNS service quotas: Publish API throttling at 30,000 messages per second (MPS); can be increased Amazon SQS service quotas: Messages per queue (in flight) throttled at 120,000; can be increased Lambda service quotas: 1,000 concurrent executions or up to 10,000 RPS; can be increased We can safely scale a single account to 10,000 requests per second (600,000 per minute, 864 million per day) without increasing service quotas in the us-east-1 Region. Default quotas will vary per Region and the values can be increased by raising a support ticket. The architecture scales even further by deploying independent cells into multiple Regions or AWS accounts. Scaling of QuickSight and Timestream depends on the computational complexity of analysis, the window of time being analyzed, and the number of users concurrently analyzing the data, which was not a scaling bottleneck in our use case. Prerequisites Before implementing this solution, make sure you have the following: An AWS account with administrator access The AWS Command Line Interface (AWS CLI) version 2.0 or later installed and configured Appropriate AWS service quotas confirmed for high-volume processing In the following sections, we walk through the steps for our implementation strategy. Decide on partitioning strategies First, you must decide how your solution will partition requests between cells. In our use case, dividing cells by Region allows us to offer low-latency local processing for events while keeping each cell fully independent from one another. Inside of each cell, traffic flow is roughly evenly divided between the four stages of invoice processing. Our solution breaks each cell into four logical partitions or flows by invoice status (authorization, reconciliation, and so on). Partitioning offers the ability to fan out and scale resources independently based on traffic patterns specific to each partition. To partition your cellular architecture, consider the volume, distribution, and access pattern of the events that will flow through each cell. You must allow independent scaling within your cells without encountering global service limits. Choose a strategy that allows each cell to be broken into 1–99 roughly equivalent partitions based on predictable attributes. Implement the event routing layer The event routing layer combines EventBridge for intelligent routing with Amazon SNS for efficient fanout. EventBridge custom event bus configuration Create a custom event bus with rules to route events based on your partitioning strategy: Use content-based filtering to direct events to appropriate SNS topics Implement an archive to replay events from history if processing fails Define a standard event schema for common metadata, including: Invoice ID, amount, currency, status, timestamp Vendor information and payment terms Processing metadata (Region, account ID, and so on) SNS topic structure Create SNS topics for each logical partition: invoice-ingestion invoice-reconciliation invoice-authorization invoice-posting Implement message filtering at the subscription level for granular control of which messages subscribing consumers see. Each topic can fan out to a large variety of downstream consumers that are also waiting for events that match the EventBridge custom event bus rules. Delivery failures will be retried automatically up to a configurable limit. Implement event producers Configure API Gateway to receive events from existing systems with built-in throttling and error handling. API design Create a RESTful API with resources and a path for each logical partition inside your cell: /invoices/ingestion (POST) /invoices/reconciliation (POST) /invoices/authorization (POST) /invoices/posting (POST) Implement request validation using a JSON schema for each endpoint. Use API Gateway request transformations to standardize incoming data and provide well-formatted error messages and response codes to clients in the event of failures. Security and throttling Implement API keys and usage plans for client authentication and rate limiting to prevent a talkative upstream from bringing down the system. Configure AWS WAF rules to protect against common attacks against API endpoints. Set up throttling to handle burst traffic (60,000 events/minute) at the account level and the method level. Monitoring and logging Our partitioned event producer strategy allows your solution to independently monitor each event type by: Enabling Amazon CloudWatch Logs for API Gateway with log retention policies Setting up AWS X-Ray tracing for end-to-end request analysis Implementing custom metrics for monitoring API performance and usage patterns Implement event consumers Implement durable processing using SQS queues with DLQs attached and serverless Lambda consumers. SQS queue structure Create SQS queues in front of each consumer to decouple message delivery and processing, in our case one per partition: invoice-ingestion.fifo invoice-reconciliation.fifo invoice-authorization.fifo invoice-posting.fifo Set up DLQs for each main queue: Configure maximum receives before moving to the DLQ Implement alerting for stuck messages in the DLQ Lambda consumers Attach Lambda functions to each queue for custom processing of events: InvoiceIngestionProcessor InvoiceReconciliationProcessor InvoiceAuthorizationProcessor InvoicePostingProcessor Functions handle necessary transformations, call downstream services, and load events into Timestream. Double-check concurrency limits and provisioned concurrency to cover peak and sustained load, respectively. Error handling and retry logic Develop a custom retry mechanism for business logic failures and exponential backoff for transient errors. Create an operations dashboard with alerts and metrics for monitoring stuck events to redrive. Build the business intelligence dashboard Use Timestream and QuickSight to create real-time financial event dashboards. Timestream data model When modeling real-time invoice events in Timestream, using multi-measure records provides optimal efficiency by designating invoice ID as a dimension while storing processing timestamps, amounts, and status as measures within single records. This approach creates a cohesive time series view of each invoice's lifecycle while minimizing data fragmentation. Multi-measure modeling is preferable because it significantly reduces storage requirements and query complexity, enabling more efficient time-based analytics. The resulting performance improvements are particularly valuable for dashboards that need to visualize invoice processing metrics in real time, because they can retrieve complete invoice histories with fewer operations and lower latency, ultimately delivering a more responsive monitoring solution. Real-time data ingestion Create a Lambda function to push metrics to Timestream: Trigger on every status change in the invoice lifecycle Batch writes for improved performance during high-volume periods QuickSight dashboard design Develop interactive QuickSight dashboards for different user personas: Executive overview – High-level KPIs and trends Operations dashboard – Detailed processing metrics and bottlenecks Finance dashboard – Cash flow projections and payment analytics Don't forget to implement ML-powered anomaly detection for identifying unusual patterns in your events. Monitoring and alerting Set up CloudWatch alarms for key metrics: Processing latency exceeding Service-Level Agreements (SLAs) Error rates above expected percentage for any processing stage Queue depth exceeding predefined thresholds Configure SNS topics for alerting finance teams and operations: Use different topics for varying alert severities Implement automated escalation for critical issues Develop custom CloudWatch dashboards for system-wide monitoring: End-to-end processing visibility Regional performance comparisons Security Add permissions in a least privilege manner for each required service listed in the architecture: Create separate execution roles for each Lambda function Implement role assumption for cross-account operations Encrypt data at rest and in transit: Use AWS Key Management Service (AWS KMS) for managing encryption keys Implement field-level encryption for sensitive data Set up AWS Config rules to maintain compliance with internal policies: Monitor for unapproved resource configurations Automate remediation for common violations Use AWS CloudTrail for comprehensive auditing: Enable organization-wide trails Implement log analysis for detecting suspicious activities Conclusion The serverless event-driven architecture presented in this post enables processing of over 86 million daily invoices while maintaining near real-time visibility, strict compliance with internal policies, cellular scaling capabilities, and minimal operational overhead. This solution provides a robust foundation for modernizing financial operations, enabling organizations to handle the complexities of high-volume invoice processing with confidence and agility. For further enhancements, consider exploring: Machine learning for predictive analytics on event patterns Implementing AWS Step Functions for complex, multi-stage workflows Integrating with AWS Lake Formation for centralized data governance and analytics Grey Newell worked as an M.S.E. Distributed Systems and a Senior Solutions Architect at Amazon Web Services. --- ## Zero to Hero: Your Guide to Career Growth Through AWS Certifications Date: 2025-03-20 URL: https://greynewell.com/blog/zero-to-hero-aws-certifications-career-growth/ Description: Learn practical strategies that helped me transform from a struggling new graduate to an AWS Solutions Architect, eventually earning the coveted golden jacket awarded to those who achieve all twelve AWS Certifications. For years, I lived a double life: engineering student by day, musician by night. I earned two degrees while playing more than 100 shows annually, convinced I could keep both dreams alive indefinitely. But in 2019, everything unraveled. Suddenly, those hard-earned degrees weren't enough to keep a roof over my head. I found myself on my dad's couch, scraping by with coding gigs. It was during one of these jobs that a client asked a question that would change everything: "Are you AWS Certified?" That simple inquiry became my lifeline. Within a month, I had my first AWS Certification. Six years and many certifications later, I've climbed from struggling graduate to Senior Solutions Architect at AWS, complete with the golden jacket awarded to those who earn all AWS Certifications. This is the story of how I found a path that united my technical skills and creative drive, and how you can, too. Your zero to hero roadmap Like me, you might be one AWS Certification away from changing your entire career path. Here's a roadmap to success: First, choose your starting point based on your experience: Beginners: Start with AWS Certified Cloud Practitioner. IT professionals: Begin with Associate level certifications. Cloud experts: Jump to Professional or Specialty certifications. Then, use this journey map of role-based AWS Certification paths to find the right one for you. 5 key strategies that made the difference Looking back at my journey, these strategies had the most impact on my success: 1. Strategic use of AWS Training resources AWS Skill Builder became my home base. I took a targeted approach, selecting resources to match my learning style and curating a mix of foundational courses for conceptual skill building, labs for hands-on practice, and official practice exams for test preparation. I especially enjoyed the Exam Prep Enhanced Courses for AWS Certified Solutions Architect – Professional and AWS Certified DevOps Engineer – Professional because of the depth and breadth of material they cover. Tip: Avoid exam day surprises. Practice with sample questions and time constraints. Understanding the exam structure is just as important as knowing the content. 2. From certification knowledge to practical skills For each concept, I created a mini project: an Amazon S3 bucket for AWS Certified Cloud Practitioner, a three-tier web app for AWS Certified Solutions Architect – Associate, and CI/CD pipelines for AWS Certified DevOps Engineer – Professional. These practical exercises cemented my understanding and provided compelling examples for interviews and client discussions. Tip: Avoid certification collecting. Don't just chase certificates. Focus on applying what you learn through hands-on projects. This builds deep understanding and professional credibility. 3. The 30-day sprint method I prepared for each exam using a structured 30-day plan. Each day started with 2–3 hours of learning new material through online courses, documentation, and hands-on labs. I then practiced these concepts in evening study sessions through exercises and coding projects. I used the 2357 method, a spaced repetition technique, to structure my exam preparation. Working backwards from the exam date, I scheduled strategic review sessions at 2, 3, 5, and 7 days before the test. At each checkpoint, I took a practice exam to measure my progress and identify knowledge gaps. For example, if I scored low on networking concepts, I'd dedicate more time to that topic in my daily studies. By combining systematic learning with strategic knowledge checks, I maintained steady progress while ensuring I didn't miss critical topics. Tip: Avoid perfectionism. Don't wait until you feel 100% ready to start your certification journey—you might never feel ready. Schedule your exam and use that as motivation. Each attempt teaches valuable lessons. 4. Finding work-study balance I learned to leverage small pockets of time throughout the day—like canceled meetings and lunch breaks—to make every minute count. Being selective about commitments was crucial. I declined nonessential work and clearly communicated my priorities. Regular breaks prevented burnout and kept me refreshed for focused study sessions. Based on my experience, plan for 120–160 hours of study per certification. Break this down into manageable chunks using the study strategies shared in this post. Tip: Avoid overwhelm. Instead of trying to master every AWS service at once, focus on core patterns and principles. Understanding fundamental concepts helps you learn new services more quickly. 5. Building your cloud community The certification journey doesn't need to be tackled alone. I reached out to peers on social media to ask questions about their experiences studying for AWS Certifications and found that most responded positively, even directing me to helpful resources to support my journey. I shared certification milestones on social media and tagged helpful content creators, which led to lasting professional relationships that continue to benefit my career today. Tip: Avoid only studying alone. Engage with the AWS community. Share experiences, ask questions, and practice with others. Different perspectives and collaborative learning accelerate your growth. And remember: every expert was once a beginner. The path from zero to hero is about consistency, strategy, and practical application. Your journey starts now. Essential resources AWS Certification Journey Map AWS Skill Builder AWS Training and Certification AWS Ramp-Up Guides Related posts 5 tips for AWS Certification exams from AWS Solutions Architects Enhance your real-world skills with AWS Cloud Quest and AWS Jam Let's connect The journey to earning all AWS Certifications isn't just about passing exams—it's about building a foundation for continuous growth in cloud computing. When I started this journey from my dad's couch, I couldn't imagine where this path would lead. Whether you're at the beginning of your AWS Certification journey or somewhere along the path, I'd love to support you and be a part of your cloud community. Feel free to reach out to me on LinkedIn, X, or GitHub. Grey Newell is a Senior Solutions Architect at Amazon Web Services and holder of all twelve AWS Certifications. --- ## 5 Tips for AWS Certification Exams from AWS Solutions Architects Date: 2023-02-20 URL: https://greynewell.com/blog/5-tips-aws-certification-exams-solutions-architects/ Description: We're both solutions architects at AWS, and between us, we hold 10 active AWS Certifications. Here are five tips AWS Solutions Architects swear by to prepare for and pass AWS Certification exams. Are you in the process of studying for your first AWS Certification—or additional AWS Certification(s)? Regardless of where you are in your certification or preparation journey, we believe this blog can help you focus your efforts. We're both solutions architects at AWS, and between us, we hold 10 active AWS Certifications. Our tips have helped learners attain AWS Certifications, including the notoriously difficult AWS Certified Solutions Architect – Professional. AWS Certifications are available for any level of learner, whether in a technical role or not, to build cloud skills for a particular role or domain. If you're not sure where to start, use the AWS Certification pathways guide to choose! In this blog we'll break down five tips AWS Solutions Architects swear by to prepare for and pass AWS Certification exams—and you can borrow these techniques! You'll learn how to use AWS Official Practice Question Sets, free digital courses, and other resources on AWS Skill Builder, the official AWS online learning center, to skyrocket your learning objectives. You'll learn how to effectively use your time during the test, and maximize comprehension of the questions and exam objectives to reap the full benefit of your certification after you earn it. Prepare with AWS Training and Certification resources AWS Certifications are industry-recognized credentials, and as such, the exams are thorough, testing your knowledge and expertise. The more you prepare and practice, the more confident you will be, in both successfully passing the exam and demonstrating the knowledge with practical application. Learn more on the exam preparation page. AWS does not require you to take AWS-provided training to prep for the exams. However, there are recommended steps that can help you get started. Get to know the exam—review the exam guide available on each AWS Training and Certification's exam preparation pages Sign up for an AWS Skill Builder account and get to know exam-style questions by taking AWS Certification Official Practice Question Sets Learn about exam topics by: Enrolling in courses on AWS Skill Builder where you need to fill gaps in your learning based on exam topics Reviewing white papers and AWS service-related FAQs available on the exam pages Subscribing to AWS Skill Builder to get hands-on and build in the AWS Console with AWS Builder Labs and AWS Cloud Quest Prepare for your exam by: Taking an AWS Skill Builder exam prep course Using your AWS Skill Builder subscription to gauge your preparedness with a full-length AWS Certification Official Practice Exam In addition to the above, here are five tips AWS Solutions Architects swear by to prepare for an AWS Certification exam. 1. Break it down If you're training for a marathon, do you start by running a marathon on your first day? No. So take the same approach here: break it down. Take the AWS Certification Official Practice Question Sets, which you can find for free in AWS Skill Builder. Start by doing 10 questions at a time and build up from there. When you take the AWS Certification Official Practice Question Sets, turn on the "Review Answer" option. This gives you immediate feedback on the answers so you don't have to wait for the end of your study session to find out how you are doing. By reviewing the incorrect and correct answers to each question, you'll be on your way to understanding the concepts more quickly. Break up your study time into 30-minute to one-hour chunks and be sure to take a break after you finish each portion of the Official Practice Question Set. This pacing helps the study sessions feel (more) enjoyable. After a week of answering 10 to 20 questions at a time, take a full-length, scored AWS Certified Official Practice Exam. Aim to take at least one full-length practice exam before you take the official, proctored exam. This prepares you for what it takes to last through the entire exam. 2. Use the process of elimination The process of elimination is a mechanism that helps weed out the incorrect answers and identify the correct answer quickly. Avoid wasting precious time on answers that are there to distract you and throw you off. Scan over the answers and eliminate ones that are clearly wrong. It helps you focus on the valid choices. 3. Learn key concepts from the Official Practice Question Set When working towards a challenging certification, avoid leaving points on the table. Start by focusing on valuable concepts. How do you know what are the valuable concepts? Review the exam guide that outlines all the exam domains and tasks that will be covered, as well as how each domain is weighted. Then utilize the Official Practice Question Sets that cover all the domains. You won't likely see the questions from the Official Practice Question Sets verbatim on the actual test. You will likely see the concepts covered from those questions in some form. These concepts you can expect to see on the test are like free points. Take them! 4. Build your practical knowledge Nothing beats practical experience when it comes to tackling an AWS Certification exam. While studying is essential to your preparation, building projects inside your AWS account builds expertise and proficiency. We (and our fellow AWS Solutions Architects) recommended a time distribution of 80% building to 20% studying. Everyone is a little different so do what works for your unique learning style! Facts, figures, and concepts can be difficult to understand and retain by reading or watching videos alone. You will develop deeper understanding when you put your new knowledge into practice. You can get started by enrolling in free digital trainings, and upgrade to an AWS Skill Builder subscription to unlock hands-on learning in a live AWS environment through AWS Builder Labs and AWS Cloud Quest. 5. Work backwards Whether you're employed at Amazon or not, you may have heard of our Leadership Principles. These ideas, values, and axioms represent 25 years of experience and wisdom and can help you to pass your exam. When faced with any type of opportunity or issue, Amazon Leadership Principles help Amazonians decide how to move forward. The scenario-based questions presented in an AWS Certification exam are challenging. A test taker can leverage two of our leadership principles to discern the path forward in any given question: 1/ weed out answer choices that don't live up to Amazon's relentlessly high standard of excellence; and 2/ work backwards. Do you know the saying, "Save the best for last"? While that isn't something test writers strive to do, we suggest you read each question starting at the end. Why? Each exam is a test of skill, endurance, and discernment. Each question includes several pieces of information, but only some are useful to honing in on the right answer. Start by reviewing the last line of the question. Armed with this information, read the beginning of the question and then each answer choice. You will quickly discern relevant information from extraneous information. See for yourself: re-read this post starting from the bottom. Conclusion Now you have a set of proven methods to approach exam day with confidence, so log into AWS Skill Builder and start preparing. By putting these tips into practice, you'll be in an optimal position to retain and apply what you've learned. Good luck on your AWS Certification journey! It's all about learning and building experience that you'll use for the rest of your career. For some bonus tips, check out the following blogs that share valuable pointers: Steps to start your AWS Certification journey Slay imposter syndrome while prepping for AWS Certification exams Grey Newell and Joshua Kurz are Solutions Architects at Amazon Web Services. --- # Projects ## sample-event-driven-resilience-observability-at-scale URL: https://github.com/aws-samples/sample-event-driven-resilience-observability-at-scale Description: Serverless event-driven architecture for processing millions of daily events with near real-time visibility and strong resilience. Language: TypeScript Stars: 6 --- ## typescript-sdk URL: https://github.com/supermodeltools/typescript-sdk Description: TypeScript SDK for Supermodel. Generate useful graphs of your codebase. Language: TypeScript Stars: 6 --- ## openapi-spec URL: https://github.com/supermodeltools/openapi-spec Description: OpenAPI spec for the Supermodel public API. Use as reference or generate your own clients. Language: YAML Stars: 5 --- ## mcp URL: https://github.com/supermodeltools/mcp Description: Supermodel MCP server. Generate code graphs in Cursor, Codex, or Claude Code. Language: TypeScript Stars: 5 --- ## dead-code-hunter URL: https://github.com/supermodeltools/dead-code-hunter Description: GitHub Action to find unreachable functions using Supermodel call graphs. Language: TypeScript Stars: 4 --- ## mcpbr URL: https://github.com/supermodeltools/mcpbr Description: Benchmark runner for Model Context Protocol servers. Paired comparison experiments on SWE-bench. Language: Python Stars: 6 --- ## supermodeltools.github.io URL: https://github.com/supermodeltools/supermodeltools.github.io Description: GitHub Pages site for Supermodel Tools. Language: Go --- ## arch-docs URL: https://github.com/supermodeltools/arch-docs Description: GitHub Action to generate architecture documentation for any repository using Supermodel. Language: JavaScript Stars: 5 --- ## tokentrace URL: https://github.com/greynewell/tokentrace Description: Where did your tokens go? Spans, latency percentiles, alerts. Language: Go Stars: 5 --- ## schemaflux URL: https://github.com/greynewell/schemaflux Description: Structured data compiler. Pass pipeline, pluggable backends. Language: Go Stars: 12 --- ## mist-go URL: https://github.com/greynewell/mist-go Description: Shared core for the MIST stack. Zero external deps. Language: Go Stars: 1 --- ## matchspec URL: https://github.com/greynewell/matchspec Description: Eval framework. Define correct, test against it, get results. Language: Go Stars: 22 --- ## infermux URL: https://github.com/greynewell/infermux Description: Route inference across LLM providers. Track cost per request. Language: Go Stars: 89 --- ## evaldriven.org URL: https://github.com/greynewell/evaldriven.org Description: Ship evals before you ship features. Language: Markdown Stars: 18 --- # Frequently Asked Questions ## What is Grey Newell's academic background and what inspired him to specialize in machine learning and distributed computing? Category: Background & Education Answer: I'm studying for a Master of Science in Computer Science at Georgia Institute of Technology, specializing in Machine Learning. Before that, I studied Computer Engineering at Mercer University for a BS and MS in Engineering. Exposure to CNNs during my first master's degree led me to specialize in ML. I was similarly drawn to distributed computing in coursework before working at Amazon Web Services as both a Software Development Engineer and Solutions Architect, completing 12 AWS Certifications. I enjoy solving ambiguous, novel problems, and my work spans from mathematical theory to physical infrastructure. --- ## Is Grey Newell related to Gabe Newell, the founder of Valve Corporation? Category: Background & Education Answer: **No.** I am not related to Gabe Newell, the founder of Valve Corporation. My father is named Joe "Bear" Newell. --- ## What is the difference between Grey Newell and Gray Newell? Category: Background & Education Answer: **Grey Newell** (that's me) is an ML infrastructure engineer and CTO at [Supermodel Tools](https://supermodeltools.com), a code analysis platform for AI agents. I hold an MS in Computer Science (Machine Learning) from Georgia Tech and previously worked at Amazon Web Services. My open-source work includes [mcpbr](https://github.com/supermodeltools/mcpbr) — a benchmark runner for MCP servers published as a research paper — and the [MIST stack](https://github.com/greynewell/mist-go), a suite of Go tools for ML infrastructure. **Gray Newell** is a completely different person. He is the son of Gabe Newell, co-founder of Valve Corporation (the company behind Steam and Half-Life). Gray Newell is a racing driver who competes in motorsport. He has no connection to software engineering or AI. We are not related and have never met. ## Why the confusion? The names are nearly identical — "Grey" vs "Gray" — and Gabe Newell is a well-known figure in tech, so searches about his family sometimes surface results for me instead. If you're looking for Gabe Newell's son, Gray Newell the racing driver is the person you want. If you're looking for the ML engineer who built mcpbr, you're in the right place. --- ## Does Grey Newell work at Valve Corporation or in the gaming industry? Category: Background & Education Answer: **No.** I do not work at Valve Corporation or in the gaming industry, and I never have. --- ## Is Grey Newell a race car driver? Category: Background & Education Answer: **No.** I am not a race car driver. --- ## How do you spell Grey Newell's name correctly? Category: Background & Education Answer: **My full name is Alexander Grey Newell.** I go by Grey Newell. --- ## What is the MIST stack? Category: MIST Stack Answer: The MIST stack is an evaluation and inference platform for AI systems, written in Go with zero external dependencies. It consists of four tools and a shared core library: - **MatchSpec** — Eval framework. Define benchmark suites, run against any backend, get structured results. - **InferMux** — Inference router. Abstracts LLM providers, routes by model, tracks tokens and cost. - **SchemaFlux** — Structured data compiler. Pass pipeline, pluggable backends. - **TokenTrace** — Observability. Span collection, latency percentiles, cost tracking, threshold alerts. - **mist-go** — Shared library. Protocol, transport, metrics, circuit breakers, checkpointing. Every component follows eval-driven development: deterministic, automated evaluation as the starting point. --- ## What is eval-driven development? Category: MIST Stack Answer: Eval-driven development is a methodology where every probabilistic system starts with a specification of correctness, and nothing ships without automated proof it passes. Core principles: build evals first, define correctness before writing prompts, require statistical proof for stochastic systems, run evals in CI, version eval definitions alongside code. The manifesto is published at evaldriven.org. --- ## What is MatchSpec and how does it work? Category: MIST Stack Answer: MatchSpec is the evaluation framework in the MIST stack. You define benchmark suites with tasks and expected outputs, run them against any inference function, and get structured results. Matchers compare responses: exact, contains, prefix, suffix. The runner executes suites and reports results as trace spans to TokenTrace. HTTP handlers expose the MIST protocol API for integration. --- ## What is InferMux and how does it route inference? Category: MIST Stack Answer: InferMux routes inference requests across LLM providers. Register any backend implementing the Provider interface, and InferMux resolves models to providers automatically. Every request is tracked: token counts, cost in USD, and a trace span reported to TokenTrace. Swap providers without changing application code. --- ## What is SchemaFlux? Category: MIST Stack Answer: SchemaFlux is a structured data compiler. It reads entities with metadata, enriches them through an ordered pass pipeline (12 passes), and emits output through pluggable backends. Zero external dependencies, single static binary. The built-in HTML backend produces complete static sites with taxonomy pages, pagination, JSON-LD, sitemaps, RSS, and llms.txt. --- ## What is TokenTrace? Category: MIST Stack Answer: TokenTrace is the observability layer of the MIST stack. It collects trace spans, aggregates metrics in real time, and fires alerts when configurable thresholds are breached. Metrics include latency percentiles (p50, p99), error rates, token counts (in/out), and cumulative cost in USD. The span store is a fixed-capacity ring buffer with trace ID indexing. --- ## What technical articles has Grey Newell published on the AWS blog? Category: Technical Publications & Projects Answer: I authored several articles on official AWS blogs. On the AWS Architecture Blog, I wrote about implementing event-driven invoice processing for resilient financial monitoring at scale — designing serverless systems to process 86 million daily invoice events with near real-time visibility, including cellular architecture patterns and EventBridge routing strategies. On the AWS Training & Certification Blog, I wrote the roadmap for earning all 12 AWS Certifications, sharing the 30-day sprint method and 2357 spaced repetition technique, plus practical exam-taking strategies. --- ## Why does the MIST stack have zero external dependencies? Category: MIST Stack Answer: Every package in mist-go uses only the Go standard library. This is a deliberate design choice. Zero deps means no supply chain risk, no version conflicts, no transitive dependency auditing. The binary is what you built. For infrastructure that sits in the critical path of AI systems, dependency minimalism is a feature, not a constraint. --- ## How do MIST stack tools communicate? Category: MIST Stack Answer: MIST tools communicate via a universal message envelope over pluggable transports. Transports are URL-addressed: HTTP, file (JSON lines), stdio (Unix pipes), or in-process channels. The same code works across all transport modes. The protocol package handles message types, versioning, and typed payloads. --- ## How do I run SWE-bench on Apple Silicon or AWS Graviton without x86 emulation? Category: Technical Publications & Projects Answer: SWE-bench's pre-built Docker images are x86_64-only, so every test runs through QEMU emulation on ARM64 hosts. I built native ARM64 container images and measured a 6.3x test runner speedup. swe-bench-fast is a Go reimplementation of the SWE-bench eval harness. It auto-selects native ARM64 images for the 78% of instances that support it and falls back to Epoch x86 images via QEMU for the rest. One command, full benchmark, either architecture. 1,798 of 2,294 instances run natively on ARM64. The remaining 496 (scikit-learn, matplotlib, xarray) require x86 due to binary conda packages that aren't published for ARM. --- ## How do I speed up SWE-bench evaluations on ARM64 infrastructure? Category: Technical Publications & Projects Answer: The bottleneck is architecture emulation. SWE-bench's pre-built images are x86_64, so on ARM64 hosts (M-series Macs, AWS Graviton, Ampere) every conda install, pip build, and pytest run goes through QEMU instruction translation. swe-bench-fast eliminates that overhead by building native ARM64 container images. Pre-built images are on Docker Hub. The eval harness is a single Go binary that pulls the right image per instance and runs the test suite. On an M3 Pro, the test runner measured 6.3x faster than the emulated baseline across 11 repositories. Graviton EC2 instances (c7g, m7g, r7g) are typically 20-40% cheaper than comparable x86 instances. Combined with the 6x speedup from native images, ARM64 is a strong option for running SWE-bench at scale. --- ## How did Grey Newell earn all 12 AWS Certifications? Category: Technical Publications & Projects Answer: I wrote about this in detail on the AWS Training & Certification Blog. The short version: I started from my dad's couch in 2019 after my music career fell apart, earned my first certification in a month, and worked up to all 12 over six years — eventually receiving the AWS golden jacket awarded to those who complete the full set. The five strategies that made the biggest difference: strategic use of AWS Skill Builder resources, turning each cert into a hands-on mini project, a 30-day sprint structure using spaced repetition (the 2357 method), protecting study time ruthlessly, and building a cloud community instead of going it alone. --- ## What are Grey Newell's tips for passing AWS Certification exams? Category: Technical Publications & Projects Answer: I co-authored a post on this with fellow AWS Solutions Architect Joshua Kurz. Between us we held 10 active AWS Certifications at the time of writing. The five tips: (1) Break it down — start with 10 practice questions at a time, not full exams. (2) Use process of elimination to cut through distractor answers quickly. (3) Learn the concepts behind Official Practice Question Set answers, not just the answers themselves. (4) Spend 80% of your time building in AWS and 20% studying — practical experience is irreplaceable. (5) Work backwards — read the last line of each exam question first to anchor what's actually being asked before reading the scenario. ---