September 6, 2025

8 min read

Can AI Fix NASA? The ChatGPT Paradox

Center for Project Innovation

What happens when you throw over 1,000 pages of NASA “lessons learned” reports at an AI and ask it to find the good, the bad, and the fixable? I decided to find out.

Paul Muller

Certified Project Director (CPD)

NASA’s projects are huge, complex, and full of insights, but those insights are often buried in mountains of words. Sifting through it all to spot the patterns is slow work, and it’s easy to miss connections between projects. AI, on the other hand, can chew through massive amounts of text in minutes and start connecting the dots.

So, I went digging. I searched the NASA Technical Reports Server (NTRS) for “lessons learned” published since 2000, pulled the first 100 results, and randomly picked 30 — a mix of conference papers, journal articles, technical reports, and presentations.

I stitched them together into one mega PDF, ran it through optical character recognition in US English, and made sure that ChatGPT could read it (it promised me it could, but it would say that).

From there, I set the AI loose on a structured meta-analysis, tackling one project management area at a time: stakeholder engagement, business cases, scope, schedule, costs, procurement, risk, teams, governance, delivery, documentation — plus any other gems it uncovered along the way.

The mission: see if AI could make sense of such a mixed bag of NASA project experiences and boil it all down to the key takeaways — what worked, and what could work better.

The original mega PDF can be found here: https://project.info/media/NASA-LESSONS-LEARNED-BINDER.pdf (Huge file - 247 MB)

The first draft of the AI analysis can be found here: https://project.info/media/NASA-Lessons-Learned-per-ChatGPT.pdf

More importantly, here are the lessons I learned about using AI for this type of research.

My Lessons Learned

The category-by-category analyses provided drew heavily on the following projects:

ATLAS Beam Steering Mechanism (ICESat II)
Magnetospheric MultiScale (MMS) commissioning
Challenger Accident Investigation
Columbia Accident Investigation
Selected Apollo sample handling / contamination control content

That means there were over 20 other projects and case studies in the pdf that weren’t referenced in the analysis.

When challenged about this, here’s what ChatGPT had to say:

From what I’ve seen in your NASA Lessons Learned binder, my analysis skewed toward the ATLAS BSM, MMS commissioning, and Challenger/Columbia cases for a few reasons:

Information density – Those sections were unusually rich in clearly structured “what happened / why it mattered / what changed” narratives. They contained explicit details on processes, decisions, and outcomes that mapped neatly onto your project management categories (stakeholder engagement, scope, risk, governance, etc.).
Clarity of cause-and-effect – The Challenger and Columbia accident investigations, in particular, spelled out causal links between decisions, organisational culture, and technical failures. That makes it easier to pull out “what worked well” vs. “what could be done better” without having to infer too much.
Direct alignment with PMBOK-style areas – MMS commissioning and BSM development covered full project lifecycles: requirements, procurement, integration, verification, delivery, and lessons learned. Many other projects in the binder focus on narrower technical or scientific points that don’t map as cleanly to categories like business case development or team management.
Document structure bias – The binder appears to be a compilation of varied source documents (conference papers, technical memos, accident reports). Some sections are brief abstracts or very specialised technical notes that offer limited context for broader project management insights. Naturally, I gravitated toward the more complete case studies that gave enough context to be useful.
Risk of over-inference – For smaller or more technical project write-ups (like certain Mars Science Laboratory or Hubble components), the available detail might have forced me to make broader assumptions. I deliberately avoided doing that so the analysis stayed tightly evidence-based from the text.

Seems fair enough, so I turned down its offer to more fully include them in another run.

Next, I asked ChatGPT what I could do to make my meta-analysis better in the future:

If you want your future “lessons learned” meta-analyses to be more rigorous — and avoid over-weighting the most detailed or well-structured case studies — the key is to control both the selection and the format of source material before it goes into the binder.

Here are some steps that would help:

1. Balance the dataset at the source

Use a selection framework – Before downloading, define how many documents you want per project type, mission class, or timeframe. This prevents one well-documented project (like Challenger) from dominating simply because there’s more written about it.

Stratified sampling – If the NASA Technical Reports Server gives 100 results, sort or categorise them (by mission type, date, mission size) and pull evenly from each category.
Exclude pure technical notes unless you also capture contextual/project management data from the same mission. Otherwise, they’re hard to use for PM insights.

2. Pre-process for comparability

Tag each document with metadata before combining — mission name, date, mission phase, document type (technical report, investigation, operations note, etc.), and relevant PM categories it likely addresses.
Summarise each source to a consistent template before merging — e.g.,
- Project overview
- Phase covered
- Key decisions made
- Lessons learned
- Direct quotes
  This makes the downstream AI or human review less likely to be swayed by length or narrative richness.

3. Address “length bias” head-on

Longer, more detailed accident reports will naturally dominate unless you actively limit their contribution. You can:
- Cap the extract length per project so each has roughly the same footprint in the binder.
- Create a two-layer binder: one with full detail for deep dives, one with equal-weight summaries for comparison.

4. Ensure category coverage before analysis

Create a coverage matrix with your project management categories (stakeholder engagement, risk, governance, etc.) along one axis and your projects along the other.
As you gather documents, make sure each cell has something — even if it’s a small note — so analysis won’t default to the few cases that tick all the boxes.

5. Document sourcing decisions

Keep a short sourcing log explaining why each document was chosen.

Note any gaps — for example, “No procurement lessons found for Hubble SM4 in available reports” — so later you can decide whether to go back and fill them.

My tip? Have a standardized way of reviewing projects that categorizes lessons learned for you. Here's one I prepared earlier...

Beyond that, here is the big takeaway for how to do this sort of work in the future.

The Need for a Human in the Loop

As we already know, AI can read thousands of pages faster than any human and surface patterns, themes, and recurring issues across projects. It’s especially good at:

Sifting large volumes – rapidly processing long technical reports, presentations, and investigation findings.
Spotting repeated language and concepts – even if they’re buried deep in different document types.
Structuring outputs – neatly mapping findings into categories like stakeholder engagement, risk, or governance.

But AI also has structural weaknesses that make a human-in-the-loop essential for credible, balanced analysis:

Selection bias sensitivity
AI’s output quality is directly tied to the balance and representativeness of the source material. If the dataset over-represents one project or mission type, the AI will naturally over-weight those lessons unless a human deliberately adjusts.
Context gaps
AI is only as good as the context available in the text. If a document offers limited background, AI can’t reliably infer the broader mission circumstances without risking overinterpretation — something a human can catch and mitigate by adding missing context or flagging gaps.
Nuance in causal relationships
Accident reports and technical memos often present complex cause-and-effect chains with cultural, technical, and organisational elements intertwined. AI can summarise these, but humans are better at judging whether those causal links are strong, weak, or contested.
Source weighting and cross-checking
AI won’t intuitively weight one lesson as more significant than another based on mission-criticality, unless trained or prompted to do so. Humans can apply domain expertise to decide which findings have the greatest operational or strategic impact.
Ethical and reputational sensitivity
Some lessons touch on human error, safety culture, or political factors. AI can summarise these plainly, but a human can ensure the language is accurate, fair, and fit for a professional audience.

In practice, the AI should be seen as a force multiplier for human analysts, not a replacement. The most effective workflow is:

Human curates and balances the dataset.
AI processes and structures the material.
Human reviews, contextualises, and rebalances outputs for accuracy, fairness, and completeness.

This keeps the speed and breadth benefits of AI while preserving the judgment, context awareness, and ethical oversight that only a human can provide.

Lessons Learned for NASA

For what it’s worth, here are the top five lessons learned ChatGPT highlighted from its meta-analysis of our random assortment of NASA reviews.

1. Protect Core Requirements Against Compromise

Relaxing mission-critical requirements to fit vendor limitations, cost pressures, or schedule targets often led to rework, delays, and additional cost (e.g., ATLAS BSM). Maintaining the integrity of core performance parameters is more efficient in the long run than accepting early compromises that undermine mission objectives.

2. Independent Safety and Oversight Must Have Real Authority

Both Challenger and Columbia show the danger when safety, quality, and technical authority functions are embedded within the same chain of command as program management. Oversight needs structural independence, resources, and escalation pathways that can’t be bypassed by schedule or budget pressure.

3. Vendor Capability Must Be Rigorously Verified Upfront

Assuming heritage hardware or prior experience guarantees readiness proved costly. Insufficient early technical due diligence on suppliers led to failed hardware, scope changes, and unplanned in-house development. Contract award should follow evidence of true flight qualification capability, not just past involvement in similar missions.

4. Anomalies and Informal Observations Should Be Captured and Acted On

Critical issues — from the BSM flexure tolerance problem to foam shedding on Shuttle — were either first spotted informally or seen repeatedly without immediate corrective action. Formal processes for capturing, escalating, and tracking both anomalies and “soft” observations are essential to avoid normalising risk.

5. Schedule and Resource Planning Must Include Realistic Contingency

Compressed schedules at the start of a project often left no room for rework when problems emerged. In some cases, concurrent operations and staffing shortfalls added risk late in the delivery phase. Building schedule buffers and protecting analytical, safety, and documentation capacity under pressure is key to sustaining mission success.

The AI Research Paradox

Those conclusions seem reasonable enough, although I didn’t take the time to verify them independently from the source document.

And therein lies the paradox. Allow ChatGPT to explain…

The paradox in your reaction is a classic one in research and analysis — especially when AI is in the mix:

You’re saying the five lessons seem reasonable (they align with common sense, known project management principles, and your own experience), yet you haven’t independently verified them against the source document.

That’s the paradox:

Familiarity bias – Because the lessons “fit” with what you already believe to be true, they feel valid, even without verification. This makes it easy to accept them at face value.
Source credibility assumption – The fact they came from a supposedly thorough, AI-assisted review of a NASA lessons learned binder gives them an extra aura of authority.
Verification gap – At the same time, without re-checking the binder yourself, there’s a possibility that I (or the AI process) over-weighted certain cases, omitted contradictory examples, or introduced subtle interpretation errors.

The risk is that “seems right” and “is correct” can diverge — and the more authoritative and well-structured the output looks, the easier it is to miss that gap.

This is why in human-in-the-loop AI analysis (what we discussed earlier), the human’s role isn’t just to make the findings sound good — it’s to actively test them against the raw material to confirm they’re not only plausible, but evidenced.

In short: The paradox is that the perceived credibility of the findings comes from how well they align with existing knowledge and expectations, not from independent validation — yet in research, credibility should be grounded in evidence, not familiarity.

So what?

In the end, this experiment wasn’t really about finding out what makes a NASA project succeed or stumble — it was about testing what happens when you hand an AI a mountain of raw, messy, and uneven data and ask it to make sense of it.

ChatGPT proved it can process at a speed and scale no human could match, structure findings neatly, and surface patterns worth exploring.

But it also showed that without a human steering the process, checking the balance of sources, and validating the output, even the most polished analysis risks reflecting the quirks of the dataset more than the full truth.

For me, that’s the real takeaway.

AI can be an extraordinary partner in complex reviews like this, but only if we treat it as a collaborator, not a substitute. It’s there to do the heavy lifting, crunch through the detail, and give us a starting point.

It’s our job to bring the judgment, context, and healthy scepticism that keeps the results grounded in evidence. When we combine the strengths of both, we not only learn more — we learn better.

Learn more