Task 19

Thorn Thaler - <

2025-07-25

1 Setup

1.1 Libraries

library(httr)
library(xml2)
library(magrittr)
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
library(tidyr)
library(kableExtra)

1.2 Retrieve Data from AoC

session_cookie <- set_cookies(session = keyring::key_get("AoC-GitHub-Cookie"))
base_url <- paste0("https://adventofcode.com/", params$year, "/day/", params$task_nr)
puzzle <- GET(base_url,
              session_cookie) %>% 
  content(encoding = "UTF-8") %>% 
  xml_find_all("///article") %>% 
  lapply(as.character)

parse_puzzle_data <- function(text_block = readClipboard()) {
  if (length(text_block) == 1L) {
    text_block <- text_block %>% 
      str_split("\n") %>% 
      extract2(1L) %>% 
      keep(nzchar)
  }
  rules <- str_subset(text_block, "=>") %>% 
    str_split(" => ") %>% 
    do.call(rbind, .) %>% 
    set_colnames(c("from", "to")) %>% 
    as_tibble()
  molecule <- str_subset(text_block, "=>", TRUE)
  list(rules = rules, molecule = molecule)
}

puzzle_data <- local({
  GET(paste0(base_url, "/input"),
      session_cookie) %>% 
    content(encoding = "UTF-8") %>% 
    parse_puzzle_data()
})

2 Puzzle Day 19

2.1 Part 1

2.1.1 Description

— Day 19: Medicine for Rudolph —

Rudolph the Red-Nosed Reindeer is sick! His nose isn’t shining very brightly, and he needs medicine.

Red-Nosed Reindeer biology isn’t similar to regular reindeer biology; Rudolph is going to need custom-made medicine. Unfortunately, Red-Nosed Reindeer chemistry isn’t similar to regular reindeer chemistry, either.

The North Pole is equipped with a Red-Nosed Reindeer nuclear fusion/fission plant, capable of constructing any Red-Nosed Reindeer molecule you need. It works by starting with some input molecule and then doing a series of replacements, one per step, until it has the right molecule.

However, the machine has to be calibrated before it can be used. Calibration involves determining the number of molecules that can be generated in one step from a given starting point.

For example, imagine a simpler machine that supports only the following replacements:

H => HO
H => OH
O => HH

Given the replacements above and starting with HOH, the following molecules could be generated:

  • HOOH (via H => HO on the first H).
  • HOHO (via H => HO on the second H).
  • OHOH (via H => OH on the first H).
  • HOOH (via H => OH on the second H).
  • HHHH (via O => HH).

So, in the example above, there are 4 distinct molecules (not five, because HOOH appears twice) after one replacement from HOH. Santa’s favorite molecule, HOHOHO, can become 7 distinct molecules (over nine replacements: six from H, and three from O).

The machine replaces without regard for the surrounding characters. For example, given the string H2O, the transition H => OO would result in OO2O.

Your puzzle input describes all of the possible replacements and, at the bottom, the medicine molecule for which you need to calibrate the machine. How many distinct molecules can be created after all the different ways you can do one replacement on the medicine molecule?

2.1.2 Solution

We use regular expressions to solve this task.

count_new_molecules <- function(molecule, rules) {
  unique_patterns <- rules %>% 
    pull(from) %>% 
    unique()
  replace_molecule <- function(molecule, start, end, to) {
    str_sub(molecule, start, end) <- to
    molecule
  }
  all_matches <- map(unique_patterns, function(reg) {
    pos <- str_locate_all(molecule, reg) %>%
      extract2(1L) %>% 
      as_tibble()
    to <- rules %>% 
      filter(from == reg) %>% 
      pull(to)
    pos %>% 
      mutate(to = list(to)) %>% 
      mutate(from = reg, .before = 1L) %>% 
      unnest(cols = c(to))
  }) %>% 
    list_rbind() %>% 
    rowwise()
  all_matches %>% 
    mutate(new_molecule = replace_molecule(molecule, start, end, to)) %>% 
    pull(new_molecule) %>% 
    unique() %>% 
    length()
}

count_new_molecules(puzzle_data$molecule, puzzle_data$rules)
## [1] 576

2.2 Part 2

2.2.1 Description

— Part Two —

Now that the machine is calibrated, you’re ready to begin molecule fabrication.

Molecule fabrication always begins with just a single electron, e, and applying replacements one at a time, just like the ones during calibration.

For example, suppose you have the following replacements:

e => H
e => O
H => HO
H => OH
O => HH

If you’d like to make HOH, you start with e, and then make the following replacements:

  • e => O to get O
  • O => HH to get HH
  • H => OH (on the second H) to get HOH

So, you could make HOH after 3 steps. Santa’s favorite molecule, HOHOHO, can be made in 6 steps.

How long will it take to make the medicine? Given the available replacements and the medicine molecule in your puzzle input, what is the fewest number of steps to go from e to the medicine molecule?

2.2.2 Solution

This was a tough nut to crack. First approaches of (guided) brute-force leaded nowhere, neither some ideas to parse the string into a nested list, which can be eventually transformed into a tree. There, we walked from the bottom to the top, reducing Rn Ar brackets one after another. However, teh remaining top level string was still too large to brute force all possible rule combinations.

Some observations we amade along the efforst were:

  1. Rules that do not contain Y, Rn or Ar.
  2. Rules that are of the form x Rn z Ar where x and z are single atoms.
  3. Rules that are of the form x Rn (z Y w)+ Ar where x, z and w are single atoms.

We can regard Rn as (, Ar as ) and Y as ,. This translates the relevant rules to the following:

puzzle_data$rules %>% 
  filter(str_detect(to, "Ar")) %>% 
  mutate(to = str_replace_all(to, c(Ar = ")", Y = ",", Rn = "("))) %>%
  mutate(" => " = " => ", .after = 1L) %>% 
  kbl() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
                full_width = FALSE) %>% 
  column_spec(1:3, monospace = TRUE)
from => to
Al => Th(F)
B => Ti(F)
Ca => P(F)
Ca => Si(F,F)
Ca => Si(Mg)
H => C(Al)
H => C(F,F,F)
H => C(F,Mg)
H => C(Mg,F)
H => N(F,F)
H => N(Mg)
H => O(F)
N => C(F)
O => C(F,F)
O => C(Mg)
O => N(F)
P => Si(F)

Furthermore, we observe that there are only 4 types of production rules (X being a terminal symbol, i.e. not (, ) or ,):

  1. XX => X or XX => e
  2. X(X) => X
  3. X(X,X) => X
  4. X(X,X,X) => X

That is, each rule (when applied from the full string), reduces the string length (measured as number of tokens and not characters) either by 1 (when we apply rule (1)), 3 (rule (2)), 5 (rule (3)) or 7 (rule (4)). To reduce a string of length n to a string of length 1 we would need n - 1 steps if we applied only rules of type (1). Applying a bracket rule (2) - (4) needs the same amount of steps, but we reduce the brackets for free. Applying a rule with a comma, gets 2 symbols for free (, and X). Thus we simply need to count number of brackets, commas and tokens and get an analytical solution.

This gives us the correct answer, however, our idea to implement a reducing algorithm failed.

N.B. This problem kept bothering me and after reading more on the underlying problem, I figured that it can be solved via an adapted version of the Cocke–Younger–Kasami algorithm. I include my working C++ code for solving this issue with means of the CYK algorithm in the appendix.

count_reduction_steps <- function(molecule) {
  str_count(molecule, "[A-Z][a-z]?") - 
    str_count(molecule, "Rn|Ar") -
    2 * str_count(molecule, "Y") -
    1
}

count_reduction_steps(puzzle_data$molecule)
## [1] 207

3 CYK Algorithm

The Cocke–Younger–Kasami algorithm (CYK) determines whether a given string is part of a grammar. However, this grammar must be a context-free grammar given in Chomsky normal form (CNF).

The CNF dictates that every rule NT must resolve either to a single terminal, or to exactly 2 non terminals. Thus, we needed first to translate the given rules into the CNF and could then apply the CYK.

The CYK in its original form uses dynamic programming to determine whether a substring starting at position i up to position j can be built using the rules. It starts with single tokens (they can be built if there is a unit production rule producing this terminal). In subsequent iterations it uses the previously calculated values to determine whether a longer string can be formed. The idea being, to determine whether we can build a substring from position 1 to 2 for example, we simply look whether we can build this string from the rules producing string (if at all) from 1 to 1 and a rule producing a string from 2 to 2. So if rule NT1 produced a string from position 1 to 1, and NT2 produced the string from position 2 to 2 and we have a rule producing NT1 NT2 we can deduce that also a string from position to 2 can be produced.

The deeper we go into the iteration the more combinations must be checked, but eventually we fill the last cell. If in the end this cell is not empty (or more generally can be reduced to the starting symbol), we can conclude that the string can be built.

In the orignal form it test simply for existance, but in our adapted form we count number of replacements on the way (becuase we needed to transfrom the problem to CNF upfront, we must take care to count only the original and not the synthetical rules, which were simply introduced to satisfy the CNF requirement).

A great visual tool for understanding the CYK algorithm can be found here.

#ifndef STANDALONE
#include <Rcpp.h>
using namespace Rcpp;
#endif

#include <iostream>
#include <map>
#include <memory>
#include <optional>
#include <set>
#include <string>
#include <utility>
#include <vector>

using Token = std::string;
using DoubleToken = std::pair<Token, std::optional<Token>>;

class Grammar {
  private:
    std::set<Token> terminals;
    std::set<Token> non_terminals;
    std::set<Token> original_rules;
    std::map<Token, size_t> bin_cnts;
    std::map<Token, std::set<DoubleToken>> rules;
    std::map<DoubleToken, std::set<Token>> inverse_rules;

    Token make_nonterminal(const Token&, bool);
    std::vector<Token> tokenize_input(const Token&) const;
    void add_rule(const Token&, const std::vector<Token>&);

    bool is_original_rule(const Token&, const DoubleToken&) const;

    std::map<Token, unsigned int> run_cyk(const std::vector<Token>&) const;

  public:
    Grammar(const std::map<Token, std::vector<std::vector<Token>>>&);

    bool is_member(const Token&) const;
    bool is_member(const std::vector<Token>&) const;

    unsigned int get_nr_replacements(const Token&) const;
    unsigned int get_nr_replacements(const std::vector<Token>&) const;

    friend std::ostream& operator<<(std::ostream&, const Grammar&);
};

Grammar::Grammar(const std::map<Token, std::vector<std::vector<Token>>>& rules) {
  for (const auto& [symbol, rule] : rules) {
    for (const auto& r : rule) {
      add_rule(symbol, r);
    }
  }
}

bool Grammar::is_member(const std::vector<Token>& word_list) const {
  return run_cyk(word_list).size() > 0;
}

bool Grammar::is_member(const Token& word) const {
  return is_member(tokenize_input(word));
}

unsigned int Grammar::get_nr_replacements(const std::vector<Token>& word_list) const {
  std::map<Token, unsigned int> cyk_results = run_cyk(word_list);
  unsigned int result = -1; // set to maximum value
  for (const auto& [key, n] : cyk_results) {
    result = (n < result) ? n : result;
  }
  return result;
}

unsigned int Grammar::get_nr_replacements(const Token& word) const {
  return get_nr_replacements(tokenize_input(word));
}

std::ostream& operator<<(std::ostream& stream, const Grammar& grammar) {
  stream << "Terminals:" << std::endl << "\t[";
  for (auto it = grammar.terminals.begin(); it != grammar.terminals.end(); ++it) {
    stream << "\"" << *it << "\"";
    if (next(it) != grammar.terminals.end()) {
      stream << ", ";
    }
  }
  stream << "]" << std::endl;
  stream << "Non-Terminals:" << std::endl << "\t[";
  for (auto it = grammar.non_terminals.begin(); it != grammar.non_terminals.end(); ++it) {
    stream << *it;
    if (next(it) != grammar.non_terminals.end()) {
      stream << ", ";
    }
  }
  stream << "]" << std::endl;
  stream << "Original Rules:" << std::endl << "\t[";
  for (auto it = grammar.original_rules.begin(); it != grammar.original_rules.end(); ++it) {
    stream << *it;
    if (next(it) != grammar.original_rules.end()) {
      stream << ", ";
    }
  }
  stream << "]" << std::endl << "Rules:" << std::endl;
  for (const auto& rule : grammar.rules) {
    stream << "\t" << rule.first << " => ";
    auto it = rule.second.begin();
    while (it != rule.second.end()) {
      if (it->second) {
        stream << it->first << " " << *(it->second);
      } else {
        stream << "\"" << it->first << "\"";
      }
      ++it;
      if (it != rule.second.end()) {
        stream << " | ";
      }
    }
    stream << std::endl;
  }
  return stream;
}

Token Grammar::make_nonterminal(const Token& symbol, bool add_suffix = false) {
  Token nt = "NT_" + symbol;
  if (add_suffix) {
    size_t suffix = bin_cnts[symbol] + 1;
    nt = nt + "_" + std::to_string(suffix);
    bin_cnts[symbol]++;
  }
  return nt;
}

std::vector<Token> Grammar::tokenize_input(const Token& word) const {
  Token token;
  std::vector<Token> word_list;
  for (auto it = word.begin(); it != word.end(); ++it) {
    token = *it;
    if (std::next(it) != word.end() && std::islower(*std::next(it))) {
      token += *std::next(it);
      ++it; // Skip the next character as it is already added
    }
    word_list.push_back({token});
  }
  return word_list;
}

void Grammar::add_rule(const Token& lhs, const std::vector<Token>& rhs) {
  DoubleToken p1;
  Token nt_lhs, nt_rhs_left, nt_rhs_right, nt_prev;
  bool need_bin = rhs.size() > 2;
  nt_prev = make_nonterminal(lhs);
  original_rules.insert(nt_prev); // add lhs to original rules needed to count only them
  for (size_t i = 0; i < rhs.size() - 1; ++i) {
    /*
     * Original string: H => CRnFYFYFAr
     * lhs = "H"
     * rhs = {"C", "Rn", "F", "Y", "F", "Ar"}
     * NT_H => NT_C NT_H_1     [0]
     * NT_H_1 => NT_Rn NT_H_2  [1]
     * NT_H_2 => NT_F NT_H_3   [2]
     * NT_H_3 => NT_Y NT_H_4   [3]
     * NT_H_4 => NT_F NT_Ar    [4]
     * NT_H => "H"
     * NT_C => "C"
     * NT_Rn = "Rn"
     * NT_F => "F"
     * NT_Y => "Y",
     * NT_Ar => "Ar"
     */
    terminals.insert(rhs[i]);
    terminals.insert(rhs[i + 1]);
    nt_rhs_left = make_nonterminal(rhs[i]); // NT_C
    p1 = std::make_pair(rhs[i], std::nullopt);
    rules[nt_rhs_left].insert(p1); // NT_C => "C"
    inverse_rules[p1].insert(nt_rhs_left);
    if (need_bin && i != (rhs.size() - 2)) {
      nt_rhs_right = make_nonterminal(lhs, true);
    } else {
      nt_rhs_right = make_nonterminal(rhs[i + 1]);
      p1 = std::make_pair(rhs[i + 1], std::nullopt);
      rules[nt_rhs_right].insert(p1); // NT_Ar => "Ar"
      inverse_rules[p1].insert(nt_rhs_right);
    }
    p1 = std::make_pair(nt_rhs_left, nt_rhs_right);
    rules[nt_prev].insert(p1);
    inverse_rules[p1].insert(nt_prev);
    non_terminals.insert(nt_rhs_left);
    non_terminals.insert(nt_rhs_right);
    non_terminals.insert(nt_prev);
    nt_prev = nt_rhs_right;
  }
}

bool Grammar::is_original_rule(const Token& rule, const DoubleToken& rhs) const {
  return original_rules.find(rule) != original_rules.end() && rhs.second.has_value();
}

std::map<Token, unsigned int> Grammar::run_cyk(const std::vector<Token>& word_list) const {
  DoubleToken p1;
  std::set<Token> rhs;
  size_t n = word_list.size();

  std::map<int, std::map<int, std::map<Token, unsigned int>>> token_table;
  for (size_t j = 0; j < n; ++j) {
    p1 = std::make_pair(word_list[j], std::nullopt);
    if (inverse_rules.find(p1) != inverse_rules.end()) {
      rhs = inverse_rules.at(p1);
      for (const auto& r : rhs) {
        token_table[j][j][r] = 0U;
      }
    }
    for (int i = j; i >= 0; --i) {
      for (size_t k = i; k <= j; ++k) {
        for (const auto& [rl, il] : token_table[i][k]) {
          for (const auto& [rr, ir] : token_table[k + 1][j]) {
            p1 = std::make_pair(rl, rr);
            if (inverse_rules.find(p1) != inverse_rules.end()) {
              rhs = inverse_rules.at(p1);
              for (const auto& r : rhs) {
                token_table[i][j][r] = il + ir + (is_original_rule(r, p1) ? 1U : 0U);
              }
            }
          }
        }
      }
    }
  }
  return token_table[0][n - 1];
}

#ifndef STANDALONE
// [[Rcpp::export]]
unsigned int count_replacements(List rules, Token molecule) {
  Token symbol;
  CharacterVector symbols = rules.names();
  std::map<Token, std::vector<std::vector<Token>>> rule_map;
  for (R_xlen_t i = 0; i < rules.size(); ++i) {
    symbol = as<Token>(symbols[i]);
    std::vector<Token> rule_vec = as<std::vector<Token>>(rules[i]);
    rule_map[symbol].push_back(rule_vec);
  }
  Grammar grammar(rule_map);
  return grammar.get_nr_replacements(molecule);
}
#else
int main() {
  std::map<Token, std::vector<std::vector<Token>>> rules = {{"H", {{"H", "O"}, {"O", "H"}}},
                                                            {"O", {{"H", "H"}}}};
  Grammar grammar(rules);
  Token molecule = "HOH";
  unsigned int replacements = grammar.get_nr_replacements(molecule);
  std::cout << "Number of replacements for molecule '" << molecule << "': " << replacements
            << std::endl;
  return 0;
}
#endif
rules <- puzzle_data$rules %>% 
  mutate(to = str_split(to, "(?<=\\w)(?=[A-Z])"))

rules <- rules %>% 
  pull(to) %>% 
  set_names(rules %>% 
              pull(from))
count_replacements(rules, puzzle_data$molecule)
## [1] 207