FEATURE: Implemented basic parser rules #18
115
input.md
Normal file
115
input.md
Normal file
@ -0,0 +1,115 @@
|
||||
# MarkdownToHtmlCompiler
|
||||
|
||||
### Project Overview
|
||||
|
||||
The goal is to create a program that reads a file containing text formatted in a simple version of
|
||||
Markdown and converts it into a valid HTML file. The program will need to identify and translate
|
||||
specific syntax (e.g., `# Heading` to `<h1>Heading</h1>`, `*text*` to `<em>text</em>`).
|
||||
|
||||
|
||||
### Implementation Requirements (Generated by Gemini)
|
||||
|
||||
Class Hierarchy: Design a class hierarchy to represent the components of your Markdown document. An
|
||||
abstract base class, Element, can define common behavior. Derived classes would then represent specific
|
||||
types of elements, such as Heading, Paragraph, BoldText, and ListItem. This is a perfect example of
|
||||
inheritance and polymorphism.
|
||||
|
||||
Object Composition: A Document class can be composed of multiple Element objects, representing the
|
||||
entire file. A Parser class would be composed of helper methods to break down the input string and
|
||||
build the Document object. This shows how you can build a complex system from smaller, self-contained
|
||||
objects.
|
||||
|
||||
File I/O and Exceptions: You will need to use ifstream to read the Markdown file and ofstream to write
|
||||
the generated HTML file. Your code should use exceptions to gracefully handle potential errors, such
|
||||
as a file not being found.
|
||||
|
||||
Operator Overloading: Overload the << stream insertion operator for your Element and Document classes.
|
||||
This would allow you to easily print the generated HTML to the console or write it to a file, making
|
||||
your code cleaner and more readable.
|
||||
|
||||
UML Diagram: The complexity of the class relationships makes a UML diagram an essential part of the
|
||||
project. It will help you plan your design and will be a key component of your submission.
|
||||
|
||||
Recursive Descent Parser: This is the primary algorithm you'll use. It's a top-down parsing technique
|
||||
where a set of recursive functions "descend" through the grammar of your simple Markdown language. For
|
||||
example, a parse_document() function would call parse_line(), which in turn might call parse_bold_text()
|
||||
or parse_italic_text(). This method is intuitive and easy to implement for a simple grammar.
|
||||
|
||||
Stack: A stack is essential for handling nested elements. For instance, if you allow bold text inside
|
||||
italic text (_This is *bold and italic* text_), you can push the _ token onto the stack and then push
|
||||
the * token. When you encounter the closing *, you check if the top of the stack matches. This ensures
|
||||
that all tags are correctly opened and closed. Your presentation can visually demonstrate this process
|
||||
with a stack diagram.
|
||||
|
||||
Hash Map or Map: A hash map (std::unordered_map) or a map (std::map) can be used to efficiently store
|
||||
and retrieve the HTML equivalent for each Markdown tag. For example, you could map `#` to `<h1>`or `*`
|
||||
to `<em>`. This provides O(1) average-case lookup time.
|
||||
|
||||
|
||||
### Contribution Policy
|
||||
|
||||
###### Branching
|
||||
When working on this project, please use a feature branch (i.e. `feature/parser`) with a descriptive name.
|
||||
`feature/a` is not a descriptive name. These branches should be branched off the most recent `main` branch,
|
||||
we will not make use of a `dev` or `staging` branch since the project is small in scale as well as time.
|
||||
**However, if the project becomes larger or out-of-control, a dev/staging branch will be implemented.**
|
||||
|
||||
###### Commits
|
||||
|
||||
When working, it is best practice to commit code as much as possible, without being over zealous. For
|
||||
example, when a feature or bug is complete, its time to commit. But when you have to make a new function,
|
||||
that does not mean its time. Each team member should use their best judgment.
|
||||
|
||||
Commit messages a little bit more important, when working in a team, it is important to provide strong,
|
||||
clear and concise commit messages. In this project, the team will use a simple formula:
|
||||
|
||||
**(SUBJECT) Title: textual description**
|
||||
|
||||
i.e. (FIX) Rendering completed: explain what changed in short.
|
||||
|
||||
###### Pushing
|
||||
|
||||
When working in a feature branch, pushing and pulling has no restrictions. Feel free to do as much
|
||||
(or as little) as possible. However, you **CANNOT** push directly to `main`, the VCS will not allow you
|
||||
to do so, but do not make that mistake. When you are ready to merge a feature, you will create a PR
|
||||
and once it has been reviewed and approved it will be automatically merged in.
|
||||
|
||||
###### Pull Requests (PR)
|
||||
|
||||
Once a feature is complete, you will create a pull request. Before a request can be merged into `main`,
|
||||
one approval is required (which cannot be the author). This practice is to promote team work and encourage
|
||||
code reviews. Each team member is expected to check in frequently and review as often as they are able to,
|
||||
however, there is no defined time requirement. Personal communication is totally acceptable as a means to
|
||||
request approval, since I am unsure if this platform will notify members.
|
||||
|
||||
###### Issues
|
||||
|
||||
If a bug, issue, or otherwise concern is noticed the first thing the team member should do is create an
|
||||
issue. An issue should be descriptive and contain everything another team member needs to understand the
|
||||
issue and its context. This way, a new team member can tackle the issue without contextual gaps.
|
||||
|
||||
If a member would like to work on the issue themself, the `assignee` field is where this should be defined.
|
||||
If a member would like help from another member, they should assign the other team member to the issue, and
|
||||
leave a comment in the issue itself describing what help is needed.
|
||||
|
||||
**Labels** are important for understanding what type of issues/bugs exist in the application. When a bug is
|
||||
created, make sure the proper labels are applied. These labels will be abstract, such as: `bug`, `fix` or `feature`
|
||||
and they will also be specific, such as: `parser`, `i/o` or `processer`. A combination of both styles of labels
|
||||
allows other team members to understand what is going on. If a member feels an issue is missing, they are free
|
||||
to create new ones, but there is a such thing as **too many labels** a few per issue is totally fine. They are
|
||||
not meant to replace the description.
|
||||
|
||||
**Priority** is the final important factor to consider. In this project, priority will be defined using labels
|
||||
as well. The policy defined above will apply here to priority labels as well. However, these labels are
|
||||
**mutually exclusive**.
|
||||
|
||||
###### Projects (Sprints)
|
||||
|
||||
The use of the `projects` tab in the VCS will allow the team to remain organized as create notes and action
|
||||
items that should be completed before one another. These resemble `sprints` from the `AGILE` development life cycle.
|
||||
A new "project" should be created when a large piece of functionality needs to be created. Issues can **and should**
|
||||
be attached to the projects they are related too. This will continue to encourage teamwork and organization.
|
||||
|
||||
Projects should have defined criteria, such as input and outputs, expectations and a semi-defined timeline.
|
||||
Once a description and is defined, tasks can be added and moved around as needed. The team will use **Kanban**
|
||||
project types, as they are simple and easy to understand for new team members.
|
||||
256
lib/parser.cpp
256
lib/parser.cpp
@ -1,9 +1,17 @@
|
||||
#include "parser.h"
|
||||
#include "inlineNode.h"
|
||||
#include "structureNode.h"
|
||||
#include "util.h"
|
||||
#include <algorithm>
|
||||
#include <cctype>
|
||||
#include <fstream>
|
||||
#include <memory>
|
||||
#include <sstream>
|
||||
#include <stdexcept>
|
||||
#include <string>
|
||||
|
||||
using std::string;
|
||||
using std::vector;
|
||||
|
||||
Parser::Parser(string input_file_path, string output_file_path) {
|
||||
// NOTE: Remove any white space AROUND the inputs
|
||||
@ -34,3 +42,251 @@ void Parser::Inspect() {
|
||||
std::cout << "std::string output_file_path: " << this->output_file_path
|
||||
<< std::endl;
|
||||
}
|
||||
|
||||
// replace '\r\n' with '\n'
|
||||
void Parser::NormalizeInputStream() {
|
||||
if (this->content.empty())
|
||||
return;
|
||||
|
||||
size_t pos = 0;
|
||||
while ((pos = content.find("\r\n", pos)) != string::npos) {
|
||||
this->content.replace(pos, 2, "\n");
|
||||
pos++;
|
||||
}
|
||||
|
||||
// NOTE: Remove all occurrences of '\r'
|
||||
this->content.erase(
|
||||
std::remove(this->content.begin(), this->content.end(), '\r'),
|
||||
this->content.end());
|
||||
}
|
||||
|
||||
void Parser::ParseDocument() {
|
||||
// Open the input file
|
||||
std::ifstream input_file(this->input_file_path);
|
||||
|
||||
if (!input_file.is_open()) {
|
||||
throw std::runtime_error("Failed to open input file.");
|
||||
return;
|
||||
}
|
||||
|
||||
// Read the file into a single string
|
||||
std::stringstream buffer;
|
||||
buffer << input_file.rdbuf();
|
||||
this->content = buffer.str();
|
||||
|
||||
input_file.close();
|
||||
|
||||
// Remove the windows BS
|
||||
NormalizeInputStream();
|
||||
|
||||
// We need document parent
|
||||
this->DOM = std::make_unique<DocumentNode>();
|
||||
|
||||
while (!IsEOF()) {
|
||||
// std::cout << Peek(); Consume();
|
||||
auto block = ParseBlock();
|
||||
if (block != nullptr)
|
||||
this->DOM->AddChild(std::move(block));
|
||||
}
|
||||
|
||||
std::cout << this->DOM->ToHtml();
|
||||
}
|
||||
|
||||
// All this does is pick which subparser to call
|
||||
// Identify which block to parse
|
||||
std::unique_ptr<Node> Parser::ParseBlock() {
|
||||
// Remove whitespace using peek and consume (' ', '\t', '\n')
|
||||
ConsumeWhiteSpace();
|
||||
|
||||
// NOTE: Simple example
|
||||
// std::string ch(1, Peek());
|
||||
// std::unique_ptr<Node> block = std::make_unique<TextNode>(ch);
|
||||
// Consume();
|
||||
|
||||
if (Peek() == '#') {
|
||||
return ParseHeading();
|
||||
}
|
||||
|
||||
// this is the default case
|
||||
return ParseParagraph();
|
||||
}
|
||||
|
||||
std::unique_ptr<Node> Parser::ParseParagraph() {
|
||||
auto node = std::make_unique<ParagraphNode>();
|
||||
|
||||
// This should call parse inline
|
||||
auto text_nodes = ParseInline();
|
||||
for (auto &text_node : text_nodes) {
|
||||
node->AddChild(std::move(text_node));
|
||||
}
|
||||
|
||||
if (node->GetChilren().size() < 1)
|
||||
return nullptr;
|
||||
|
||||
return node;
|
||||
}
|
||||
|
||||
std::unique_ptr<Node> Parser::ParseHeading() {
|
||||
// Compute the size of the heading
|
||||
int i = 0;
|
||||
char c = Peek();
|
||||
while (c == '#') {
|
||||
c = Peek(i++);
|
||||
}
|
||||
|
||||
Consume(i - 1);
|
||||
auto node = std::make_unique<HeadingNode>(i - 1);
|
||||
|
||||
ConsumeWhiteSpace();
|
||||
|
||||
std::string str;
|
||||
while (!IsEOF()) {
|
||||
c = Peek();
|
||||
// We can stop as soon as we see a new line. Headings are single line blocks
|
||||
if (c == '\n')
|
||||
break;
|
||||
|
||||
// If a newline, use a space instead
|
||||
str += c;
|
||||
Consume();
|
||||
}
|
||||
|
||||
// BUG: Why do we need to check this?
|
||||
if (str == "")
|
||||
return nullptr;
|
||||
|
||||
auto text_node = std::make_unique<TextNode>(str);
|
||||
node->AddChild(std::move(text_node));
|
||||
|
||||
return node;
|
||||
}
|
||||
|
||||
vector<std::unique_ptr<Node>> Parser::ParseInline() {
|
||||
vector<std::unique_ptr<Node>> nodes;
|
||||
string str;
|
||||
|
||||
while (!IsEOF()) {
|
||||
char c = Peek();
|
||||
// If this char and next char are both newlines: then we have an empty line,
|
||||
// we should stop.
|
||||
if (c == '\n' && Peek(1) == '\n')
|
||||
break;
|
||||
|
||||
if (c == '*' && Peek(1) == '*' && Peek(2) == '*') {
|
||||
PushTextNode(nodes, str);
|
||||
nodes.push_back(std::move(ParseBoldItalic()));
|
||||
continue;
|
||||
} else if (c == '*' && Peek(1) == '*') {
|
||||
PushTextNode(nodes, str);
|
||||
nodes.push_back(std::move(ParseBold()));
|
||||
continue;
|
||||
} else if (c == '*') {
|
||||
PushTextNode(nodes, str);
|
||||
nodes.push_back(std::move(ParseItalic()));
|
||||
continue;
|
||||
}
|
||||
|
||||
// If a newline, use a space instead
|
||||
str += (c == '\n' ? ' ' : c);
|
||||
Consume();
|
||||
}
|
||||
|
||||
// Push the last node, if the string is not empty
|
||||
PushTextNode(nodes, str);
|
||||
return nodes;
|
||||
}
|
||||
|
||||
std::unique_ptr<Node> Parser::ParseItalic() {
|
||||
string str;
|
||||
Consume(1);
|
||||
|
||||
while (!IsEOF()) {
|
||||
char c = Peek();
|
||||
|
||||
if (c == '\n' && Peek(1) == '\n')
|
||||
break;
|
||||
|
||||
if (c == '*') {
|
||||
Consume(1);
|
||||
break;
|
||||
}
|
||||
|
||||
str += c;
|
||||
Consume();
|
||||
}
|
||||
|
||||
return std::make_unique<ItalicNode>(str);
|
||||
}
|
||||
|
||||
std::unique_ptr<Node> Parser::ParseBold() {
|
||||
string str;
|
||||
Consume(2);
|
||||
|
||||
while (!IsEOF()) {
|
||||
char c = Peek();
|
||||
|
||||
if (c == '\n' && Peek(1) == '\n')
|
||||
break;
|
||||
|
||||
if (c == '*' && Peek(1) == '*') {
|
||||
Consume(2);
|
||||
break;
|
||||
}
|
||||
|
||||
str += c;
|
||||
Consume();
|
||||
}
|
||||
|
||||
return std::make_unique<BoldNode>(str);
|
||||
}
|
||||
|
||||
std::unique_ptr<Node> Parser::ParseBoldItalic() {
|
||||
string str;
|
||||
Consume(3);
|
||||
|
||||
while (!IsEOF()) {
|
||||
char c = Peek();
|
||||
|
||||
if (c == '\n' && Peek(1) == '\n')
|
||||
break;
|
||||
|
||||
if (c == '*' && Peek(1) == '*' && Peek(2) == '*') {
|
||||
Consume(3);
|
||||
break;
|
||||
}
|
||||
|
||||
str += c;
|
||||
Consume();
|
||||
}
|
||||
|
||||
return std::make_unique<BoldItalicNode>(str);
|
||||
}
|
||||
|
||||
void Parser::PushTextNode(vector<std::unique_ptr<Node>> &nodes, string &str) {
|
||||
if (!str.empty())
|
||||
nodes.push_back(std::move(std::make_unique<TextNode>(str)));
|
||||
str = "";
|
||||
}
|
||||
|
||||
char Parser::Peek(size_t offset) {
|
||||
size_t look_ahead_pos = this->position + offset;
|
||||
|
||||
if (look_ahead_pos < this->content.length()) {
|
||||
return this->content[look_ahead_pos];
|
||||
}
|
||||
|
||||
return '\0'; // null if past end
|
||||
};
|
||||
|
||||
void Parser::Consume(size_t count) { this->position += count; };
|
||||
|
||||
bool Parser::IsEOF() { return this->position >= this->content.length(); };
|
||||
|
||||
void Parser::ConsumeWhiteSpace() {
|
||||
// TODO: This can be optimized using an accumulator and then consuming
|
||||
char c = Peek();
|
||||
while (c == ' ' || c == '\t' || c == '\n') {
|
||||
Consume();
|
||||
c = Peek();
|
||||
}
|
||||
}
|
||||
|
||||
55
lib/parser.h
55
lib/parser.h
@ -1,11 +1,14 @@
|
||||
#ifndef PARSER_H
|
||||
#define PARSER_H
|
||||
|
||||
#include "node.h"
|
||||
#include <iostream>
|
||||
#include <memory>
|
||||
#include <stack>
|
||||
#include <string>
|
||||
|
||||
using std::string;
|
||||
using std::vector;
|
||||
|
||||
/**
|
||||
* @brief Markdown parser class.
|
||||
@ -48,7 +51,7 @@ public:
|
||||
*
|
||||
* @author Hayden Hargreaves (hhargreaves2006@gmail.com)
|
||||
*/
|
||||
void ParseDocument(void);
|
||||
void ParseDocument();
|
||||
|
||||
protected:
|
||||
/**
|
||||
@ -70,35 +73,57 @@ protected:
|
||||
*/
|
||||
string output_file_path;
|
||||
|
||||
/**
|
||||
* @brief Parser generated tree.
|
||||
*
|
||||
* This value will store the root, which is expected to be a DocumentNode.
|
||||
* This node will mark the start of the tree. The parser will populate this
|
||||
* tree during the parsing process.
|
||||
*
|
||||
* @author Hayden Hargreaves (hhargreaves2006@gmail.com)
|
||||
*/
|
||||
std::unique_ptr<Node> DOM;
|
||||
|
||||
// NOTE: We need a stack, just not sure what goes in it yet
|
||||
// std::stack<any> stack;
|
||||
|
||||
private:
|
||||
// windows... >:(
|
||||
void NormalizeInputStream();
|
||||
|
||||
/**
|
||||
* @brief Parse a single line.
|
||||
* @brief Parse a single block of content
|
||||
*
|
||||
* How does this function work...
|
||||
* This is where the magic happens.
|
||||
*
|
||||
* @param line Target line to parse, as string.
|
||||
* @return DOMNode, once exists
|
||||
*
|
||||
* @author Hayden Hargreaves (hhargreaves2006@gmail.com)
|
||||
*/
|
||||
void ParseLine(string line);
|
||||
std::unique_ptr<Node> ParseBlock();
|
||||
|
||||
// NOTE: Parser operations, again, abstract, just for brainstorming now
|
||||
// These should operate on internal state, not lines themselves
|
||||
void ParseHeader();
|
||||
void ParseParagraph();
|
||||
void ParseItalic();
|
||||
void ParseBold();
|
||||
void ParseBoldItalic();
|
||||
// Stores index in the string
|
||||
size_t position = 0;
|
||||
|
||||
// NOTE: Character operations, these are just for brainstorming
|
||||
char Peek();
|
||||
void Consume();
|
||||
bool EndOfLine();
|
||||
// Working input content
|
||||
string content;
|
||||
|
||||
std::unique_ptr<Node> ParseParagraph();
|
||||
std::unique_ptr<Node> ParseHeading();
|
||||
vector<std::unique_ptr<Node>> ParseInline();
|
||||
|
||||
void PushTextNode(vector<std::unique_ptr<Node>> &nodes, string &str);
|
||||
|
||||
std::unique_ptr<Node> ParseItalic();
|
||||
std::unique_ptr<Node> ParseBold();
|
||||
std::unique_ptr<Node> ParseBoldItalic();
|
||||
|
||||
char Peek(size_t offset = 0);
|
||||
void Consume(size_t count = 1);
|
||||
bool IsEOF();
|
||||
|
||||
void ConsumeWhiteSpace();
|
||||
};
|
||||
|
||||
#endif
|
||||
|
||||
@ -126,4 +126,4 @@ std::string WatchDog::timePointToString(const fs::file_time_type& timePoint){
|
||||
std::strftime(buffer, sizeof(buffer), "%Y-%m-%d %H:%M:%S", &localTime);
|
||||
|
||||
return std::string(buffer);
|
||||
}
|
||||
}
|
||||
|
||||
@ -78,4 +78,10 @@ void test_input(int argc, char **argv) {
|
||||
std::cout << std::endl;
|
||||
}
|
||||
|
||||
int main(int argc, char **argv) { test_nodes(); }
|
||||
int main(int argc, char **argv) {
|
||||
Parser p("input.md");
|
||||
p.ParseDocument();
|
||||
|
||||
Parser p2("README.md");
|
||||
p2.ParseDocument();
|
||||
}
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user