Knowledge Graph Generator Project Report

Abstract

The Knowledge Graph Generator (KGG) is a tool designed to extract structured information from unstructured text and visualize it as a knowledge graph. Leveraging large language models and modern visualization libraries, KGG enables users to input any text and receive an interactive graph of entities and their relationships. The project is implemented in Python using HuggingFace Transformers, PyVis, and Google Colab for an accessible and interactive experience.

Introduction

The Knowledge Graph Generator (KGG) is an AI-powered application that transforms unstructured text into a structured knowledge graph. By identifying key entities and their relationships, KGG provides a visual and interactive representation of information, aiding in understanding, exploration, and further analysis. The project is built using Python, HuggingFace Transformers, and PyVis, and is designed to run seamlessly in Google Colab.

Approach & Methodology

The core approach of KGG is unique in that, instead of relying on external APIs to access large language models (LLMs), the LLM is loaded and run entirely locally on the user's machine. This ensures data privacy, removes dependency on internet connectivity, and avoids API usage costs. However, running a state-of-the-art LLM such as Open-Orca/Mistral-7B-OpenOrca locally presents significant hardware challenges, as these models typically require more than 15GB of GPU memory.

To overcome this, quantization techniques are employed. Specifically, the model is loaded using BitsAndBytesConfig with 4-bit quantization (nf4), drastically reducing the memory footprint and enabling efficient inference even on consumer-grade GPUs. This allows the entire pipeline—from prompt engineering to entity extraction and graph visualization—to be executed locally, making the solution both powerful and accessible.

Model Loading and Quantization: The Open-Orca/Mistral-7B-OpenOrca model is loaded locally using HuggingFace Transformers with 4-bit quantization (nf4) for efficient inference. This is achieved using the BitsAndBytesConfig for memory and speed optimization, making it feasible to run the model on hardware with limited GPU resources.
Prompt Engineering: A system prompt instructs the model to extract entities and relationships from the context and output them in a JSON format with fields: node1, node2, and relationship.
Text Processing: The user-provided text is formatted into a prompt and passed to the locally running model. The model generates a response, which is parsed to extract the JSON array of relationships.
Knowledge Graph Construction: The extracted entities and relationships are used to build a graph using PyVis, where nodes represent entities and edges represent relationships.
Visualization: The resulting graph is rendered as interactive HTML, allowing users to explore the knowledge graph visually within the notebook or exported as a standalone HTML file.

Features

Automated Entity and Relationship Extraction: Utilizes a large language model to identify and extract entities and their relationships from arbitrary text.
Interactive Visualization: Generates an interactive knowledge graph using PyVis, allowing users to explore nodes and edges dynamically.
Efficient Model Inference: Employs 4-bit quantization for the language model, reducing memory usage and improving inference speed.
Google Colab Integration: Designed for easy use in Google Colab, with a user-friendly interface for input and visualization.
Customizable Prompts: The prompt engineering approach allows for flexible extraction of different types of relationships as needed.

Applications

Information Retrieval and Organization
- Data Management: Organize large volumes of textual data into structured, interconnected entities. Useful for creating databases or enhancing existing ones.
- Content Summarization: Summarize key information from long documents or articles by extracting main entities and their relationships.
Education
- Teaching Aid: Assist educators in creating interactive teaching materials by visually representing complex subjects and their interrelations.
- Student Projects: Provide a tool for students to visualize and present their research or project findings.
Knowledge Discovery
- Research: Aid researchers in identifying relationships between different concepts, facilitating new insights and hypothesis generation.
- Literature Reviews: Summarize findings from numerous studies by mapping out key terms and their connections.

Algorithms & Implementation

Model Loading and Quantization


from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model_name = "Open-Orca/Mistral-7B-OpenOrca"
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model.config.use_cache = False
model.config.pretraining_tp = 1

Prompt Engineering and Extraction


def getprompt(text):
    SYS_PROMPT = (
        "You are an AI assistant tasked with extracting structured information from the context to create a knowledge graph. "
        "Your goal is to identify key entities and their relationships in the context and present this information in a JSON format "
        "with fields: 'node1', 'node2', and 'relationship'."
    )
    USER_PROMPT = f"context: ```{text}``` \\n\\n output: "
    PROMPT = f"{SYS_PROMPT}\\n\\n{USER_PROMPT}"
    return PROMPT

def function(text):
    prompt = getprompt(text)
    inputs = tokenizer.encode(prompt, return_tensors="pt")
    outputs = model.generate(inputs, max_length=1024, num_return_sequences=1)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    json_response = response.split("[")[1].split("]")[0]
    json_response = "[\\n" + json_response + "]"
    json_response = json.loads(json_response)
    return json_response

Knowledge Graph Construction and Visualization


from pyvis.network import Network

def generate_knowledge_graph(text):
    data = function(text)
    net = Network(notebook=True, directed=True, cdn_resources='remote')
    for relation in data:
        net.add_node(relation['node1'], label=relation['node1'], title=relation['node1'])
        net.add_node(relation['node2'], label=relation['node2'], title=relation['node2'])
        net.add_edge(relation['node1'], relation['node2'], title=relation['relationship'], label=relation['relationship'])
    net.repulsion(node_distance=180, spring_length=100)
    return net.generate_html()

User Interface

The user interface is implemented using HTML and JavaScript within the Colab notebook. Users can input text into a search bar, and upon clicking the search button, the knowledge graph is generated and displayed interactively. The UI is styled for clarity and ease of use, with responsive design and dynamic feedback.

Knowledge Graph Generator UI — Knowledge Graph Generator User Interface

Example Knowledge Graph — Example Output: Generated Knowledge Graph

Conclusion

The Knowledge Graph Generator project demonstrates the power of combining large language models with interactive visualization tools to extract and represent structured knowledge from unstructured text. By automating the process of entity and relationship extraction and providing an intuitive interface, KGG makes knowledge discovery accessible and efficient for users in research, education, and industry.

Bibliography

Open-Orca/Mistral-7B-OpenOrca: HuggingFace Model Card
PyVis Documentation: https://pyvis.readthedocs.io/en/latest/
HuggingFace Transformers: https://huggingface.co/docs/transformers/index
Google Colab: https://colab.research.google.com/