GitHub - YAGI0423/gpt_modules: 「Pytorch 기반 GPT 모델 및 모듈 라이브러리」에 대한 내용을 다루고 있습니다.

이 저장소(Repository)는 「Pytorch 기반 GPT 모델 및 모듈 라이브러리」에 대한 내용을 다루고 있습니다.

작성자: YAGI

최종 수정일: 2025-03-28

2025.03.27: 코드 작성 완료
2025.03.28: READ ME 작성 완료
2025.03.28: 프로젝트 종료
2025.03.30: 오류 수정, 모듈 추가 및 삭제
- Grouped Query Attention 모듈의 RoPE 미적용 사항 수정
- RoPEAttenion 모듈 추가
- RoPETransformerBlock 모듈 추가
- AddNorm 모듈 삭제

프로젝트 기간: 2025-03-22 ~ 2025-03-28

프로젝트 내용

본 프로젝트에서는 Pytorch 기반의 다양한 GPT 모델과 RoPE(Rotary PositionalEmbedding), MoE(Mixture of Expert)와 같은 GPT 모델에 사용되는 여러 모듈을 제공한다. 나아가 Hugging Face에서 제공하는 databricks/databricks-dolly-15k 데이터셋을 이용하여 몇 가지 모델의 평가 지표를 제시한다. Table 1은 gptModules 라이브러리에서 제공하는 GPT 모델에 대한 설명이다.

Model	Description	Code
GPT-1	-	`models.GPT(...)`
GPT-2	· Pre-Norm Layer	`models.GPT2(...)`
ALiBi GPT	· Pre-Norm Layer · ALiBi Embedding Layer	`models.ALiBiGPT(...)`
LLaMA	· Pre-Norm Layer · RoPE Embedding Layer · Group Query Attention · RMS Normalization	`models.LLaMA(...)`
DeepSeek V2	· Pre-Norm Layer · RoPE Embedding Layer · Multi Head Latent Attention · DeepSeek MoE · RMS Normalization	`models.DeepSeek(...)`

Table 1. gptModules Library Models.

gptModules 라이브러리의 layers를 사용하여 모델뿐만 아니라 GPT에 사용되는 다양한 모듈에 접근할 수 있다. Table 2는 본 라이브러리에서 제공하는 GPT 모듈에 대한 설명이다.

Layer	Module	Code
Embedding	· Embedding	layers.Embeddings(...)
	· Embedding Without Positional Embedding	layers.EmbeddingsWithoutPosition(...)
	· Rotary Positional Embedding	layers.RotaryPositionalEmbeddings(...)
	· ALiBi(Attention with Linear Biases) Positional Embedding	layers.ALiBiEmbeddings(...)
Normalization	· RMS Normalization	layers.RMSNorm(...)
Multi Head Attention	· Masked Multi Head Attention	layers.MaskedMultiHeadAttention(...)
	· ALiBi Attention	layers.ALiBiAttention(...)
	· RoPE Attention	layers.RoPEAttention(...)
	· GQA(Grouped Query Attention) **with RoPE	layers.GroupedQueryAttention(...)
	· GQA Without RoPE	layers.GroupedQueryAttentionWithoutRoPE(...)
	· Multi Head Latent Attention **with RoPE	layers.MultiHeadLatentAttention(...)
· Multi Head Latent Attention Without RoPE	layers.MultiHeadLatentAttentionWithoutRoPE(...)
Feed Forward	· Position Wise Feed Forward	layers.PositionWiseFeedForward(...)
Feed Forward	· Deep Seek V2 Mixture of Expert(MoE)	layers.DeepSeekMoE(...)
Transfomer Block	· Transformer Block	layers.TransformerBlock(...)
	· Pre-Norm Transformer Block	layers.PreNormTransformerBlock(...)
	· ALiBi Transformer Block	layers.ALiBiTransformerBlock(...)
	· RoPE Transformer Block	layers.RoPETransformerBlock(...)
	· Grouped Query Transformer Block	layers.GroupedQueryTransformerBlock(...)
	· Deep Seek Transformer Block	layers.DeepseekTransformerBlock(...)
· Deep Seek Transformer Block Without RoPE	layers.DeepSeekTransformerBlockWithoutRoPE(...)

Table 2. gptModules Library Layers.

Hugging Face에서 제공하는 databricks/databricks-dolly-15k 데이터셋을 이용하여 gputModules의 각 모델을 학습하였다. Tokenizer는 GPT2Tokenizer를 사용하였다. n_layer=9, n_heads=10, d_model=560, d_ff=2304를 모든 모델의 기본적인 아키텍처 하이퍼파라미터로 설정하였다. 별개의 하이퍼파라미터를 요구하는 모델인 LLaMA의 경우 n_groups=5, rope_base=500_000으로 설정하였으며, DeepSeek의 경우 n_shared=1, d_ff=576, top_k=2, d_kv_comp=12, d_rope=14, rope_base=10_000으로 설정하였다. 옵티마이저로 AdamW를 사용하였다. learning rate의 경우 초기 0에서 최대 0.00022까지 상승하여 이후 점차 감소하도록 learning rate decay를 수행하였다. 총 Epoch은 8회이며 Batch 사이즈는 8이다. Fig 1은 Train set과 Validation set에 대한 각 모델의 학습에 따른 Loss의 변화를 제시한 것으로, 최종 Test set에 대한 Loss 및 학습 수행 과정 간의 Iter / Sec를 Table 3을 통해 확인할 수 있다.

Fig 1. Loss Graph of Models about Train and Validation Dataset.

Model	Test Loss	Iter/Sec
GPT-1	2.402	8.262
GPT-2	3.210	7.235
ALiBi GPT	1.702	7.132
LLaMA	2.681	4.523
DeepSeek V2	2.189	2.688

Table 3. Loss and Iter/sec of Test Dataset.

Getting Start

Example

#TRAIN and Save Model
$ python train.py --model GPT --device cuda #choice model [GPT, GPT2, ALiBiGPT, LLaMA, DeepSeek,]

'''
`학습 완료 후, ./figures/`에 학습 그래프가 저장됨.
`./saved_models`에 학습된 모델이 저장됨.
'''


#Inference Example
$ python inference.py --prompt 'Where is Florida' --model DeepSeek --device cuda #you can edit prompt


>>> 
============< DeepSeek Inference>===========
'user: Where is Florida'
'user: Where is Florida ai: Florida is a state located in the United States of America.'
============================================

Use Models or Modules

import torch
from gptModules import models, layers


VOCAB_SIZE = 256
DEVICE = 'cuda:0'

x = torch.randint(0, VOCAB_SIZE, size=(1, 5)).to(DEVICE) #Input Tensor: Batch(1) x Seq(5)
att_mask = torch.ones(1, 5, dtype=torch.int).to(DEVICE)

#x = tensor([[ 43, 224,  12, 199, 212]])
#att_mask = tensor([[1, 1, 1, 1, 1]])


model = models.GPT(
    vocab_size=VOCAB_SIZE,
    n_layers=2,
    n_heads=4,
    d_model=64,
    d_ff=1024,
    max_seq_length=64,
).to(DEVICE)


out = model(x, att_mask) #Softmax 적용 안됨

개발 환경

Language

+ Python 3.9.1

Library

+ tqdm 4.67.1
+ pytorch 2.1.2+cu118
+ transformers 4.49.0

License

This project is licensed under the terms of the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
figures		figures
gptModules		gptModules
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

이 저장소(Repository)는 「Pytorch 기반 GPT 모델 및 모듈 라이브러리」에 대한 내용을 다루고 있습니다.

프로젝트 내용

Getting Start

Example

Use Models or Modules

개발 환경

License

About

Releases

Packages

Languages

License

YAGI0423/gpt_modules

Folders and files

Latest commit

History

Repository files navigation

이 저장소(Repository)는 「Pytorch 기반 GPT 모델 및 모듈 라이브러리」에 대한 내용을 다루고 있습니다.

프로젝트 내용

Getting Start

Example

Use Models or Modules

개발 환경

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages