從 HumanNeRF 到 PersonNeRF若你想挑戰 Google 等級的 PersonNeRF(處理不同時間拍攝的照片)
HumanNeRF 到 PersonNeRF若你想挑戰 Google 等級的 PersonNeRF(處理不同時間拍攝的照片)
要復現翁仲毅(Chung-Yi Weng)的研究,最理想的起點是從他的 HumanNeRF 開始,因為這是他後續在 Google 研發許多應用的核心底層技術。
以下是從技術論文角度進行復現的步驟建議與關鍵組件:
1. 取得原始碼與環境準備他大多數的研究都有在 GitHub 上開源。核心專案: HumanNeRF GitHub Repo開發環境:硬體: 建議至少需要一張 24GB 顯存 (VRAM) 的 GPU(如 RTX 3090 或 4090),因為神經渲染對記憶體要求很高。
軟體: PyTorch 是主要的開發框架。
2. 核心算法的三大復現重點若要自行實作或修改模型,需關注以下三個關鍵模塊:Canonical Space (標準空間) 建模:不要直接在動作影像中訓練,而是要建立一個「T-Pose」的靜態 NeRF 模型。
這讓神經網路能學習到穩定的外觀特徵(如皮膚紋理),而不受動作干擾。Motion Field (動作場) 與 SMPL 模型:利用 SMPL (人體參數化模型) 作為先驗知識。復現時需要結合運動估計算法,計算每一幀到標準空間的 Backward Warping (反向扭曲),這部分通常使用 MLP 來預測偏移量 (Offsets)。Non-rigid Refinement (非剛性微調):這是論文中最精妙的地方。單純的骨架運動無法處理衣服的飄動,因此需要一個額外的微調層來處理這些細節。
3. 資料集準備 (Dataset)復現這類模型最難的部分往往是資料處理。ZJU-MoCap: 這是復現 HumanNeRF 最常用的多視角人體動作資料集。自製資料: 若想復現「單視角影片」轉 3D,你需要準備一段繞著人走一圈(或人自轉)的影片,並使用 COLMAP 提取相機位姿,或使用 VIBE/ROMP 提取人體骨架。
4. 關鍵技術指標與調優在實驗過程中,你需要監控以下指標以確保復現方向正確:PSNR / SSIM: 衡量生成的影像質量。
LPIPS: 衡量感知上的真實感(對於人體渲染尤為重要)。
Training Time: 典型的 HumanNeRF 訓練在單張 3090 上可能需要 48-72 小時才能達到論文中的精細度。
5. 進階:從 HumanNeRF 到 PersonNeRF若你想挑戰 Google 等級的 PersonNeRF(處理不同時間拍攝的照片),你需要額外實作:Appearance Embedding: 為不同照片加上不同的向量,讓模型學習「某張照片有穿外套,某張沒有」。Hash Grid Encoding: 參考 Instant-NGP 技術來加速訓練過程,這在 Google 近期的研究中被廣泛採用。
建議的學習路徑如果您是剛接觸這個領域,我建議按此順序閱讀與實作:
From HumanNeRF to PersonNeRF: If you want to take on a Google-level PersonNeRF (handling photos taken at different times)
To replicate Chung-Yi Weng’s research, the ideal starting point is his HumanNeRF, as it is the core underlying technology for many of his subsequent applications developed at Google.
The following are step-by-step suggestions and key components for replication from a technical paper perspective:
1. Obtain the source code and set up the environment
Most of his research is open-sourced on GitHub.
Core project: HumanNeRF GitHub Repo
Development environment:
Hardware: It is recommended to have at least one GPU with 24GB VRAM (e.g., RTX 3090 or 4090), as neural rendering is highly memory-intensive.
Software: PyTorch is the primary development framework.
2. Three key focus areas for core algorithm replication
If you want to implement or modify the model yourself, pay attention to the following three key modules:
· Canonical Space modeling: Do not train directly on action images; instead, build a static NeRF model in a "T-Pose."
This allows the neural network to learn stable appearance features (e.g., skin texture) without being disrupted by motion.
· Motion Field and SMPL model: Use SMPL (a parametric human model) as prior knowledge. For replication, you need to combine it with motion estimation algorithms to compute backward warping from each frame to the canonical space. This part typically uses an MLP to predict offsets.
· Non-rigid refinement: This is the most ingenious part of the paper. Skeletal motion alone cannot handle clothing movement, so an additional refinement layer is needed to capture these details.
1. Dataset preparation
The hardest part of replicating such models is often data processing.
· ZJU-MoCap: This is the most commonly used multi-view human motion dataset for replicating HumanNeRF.
· Custom data: If you want to replicate converting a "single-view video" to 3D, you will need a video that circles around a person (or where the person rotates), then use COLMAP to extract camera poses or use VIBE/ROMP to extract human skeleton poses.
1. Key technical metrics and tuning
During experimentation, monitor the following metrics to ensure you are on the right track:
· PSNR / SSIM: Measure image quality.
· LPIPS: Measure perceptual realism (especially important for human rendering).
· Training time: A typical HumanNeRF training on a single RTX 3090 may take 48–72 hours to achieve the fine quality described in the paper.
1. Advanced: From HumanNeRF to PersonNeRF
If you want to take on a Google-level PersonNeRF (handling photos taken at different times), you need to additionally implement:
· Appearance Embedding: Assign different vectors to different photos, allowing the model to learn that "one photo has a jacket, another does not."
· Hash Grid Encoding: Reference the Instant-NGP technique to accelerate training, which is widely adopted in recent Google research.
Suggested learning path
If you are new to this field, it is recommended to read and implement in the following order:
(Original Chinese list omitted as it was incomplete in the user's message.)
留言
張貼留言