SYSTEM_ONLINE :: PORTFOLIO_v2.0.0

Swarnavo
Mukherjee

_
CUDARustLLM InferenceGPU HypervisorRing-0 Kernel
SCROLL
// SECTION_01

PROFILE.

Swarnavo Mukherjee
IDENTITY_CONFIRMED
Swarnavo Mukherjee
Kolkata, India · techmonk000
SYS_AUTH · v2.0
● ACCESS_GRANTED
HANDLE
techmonk000
CLEARANCE
RING-0
ARCH
x86_64
PRIMARY
Rust / CUDA
THREAT_SCAN
MALWARECLEAN
INTRUSIONBLOCKED
ANOMALY0 FOUND
KERNEL_STATUS
NEBIONRUNNING
CYBER-OSARMED
CUDAsm_89
UID · 0x4E4542494F4E
SESSION · ENCRYPTED
swarnavo@nebion:~
$
$
$
$
$
$
$
$
$
$
$
PERFORMANCE_METRICS
0%
False Positive Reduction (Cyber-OS)
0.0×
Concurrent Users (Nebion H100)
0%
Report Time Reduced (Alliance)
0+
Lines of no-std kernel code


BIO

I build software that operates at the intersection of hardware and intelligence. My work spans hand-written CUDA kernels for flash attention, a multi-tenant GPU hypervisor that breaks the HBM concurrent-user ceiling, and a ring-0 OS kernel with in-kernel malware classification.

With my previous experience of building core systems for my company, I write software that is scalable and sits at the hardware layer — schedulers, memory managers, interrupt handlers, and tensor-core GEMM kernels.

RustC++/CUDAPythonTypeScriptPTX inline ASMTokio asyncNext.jsDocker
SCROLL
// SECTION_02

PROJECTS.

PROJECT_01

Nebion

GPU Hypervisor Engine
GITHUB
RustCUDAC++PTX

Multi-tenant LLM inference server. Runs Qwen-70B on a single H100.

  • Hand-written CUDA kernels: GEMM, flash attention, paged KV decode
  • OpenAI-compatible HTTP surface via Axum + Tokio
  • 4-tier adaptive quantization (FP16/FP8 E4M3/INT8/INT4) with attention-mass EMA solver
  • 3-tier KV cache fabric: HBM → Pinned Host RAM → NVMe
  • SLA-aware admission with mid-flight P0–P3 preemption
  • 2-3× more concurrent users on Qwen-7B / H100
Concurrent users
70B
Params on H100
FP8
Min quantization
PROJECT_02

Cyber-OS

AI-Kernel Security Subsystem
GITHUB
RustWHPXx86_64 ASMC++

Ring-0 OS kernel with in-kernel ML malware classification.

  • Rust WHPX VMM + ~5K-line no-std x86_64 kernel
  • Own GDT/IDT/TSS/4-level paging with NX, preemptive scheduler
  • SYSCALL/SYSRET — full ring-0 stack, sub-µs interrupt handling at 100 Hz
  • In-kernel 9-stump gradient-boosted malware classifier (sub-µs inference)
  • MalConv2 CNN + LightGBM-ABI with Authenticode + reputation filtering
  • Cut false positives 96%: 1,198 → 50 on a 22,000-file scan
96%
FP reduction
22K
Files scanned
<1µs
Interrupt latency
SCROLL
// SECTION_03

SKILLS.

// gemm_fp16.cu — scroll to explore the kernel

gemm_fp16.cunebion / crates / nebion-cuda
1#include <cuda_runtime.h>
2#include <cuda_fp16.h>
3#include <cublas_v2.h>
4#include <cstddef>
5 
6namespace nebion {
7 
8constexpr int BM = 64;
9constexpr int BN = 64;
10constexpr int BK = 16;
11constexpr int TM = 4;
12constexpr int TN = 4;
13 
14static_assert(BM % TM == 0, "BM must be divisible by TM");
15static_assert(BN % TN == 0, "BN must be divisible by TN");
16static_assert((BM / TM) * (BN / TN) == 256, "block must have 256 threads");
17 
18__global__ void gemm_f16_kernel(
19 const int M, const int N, const int K,
20 const float alpha,
21 const __half* __restrict__ A,
22 const __half* __restrict__ B,
23 const float beta,
24 __half* __restrict__ C)
25{
26 const int block_row = blockIdx.y * BM;
27 const int block_col = blockIdx.x * BN;
28 
29 const int tx = threadIdx.x;
30 const int ty = threadIdx.y;
31 
32 const int row = block_row + ty * TM;
33 const int col = block_col + tx * TN;
34 
35 float c_reg[TM][TN] = {{0.0f}};
36 
37 __shared__ __half A_tile[BM][BK];
38 __shared__ __half B_tile[BK][BN];
39 
40 const int tid = ty * (BN / TN) + tx;
41 
42 for (int k0 = 0; k0 < K; k0 += BK) {
43 #pragma unroll
44 for (int i = 0; i < 4; i++) {
45 const int linear = tid * 4 + i;
46 const int tile_r = linear / BK;
47 const int tile_c = linear % BK;
48 const int g_r = block_row + tile_r;
49 const int g_c = k0 + tile_c;
50 A_tile[tile_r][tile_c] =
51 (g_r < M && g_c < K) ? A[g_r * K + g_c] : __float2half(0.0f);
52 }
53 
54 #pragma unroll
55 for (int i = 0; i < 4; i++) {
56 const int linear = tid * 4 + i;
57 const int tile_r = linear / BN;
58 const int tile_c = linear % BN;
59 const int g_r = k0 + tile_r;
60 const int g_c = block_col + tile_c;
61 B_tile[tile_r][tile_c] =
62 (g_r < K && g_c < N) ? B[g_r * N + g_c] : __float2half(0.0f);
63 }
64 
65 __syncthreads();
66 
67 #pragma unroll
68 for (int kk = 0; kk < BK; kk++) {
69 __half a_col[TM];
70 __half b_row[TN];
71 
72 #pragma unroll
73 for (int i = 0; i < TM; i++) {
74 a_col[i] = A_tile[ty * TM + i][kk];
75 }
76 #pragma unroll
77 for (int j = 0; j < TN; j++) {
78 b_row[j] = B_tile[kk][tx * TN + j];
79 }
80 #pragma unroll
81 for (int i = 0; i < TM; i++) {
82 const float a_f = __half2float(a_col[i]);
83 #pragma unroll
84 for (int j = 0; j < TN; j++) {
85 c_reg[i][j] += a_f * __half2float(b_row[j]);
86 }
87 }
88 }
89 
90 __syncthreads();
91 }
92 
93 #pragma unroll
94 for (int i = 0; i < TM; i++) {
95 const int g_r = row + i;
96 if (g_r >= M) continue;
97 #pragma unroll
98 for (int j = 0; j < TN; j++) {
99 const int g_c = col + j;
100 if (g_c >= N) continue;
101 const float prev = (beta == 0.0f) ? 0.0f : __half2float(C[g_r * N + g_c]) * beta;
102 const float out = alpha * c_reg[i][j] + prev;
103 C[g_r * N + g_c] = __float2half(out);
104 }
105 }
106}
107 
108} // namespace nebion
109 
110extern "C" {
111 
112int nebion_gemm_f16(
113 int M, int N, int K,
114 float alpha,
115 const void* A,
116 const void* B,
117 float beta,
118 void* C)
119{
120 if (M <= 0 || N <= 0 || K <= 0) return 0;
121 if (A == nullptr || B == nullptr || C == nullptr) {
122 return static_cast<int>(cudaErrorInvalidValue);
123 }
124 if (M % nebion::BM != 0 || N % nebion::BN != 0) {
125 return static_cast<int>(cudaErrorInvalidValue);
126 }
127 
128 dim3 block(nebion::BN / nebion::TN, nebion::BM / nebion::TM);
129 dim3 grid(N / nebion::BN, M / nebion::BM);
130 
131 nebion::gemm_f16_kernel<<<grid, block>>>(
132 M, N, K, alpha,
133 static_cast<const __half*>(A),
134 static_cast<const __half*>(B),
135 beta,
136 static_cast<__half*>(C)
137 );
138 
139 cudaError_t err = cudaGetLastError();
140 return static_cast<int>(err);
141}
142 
143static cublasHandle_t g_cublas_handle = nullptr;
144 
145static cublasStatus_t ensure_handle()
146{
147 if (g_cublas_handle == nullptr) {
148 return cublasCreate(&g_cublas_handle);
149 }
150 return CUBLAS_STATUS_SUCCESS;
151}
152 
153static int run_cublas_hgemm(
154 int M, int N, int K,
155 float alpha_f,
156 const void* A,
157 const void* B,
158 float beta_f,
159 void* C,
160 cublasGemmAlgo_t algo)
161{
162 if (ensure_handle() != CUBLAS_STATUS_SUCCESS) return -1;
163 
164 const float alpha = alpha_f;
165 const float beta = beta_f;
166 
167 cublasStatus_t status = cublasGemmEx(
168 g_cublas_handle,
169 CUBLAS_OP_N, CUBLAS_OP_N,
170 N, M, K,
171 &alpha,
172 B, CUDA_R_16F, N,
173 A, CUDA_R_16F, K,
174 &beta,
175 C, CUDA_R_16F, N,
176 CUBLAS_COMPUTE_32F,
177 algo
178 );
179 
180 if (status != CUBLAS_STATUS_SUCCESS) {
181 return static_cast<int>(status);
182 }
183 
184 cudaError_t err = cudaGetLastError();
185 return static_cast<int>(err);
186}
187 
188int nebion_cublas_hgemm_tc(
189 int M, int N, int K,
190 float alpha_f,
191 const void* A,
192 const void* B,
193 float beta_f,
194 void* C)
195{
196 if (ensure_handle() != CUBLAS_STATUS_SUCCESS) return -1;
197 
198 cublasSetMathMode(g_cublas_handle, CUBLAS_DEFAULT_MATH);
199 
200 const float alpha = alpha_f;
201 const float beta = beta_f;
202 
203 cublasStatus_t status = cublasGemmEx(
204 g_cublas_handle,
205 CUBLAS_OP_N, CUBLAS_OP_N,
206 N, M, K,
207 &alpha,
208 B, CUDA_R_16F, N,
209 A, CUDA_R_16F, K,
210 &beta,
211 C, CUDA_R_16F, N,
212 CUBLAS_COMPUTE_32F_FAST_16F,
213 CUBLAS_GEMM_DEFAULT_TENSOR_OP
214 );
215 if (status != CUBLAS_STATUS_SUCCESS) return static_cast<int>(status);
216 
217 cudaError_t err = cudaGetLastError();
218 return static_cast<int>(err);
219}
220 
221int nebion_cublas_hgemm_cuda_cores(
222 int M, int N, int K,
223 float alpha_f,
224 const void* A,
225 const void* B,
226 float beta_f,
227 void* C)
228{
229 if (ensure_handle() != CUBLAS_STATUS_SUCCESS) return -1;
230 
231 cublasSetMathMode(g_cublas_handle, CUBLAS_PEDANTIC_MATH);
232 
233 const float alpha = alpha_f;
234 const float beta = beta_f;
235 
236 cublasStatus_t status = cublasGemmEx(
237 g_cublas_handle,
238 CUBLAS_OP_N, CUBLAS_OP_N,
239 N, M, K,
240 &alpha,
241 B, CUDA_R_16F, N,
242 A, CUDA_R_16F, K,
243 &beta,
244 C, CUDA_R_16F, N,
245 CUBLAS_COMPUTE_32F,
246 CUBLAS_GEMM_DEFAULT
247 );
248 
249 cublasSetMathMode(g_cublas_handle, CUBLAS_DEFAULT_MATH);
250 
251 if (status != CUBLAS_STATUS_SUCCESS) return static_cast<int>(status);
252 cudaError_t err = cudaGetLastError();
253 return static_cast<int>(err);
254}
255 
256} // extern "C"
CUDA C++·UTF-8● nebion-cuda
GPU / CUDA
CUDAPTX inline ASMWMMA Tensor-CoreFlash Attention KernelsPaged KV Decode
Systems
RustC++Tokio async runtimex86_64 Ring-0WHPX VMMno-std kernel
ML / Research
PyTorchsafe-tensorsKV CompressionQuantization (FP8/INT8/INT4)LightGBM
Backend
Axum + TowerNext.js App RouterNode.jsPostgresDockerClerkREST
SCROLL
// SECTION_04

EXP_LOG.

POSITION_01

Software Engineer

J.P. Morgan Chase & Co.
August 2026 · Upcoming
UPCOMING
POSITION_02

Software Engineer

Alliance Australia Property Pvt. Ltd
May 2025 — April 2026 · Kolkata, India
COMPLETED
dmesg | grep -i "alliance"
[05/2025]INFO
Joined Alliance Australia Property Pvt. Ltd as Software Engineer
[05/2025]INIT
Bootstrapped LLM pipeline for AI-based property valuation reports
[06/2025]PERF
Data ingestion + automation → 60% reduction in manual report time
[07/2025]KERNEL
Architected handwritten x86_64 kernel — detected 4+ critical vulnerabilities
[08/2025]PERF
System security hardened by 40% via kernel-level threat isolation
[09/2025]REVENUE
Software generated 100 leads — company hit $10K MRR milestone
[04/2026]EXIT
Role completed — full-cycle: infra engineer → kernel author → revenue driver
EDUCATION
B.Tech Computer Science
University of Engineering & Management
Kolkata, India
KEY_ACHIEVEMENTS
MRR Generated$10K
Leads via Software100+
Time Reduction60%
Security Improvement40%
LANGUAGES
Rust95%
C++/CUDA90%
Python85%
TypeScript80%
Java65%
SCROLL
// SECTION_05

CONNECT.

Building something at the intersection of hardware and intelligence?
Let's talk.

swarnavo@nebion:~ — contact.sh
$whoami
Swarnavo Mukherjee // GPU Systems · LLM Inference · CUDA
$cat location.txt
Kolkata, India · Open to remote worldwide
$cat availability.txt
Open to: Full-time · Contract · Research Collabs
$echo $EMAIL
swarnavomukherjee03@gmail.com
$
SWARNAVO.SYS v2.0.0