This paper presents a research proposal for an ASIC implementation of deconvolution kernels optimized for sub-10 µs detection latency, combining a customized systolic array, INT4 quantization, hierarchical memory, and streaming dataflow.
Key findings
Proposed ASIC architecture achieves 5-8 µs detection latency at 1GHz in 7nm CMOS technology.
Novel systolic array optimized for transposed convolution operations with overlapping sum management.
Quantization-aware training enables aggressive INT4 weight representation with <1% accuracy degradation.
Streaming memory hierarchy supports continuous event-based input processing without pipeline stalls.
Limitations & open questions
Quantization-induced accuracy degradation, clock distribution challenges, and thermal constraints under sustained operation are identified as key risk factors.