How can I change the vision encoder and text encoder to custom pretrained clip vision encoder and text encoder?