将部分UTF-8解码为NSString
在使用NSURLConnection
类通过网络获取UTF-8
编码文件时,委托的connection:didReceiveData:
很可能connection:didReceiveData:
消息将与NSData
一起发送, NSData
会截断UTF-8
文件 – 因为UTF-8
是一个multithreading字节编码方案,单个字符可以在两个单独的NSData
发送
换句话说,如果我加入了从connection:didReceiveData:
获得的所有数据connection:didReceiveData:
我将拥有一个有效的UTF-8
文件,但每个单独的数据都不是有效的UTF-8
()。
我不想将所有下载的文件存储在内存中。
我想要的是:给定NSData
,解码你可以进入NSString
任何东西。 如果NSData
的最后几个字节是未闭合的代理,请告诉我,所以我可以将它们保存为下一个NSData
。
一个明显的解决方案是反复尝试使用initWithData:encoding:
进行解码initWithData:encoding:
每次截断最后一个字节,直到成功。 不幸的是,这可能非常浪费。
如果要确保不要在UTF-8多字节序列的中间停止,则需要查看字节数组的末尾并检查前2位。
- 如果顶部位是0,那么它是ASCII样式的非转义UTF-8代码之一,你已经完成了。
- 如果顶部位是1而倒数第二个是0,那么它是转义序列的延续并且可能代表该序列的最后一个字节,因此您需要缓冲该字符以供以后使用,然后查看前面的内容字符*
- 如果顶部位为1且第二个顶部也为1,则它是多字节序列的开头,您需要通过查找前0位来确定序列中有多少个字符。
查看Wikipedia条目中的多字节表: http : //en.wikipedia.org/wiki/UTF-8
// assumes that receivedData contains both the leftovers and the new data unsigned char *data= [receivedData bytes]; UInteger byteCount= [receivedData length]; if (byteCount<1) return nil; // or @""; unsigned char *lastByte = data[byteCount-1]; if ( lastByte & 0x80 == 0) { NSString *newString = [NSString initWithBytes: data length: byteCount encoding: NSUTF8Encoding]; // verify success // remove bytes from mutable receivedData, or set overflow to empty return newString; } // now eat all of the continuation bytes UInteger backCount=0; while ( (byteCount > 0) && (lastByte & 0xc0 == 0x80)) { backCount++; byteCount--; lastByte = data[byteCount-1]; } // at this point, either we have exhausted byteCount or we have the initial character // if we exhaust the byte count we're probably in an illegal sequence, as we should // always have the initial character in the receivedData if (byteCount<1) { // error! return nil; } // at this point, you can either use just byteCount, or you can compute the // length of the sequence from the lastByte in order // to determine if you have exactly the right number of characters to decode UTF-8. UInteger requiredBytes = 0; if (lastByte & 0xe0 == 0xc0) { // 110xxxxx // 2 byte sequence requiredBytes= 1; } else if (lastByte & 0xf0 == 0xe0) { // 1110xxxx // 3 byte sequence requiredBytes= 2; } else if (lastByte & 0xf8 == 0xf0) { // 11110xxx // 4 byte sequence requiredBytes= 3; } else if (lastByte & 0xfc == 0xf8) { // 111110xx // 5 byte sequence requiredBytes= 4; } else if (lastByte & 0xfe == 0xfc) { // 1111110x // 6 byte sequence requiredBytes= 5; } else { // shouldn't happen, illegal UTF8 seq } // now we know how many characters we need and we know how many // (backCount) we have, so either use them, or take the // introductory character away. if (requiredBytes==backCount) { // we have the right number of bytes byteCount += backCount; } else { // we don't have the right number of bytes, so remove the intro character byteCount -= 1; } NSString *newString = [NSString initWithBytes: data length: byteCount encoding: NSUTF8Encoding]; // verify success // remove byteCount bytes from mutable receivedData, or set overflow to the // bytes between byteCount and [receivedData count] return newString;
UTF-8是一种非常简单的解析编码,旨在使检测不完整序列变得容易,如果从不完整序列开始,则可以找到它的开头。
从末尾向后搜索一个<= 0x7f或> 0xc0的字节。 如果它<= 0x7f,它就完成了。 如果它在0xc0和0xdf之间(包括0和0),则需要一个后续字节才能完成。 如果它在0xe0和0xef之间,则需要两个后续字节才能完成。 如果它> = 0xf0,则需要三个后续字节才能完成。
我有类似的问题 – 部分解码utf8
之前
NSString * adsTopic = [components[2] stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]]; adsInfo->adsTopic = malloc(sizeof(char) * adsTopic.length + 1); strncpy(adsInfo->adsTopic, [adsTopic UTF8String], adsTopic.length + 1);
之后[解决]
NSString *adsTopic = [components[2] stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]]; NSUInteger byteCount = [adsTopic lengthOfBytesUsingEncoding:NSUTF8StringEncoding]; NSLog(@"number of Unicode characters in the string topic == %lu",(unsigned long)byteCount); adsInfo->adsTopic = malloc(byteCount+1); strncpy(adsInfo->adsTopic, [adsTopic UTF8String], byteCount + 1); NSString *text=[NSString stringWithCString:adsInfo.adsTopic encoding:NSUTF8StringEncoding]; NSLog(@"=== %@", text);